Substance Submissions: | data specifications | file formats (SDF, CSV, compressed files (GZIP)) | sample files (chemical submission, RNAi submission)
Assay Submissions: | three parts to an assay submission: (1) substance submission + (2) assay description + (3) assay data | data specifications | file formats (spreadsheet, XML, ASN, CSV) | sample files (single readout IC50, multiple readouts with dose-response data, panel assay with multiple targets, RNAi assay)
Small molecules and RNAi can be submitted to the PubChem Substance database in either SDF or CSV format. The allowable tags that can be used in both formats are provided in the data specifications below. (The complete Upload Help document provides additional details about substance submissions.)
- Data specifications for submissions to PubChem Substance:
- File formats that can be used for submissions to PubChem Substance:
- SDF - The *.sdf format is generally used for submitting chemicals because they typically include chemical structures. SDF is one of a family of chemical-data file formats intended especially for structural information, and "SDF" stands for structure-data file. The *.sdf file format contains two main types of information for each small molecule submitted: (1) an optional structure specification, in the form of 2D or 3D coordinates, SMILES, InChi, chemical synonyms, or a PubChem Compound ID (CID), and (2) metadata that use the appropriate tags from the data specifications and provide descriptive information about the substance. If the structure is specified as a chemical synonym or CID, the PubChem Upload system will find the corresponding PubChem Compound record and import the chemical structure from that Compound record into your substance submission. Example SDF files are provided below. (Note that chemicals can also be submitted in *.csv format, with their structures specified in any of the forms mentioned above (preferably CID), except in the form of explicit 2D or 3D coordinates (coordinates can only be accomodated by *.sdf files).)
- CSV - The *.csv format is most convenient for submitting RNAi's because they generally don't include a chemical structure and the tabular format makes it easy to note relevant information, such as gene or protein targets. A *.csv file can include any/all of the SD tags listed in the substance data specifications (separate document), and they serve as column headers. The SD tags can come in any order, with the exception of the PUBCHEM_EXT_DATASOURCE_REGID, which should always be in the first column. Each substance can take up one line only. When an SD tag is associated with multiple values, they are separated with a newline character. (Hint: when entering data with Excel, use Alt-Enter to create a newline in data cell or double click a data cell before pasting multiple line entry into it.) Example CSV files are provided below. CSV files can be imported into any spreadsheet program for viewing or editing.
- Compressed files - Please note that the file you upload to the PubChem Upload system may be compressed. Compressing your file may substantially reduce the time it takes to transfer your data. We support files compressed using the "gzip" compressor. Please note that we do not support "zip" or "bzip2" compressed files.
- If you choose to enter your data by filling in forms (using the wizards) provided by the PubChem Upload system, rather than by uploading files, the data you enter will be formatted for you and can be exported in any of these file formats.
- Sample files for submissions to PubChem Substance:
- SDF file examples for submitting chemicals:
- new submission
This sample file contains 9 substances, which vary primarily in how their chemical structures are depicted and in the amount of annotations provided. The chemical structures are represented either as 3D coordinates, SMILES, InChI, a MeSH synonym, or a PubChem Compound ID. If a chemical structure is specified as a MeSH synonym or PubChem Compound ID, then PubChem Upload will find the PubChem Compound record that corresponds to the MeSH term or CID, and will import the chemical structure from that Compound record into your substance submission. (A separate file provides additional information about chemical name synonyms, including MeSH.) One of the structures in the sample file does not have a chemical structure, and is included to illustrate the optional nature of chemical structures. Each substance in the sample SDF file is tagged with a PUBCHEM_EXT_DATASOURCE_REGID (external registry ID), and salient features of the example are noted in the corresponding PUBCHEM_SUBSTANCE_COMMENT field. The end of an individual substance is denoted by four dollar signs $$$$.
- update existing PubChem substance record
SDF files for updates generally look no different than SDF files for new submissions. The only distinction is that an update uses the same regid that is already present in an existing PubChem Substance record (from the *same* data source, either publicly available or on hold), while a new submission uses a regid that is not yet found in any PubChem Substance record from that data source. If the regid provided in an SDF file matches a record that's already in PubChem Substance from the same data source, it's an update; if not, it's considered new.
- revoke PubChem substance record
Registry IDs (regids) that are listed in the revoke file must exactly match existing records in PubChem Substance, and must be from the *same* data source. Only two fields are necessary in the revoke file: the regid and a short comment explaining the revoke.
- mix of new submission, update, and revoke
If desired, you can submit a single *.sdf file that contains a mix of new substance submissions, updates to existing records, and revokes.
- CSV file example for submitting RNAi's:
Biological activity data for small molecules and RNAi can be submitted to the PubChem BioAssay database in either spreadsheet (XLSX or ODS), XML, ASN, or CSV format. The allowable tags that can be used in all formats are provided in the data specifications below. If you choose to enter your data by using the wizards provided by the PubChem Upload system, rather than by importing files, the data you enter will be formatted for you and can be exported in any of these file formats.
- Three parts to an assay submission:
- The three parts of an assay submission include:
(1) substance submission + (2) assay description + (3) assay data
They are described in detail in the complete PubChem Upload help document.
- Please note that the substances tested in your assay must be submitted to the PubChem Substance database before you submit your assay description and data to the PubChem BioAssay database.
- After you have completed the substance submission, the links below provide access to some key information about submitting the assay description and data. Those links are intended to serve as a quick reference, and the complete help document provides additional tips on completing the various assay description fields, such as activity outcome method, panel assays, and more.
- Data specifications for submissions to PubChem BioAssay:
- File formats for submissions to PubChem BioAssay:
- Biological activity data for small molecules and RNAi can be submitted to the PubChem BioAssay database in either spreadsheet (XLSX or ODS), XML, ASN, or CSV format. Examples of each are provided below.
- If you choose to enter your data by filling in forms (i.e., using wizards) provided by the PubChem Upload system, rather than by importing files, the data you enter will be formatted for you and can be exported in any of these file formats.
- Please see the complete help document for tips for assay submissions in CSV format, including the ordering of columns, if you choose to submit CSV files.
- Sample formatted files for submissions to PubChem BioAssay:
- Example 1: single readout IC50 (derived from AID 449747)
- spreadsheet format (*.xlsx) (can also be *.ods)
- XML format
- ASN format
- CSV format:
- part 1: general (information about the assay experiment, such as assay title, description protocol, comments, etc.)
- part 2: targets (protein and/or gene targets)
- part 3: xrefs (cross references to other databases)
- part 4: results (assay data)
- optional CSV file: categorized comments - An additional type of CSV file, not associated with the example above but which can be included in assay submissions, is categorized comments. It contains user defined TAG-VALUE pairs that are stored in the assay record as comments, and provides a convenient placeholder for user-defined ontologies or other definitions outside the scope of the PubChem data specification.
A separate file provides additional details about CSV files, including descriptions of the five file types described above, tags, and allowable values for submitting data to the PubChem Assay database.
- Example 2: multiple readouts with dose-response data (derived from AID 623973)
- Example 3: panel assay with multiple targets & bioactivity outcomes (derived from AID 1433)
- Example 4: RNAi assay (derived from AID 624150)
Note that prior to submitting assay data, the substances or RNAi that were tested must first be to be submitted to the PubChem Substance database. The biological test results can then be submitted to the PubChem BioAssay database. For example, the RNAi substance submission file, in CSV format, that corresponds to Example 4, is provided in an earlier section of this document and lists the RNAis that were tested by this assay.