PubChem Upload Help (brief version)
 
 
   

This brief help document provides basic information about the PubChem Upload tool, including sample files for submitting substances and assays. The complete help document includes the information provided here, plus technical details about the PubChem Upload tool and FTP submissions. A separate file provides answers to frequently asked questions (FAQs), and tutorial provides step by step examples of how to use PubChem Upload for substance and assay submissions. If you need additional help, please contact pubchem-deposit-help@ncbi.nlm.nih.gov.

 
     
   
 
What is PubChem Upload? back to top

Tool for PubChem data submissions and updates back to top

PubChem Upload is a tool for submitting small molecule and bioassay data to the PubChem Substance and BioAssay databases. It offers streamlined procedures for both data submissions and updates, and replaces the original Pubchem Deposition Gateway. To allow for greatest flexibility, the Upload system allows you to submit data using any one of the methods noted below.

PubChem Upload can also be used for updating existing PubChem records. It retrieves your "In PubChem" substance and assay records, loads them into the interface, and allows you to make edits -- large or small -- directly in the interface, and then commit the revised file to PubChem. It's also still possible, if desired, to make an update using the old approach -- by importing a formatted file that contains the original Registry ID (RegID) for the submission and includes the desired revisions.

Flexible methods for data entry back to top

You can submit data to the PubChem Substance or BioAssay databases in any of the following ways:
  • Use wizards to enter data into web forms -- Wizards assist novice users in entering substance and/or assay data, without requiring knowledge of detailed data specifications. Hints appear at the bottom of most screens, with links to examples in the upload tutorial, if/as appropriate), proving tips on data entry. After you type or import data into the wizards, the Upload system will prepare a properly formatted file that conforms to the data specifications. The wizards also give you the option of using an existing PubChem record as a template for your new submission. The complete Upload Help document provides illustrated examples of the PubChem Upload interface.  -OR-

  • Upload formatted files -- Upload also accepts formatted files for substance and assay submissions, for depositors who prefer that approach instead of using wizards. Details about file format specifications, as well as sample files, are provided in subsequent sections of this document.  -OR-

  • FTP depositions -- For large and/or frequent data uploads, depositions by FTP are recommended. Details about requesting an FTP account and the procedures for FTP submissions are provided in the complete PubChem Upload help document. Once your data are in your FTP directory, you can review and edit them, and commit them to PubChem, using the new Upload system.

 
Account Types back to top

Test Account back to top

A "test" account enables you to upload data into the pending submissions database, but does not require signing the data transfer agreement and does not enable you to "commit" the data into the official PubChem database. The test acct is just for trying the system to determine if you can successfully upload and see your structure.

NOTE: When you register for a test account, you must enter your e-mail address. When you go to the next screen, PubChem upload will send a message to your e-mail address with a code that you must enter on the second screen in order to proceed. This confirms the validity of the e-mail address, in order to ensure communications about your PubChem submissions.

Full Account back to top

A full account requires you to sign the data transfer agreement and enables you to "commit" your data into the official PubChem database.

Upload account users must agree to and are bound by the data transfer agreement. This agreement gives PubChem the right to redistribute the information deposited. It is important to note that all data deposited into PubChem is covered under fair-usage and, as such, does not require the depositor to assign away copyrights or other ownership prior to PubChem deposition, i.e., the deposited data does not have to become part of the public domain."

If your organization requires modifications to the Data Transfer Agreement, please contact a PubChem Curator.

NOTE: In the previous PubChem Deposition Gateway, test accounts were separate from full accounts, and after submitters uploaded data into a test account, they were required to resubmit data into a full account in order to deposit them into PubChem. In the new Upload system, a test account can be converted to a full account at any time, and the data do not have to be resubmitted in order to deposit them into PubChem.

The complete Upload Help document provides additional details about account types.



 
Data Release back to top

The complete Upload Help document provides an overview of the three main phases in submission process:
(1) upload data to pending submissions database(2) review submission(3) commit to PubChem
Once the submission process is finished, your data can either be released immediately, or held until a date that you specify, as noted below.

Immediate release of data back to top

After you "Commit" your data into PubChem, it usually takes about one to two days for the data to be processed and publicly released. (Delays can sometimes occur, depending on the volume of data PubChem is receiving or other extenuating factors, such as technical issues due to infrastructure upgrades; however, such delays are infrequent.)

By default, data are released immediately after processing is complete, unless you have indicated a PUBCHEM_HOLD_UNTIL_DATE.

If you want to indicate a hold-until-date, it is critical that you do that as part of the INITIAL submission process, as noted below.

Hold-until-date back to top

If you prefer to keep your data private until a specific date, rather than have it released immediately, it is critical that you indicate a "PUBCHEM_HOLD_UNTIL_DATE" as part of the initial submission process. In such a case, your data will be added to PubChem after you press "Commit," but they will not be visible to the public until the "PUBCHEM_HOLD_UNTIL_DATE" that you specified. (The substance data specifications and bioassay data specifications provide additional details about the PUBCHEM_HOLD_UNTIL_DATE for each type of data.)

An initial hold of up to one year from initial submission is accepted, with the possibility of additional extensions.

You can update your record any time to remove the on-hold request and release the record to the public. The complete PubChem Upload Help document describes how to remove the on-hold request for substances and how to remove the on-hold request for assays.

If you DO NOT specify a hold-until-date at the time of initial submission, your PubChem record will be released automatically after the record goes through standard PubChem data processing.

Once a record is "published" (publicly released) in PubChem, it can no longer be put back to the on-hold status. (That is, a retroactive hold cannot be placed on PubChem records that are already publically accessible.) The only action available after that time is a revoke.

Notes for assay submissions:
  • If you have chosen to hold your assay data private, but you would like to provide access to select individuals such as collaborators or reviewers, you can create a URL that provides temporary access to a given on-hold assay submission (i.e., to a given AID and its associated substance records). You can then share it with the desired individuals. This function was added in order to facilitate collaboration (whether external or in-house) and administrative work such as publication and grant processing.

  • An assay cannot be released to the public before all of its associated substances (SIDs) are public. (Validation checks are performed on the release dates of an assay and its associated substance records, and an error is generated if a request is made to release an assay before all of its SIDs are public.) Therefore, if you would like to release your assay data to the public, you may need to do two steps if both the assay data and the substances are on-hold:

    1. Release the substances: Open your PubChem Upload home page/"Substances" folder tab/"On Hold" subfolder, and type the Assay ID (AID) in the text box entitled "Release (remove Hold Until Date) Substances." That will release all PubChem Substance records that are associated with that AID.

    2. Release the assay: Once the substances are uploading to PubChem, release the assay data by following the steps described in the Assay data release section of the complete Upload Help document.

See the complete Upload Help document for information about the three main phases in submission process, the release of data, and how to remove an on-hold request and release the record to the public.



 
Data Specifications and File Formats back to top

Substance Submissions: | data specifications | file formats (SDF, CSV, compressed files (GZIP)) | sample files (chemical submission, RNAi submission)

Assay Submissions: | three parts to an assay submission: (1) substance submission + (2) assay description + (3) assay data | data specifications | file formats (spreadsheet, XML, ASN, CSV, compressed files (GZIP)) | sample files (single readout IC50, multiple readouts with dose-response data, panel assay with multiple targets, RNAi assay)

PubChem Substance: submission of small molecules and RNAi back to top

Small molecules and RNAi can be submitted to the PubChem Substance database via the wizards provided by the PubChem Upload system, or by uploading files in either SDF or CSV format. The allowable tags that can be used in both file formats are provided in the data specifications below.

The information below is intended to provide a brief reference to some key aspects of submitting substance data to PubChem. Additional details about substance submissions are provided in the complete Upload Help document.
  • Data specifications for submissions to PubChem Substance: back to top


  • File formats that can be used for submissions to PubChem Substance: back to top

    • SDF - The *.sdf format is generally used for submitting chemicals because they typically include chemical structures. SDF is one of a family of chemical-data file formats intended especially for structural information, and "SDF" stands for structure-data file. The *.sdf file format contains two main types of information for each small molecule submitted: (1) an optional structure specification, in the form of 2D or 3D coordinates, SMILES, InChi, chemical synonyms, or a PubChem Compound ID (CID), and (2) metadata that use the appropriate tags from the data specifications and provide descriptive information about the substance. If the structure is specified as a chemical synonym or CID, the PubChem Upload system will find the corresponding PubChem Compound record and import the chemical structure from that Compound record into your substance submission. Example SDF files are provided below. (Note that chemicals can also be submitted in *.csv format, with their structures specified in any of the forms mentioned above (preferably CID), except in the form of explicit 2D or 3D coordinates (coordinates can only be accomodated by *.sdf files).)

    • CSV - The *.csv format is most convenient for submitting RNAi's because they generally don't include a chemical structure and the tabular format makes it easy to note relevant information, such as gene or protein targets. A *.csv file can include any/all of the SD tags listed in the substance data specifications, and they serve as column headers. The SD tags can come in any order, with the exception of the PUBCHEM_EXT_DATASOURCE_REGID, which should always be in the first column. Each substance can take up one line only. When an SD tag is associated with multiple values, they are separated with a newline character. (Hint: when entering data with Excel, use Alt-Enter to create a newline in data cell or double click a data cell before pasting multiple line entry into it.) Example CSV files are provided below. CSV files can be imported into any spreadsheet program for viewing or editing.

      • Information about the rules for proper formatting of CSV files is provided on sites such as Wikipedia (http://en.wikipedia.org/wiki/Comma-separated_values) and the Internet Engineering Task Force (IETF) (http://tools.ietf.org/pdf/rfc4180.pdf, http://tools.ietf.org/html/rfc4180). For example:

        • If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example, if your CSV file contains a field that looks like:
          QC'd by "Quicktest"
          then that field should be revised to look like:
          "QC'd by ""Quicktest"""

      • If you are uploading a very large file through a web browser interface, it is possible you might see an error message such as a time out or bad gateway. This usually means that your browser timed out before it could get notification back to you confirming that your file was uploaded. If you simply close the pop-up that displays the error message and wait a few minutes, you should see that the file has been uploaded. After a few minutes, refresh your original window, which should display a message such as "validating" or "validated." If you see a message such as "parsing failed," then the file might have had some problem. But generally, PubChem Upload eventually receives everything in large files, even if a time out error message is displayed during the upload process.

      • As an alternative, you can use FTP to upload very large files. Details about requesting an FTP account and the procedures for FTP submissions are provided in the complete PubChem Upload help document. Please note, however, that FTP is mainly intended as a way to begin a new submission, and not necessarily as a way to add to an existing submission.

    • Compressed files - Submissions uploaded by FTP may be compressed. This may substantially reduce the time it takes to transfer your data. We support files compressed using the "gzip" compressor. Please note that we do not support "zip" or "bzip2" compressed files.

    • If you choose to enter your data by filling in forms (using the wizards) provided by the PubChem Upload system, rather than by uploading files, the data you enter will be formatted for you and can be exported in any of these file formats.


  • Sample files for submissions to PubChem Substance: back to top
    • SDF file examples for submitting chemicals: back to top

      • new submission
        • This sample file contains 9 substances, which vary primarily in how their chemical structures are depicted and in the amount of annotations provided. The chemical structures are represented either as 3D coordinates, SMILES, InChI, a MeSH synonym, or a PubChem Compound ID. If a chemical structure is specified as a MeSH synonym or PubChem Compound ID, then PubChem Upload will find the PubChem Compound record that corresponds to the MeSH term or CID, and will import the chemical structure from that Compound record into your substance submission. (A separate file provides additional information about chemical name synonyms, including MeSH.) One of the structures in the sample file does not have a chemical structure, and is included to illustrate the optional nature of chemical structures. Each substance in the sample SDF file is tagged with a PUBCHEM_EXT_DATASOURCE_REGID (external registry ID), and salient features of the example are noted in the corresponding PUBCHEM_SUBSTANCE_COMMENT field. The end of an individual substance is denoted by four dollar signs $$$$.

      • update existing PubChem substance record
        • SDF files for updates generally look no different than SDF files for new submissions. The only distinction is that an update uses the same regid that is already present in an existing PubChem Substance record (from the *same* data source, either publicly available or on hold), while a new submission uses a regid that is not yet found in any PubChem Substance record from that data source. If the regid provided in an SDF file matches a record that's already in PubChem Substance from the same data source, it's an update; if not, it's considered new.

      • revoke PubChem substance record
        • Registry IDs (regids) that are listed in the revoke file must exactly match existing records in PubChem Substance, and must be from the *same* data source. Only two fields are necessary in the revoke file: the regid and a short comment explaining the revoke.

      • mix of new submission, update, and revoke
        • If desired, you can submit a single *.sdf file that contains a mix of new substance submissions, updates to existing records, and revokes.

    • CSV file example for submitting RNAi's: back to top

    Additional details about substance submissions are provided in the complete Upload Help document.

PubChem BioAssay: submission of biological activity data back to top

Biological activity data for small molecules and RNAi can be submitted to the PubChem BioAssay databasedatabase via the wizards provided by the PubChem Upload system, or by uploading files in either spreadsheet (XLSX or ODS), XML, ASN, or CSV format. The allowable tags that can be used in all file formats are provided in the data specifications below. If you choose to enter your data by using the wizards provided by the PubChem Upload system, rather than by importing files, the data you enter will be formatted for you and can be exported in any of these file formats.

The information below is intended to provide a brief reference to some key aspects of submitting assay data to PubChem. Additional details about assay submissions are provided in the complete Upload Help document.
  • Three parts to an assay submission: back to top

  • Data specifications for submissions to PubChem BioAssay: back to top


  • File formats for submissions of assay descriptions and assay data to PubChem BioAssay: back to top

     

    Note: The substances that were tested also need to be submitted, but in a separate procedure, as noted in the section of this document on "three parts to an assay submission." The information below is for the submission of assay descriptions and data.

     

    • Assay descriptions and biological activity data for small molecules and RNAi can be submitted to the PubChem BioAssay database in either spreadsheet (XLSX or ODS), XML, ASN, or CSV format. Examples of each are provided below.

    • If you choose to enter your data by filling in forms (i.e., using wizards) provided by the PubChem Upload system, rather than by importing files, the data you enter will be formatted for you and can be exported in any of these file formats.

    • Compressed files - Submissions uploaded by FTP may be compressed. This may substantially reduce the time it takes to transfer your data. We support files compressed using the "gzip" compressor. Please note that we do not support "zip" or "bzip2" compressed files.

    • Please see the complete help document for tips for assay submissions in CSV format, including the ordering of columns, if you choose to submit CSV files.

  • Sample files for submissions of assay descriptions and assay data to PubChem BioAssay: back to top

     

    Note: The substances that were tested also need to be submitted, but in a separate procedure, as noted in the section of this document on "three parts to an assay submission." The information below is for the submission of assay descriptions and data.

     

Additional details about assay submissions are provided in the complete Upload Help document.

 
 Revised 11 December 2014