PubChem Upload Help (complete version)
 
 
   

This is the complete help document for the PubChem Upload tool, with technical details about the system and about FTP submissions. A brief help document is also available for quick reference, with a subset of information from this file, such as sample files for submitting substances and assays. A separate file provides answers to frequently asked questions (FAQs), and tutorial provides step by step examples of how to use PubChem Upload for substance and assay submissions. If you need additional help, please contact pubchem-deposit-help@ncbi.nlm.nih.gov.

 
     
 
TABLE OF CONTENTS
 
What is PubChem Upload?
Tool for data submissions & updates
Flexible methods for data entry (wizards, upload files, FTP)

Overview
Interface: four different page views
Account setup & login
Test account
Full account
Special situations (multiple users, multiple accounts)
Account Login
Three main phases in submission process:
(1) upload data to pending submissions database(2) review submission(3) commit to PubChem
Data release (immediate or hold-until-date)

Home Page for your account
"New Submission" button
"Welcome" folder tab
"Substances" folder tab
"Assays" folder tab
"Account Settings" folder tab

Substance submissions
General guidelines
Data specifications
File Formats (SDF, CSV, compressed files (GZIP))
Sample files
chemical submission
RNAi submission
Data Processing
submission status
seven steps in a substance submission
submission status report
Data Release

Assay submissions
General guidelines
Three parts to an assay submission: (1) substance submission + (2) assay description + (3) assay data
Data specifications
File formats (spreadsheet, XML, ASN, CSV, compressed files (GZIP))
Sample files:
single readout IC50
multiple readouts with dose-response data
panel assay with multiple targets
RNAi assay
Assay description fields
Data Processing
submission status
five steps in an assay submission
Create URL to share on hold assay data with collaborators
Data Release

FTP depositions
Request FTP account
Substance depositions by FTP
Assay depositions by FTP

Log of changes to PubChem Upload

Differences between old PubChem Deposition Gateway vs. new PubChem Upload system

References

 
 
 
What is PubChem Upload? back to top

Tool for PubChem data submissions and updates back to top

PubChem Upload is a tool for submitting small molecule and bioassay data the PubChem Substance and BioAssay databases. It offers streamlined procedures for both data submissions and updates, and replaces the original Pubchem Deposition Gateway. (A separate section of this file describes the differences between the Deposition Gateway and Upload.) To allow for greatest flexibility, the Upload system allows you to submit data using any one of the methods noted below.

PubChem Upload can also be used for updating existing PubChem records. It retrieves your "In PubChem" substance and assay records, loads them into the interface, and allows you to make edits -- large or small -- directly in the interface, and then commit the revised file to PubChem. It's also still possible, if desired, to make an update using the old approach -- by importing a formatted file that contains the original Registry ID (RegID) for the submission and includes the desired revisions.

Flexible methods for data entry back to top

You can submit data to the PubChem Substance or BioAssay databases in any of the following ways:
  • Use wizards to enter data into web forms -- Wizards assist novice users in entering substance and/or assay data, without requiring knowledge of detailed data specifications. Hints appear at the bottom of most screens, proving tips on data entry and links to examples in the upload tutorial, if/as appropriate. The wizards also give you the option of using an existing PubChem record as a template, if desired, in order to automatically populate some information into your new submission. After you type or import data into the wizards, the Upload system will prepare a properly formatted file that conforms to the data specifications. A separate section of this document provides illustrated examples of the PubChem Upload interface.  -OR-

  • Upload formatted files -- Upload also accepts formatted files for subtance and assay submissions, for depositors who prefer that approach instead of using wizards. Details about file format specifications, as well as sample files, are provided in subsequent sections of this document.  -OR-

  • FTP depositions -- For large and/or frequent data uploads, depositions by FTP are recommended. Details about requesting an FTP account and the procedures for FTP submissions are provided in a separate section of this help document. Once your data are in your FTP directory, you can review and edit them, and commit them to PubChem, using the new Upload system.

 
Overview back to top

interface (four page views) | account setup & login (test account, full account + data transfer agreement) |
special situations (multiple users on one account, multiple accounts for one organization) |
three phases in submission process (upload to pending submissions databasereview/editcommit to PubChem) |
data release (immediate or hold-until-date)

Interface: four different page views back to top

login page | home page for your account | substance submissions interface | assay submissions interface
Page View: Use this page to: Example: (click on an image to read more)
Login page Sign in to your account with your e-mail address as the username ("Login")

OR

Create an account ("Register")
back to top Thumbnail view of PubChem Upload Login page. Click on the image for more detail about this page.
Home page
for your account
Begin a New Submission for one or more substances, or for a bioassay.

Use the Welcome folder tab to access a tutorial, this help document, and an e-mail link to the PubChem depositions help desk, for comments and questions about the system.

Use the the Substances folder tab and Assays folder tab to access all of your PubChem records, including pending submissions as well as those already in the PubChem database (as publicly available or on-hold records). Mouse over an Upload ID to access the view/edit function, which opens the selected record in the submissions interface. Use the "On Hold" tab, "Release" dialog box, to release data, if desired.

Use the the Account Settings folder tab to view or edit your account information and Upload preferences.

back to top Thumbnail view of the PubChem Upload home page, which provides access to all of your pending and in-PubChem records, in both the Substance and BioAssay databases, and to a tutorial and help documentation on how to use the system. Click on the image for more detail about the Upload home page.
Substance
submissions interface
Edit, commit, or delete pending small molecule submissions (chemicals or RNAi) to the PubChem Substance database. Mouseover an item in the desired folder tab to see the available actions for that item.

Make updates to existing "In PubChem" substance records, whether they are publicly available or on hold.

back to top Thumbnail view of the PubChem Upload substance submissions interface.
BioAssay
submissions interface
Edit, commit, or delete pending bioassay submissions to the PubChem BioAssay database. Mouseover an item in the desired folder tab to see the available actions for that item.

Make updates to existing "In PubChem" bioassay records, whether they are publicly available or on hold.

back to top Thumbnail view of the PubChem Upload assay submissions interface

Account Setup & Login back to top

test account | full account (Data Transfer Agreement) | special situations (multiple users on one account, multiple accounts for one organization) | account login
  • Test Account back to top

    • A "test" account enables you to upload data into the pending submissions database, but does not require signing the data transfer agreement (HTML format) and does not enable you to "commit" the data into the official PubChem database. The test acct is just for trying the system to determine if you can successfully upload and see your structure.

      When you register for a test account, you must enter your e-mail address. When you go to the next screen, PubChem upload will send a message to your e-mail address with a code that you must enter on the second screen in order to proceed. This confirms the validity of the e-mail address, in order to ensure communications about your PubChem submissions.

      When you are ready to upgrade from a test account to a full account, you can do that in any one of several ways:
      • The first time you press the commit button, PubChem Upload will prompt you to upgrade to a full account in order to complete the commit step.  -OR-
      • On your PubChem Upload home page, you can click on the "Upgrade to Full Account" link in the upper right corner of the page (in the blue header bar area). That link will no longer appear on your Upload home page after you have upgraded to a full account.  -OR-
      • On your PubChem Upload home page, you can select the "Account Settings" folder tab and then press the "Upgrade to Full Account" button. That button will no longer appear on your Upload home page after you have upgraded to a full account.

  • Full Account back to top

    • A "full" account requires you to sign the data transfer agreement and enables you to "commit" your data into the official PubChem database.

      Upload account users must agree to and are bound by the data transfer agreement. This agreement gives PubChem the right to redistribute the information deposited. It is important to note that all data deposited into PubChem is covered under fair-usage and, as such, does not require the depositor to assign away copyrights or other ownership prior to PubChem deposition, i.e., the deposited data does not have to become part of the public domain."

      If your organization requires modifications to the Data Transfer Agreement, please contact a PubChem Curator.

      NOTE: In the previous PubChem Deposition Gateway, test accounts were separate from full accounts, and after submitters uploaded data into a test account, they were required to resubmit data into a full account in order to deposit them into PubChem. In the new Upload system, a test account can be converted to a full account at any time, and the data do not have to be resubmitted in order to deposit them into PubChem.

  • Special Situations back to top

    • Multiple users on one account -- If desired, you can create one Upload account that contains multiple users, each having their own login and password. This makes it possible to have:

      • Separate logins - You can have an arbitrary number of users each with her own password on one Upload account. No one can see the password of another user, but there is one primary user who can add and remove other users.

      • Separate deposition tracking - For each action that a user takes, the processing history for a substance or assay deposition will record that user's id. The user who initiates a deposition, however, will remain associated with the deposition even if others work on it. The pending listing, for example, of that substance or assay deposition will show that user as the owner.

      • Joint access - Though login and tracking is distinguished, access is not. All users from the same data source will see all of their depositions whether they initiated them or not. Users are also free to make follow-up actions on any submission from their data source. For example, if jane_doe deposits a file of substances that pass validation, juan_carlos (from the same data source) may commit the submission for publishing or for that matter may even delete it.
      To implement this option:
      1. Choose one person to be the administrator (primary user) and have that person create the initial Upload account. If you already have an Upload account and would like to add users, skip to the next step.

      2. The administrator must login to PubChem Upload and then, on the Upload home page, open the Account Settings > Contacts tab and click on the "Add Contact" button. Fill out this form for each additional user. You can have the user enter her password while the administrator is present or the admin can enter a temporary and the user can change it afterwards.

    • Multiple accounts for one organization -- Organizations with multiple data collections may need multiple Upload accounts.

      • PubChem requires a separate Upload account with a unique username for each unique data source. If you have multiple substance collections that need to be treated independently within PubChem, you will need separate Deposition accounts for each. For example, the "NIST" and "NIST Chemistry WebBook" substance collections are both generated by the organization NIST. (The PubChem Upload regards your e-mail address as the username. Nevertheless, it is still possible for you to open a second account with the same e-mail address. However, PubChem Upload will prompt you to create a different username for the second account.)

  • Account Login back to top

    • Enter your e-mail address as your username. If your e-mail address is associated with two or more PubChem Upload accounts, then the email can only serve as a login for one account, and the other has to have a different username -- the PubChem staff check for this during the registration process. If you had an account with the old PubChem Deposition Gateway, the username and password you had for that account is valid in the new PubChem Upload system.

Three main phases in submission process back to top

  1. Upload data into Pending Submissions database back to top

    • When you upload substances or bioassays, they initially go into a pending submissions database. You can submit data into the pending submissions database from a test account or a full account.

      There are some things you can do in the pending submissions database that you can no longer do once your data are committed to PubChem. For example, it is easy to delete individual substances or whole submissions (i.e., the complete set of substance or bioassay data that have been submitted under a given Upload ID) from the pending database, but once they have been committed to PubChem, they can only be revoked.

      If you upload data to PubChem Upload, but do not complete the submission by pressing the "Commit" button, your data will remain in the pending submissions database until you press "Commit." The system will periodically send automated e-mail reminders that your upload is unfinished. PubChem Upload is not a long-term storage system, and unfinished data may be removed after a reasonable timeframe. Upload data is never added to the PubChem database until the submitter has committed it.

      If you are concerned about keeping your data private until a specific date, you can indicate a "PUBCHEM_HOLD_UNTIL_DATE" as part of the submission process. In such a case, your data will be added to PubChem after you press "Commit," but they will not be visible to the public until the "PUBCHEM_HOLD_UNTIL_DATE" that you specified.


  2. Review the pending records and edit if/as needed back to top

    • While your data are in the pending submissions database, you have options to edit them if/as needed (for example, to correct errors reported by the validation procesures, or to enhance annotations), to preview them in PubChem, and the commit or delete the data. (See illustrated examples of the interfaces that display pending substances and pending assays.)

  3. "Commit" the data into the official PubChem database back to top

    • When you "commit" your data, they are moved from the pending submissions database into the PubChem database. You must have a full account (which includes signing of a data transfer agreement) in order to commit your data into PubChem. Once in PubChem, your data will undergo some additional processing in order to fully integrate them into the database. By default, data are released immediately after processing is complete, unless you have indicated a PUBCHEM_HOLD_UNTIL_DATE.

Data Release back to top

  • Immediate release of data back to top

    • After you "Commit" your data into PubChem, it usually takes about one to two days for the data to be processed and publicly released. (Delays can sometimes occur, depending on the volume of data PubChem is receiving or other extenuating factors, such as technical issues due to infrastructure upgrades; however, such delays are infrequent.)

      By default, data are released immediately after processing is complete, unless you have indicated a PUBCHEM_HOLD_UNTIL_DATE.

      If you want to indicate a hold-until-date, it is critical that you do that as part of the INITIAL submission process, as noted below.


  • Hold-until date back to top

    • If you prefer to keep your data private until a specific date, rather than have it released immediately, it is critical that you indicate a "PUBCHEM_HOLD_UNTIL_DATE" as part of the initial submission process. In such a case, your data will be added to PubChem after you press "Commit," but they will not be visible to the public until the "PUBCHEM_HOLD_UNTIL_DATE" that you specified. (The substance data specifications and bioassay data specifications provide additional details about the PUBCHEM_HOLD_UNTIL_DATE for each type of data.)

    • An initial hold of up to one year from initial submission is accepted, with the possibility of additional extensions.

    • You can update your record any time to remove the on-hold request and release the record to the public. Separate sections of this document describe how to remove the on-hold request for substances and how to remove the on-hold request for assays.

    • If you DO NOT specify a hold-until-date at the time of initial submission, your PubChem record will be released automatically after the record goes through standard PubChem data processing.

    • Once a record is "published" (publicly released) in PubChem, it can no longer be put back to the on-hold status. (That is, a retroactive hold cannot be placed on PubChem records that are already publically accessible.) The only action available after that time is a revoke.

    • Notes for assay submissions:

      • If you have chosen to hold your assay data private, but you would like to provide access to select individuals such as collaborators or reviewers, you can create a URL that provides temporary access to a given on-hold assay submission (i.e., to a given AID and its associated substance records). You can then share it with the desired individuals. This function was added in order to facilitate collaboration (whether external or in-house) and administrative work such as publication and grant processing.

      • An assay cannot be released to the public before all of its associated substances (SIDs) are public. (Validation checks are performed on the release dates of an assay and its associated substance records, and an error is generated if a request is made to release an assay before all of its SIDs are public.) Therefore, if you would like to release your assay data to the public, you may need to do two steps if both the assay data and the substances are on-hold:

        1. Release the substances: Open your PubChem Upload home page/"Substances" folder tab/"On Hold" subfolder, and type the Assay ID (AID) in the text box entitled "Release (remove Hold Until Date) Substances." That will release all PubChem Substance records that are associated with that AID.

        2. Release the assay: Once the substances are uploading to PubChem, release the assay data by following the steps described in the Assay data release section of this document.

 
Home Page for your account back to top

When you log in to your PubChem Upload account, the web browser automatically displays your PubChem Upload home page. It serves as a: (1) hub for accessing all of your PubChem data (including in-progress submissions that are in the pending submissions database, as well data you have committed to the PubChem database, whether those data are publicly available or are keeping private until a release date that you've specified), and (2) springboard for starting new submissions, editing in-progress submissions, and updating your publicly available and/or on-hold PubChem records. Details about the features of your Upload home page are provided below, including the:


"new submission" button | "welcome" folder tab | "substances" folder tab | "assays" folder tab | "account settings" folder tab


"New Submission" button back to top Thumbnail view of the Welcome folder tab on the PubChem Upload home page, which provides access to a tutorial, this help document, and an e-mail link for sending comments about the system.


The "New Submission" button appears in the upper right corner of your PubChem Upload home page, regardless of which folder tab you are viewing. After you click on "New Submission," you will be asked what type of data (bioassay or substances) you want to submit, and what data entry method you prefer (upload files or fill in forms).

If you choose to fill in forms, you will be asked if there is a particular PubChem record that you would like to start from (i.e., that you would like to use as a template).
  • For a substance submission, you can enter either the PubChem Substance identifier (SID) or PubChem Compound identifier (CID) of the record that you'd like to use as a template. If you use this option, PubChem Upload will import the chemical structure, chemical name, and synonyms of the SID into the "new submission" form.

  • For an assay submission, you can enter the Assay identifier (AID) of the record you'd like to use as a template. If you use this option, PubChem Upload will import the assay's protocol and description into a new submission, but NOT its RegID, and not its data. However, the protocol supplies the data definition (the column headers) that will be used in the data table.
Select "no" (default) if you prefer to start your submission from scratch rather than use a template. The wizards will prompt you for the information needed at each step of the submission process, with asterisks indicating required data elements. Click on the "?" at the right edge of the form beside any data field to see hints for completing that field.

Upload ID:
Whether you begin a submission from scratch or use an existing PubChem record as a template, the PubChem Upload system will assign an Upload ID to your new submission. Your Upload IDs are visible only to you through your PubChem Upload Home Page (in the Substances/Pending, Substances/SubmissionHistory, and Assays/Pending subfolders).
  • For substance submissions, an Upload ID can be associated with one to many substances. That is, a new substance submission can be used to deposit one to many small molecules. A single Upload ID will be assigned to that data set. When the submission is committed to PubChem, each small molecule will receive its own PubChem Substance Identifier (SID). Once the molecules are in the PubChem Substance database, only the SIDs are shown in the PubChem records. The association between the Upload ID and corresponding SIDs is visible only to you, in the Substances folder tab, Submission History subfolder, of your PubChem Upload Home Page.

  • For assay submissions, each Upload ID is associated with a single assay submission. When the submission is committed to PubChem, the assay will receive a PubChem Assay Identifier (AID). (Therefore, for each Upload ID, there is only one corresponding AID.) Once the assay record is in the PubChem BioAssay database, only the AID is shown.
The PubChem Upload FAQs provide additional information about the various types of identifiers, including Upload ID, Registry ID (RegID), Substance ID (SID) Compound ID (CID), and Assay ID (AID).

"Welcome" folder tab back to top


Use the Welcome folder tab (illustrated above) to access a tutorial, this help document, and an e-mail link to the PubChem depositions help desk, for comments and questions about the system.

"Substances" folder tab back to top Thumbnail view of the Substances folder tab on the PubChem Upload home page, which provides access to all of your pending and in-PubChem substance records.

"pending" substances | "in PubChem" substances | "on hold" substances | "submission history"
"Pending" subfolder: back to top
This subfolder lists, in a tabular format, all of the pending substance submissions you have in the PubChem Upload system. From here, you can:
  • Commit or delete submission
    • Select the check box(es) of the submission(s) that you want to commit or delete, then click on the desired button (commit or delete). Note that you must have a full account in order to commit your data into PubChem.

  • View/edit submission
    • If you mouse over a row, it will be highlighted in yellow and a row of buttons will appear, indicating the available actions for that submission. If you click on view/edit submission, Upload will open the selected data set in the Substance submissions interface, where you can add more substances to the data set, or delete substances from it, and where you can modify/add annotations. A separate section of this document provides more details about the Substance submissions interface, and about the status codes for each upload ID.

"In PubChem" subfolder: back to top
This subfolder lists, in a tabular format, all of the substance records you have in the PubChem Substance database, including those that are publicly available as well as those that were committed to PubChem but are being kept private until the hold_until_date you have specified. If you mouse over any row in the table, you can choose from the following actions:
  • View in PubChem
    • Opens your substance record in the PubChem Substance database. (If your record is on-hold, then the link will open a preview of the record that is only accessible from your PubChem Upload account.)

  • Use as Template
    • Copies the data from this record into a "new submission" form, thereby using the record as a template. If you use this option, PubChem Upload will import the chemical structure, chemical name, and synonyms of the SID into the "new submission" form. If you own the SID that you have entered, then the PubChem Upload system will give you a choice to either use the SID as a template or to make modifications to that record.

  • Modify
    • Once you have chosen "modify" for a substance in the "In PubChem" folder tab, that item instantly appears in the "Pending" folder tab, with an upload ID that has been assigned to the "modify" action. As long as that SID appears in the "Pending" folder tab, you can only modify it from the "Pending" folder tab. You cannot modify it from the "In PubChem" folder tab, because doing that would create a duplicate "modify" entry in the Pending folder tab for the same RegID. (For example, if you try to modify a substance from the "In PubChem" folder tab while a modify request for that substance already exists in the "Pending" folder tab, the system will display the following error message: "Failed to create substance: RegID/name already exists for pending deposition".)

  • Revoke
    • Allows you to suppress your PubChem record from Entrez searches. Once your substance is revoked, it will only be publically available through the PubChem Substance Summary service if a user enters the record's SID. (PubChem does not allow you to delete data that were previously uploaded to the archive, but the revoke operation allows you to suppress this data from PubChem search results.)

"On Hold" subfolder: back to top
  • Release On-Hold Substances
    • The "On Hold" subfolder allows you to release, if desired, any of the substance records you have committed to the PubChem Substance database, but are being kept private until the hold_until_date you have specified. To do this, login to PubChem Upload and click on the "Substances > On Hold" tabs. There you will see a text box entitled "Release (remove Hold Until Date) Substances," where you can either:
      • paste a list of SIDs (one per line) of the substances you would like to release,
        OR
      • enter the AID of PubChem BioAssay record that you submitted, in order to release all of the PubChem Substance records that are associated with that AID.
      In either case, the substances will be released with the daily upload schedule. This function automatically commits your data to PubChem, so if you change your mind, you need to email the PubChem depositions help desk (pubchem-deposit-help@ncbi.nlm.nih.gov) quickly!

"Submission History" subfolder: back to top
  • Substance Submission History
    • The "Submission History" subfolder allows you to download a list of substance identifiers (SIDs) that were assigned to a particular Upload ID. Downloading an SID list is the only function available in the subfolder. (In contrast, the "In PubChem" tab lists all of the SIDs associated with your account, and does not group them by Upload ID. You can use that tab to view/edit your substance records, or use them as a template for a new submission.)

"Assays" folder tab back to top Thumbnail view of the Assays folder tab on the PubChem Upload home page, which provides access to all of your pending and in-PubChem BioAssay records. Click on the image for more detail about this page.

"pending" assays | "in PubChem" assays | "tools"
"Pending" subfolder: back to top
This subfolder lists, in a tabular format, all of the pending assay submissions you have in the PubChem Upload system. If you mouse over any row in the table, you can choose from the following actions:
  • Commit or delete submission (check box)
    • Select the check box(es) of the submission(s) that you want to commit or delete, then click on the desired button (commit or delete). Note that you must have a full account in order to commit your data into PubChem.

  • View/edit submission (mouseover row)
    • If you mouse over a row, it will be highlighted in yellow and a row of buttons will appear, indicating the available actions for that submission. If you click on view/edit submission, Upload will open the selected data set in the Assay submissions interface, where you can modify or add assay data and/or annotations. A separate section of this document provides more details about the Assay submissions interface, and about the status codes for each upload ID.

"In PubChem" subfolder: back to top
This subfolder lists, in a tabular format, all of the assay records you have in the PubChem BioAssay database, including those that are publicly available as well as those that were committed to PubChem but are being kept private until the hold_until_date you have specified. If you mouse over any row in the table, you can choose from the following actions:
  • View in PubChem
    • Opens your assay record in the PubChem BioAssay database. (If your record is on-hold, then the link will open a preview of the record that is only accessible from your PubChem Upload account.)

  • Use as Template
    • Copies the data from this record into a new submission form, thereby using the record as a template. If you use this option, PubChem Upload will import the assay's protocol and description into a new submission, but NOT its RegID, and not its data. However, the protocol supplies the data definition (the column headers) that will be used in the data table.

  • Modify
    • Once you have chosen "modify" for an Assay listed on the "In PubChem" folder tab, that assay instantly appears in the "Pending" folder tab, with an upload ID that has been assigned to the "modify" action.

      As long as that AID appears in the "Pending" folder tab, you can only modify it from the "Pending" folder tab. You cannot modify it from the "In PubChem" folder tab, because doing that would create a duplicate "modify" entry in the Pending folder tab for the same RegID. (For example, if you try to modify an Assay from the "In PubChem" folder tab while a modify request for that assay already exists in the "Pending" folder tab, the system will display the following error message: "Failed to create assay: RegID/name already exists for pending deposition".)

      While a PubChem BioAssay record is undergoing modification, the "Pending" folder tab, "Action" column will show (-.-) as a placeholder for the Version.Revision; the integers will be populated at a later processing step.

      Note that you can use the "Modify" option to release your assay data if they are on hold.

  • Revoke
    • Allows you to suppress your PubChem record from Entrez searches. Once your assay is revoked, it will only be publically available through the PubChem BioAssay Summary service if a user enters the record's AID. (PubChem does not allow you to delete data that were previously uploaded to the archive, but the revoke operation allows you to suppress this data from PubChem search results.)

  • Export SID list
    • Allows you to obtain the substance IDs (SIDs), including those on-hold, for any given assay.

  • Create URL

"Tools" subfolder: back to top
The Assay Tools are useful particularly for siRNA assay submissions that include gene symbols or a nucleotide sequence accessions (such as mRNA RefSeq accessions). The Assay Tools look up the Gene ID that corresponds to each gene symbol or nucleotide accession, and you can then include the Gene IDs in your assay submission, if desired. That will establish a bidirectional link between your assay and the Gene record(s), enabling users to easily access both.
  • Gene Symbol to Gene ID
    • The "Gene Symbol to Gene ID" subfolder allows you to enter a gene symbol(s) and taxonomy ID(s) TaxID for the source organism(s), and returns the Gene ID(s) for the corresponding record(s) in the Gene database. (For the gene symbol input, TaxIDs are also needed because symbols are not unique to a given species.)
      For example, enter the gene symbol CLOCK, and the TaxID 9696 and the tool returns Gene ID 9575 for the human CLOCK circandian regulator gene. (Note: If you enter an alternative symbol by which the gene has been known, such as KAT13D, the Assay Tools will still find the correct Gene ID, as long as the alternate gene symbol you have entered is listed in the "Also known as" section of the Gene record.)

  • Nucleotide Accession to Gene ID
    • The "Nucleotide Accession to Gene ID" subfolder allows you to enter the accession number(s) of a nucleotide sequence record(s), and returns the Gene ID(s) for the corresponding record(s) in the Gene database.
      For example, enter the nucleotide accession AF011568 for the Homo sapiens CLOCK mRNA and the tool returns Gene ID 9575 for the human CLOCK circandian regulator gene. (Note: If you enter any of the nucleotide accession numbers listed in the "reference sequences" or "related sequences" sections of the gene records, the Assay Tools will retrieve the correct Gene ID. You can also enter the identifer in the form of accession.version, such as AF011568.1.)
After obtaining the Gene ID(s), you can include it(them) in your assay submission, either as a column in the data table, or as XRefs in the assay description. For example, if your data table includes a column with a long list of gene symbols, you could create a new column in the data table for the corresponding Gene IDs (setting the column type to be "NCBI Gene-Id"). You could then paste your list of gene symbols into the Assay Tools tab, along with a single TaxID (if all of the genes are from a single organism), or one TaxID for each gene symbol (if the genes are from two or more organisms). After the tool looks up the corresponding Gene IDs, you can copy/paste them into the new data column. Once your PubChem BioAssay data becomes public, users who view the Gene record(s) will be able to easily access your assay, and those who view your assay will be able to easily access information about the genes that you tested.

"Account Settings" folder tab back to top Thumbnail view of the Account Settings folder tab on the PubChem Upload home page, which enables you to view or edit your account information and Upload preferences. Click on the image for more detail about this page.

account info | contacts | upload preferences | data source categories
The "Account Settings" folder tab allows you to manage your contact information and account preferences.
"Account Info" subfolder: back to top
  • This subfolder shows your contact information. Click on any line in that subfolder to make edits.
"Contacts" subfolder: back to top
  • You will see a "Contacts" subfolder only if you are the Upload account administrator (i.e., primary user) for your organization. You can add additional e-mail addresses for yourself, and/or you can add other users, in that tab. When users who are listed as contacts log into their account, they will be able to view/edit their "Account Info" tab, but will not see the "Contacts," "Upload Preferences," and "Data Source Categories" subfolders. Those subfolders can be viewed/edited only by the account administrator.
"Upload Preferences" subfolder: back to top
  • The "Upload Preferences" subfolder enables you to select options for submission processing, such as:

    • For FTP depositors only: Auto-confirm all substance and assay depositions?

      This checkbox only applies to substance and assay depositions made via FTP. If checked, all such substance and assay (Alter description only) depositions will be automatically confirmed on your end if they pass validation. This means you will not have to click on the "Commit" button on the user interface in such a case. The submission will still need to be reviewed and approved by a PubChem curator, but one manual step will be eliminated. Note that for assay submissions this automated process only applies to the alter description type of depositions.

    • For substances provided without chemical structure, attempt to generate from synonyms?

      If this box is checked and if no structure is provided in the substance record, PubChem processing will attempt to use provided synonyms to auto-generate the deposited compound structure. This processing includes the use of CID as synonym (e.g. "CID1" will use the structure of CID 1 for the structure record), matching synonym to MeSH (e.g. "Aspirin" will use the structure of CID 2244), and name to structure software (e.g. "2-acetyloxybenzoic acid" will yield the same structure as CID 2244).

    • If 3D substance coordinates supplied, are they experimental?

      If this box is checked, the depositor confirms that all 3D substance coordinates supplied were experimentally-derived. If 3D coordinates were generated by a computational algorithm, do not mark this box as it is not in the scope of the PubChem database to display such information.

    • Include CIDs with 'Get SIDs' substance download report?

      If this box under the Preferences tab is checked, an extra CID column will be included at the end of the CSV file downloaded with the 'Get SIDS' link for substance depositions. Note: If you are not the administrator (primary user) on your PubChem Upload account, you will need to ask that person to login and check this box.

    • Ignore Hold-Until dates already in the past in substance records?

      If this box under the Preferences tab is checked, any substance record Hold-Until date set in the past will be stripped out and ignored for the sake of versioning.

    • Avoid registry ID in list of chemical structure synonyms?

      If this box under the Preferences tab is checked, registry IDs will not be used as synonyms, so they will not get used as preferential names for records.

    • For RNAi Assay depositors only: Use outside RNAi substance provider?

      If this box under the Preferences tab is checked, RNAi assay depositors are able to use substance records from an outside RNAi provider in addition to their own deposited substance records.

"Data Source Categories" subfolder: back to top
  • Depositor categories indicate the nature of a depositor's institution and the type of data they supply. They give users an idea about the type of information they can expect to find for a molecule in that depositor's PubChem records or on the depositor's web site. Initial account setup allows only one selection; in this subfolder, you can add more catories. Examples of the available categories include:
Depositor Category Meaning
Biological Properties back to top Depositor provides information about the biological properties of a substance or compound.
Chemical Reactions back to top Depositor provides information about the reactivity, synthesis, or known reactions of a substance or compound.
Imaging Agents back to top Depositor provides information about the contrast agent or imaging agent used in, for example, MRI's.
Journal Publishers back to top Depositor is a journal publisher and has articles published about a substance or compound.
Metabolic Pathways back to top Depositor provides information on the metabolic pathways involving a substance or compound.
Molecular Libraries Screening Center Network back to top Depositor is part of the NIH Molecular Libraries Screening Center Network (MLSCN).
NIH Substance Repository back to top Depositor is an NIH Molecular Libraries Small Molecule Repository servicing the MLSCN.
Physical Properties back to top Depositor provides information about the experimental physical properties of a substance or compound.
Protein 3D Structures back to top Depositor provides information about the experimental 3-D structure of a substance or compound. (Most of the molecule records that fall into this depositor category are derived from Molecular Modeling Database records, which generally contain the 3-D structures of biomolecules, such as a proteins, that may be bound to the substance or compound.)
Substance Vendors back to top Depositor is a seller of a substance or compound.
Theoretical Properties back to top Depositor provides information about the theoretical properties of a substance or compound.
Toxicology back to top Depositor provides information about the toxicological properties of a substance or compound.

 
Substance Submissions back to top

general guidelines | data entry methods | data specifications | file formats (SDF, CSV, compressed files (GZIP)) | sample files (chemical submission, RNAi submission) | data processing for substances (submission status, seven steps in a substance submission, submission status report) | data release

General guidelines for submitting substances back to top Thumbnail view of the PubChem Upload substance submissions interface.

The types of substances submitted to PubChem, and the annotations that accompany them, can vary widely. For this reason, the minimum amount of information needed in a submission is a substance name and Registry ID (REG ID, which is a submitter's local indentifier for the substance). No single piece of annotation is required. However, PubChem encourages the inclusion of as much annotation as possible in order to make the data as useful as possible. The types of annotations that can be included are listed in the substance data specifications.
Data entry methods for substances -- When you begin a new submission to the PubChem Substance database, you have the option to:
  • Fill in forms (use wizards), where you can either:

    • Use a template -- Enter the CID or SID of the PubChem record of interest in order to copy the chemical structure, chemical name, and synonyms from that compound or substance record into your new submission forms.

      OR

    • Start from scratch -- If you say "no" to the option of entering a CID or SID, you can enter your substance submission from scratch.

    • In either case, wizards will assist you in entering your data without requiring knowledge of detailed data specifications. Hints appear at the bottom of most screens, proving tips on data entry and links to examples in the upload tutorial, if/as appropriate. After you type or import data into the wizards, the Upload system will prepare a properly formatted file that conforms to the data specifications.

  • Upload files

    • If you choose to upload files, you can format your data as an SDF or CSV file. Files can be compressed, if desired. Examples of each file format are provided in the next section.

Data specifications, file formats & sample files for submitting substances back to top

Small molecules and RNAi can be submitted to the PubChem Substance database in either SDF or CSV format. The allowable tags that can be used in both formats are provided in the data specifications below.

  • Data specifications for submissions to PubChem Substance: back to top


  • File formats that can be used for submissions to PubChem Substance: back to top

    • SDF - The *.sdf format is generally used for submitting chemicals because they typically include chemical structures. SDF is one of a family of chemical-data file formats intended especially for structural information, and "SDF" stands for structure-data file. The *.sdf file format contains two main types of information for each small molecule submitted: (1) an optional structure specification, in the form of 2D or 3D coordinates, SMILES, InChi, chemical synonyms, or a PubChem Compound ID (CID), and (2) metadata that use the appropriate tags from the data specifications and provide descriptive information about the substance. If the structure is specified as a chemical synonym or CID, the PubChem Upload system will find the corresponding PubChem Compound record and import the chemical structure from that Compound record into your substance submission. Example SDF files are provided below. (Note that chemicals can also be submitted in *.csv format, with their structures specified in any of the forms mentioned above (preferably CID), except in the form of explicit 2D or 3D coordinates (coordinates can only be accomodated by *.sdf files).)

    • CSV - The *.csv format is most convenient for submitting RNAi's because they generally don't include a chemical structure and the tabular format makes it easy to note relevant information, such as gene or protein targets. A *.csv file can include any/all of the SD tags listed in the substance data specifications, and they serve as column headers. The SD tags can come in any order, with the exception of the PUBCHEM_EXT_DATASOURCE_REGID, which should always be in the first column. Each substance can take up one line only. When an SD tag is associated with multiple values, they are separated with a newline character. (Hint: when entering data with Excel, use Alt-Enter to create a newline in data cell or double click a data cell before pasting multiple line entry into it.) Example CSV files are provided below. CSV files can be imported into any spreadsheet program for viewing or editing.

      • Information about the rules for proper formatting of CSV files is provided on sites such as Wikipedia (http://en.wikipedia.org/wiki/Comma-separated_values) and the Internet Engineering Task Force (IETF) (http://tools.ietf.org/pdf/rfc4180.pdf, http://tools.ietf.org/html/rfc4180). For example:

        • If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example, if your CSV file contains a field that looks like:
          QC'd by "Quicktest"
          then that field should be revised to look like:
          "QC'd by ""Quicktest"""

      • If you are uploading a very large file through a web browser interface, it is possible you might see an error message such as a time out or bad gateway. This usually means that your browser timed out before it could get notification back to you confirming that your file was uploaded. If you simply close the pop-up that displays the error message and wait a few minutes, you should see that the file has been uploaded. After a few minutes, refresh your original window, which should display a message such as "validating" or "validated." If you see a message such as "parsing failed," then the file might have had some problem. But generally, PubChem Upload eventually receives everything in large files, even if a time out error message is displayed during the upload process.

      • As an alternative, you can use FTP to upload very large files. Details about requesting an FTP account and the procedures for FTP submissions are provided in the complete PubChem Upload help document. Please note, however, that FTP is mainly intended as a way to begin a new submission, and not necessarily as a way to add to an existing submission.

    • Compressed files - Submissions uploaded by FTP may be compressed. This may substantially reduce the time it takes to transfer your data. We support files compressed using the "gzip" compressor. Please note that we do not support "zip" or "bzip2" compressed files.

    • If you choose to enter your data by filling in forms (using the wizards) provided by the PubChem Upload system, rather than by uploading files, the data you enter will be formatted for you.


  • Sample files for submissions to PubChem Substance: back to top

    • SDF file examples for submitting chemicals: back to top

      • new submission
        • This sample file contains 9 substances, which vary primarily in how their chemical structures are depicted and in the amount of annotations provided. The chemical structures are represented either as 3D coordinates, SMILES, InChI, a MeSH synonym, or a PubChem Compound ID. If a structure is provided as a MeSH synonym or PubChem Compound ID, then PubChem Upload will retrieve the PubChem Compound record that is associated with the specified MeSH term or CID, and will use the chemical structure from that PubChem Compound record. (A separate file provides additional information about chemical name synonyms, including MeSH.) In the sample SDF file, each substance is tagged with a PUBCHEM_EXT_DATASOURCE_REGID (external registry ID), and salient features of the example are noted in the corresponding PUBCHEM_SUBSTANCE_COMMENT field. The end of an individual substance is denoted by four dollar signs $$$$.

      • update existing PubChem substance record
        • SDF files for updates generally look no different than SDF files for new submissions. The only distinction is that an update uses the same regid that is already present in an existing PubChem Substance record (from the *same* data source, either publicly available or on hold), while a new submission uses a regid that is not yet found in any PubChem Substance record from that data source. If the regid provided in an SDF file matches a record that's already in PubChem Substance from the same data source, it's an update; if not, it's considered new.

      • revoke PubChem substance record
        • Registry IDs (regids) that are listed in the revoke file must exactly match existing records in PubChem Substance, and must be from the *same* data source. Only two fields are necessary in the revoke file: the regid and a short comment explaining the revoke.

      • mix of new submission, update, and revoke
        • If desired, you can submit a single *.sdf file that contains a mix of new substance submissions, updates to existing records, and revokes.

    • CSV file example for submitting RNAi's: back to top

Data Processing for Substances back to top

  • Substance Submission StatusThumbnail view of the Substances/pending folder tab on the PubChem Upload home page, which provides access to all of your pending and in-PubChem substance records. Click on the image for more detail about this page.

    • A series of seven dots indicates the status of a substance submission. These dots appear on both your Upload home page (in the "Substances > Pending" folder tab), and in the "History" tab of the Substance submissions interface. In general, the various dot colors mean the following:

      Grey dot, a submission status indicator that means a step has not been started yet. Grey dot means that the step has not been started yet.
      Orange dot, a submission status indicator that means a step is currently in progress. Orange dot means the step is currently in progress (processing).
      Blue dot, a submission status indicator that means a step has been completed. Blue dot means the step has been completed.
      Red dot, a submission status indicator that means a step has failed. Red dot means the step has failed.

  • The seven steps in substance submissions are below, including a few notes about dot colors (status indicators) for some of the steps:

    1. Initial creation of submission.
      Blue dot, a submission status indicator that means a step has been completed. the only possible dot color for this step is blue

    2. Parsing of submission.
      This is a check to see if the PubChem deposition system can recognize the file format, and if the PUBCHEM tags, especially the PUBCHEM_EXT_DATASOURCE_REGID, are present -- this step fails if PubChem cannot parse the chemical records out.

    3. Standardization of chemical structure.
      Red dot, a submission status indicator that means a step has failed. If any substances in a submission have failed standardization, corresponding error messages will appear in the "Validation" folder tab of the substance submissions interface. Click on an error message to view a list of the substances to which that error applies. Click on a substance in the list for options to edit that substance, either in forms or directly in its SDF file. If your substances do not have chemical information available, then it is expected and normal that they will "fail" Standardization.

    4. Validation part I.
      The first validation action crosschecks your submission with other submissions you have within the PubChem Upload. These crosschecks detect duplicate registry ID's and duplicate structures.

    5. Validation part II.
      The second validation action crosschecks your submission with previous submissions already deposited in PubChem. These crosschecks detect duplicate registry ID's and duplicate structures.

    6. Commitment of submission.
      By pressing the commit button, you are giving permission for the data to be moved from the pending submissions database into PubChem.

    7. Deposition of data into PubChem.
      This step represents the stage at which SIDs have been assigned to the substances and are available for download through the PubChem Upload user interface.

  • Download a Substance Submission Status Report File

    • A full report that summarizes your substance submission (including assigned PubChem SIDs, Load Codes, and CIDs that have been mapped to your unique registry identifiers) is available for any Upload submission that is approved and loading to PubChem.

    • This report can be obtained from either:
      1. the PubChem Upload web interface (by pressing the "Get SIDs" button that appears in the substance submission interface, once you have committed the submission to PubChem), or,
      2. in the case of submissions made by FTP, the status report is provided in the *.out file that the Upload system places in your FTP directory after the substances have been received by PubChem.

    • The report includes one row for each substance record in your submission and has the following column headers:

Data Release of Substances back to top

After committing your data into PubChem, substances can be released immediately (default) to the public or held private until a date that you specify. An initial hold of up to one year from initial submission is accepted, with the possibility of additional extensions, as noted in the overview/data release section of this document.

You can update your record any time to remove the on-hold request and release the record to the public. To do this:
  • Open your PubChem Upload home page/"Substances" folder tab/"On Hold" subfolder.
  • There you will see a text box entitled "Release (remove Hold Until Date) Substances," where you can either:
    • Paste a list of SIDs (one SID per line) of the substances you would like to release)
      OR
    • Enter the AID of a PubChem BioAssay record that you submitted, in order to release all PubChem Substance records that are associated with that AID. Important: this will only release the substance records, and not the associated assay record. You must take a separate step to remove the on-hold request for the assay record and release it to the public.
    In either case, the substances will be released with the daily upload schedule.
  • This function automatically commits your data to PubChem, so if you change your mind, you need to email the PubChem depositions help desk (pubchem-deposit-help@ncbi.nlm.nih.gov) quickly!

 
Assay Submissions back to top

general guidelines (three parts to assay submission: (1) substance submission + (2) assay description + (3) assay data) | data entry methods | data specifications | file formats (spreadsheet, XML, ASN, CSV, compressed files (GZIP)) | sample files (single readout IC50; multiple readouts with dose-response data; panel assay with multiple targets; RNAi assay) | assay description fields | data processing for assays (submission status) | data release

General guidelines back to top Thumbnail view of the PubChem Upload assay submissions interface

The types of assays submitted to PubChem, and the annotations that accompany them, can vary widely. The minimum amount of information needed in a summary assay submission is an assay name and Registry ID (REG ID, which is a submitter's local indentifier for the assay). For all other types of assays, the submissions must also include data. No single piece of annotation is required. However, PubChem encourages the inclusion of as much annotation as possible in order to make the data as useful as possible. The types of annotations that can be included are listed in the assay data specifications (separate document), and additional tips about the various assay description fields are provided in this document.
There are three parts to an assay submission:

(1) substance submission + (2) assay description + (3) assay data
  1. Substance submission -- A substance is the stuff being tested; typically it is what is in an assay plate well. A substance can be a discrete chemical entity (e.g. aspirin), or a complex mixture (e.g. a plant extract), or an RNAi molecule.

    The substances that were tested in your assay must be submitted to the PubChem Substance database before you submit your assay description and data to the PubChem BioAssay database. This is necessary because you will need to indicate their internal registry IDs (RegIDs) or PubChem Substance identifiers (SIDs) when you begin your bioassay submission, in order to associate the substance records with their bioactivity data.

    If you think the material in two assay plate wells is the same, we ask that you refer to it as the same substance with a single activity summary. If you think material in two wells differ, please refer to them as two distinct Substances, hopefully with different chemical structures (or different mixtures), and surely with distinct Activity Summaries.

    It is of course very common to do replicates across different batches and salt forms of a Substance when you believe the salt form to be irrelevant to activity. Your data, however, must be reduced to a single activity summary per substance that is submitted as an integer value: "inactive" - 1, "active" - 2, "inconclusive" - 3 (if there are indeed contradictory replicates), "unspecified" - 4, or "probe" - 5.

    In this way, your results will be much more accessible and understandable to users through the various searching and graphing functions of the PubChem Bioassay system.

    Use either one the following methods to submit the substances you have tested:

    • Fill in web forms: You can use the Upload Substance Submission wizards (i.e., fill out forms) for submitting chemicals or RNAi, and the data you enter will be formatted for you. If desired, you can use an existing SID or CID as a template for your new substance submission.    —OR—

    • Upload files: You can use the Upload Substance Submission wizards (i.e., fill out forms) for submitting chemicals or RNAi, and the data you enter will be formatted for you.

      • chemicals are typically uploaded into PubChem Substance database in the *.sdf file format (sample files), although the *.csv format is accepted as well.

      • RNAi are typically uploaded to PubChem Substance database in the *.csv file format (sample files).

        Note: RNAi assay depositors also have the option of using the SID or External ID that was submitted to PubChem by an outside RNAi substance provider (such as a vendor), as an alternative to submitting RNAi substance records from scratch. To do this, you can just specify, in your assay data file, either the SID or External ID of the RNAi records that the outside provider submitted to the PubChem Substance database. This allows RNAi assay depositors to essentially skip the substance submission step of the "three parts to an assay submission."

        If the substance provider is an RNAi vendor, the External ID is usually a catalog ID. Both the External ID and the PubChem SID are displayed in PubChem Substance records, and can therefore be obtained by viewing the vendor's PubChem records. In addition, some vendors have provided files that list their catalog IDs and corresponding SIDs, and these are available on the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/Extras/VendorCatalogs/).

        In order to exercise the option of using an RNAi vendor's catalog ID or SID (instead of your own SID), you need to upgrade your PubChem Upload account from a "test" account to a "full" account, and configure the account to indicate you are using a third-party or vendor's product, by checking the "For RNAi Assay depositors only: Use outside RNAi substance provider?" option under the "Upload Preferences" tab. (The Preferences tab is accessible from the "Account Settings" folder of your PubChem Upload home page.)

        Then, when you submit the assay data file (which is the third step of the three parts to an assay submission), you can enter the vendor's External ID or their PubChem SID for the RNAi you have tested.

        If you use the vendor's catalog ID, place it in the "PUBCHEM_EXT_DATASOURCE_REGID" column of your assay data file; if you use the vendor's SID, place it in the "PUBCHEM_SID" column of the data file.

        For example, if you tested the siGENOME siRNA reagent for TXNDC15, which was submitted to PubChem Substance by GE Healthcare Dharmacon RNAi Technologies, you could enter the submitter's External ID for that RNAi (in this case, "D-056123-17") in the "PUBCHEM_EXT_DATASOURCE_REGID" column of your assay data file, or you could enter the corresponding PubChem Substance identifier (193126385) in the "PUBCHEM_SID" column of the file.

         

    • The Substance submissions section of this document, and the substance data specifications (separate document), provide additional details about submitting substances.

  2. Assay description -- An Assay Description refers to the protocol and parameters of an assay, which can only be defined once. Fundamentally, the Assay Description defines the "columns" that are populated by the Assay Data "rows". Each "column" is assigned a result type identity (TID) in the Results Definition section. The Assay Data uploaded later must be reported in the same order as the TIDs defined in the Assay Description. Additionally, the Assay Data must be consistent with the Assay Description TID definitions.

  3. Assay data -- The results of your assay experiment. As long as they follow the protocol of the assay description, assay data for new substances can be continually added.

    • The assay data specifications (separate document) indicate the allowable result types (e.g., Float, Integer, Boolean or String), allowable result units (e.g., parts per thousand, million or billion; milliM, microM, nanoM, picoM, femtoM; percent; seconds, minutes, days; etc.), and additional descriptive information about the results (e.g., series_ID for defining dose-response curves, and active concentration indicator for confirmatory assays).

    • Information about file formats, as well as some sample files, for submitting assay descriptions and data are below.

Data entry methods for assay desription and data -- When you begin a new submission to the PubChem BioAssay database database, you have the option to:
  • Fill in forms (use wizards), where you can either:

    • Use a template -- Enter the AID of the PubChem record of interest in order to copy the protocol and description from that assay record into your new submission forms. (Upload will NOT copy the assay's RegID, and will not copy its data. However, the protocol will supply the data definition (the column headers) that will be used in the data table.)

      OR

    • Start from scratch -- If you say "no" to the option of entering an AID, you can enter your assay submission from scratch.

    • In either case, wizards will assist you in entering your data without requiring knowledge of detailed data specifications. Hints appear at the bottom of most screens, proving tips on data entry and links to examples in the upload tutorial, if/as appropriate. After you type or import data into the wizards, the Upload system will prepare a properly formatted file that conforms to the data specifications.

  • Upload files


Data specifications, file formats, & sample files for submitting assays back to top

  • Data specifications for submissions to PubChem Assay: back to top


  • File formats for submissions of assay descriptions and assay data to PubChem BioAssay: back to top

     

    Note: The substances that were tested also need to be submitted, but in a separate procedure, as noted in the section of this document on "three parts to an assay submission." The information below is for the submission of assay descriptions and data.

     

    • Assay descriptions and biological activity data for small molecules and RNAi can be submitted to the PubChem BioAssay database in either spreadsheet (XLSX or ODS), XML, ASN, or CSV format. Examples of each are provided below.

    • If you choose to enter your data by filling in forms (i.e., using wizards) provided by the PubChem Upload system, rather than by importing files, the data you enter will be formatted for you and can be exported in any of these file formats.

    • Compressed files - Submissions uploaded by FTP may be compressed. This may substantially reduce the time it takes to transfer your data. We support files compressed using the "gzip" compressor. Please note that we do not support "zip" or "bzip2" compressed files.

    • Tips for assay submissions in CSV format:

      The *.csv file can include any/all of the tags listed in the assay data specifications, and they serve as column headers. The CSV column ordering for the first seven columns is fixed and must be exactly as documented below. Beyond that, there must be a column for each result (TID) defined in the description. An example of an assay submission is CSV format is provided below, for the single readout IC50 assay.

      The best way to become familiar with CSV format is to use an existing assay as a template, enter your desired Assay Description, and save the template in CSV format. You can then cut and paste your data into this CSV file while maintaining the correct number of columns. For fields without data there will be nothing but consecutive commas. Your CSV file should have column headers show below as well as the names of the result definitions that you have defined; any deviations will cause errors.

      Note that any duplicated substance (SID/RegID) test results for a given assay (whether in the same data file or not) will be archived in PubChem. Only the most recent one will be available for searching.

      Additional details about the rules for proper formatting of CSV files are provided on sites such as Wikipedia (http://en.wikipedia.org/wiki/Comma-separated_values) and the Internet Engineering Task Force (IETF) (http://tools.ietf.org/pdf/rfc4180.pdf, http://tools.ietf.org/html/rfc4180). For example:
      If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example, if your CSV file contains a field that looks like:
      QC'd by "Quicktest"
      then that field should be revised to look like:
      "QC'd by ""Quicktest"""

      The following COLUMNS are accepted in your CSV file along with COLUMN HEADERS using the names of your result definitions. If a particular data cell does not have anything to report for a given column or it is not applicable, simply leave it blank.

      • Column 1: PUBCHEM_SID
        If you have previously deposited your Substance description into PubChem, you may use your Substance identifier (SID) assigned by PubChem. This must be an unsigned integer value and, in nearly all cases, your organization must have deposited the Substance associated with this SID. Optionally, you may choose to use "Column2" instead, to provide your own Substance identifier, and, if you do, you must set this column value to be "0". If you have not previously deposited your Substance descriptions into PubChem, you must, at a minimum, have these in the PubChem deposition system prior to uploading Assay Data. If you have Substance descriptions in the PubChem deposition system, you may have Assay Data refer to these by setting the value in this column to "0" and use "Column2" to provide your identifier to this Substance.

      • Column 2: PUBCHEM_EXT_DATASOURCE_REGID
        You may use your own identifier for Substance descriptions previously loaded into either PubChem or the PubChem deposition system. If you provide a value in this column, you must set the value in "Column1" to "0" or leave it blank. If you choose to identify the Substance for which you are providing data using "Column1" (i.e., using its SID), please leave this column blank.

      • Column 3: PUBCHEM_ACTIVITY_OUTCOME
        The Activity Summary for every Substance has two parts, the outcome and the score. The outcome for each Substance is reported as an integer value in this column and must be one of five different values:

        1. Substance is considered inactive.
        2. Substance is considered active.
        3. Substance activity outcome is inconclusive.
        4. Substance activity outcome is unspecified.
        5. Substance identified as a probe.

      • Column 4: PUBCHEM_ACTIVITY_SCORE
        The Activity Summary for every Substance has two parts, the outcome and the score. The score for this Substance is reported in this column and must be an integer value, where larger values are more active and smaller values are less active. Please make sure your scores are on a linear scale because that's how they will be interpreted. We encourage depositors to consider using the range 0-100, although values larger and smaller are allowed. The score values are used to allow PubChem users to partition, sort, and profile Assay Data results within and between biological assays.

      • Column 5: PUBCHEM_ACTIVITY_URL
        An URL may optionally be provided for Assay Data reported for this Substance in this column. This URL will be provided within PubChem displays to allow a PubChem user to link to your website, where you may choose to provide additional information or interfaces to your Assay Data, for example, dose-response curves, replicate data, etc.

      • Column 6: PUBCHEM_ASSAYDATA_COMMENT
        Your textual annotation and comments may optionally be provided for Assay Data reported for this Substance in this column.

      • Column 7: PUBCHEM_ASSAYDATA_REVOKE
        When you submit the data you must leave this blank or put a value '0' in this column. You may optionally suppress Assay Data for this Substance by putting a value of "1" in this column. In this case, leave all other columns blank except for Column 1: PUBCHEM_SID. Suppressing Assay Data does not delete data from PubChem, rather it eliminates all references and links to this information; however, all pre-existing links to this information will still function and a disclaimer will be displayed specifying this data is revoked.

        You may un-revoke Assay Data for a Substance by depositing either the same or new data for this Substance. Do not revoke and submit the same substance in the same file.

      • Columns 8 and higher (one column per TID): PUBCHEM_ASSAYDATA_VALUE
        All remaining columns are an order dependent one-to-one correspondence between the result definitions (TIDs) defined in the associated Assay Description. All defined "columns" must be present; however, values are optional in individual fields. Consult the auto-generated CSV template file with your description information to see the layout.

      • Panel Assays - CSV file to define panel components
        A comma-delimited CSV file is used to define panel components. Note that this CSV file is additional to and independent of the CSV file used for your assay data. The section of this help document on Panel Assays provides details on about the required headers and columns in the CSV file that defines the panel components.

  • Sample files for submissions of assay descriptions and assay data to PubChem BioAssay: back to top

     

    Note: The substances that were tested also need to be submitted, but in a separate procedure, as noted in the section of this document on "three parts to an assay submission." The information below is for the submission of assay descriptions and data.

     

    • Example 1: single readout IC50 (derived from AID 449747) back to top

    • Example 2: multiple readouts with dose-response data (derived from AID 623973) back to top

    • Example 3: panel assay with multiple targets & bioactivity outcomes (derived from AID 1433) back to top
      • spreadsheet format (*.xlsx) (can also be *.ods) -- A few notes about the sample *.xlsx file:
        • The sample *.xlsx file contains three sheets: "General" (descriptive information about the experiment), "Results" (data generated by the experiment), and "Xrefs" (cross-references to associated information, such as genes, proteins, literature, external web sites, etc.).
        • The "Results" sheet is organized in a way that groups data from a given panel. All columns that have a RESULT_PANEL_ID starting with the digit 1 belong to the first panel, those starting with the digit 2 belong to the second panel, etc. (For example, in the "Results" sheet of the sample file, you can see that columns E and F contain data from the first panel, because their RESULT_PANEL_ID values are 1_OUTCOME and 1_AC (activity concentration). Similarly, columns G and H contain data from the second panel, because their RESULT_PANEL_ID values are 2_OUTCOME and 2_AC, etc.)
        • The header rows beneath RESULT_PANEL_ID can be used to enter metadata (descriptive information) for each panel (e.g., RESULT_PANEL_NAME, RESULT_PANEL_DESCRIPTION, RESULT_PANEL_TARGET_ID, RESULT_PANEL_TYPE, RESULT_PANEL_ACT_OUTCOME_METHOD).
        • If a panel has multiple data columns, you only need to enter the descriptive information in the first column for that panel. If you prefer, you can repeat that information in each column for the panel, whichever is more convenient for you.
      • XML format
      • ASN format
      • CSV format files can also be submitted, and would follow the same type of format provided for the single readout IC50, above. This document also provides some tips for assay submissions in CSV format, including the ordering of columns. An additional CSV file, panel assay information, is also needed for panel assays submissions, and is described in a separate document.

    • Example 4: RNAi assay (derived from AID 624150) back to top

Assay Description Fields back to top

RegID | Name | Description | Protocol | Comments | Substance Type | Grant Number | Hold Until Date | Project Category | Activity Outcome Method | Activity Concentration TID | Target Data | XRefs | Results Definition | Panel Assay
As you prepare your bioassay submission to PubChem BioAssay, below are some tips about entering the assay description, to supplement the information provided in the assay submission guidelines section of this document and the assay data specifications (separate document):
  • External Assay RegID (RegID) back to top
    • The external assay identifier assigned by the depositor. This must be unique amongst your other PubChem assays.

  • Name back to top
    • A short, informative name of the assay for display purposes.

  • Description back to top
    • A definition of the assay purpose and parameters.

  • Protocol back to top
    • The assay protocol description must be provided here.

  • Comments back to top
    • Any comments on the assay can be provided here.

  • Substance Type back to top
    • By default assays are assumed to be tested on small molecules. With this pulldown, nucleotides can also be specified.

  • Grant Number back to top
    • For NIH screening centers only, a grant number can be specified. Note that this string is not validated.

  • Hold Until Date back to top
    • Optional hold-until date for bioassay data you upload into PubChem. If this field is set to a future date, your bioassay data will be made accessible to PubChem users only after that date. Your access to the data will also be limited until that date, only via the PubChem deposition-system account you have used for upload. Only set a hold-until date if you wish to delay public release of bioassay data, for example to match public access in PubChem with the publication date of a journal article about that bioassay. And please note that PubChem will only accept bioassays with either no hold-until date, or with an initial hold-until date of up to one year from initial submission (with the possibility of additional extensions at a later time).

  • Project Category back to top

    • NIH Molecular Libraries Probe Production Network (MLPCN) - This assay category should be selected by depositors that participate in MLPCN and the assay experiment was funded by MLPCN grant.
    • NIH Molecular Libraries Screening Center Network (MLSCN) - This assay category should be selected by depositors that participate in MLSCN and the assay experiment was funded by MLSCN grant.
    • NIH Molecular Libraries Probe Production Network (MLPCN), Assay Provider - This category should be selected for bioassay depositions where assay data is provided or developed by assay providers participating in MLPCN projects.
    • NIH Molecular Libraries Screening Center Network (MLSCN) , Assay Provider - This category should be selected for bioassay depositions where assay data is provided by assay providers participating in MLSCN projects.
    • Literature, Extracted - This assay category should be used for assays that have their data extracted from literature by 3rd party (not by author or article publisher)
    • Literature, Author - This assay category should be used for assays that have their data extracted from article by author
    • Literature, Publisher - This assay category should be used for assays that have their data extracted from literature by publisher
    • RNAi Global Initiative - This assay category should be used for assays that are being deposited by under RNAi Global Initiative
    • Assay Vendor - This category should be used for bioassay depositions contributed by assay service providers

  • Activity Outcome Method back to top

    • Screening assay - Single Concentration Activity Observed:
      • Activity outcome was defined based on the percentage of inhibition from test at a single dose.
    • Confirmatory assay - Concentration-Response Relationship Observed:
      • Activity outcome was defined based on EC50/IC50 values and so forth, derived from dose response curves following testings with multiple concentrations.
    • Summary assay - Candidate Probes/Leads with Supporting Evidence:
      • An assay which summarizes information from multiple assays.
      • Summary assay is a special kind of assay which gives users a summary of the project and brief overview of all related screening and confirmatory assays. A summary assay should be created simultaneously with the first (screening or confirmatory) assay of the project. At the beginning, data is optional for creating a summary assay (unlike other assay types). As the project progresses, the summary assay needs to be updated with additional descriptions, related assays, any probes identified and associated test results (if need be). A summary assay should always reference all assays it summarizes through its XRef fields to related assays. When linking to related assays with XRefs, make sure to provide a brief comment of how each assay fits into the overall picture of the project. Note that if you are linking to another assay which is pending in the deposition system, but not yet deposited into PubChem, you must link to it with its regid that you supplied (its PubChem AID will not yet be assigned).
      • To identify probes, depositors must minimally supply a CSV data file with two columns defined including their headers: PUBCHEM_SID and PUBCHEM_ACTIVITY_OUTCOME, where the latter column will have a value of 5 set for probes. In case additional depositor-defined readouts are provided, the regular CSV file format should be used. Readouts previously reported in related assays do not need to be repeated in the summary assay.
    • Other - An assay which does not fall into the above categories

  • Activity Concentration TID back to top
    • For Confirmatory assays only, an additional pulldown menu appears requiring the indication of which of your TIDs provides active concentration summary. Such a summary might be reported as the concentration which produces 50% of the maximum possible biological response such as IC50, EC50, AC50, GI50 etc. or by reporting constant parameters such as Ki, that based on which the activity outcome of your assay is called. Please choose the column number and TID name as found in your Results Definition list.

  • Target Data back to top
    • For any assay designed to identify chemicals interacting with a protein target, such as enzyme inhibitors, please add the identifier of the target molecule from one of the following NCBI databases: Please note that for such assays you should not add an additional XRef protein link. In the opposite case, in which it is only known that an assay is identifying modulators that affect some biological processing, for example, to identify compounds affecting certain protein expression, it is appropriate to identify a protein with an XRef link (described in the next section) and not with Target data.

  • XRefs back to top
    • Related Assay XRefs -
      • The Related Assay XRefs section allows for linking an assay (e.g. "A") to other PubChem assays (e.g. "B") including relevant assays from other depositors. To link assays "A" and "B" depositor can add links (XRefs) to both of them and in that case XRefs become part of the assay records. Being part of the records, links will be included in assay ASN blobs when exporting those assays using PubChem web interfaces or FTP.
      • Depositors have option of adding Xrefs to only one of the assays (e.g. "A") and PubChem then will automatically add reciprocal link to all display interfaces for assay "B". In that case, however, the Xref link will not became part of the assays record for "B" and will not be included in export functions (e.g. FTP). Also note, that PubChem does not automatically build back-links from assay "B" when assay "A" has hold-until date. PubChem-build back-links will appear after hold-until date.
    • Other XRefs -
      • The Other XRefs section links to relevant data from other NCBI databases and beyond. Examples include PubMed Ids (PMIDs), Taxonomy Ids, OMIM Ids, reference URLs to your source database/assay, etc.)
      • Attention: for XRef protein links please see the previous section on Target data to determine whether you should make an XRef protein link or fill out Target data information. You should not do both.
        • Type - Choose from a list to classify the data type.
        • Primary Citation? (PubMed-Id Type Only) - If checked for a PubMed-Id, indicates citation is directly relevant to the assay, thereby allowing your assay to be discoverable in PubMed from the cited record.
        • Value - The actual data, such as a URL or an identifer.
        • Annotation - A comment to describe the XRef data.

  • Results Definition back to top
    • Name - The name of a result. Keep this short, but informative.
    • Type - The result type typically is either a Float, Integer, Boolean or String.
      Optionally, the type can be used to specify an identifer, such as one coming from another NCBI Entrez database. For example, if PubMed Id is chosen as the type, then all data values in this column will be checked to ensure that they are valid PubMed identifiers. The following is a list of accepted identifier types:
      • PubMed Id
      • MMDB Id
      • URL
      • Protein GI
      • Nucleotide GI
      • Taxonomy Id
      • OMIM Id
      • Gene Id
      • Probe Id
      • PubChem BioAssay Id
      • PubChem Substance Id
      • PubChem Compound Id
      • Protein Target GI
        Use this only when an assay contains multiple targets.
      • Biosystems Target Id
        Use this only when an assay contains multiple targets.
      • Target Name
        Use this only when an assay contains multiple targets.
      • Target Description
        Use this only when an assay contains multiple targets.
      • Target Tax-Id
        Use this only when an assay contains multiple targets.
      • Gene Target Id
        Use this only when an assay contains multiple targets.
      • DNA Target GI
        Use this only when an assay contains multiple targets.
      • RNA Target GI
        Use this only when an assay contains multiple targets.
    • Unit - Various units are available to choose from if appropriate.
    • Description - More description to the result beyond its name.
    • Constraint - Limits on the range of accepted values for integers and floats. The more limits that can be introduced, the more validation can be performed on future data added to the assay. A minimum and/or maximum can be specified or specific acceptable values can be specified.
      • Set of Values - Individual allowed values for integer type only.
      • Minimal Value - A single number to specify minimum possible allowed value for integer or float type only.
      • Maximal Value - A single number to specify the maximum possible allowed value for integer or float type only.
      • Range - A Minimal Value and a Maximal Value.
    • Attribute: Tested Concentration - If box is checked, the micromolar concentration at which this result was tested. This concentration attribute indicates that the readout under this test result field is biological concentration-response data, the attribute provides the value of the tested concentration in micromoles.
    • Attribute: Concentration-Response (CR) Plot Labels
      • Use this attribute to track concentration-response series for confirmatory assays only.
      • If the Tested Concentration attribute for a result definition is filled in, then the optional "CR Plot Labels" menu appears for that TID. By default, only one CR label appears in the menu but the user can add labels by visiting the "Concentration-Response (CR) Plot Labels" section at the bottom of the description page.
      • Multiple labels are useful for assays with multiple series of data and tested concentrations. For each CR Plot Label series there should be at least three activity data points with tested concentration attributes set.
      • Collecting this information allows PubChem to annotate and track the concentration-response series reported, and will facilitate the development of new features such as drawing dose-response curves upon request of PubChem users.
    • Derived by equation?
      • PubChem attempts to record and distinguish experimental dose-response data points vs. data theoretically calculated such as using curve-fitting algorithms. For each concentration-response series input, if this box is not checked, the status as 'experimental data' would be assigned and recorded.
      • If checked, this option allows one to define an alternative curve fit as desired, (e.g., dropping outliers, using other fitting functions), by supplying just enough data points (about 10 are recommended) to allow a Hill equation to draw a line that presumably fits another experimental series that you have defined.

  • Panel Assay back to top
    • The PubChem bioassay data model supports the presentation and annotation of profiling screening results.

    • A single panel-type PubChem bioassay record may contain readouts and the respective bioactivity outcome annotations for screening tests over multiple individual targets, cell lines or species. Each of such targets, cell lines or species is regarded as a "panel component". Description of the experiments, including a short name, general goal, specific experimental protocol, and information of assay target, can be provided for each individual panel component. A panel component should be associated with one or multiple test result fields (TID). The test results for each panel component can be designated as "bioactivity outcome", "active concentration" if need be, or otherwise are treated as regular readouts. Profiling test results is complex, this expansion of PubChem bioassay data model allows one to describe a compound profiling screening test, and enables PubChem to record and annotate multiple related bioactivity outcomes under a single AID. Such grouping facilitates straightforward comparison and evaluation of compound bioactivities using the profiling results through the PubChem data analysis tools. To see a panel assay example, check out the kinase profiling assay.

    • Entering RESULT_PANEL_ID values in your assay data designates your assay as a multi-target panel assay and enables an additional input mechanism to define your assay. Such assays are very complex in nature and we have tried to make the interface as user-friendly as possible. Please remember, however, that extra attention should be paid to panel assay definitions and data to ensure their accuracy. Also remember, if the assay seems too complicated to deposit, it may also be too complicated for PubChem users to understand!

      • Optional Panel Headers in Assay Data
        A comma-delimited CSV file is used to define panel components. Note that this CSV file is additional to and independent of the CSV file used later for your assay data.

        The Panel Component Info CSV file consists of one required and several optional columns as follows:
        • RESULT_PANEL_ID (Required)
          This is your panel component id and is important because it allows you to associate one or more result descriptions (TIDs) with it. It must be an integer starting from one and ascending by ones.
        • RESULT_PANEL_NAME (Optional)
          Short name of panel component.
        • RESULT_PANEL_DESCRIPTION (Optional)
          Short description about specifics of panel component, such as about cell line, or target information.
        • The following labels are used to specify a target, which is often provided for profiling assays across protein families.
          • RESULT_PANEL_TARGET_ID (Optional)
            This is mandatory if any of the target fields are present.
          • RESULT_PANEL_TARGET_TYPE (Optional)
            This is mandatory if any of the target fields are present. Target type should be expressed an integer: Protein(1), DNA(2), RNA(3), Gene(4), BioSystems(5).
        • RESULT_PANEL_TAXONOMY (Optional)
          NCBI Taxonomy ID (also called "taxid", integer).
        • RESULT_PANEL_GENE (Optional)
          NCBI Gene ID (integer).
        • RESULT_PANEL_ACT_OUTCOME_METHOD (Optional)
          Assay outcome qualifier (integer). Choices include screening (1), confirmatory (2), summary (3), and other (0). (See additional information about activity outcome methods.)

Data Processing for Assays back to top

  • Assay Submission StatusThumbnail view of the Assays/pending folder tab on the PubChem Upload home page, which provides access to all of your pending and in-PubChem BioAssay records. Click on the image for more detail about this page.

    • A series of five dots indicates the status of an assay submission. These dots appear on both your Upload home page (in the "Assays > Pending" folder tab), and in the "History" tab of the Assay submissions interface. In general, the various dot colors mean the following:

      Grey dot, a submission status indicator that means a step has not been started yet. Grey dot means that the step has not been started yet.
      Orange dot, a submission status indicator that means a step is currently in progress. Orange dot means the step is currently in progress (processing).
      Blue dot, a submission status indicator that means a step has been completed. Blue dot means the step has been completed.
      Red dot, a submission status indicator that means a step has failed. Red dot means the step has failed.

  • The five steps in assay submissions are below, including the specific meanings of the dot colors for each step:

    1. Initial creation of submission.
      Blue dot, a submission status indicator that means a step has been completed. the only possible dot color for this step is blue
    2. Parsing of submission.
      Orange dot, a submission status indicator that means a step is currently in progress. orange = parsing data
      Blue dot, a submission status indicator that means a step has been completed. blue = data parsed
      Red dot, a submission status indicator that means a step has failed. red = failed while trying to parse data.
    3. Validation
      Orange dot, a submission status indicator that means a step is currently in progress. orange = validating
      Blue dot, a submission status indicator that means a step has been completed. blue = validated
      Red dot, a submission status indicator that means a step has failed. red = failed validation.
    4. Approval of submission by PubChem curators (following their review) and commitment of submission by depositor, giving permission for the data to be moved from the pending submissions database into PubChem.
      Orange dot, a submission status indicator that means a step is currently in progress. orange = committed by depositor
      Blue dot, a submission status indicator that means a step has been completed. blue = approved by PubChem curators
      Red dot, a submission status indicator that means a step has failed. red = a revision has been requested by PubChem curators.
    5. Deposition of data into PubChem. This step represents the stage at which Assay identifiers (AIDs) have been assigned to the assays and are available for download through the PubChem Upload user interface.
      Orange dot, a submission status indicator that means a step is currently in progress. orange = uploading to PubChem
      Blue dot, a submission status indicator that means a step has been completed. blue=uploaded to PubChem.
      Red dot, a submission status indicator that means a step has failed. the dot should never become red; if any technical issues arise while data are uploading, the dot will still be orange, and the PubChem staff will deal with it on our end

  • Note: For assay submissions made by FTP, a series of alphanumeric characters (rather than colored dots) are used to indicate the status codes for the submission.

Create URL to share on hold assay data with collaborators back to top

If you have chosen to hold your assay data private (as noted in the overview/data release section of this document), but you would like to provide access to select individuals such as collaborators or reviewers, you can create a URL that provides temporary access to a given assay submission (i.e., to a given AID).

To do this:
  • Login to your Upload home page
  • Open the "Assays/In PubChem" folder
  • Mouse over the assay submission (AID) of interest to view the available actions (e.g., view in PubChem, use as template, modify, revoke, export SID list, and on-hold access).
  • Click on the option for "On-Hold Access".
  • Click on the "Create URL" button. That will open a new folder tab with the temporary URL that points to the on-hold assay. You can copy/paste that URL into an e-mail to the desired individuals (e.g., collaborators, reviewers) in order to enable them to access your on-hold assay. The dialog box also provides the following options:
    • Share: The "Share" button simply opens a new folder tab with the URL.
    • Expiration Date: By default, the expiration date for the URL is 90 days from the day on which it was created. If the Expiration date text box is left blank, the default 90 days will be applied. You can change the expiration date, if desired.
  • The temporary URL will expire on the expiration date, or when the data comes off of its on-hold status, whichever comes first.
Please note that:
  • The ability to create a temporary URL to on-hold assay data was added to the PubChem Upload system in order to facilitate collaboration (whether external or in-house) and administrative work such as publication and grant processing.
  • The PubChem Upload system allows you to create a temporary URL only after your assay has been deposited to PubChem and has received an Assay identifier (AID). (A separate section of this help document describes the assay submission steps, the last of which is deposition into PubChem and assignment of an AID.)
  • You can create, and then delete, a URL for a given AID as many times as you'd like. A different URL will be generated each time you create one. Each AID can have only a single URL at any given time.
  • The temporary URL will expire on the expiration date, or when the data comes off of its on-hold status, whichever comes first.

Data Release of Assays back to top

After committing your data into PubChem, assays can be released immediately (default) to the public or held private until a date that you specify. An initial hold of up to one year from initial submission is accepted, with the possibility of additional extensions, as noted in the "overview/data release/hold-until date" section of this document.

You can update your record any time to remove the on-hold request and release the record to the public. To do this:
  • Open your PubChem Upload home page/"Assays" folder tab/"In PubChem" subfolder.
  • Use the filters in the upper right corner of that tab to view your assays that have an "on hold" status.
  • Mouse over the assay you would like to release, then select the options for "Modify/Options/Hold Until Date."
  • Change the hold-until date (for example, to today's or tomorrow's date, as desired).
  • Press the "Save" button, press the "Validate" button, and then "Commit" the revision to PubChem.
Notes for assay submissions:
  • If you have chosen to hold your assay data private, but you would like to provide access to select individuals such as collaborators or reviewers, you can create a URL that provides temporary access to a given on-hold assay submission (i.e., to a given AID and its associated substance records). You can then share it with the desired individuals. This function was added in order to facilitate collaboration (whether external or in-house) and administrative work such as publication and grant processing.

  • An assay cannot be released to the public before all of its associated substances (SIDs) are public. (Validation checks are performed on the release dates of an assay and its associated substance records, and an error is generated if a request is made to release an assay before all of its SIDs are public.) Therefore, if you would like to release your assay data to the public, you may need to do two steps if both the assay data and the substances are on-hold:

    1. Release the substances: Open your PubChem Upload home page/"Substances" folder tab/"On Hold" subfolder, and type the Assay ID (AID) in the text box entitled "Release (remove Hold Until Date) Substances." That will release all PubChem Substance records that are associated with that AID.

    2. Release the assay: Once the substances are uploading to PubChem, release the assay data by following the steps described in the Assay data release section of this document.


 
FTP Depositions back to top

request FTP account | substance depositions by FTP | assay depositions by FTP

Request FTP account back to top

FTP-based deposition provides a path for completely automated data upload into PubChem. If you have a large amount of data to be uploaded into PubChem or if you update your data on a daily or weekly basis, you may be a good candidate to use the PubChem FTP deposition method.

To get started with FTP-based depositions, you must:
  1. Have an approved PubChem Upload account.
  2. Have performed previous data uploads into PubChem.
  3. Request an FTP account from PubChem by writing to pubchem-deposit-help@ncbi.nlm.nih.gov
An FTP account is independent of your PubChem Upload account with different login credentials. The PubChem Upload account will be configured to interact directly with data uploaded via FTP. In this way, any data you upload via FTP will be accessible/editable through the PubChem Upload interface.

The procedure to create, setup, and configure your FTP account to interact with your PubChem Upload account will take one or more business days.

Substance depositions by FTP back to top

sdf file format | file suffix (*.in, *.status, *.err, *.out) | status codes | load codes | updates to substance records via FTP
  • SDF file format back to top

    • Substance depositions by FTP must be made using the SDF file format.
    • To deposit data for a Substance deposition via FTP, you must:
      1. Upload a file using your FTP account
      2. Name the file you upload with the suffix ".sdf.in" (or ".sdf.gz.in", if a compressed file)
    • There may be a delay between the completion of your FTP upload transfer and the processing of your uploaded file. This is because the deposition system processes FTP deposited data at particular times of the day and it may wait to verify that your transfer is actually complete.

  • FTP file suffix (*.in, *.status, *.err, *.out) back to top
  • The name of the files that appear in your PubChem FTP directory will be appended with various suffixes, depending upon the stage of your submission:

    File Suffix Description
    *.in The substance submission file that you FTP to PubChem should end with the suffix "*.sdf.in". The "*.in" suffix lets the deposition system know that your file is intended to be a new Substance deposition. After the file is recognized as being present, the file is transferred into the deposition system.

    FTP-based deposition processing begins when you notice the ".sdf.in" (or ".sdf.gz.in") suffixed file is removed from your FTP account directory and an "*.sdf.status" file is created. For example, if you upload the file "smid.sdf.gz.in", the status file created will be "smid.sdf.status".

    *.status The "*.sdf.status" file contains the current status of the data processing.

    *.err The ".sdf.err" file may contain a human readable error as to why your FTP-based substance deposition failed.

    The presence of a non-zero length file containing the suffix ".sdf.err" (e.g., "smid.sdf.err"), will indicate that there was a problem with your uploaded file and that your data may not be loaded into PubChem. The ".sdf.err" file will contain a human readable text message explaining why the FTP uploaded file failed. Please note that the status file is not deleted after processing and publishing are completed. The final contents, if all went well, will be "D", which will mean "Published."

    *.out The "*.sdf.out" file contains a load report containing a list of records and the load action taken. This file will appear after processing is completed and your substances are loaded into PubChem. This file will contain your SID's that correspond to your registry ID's. CID's are not provided.

    The ".sdf.out" log file is a CSV text file, easily read by Excel or other spreadsheet applications. These files contain no column headers. The columns are in a following order:

    • Data Source Name
    • External Registry ID
    • SID
    • SID Version
    • Load Code



  • Status codes for Substance Submissions by FTP back to top
  • The status file informs you of the processing progress. The possible status file contents and their meaning are listed below.

    Status Meaning
    I Submitted // STEP 1 -- completed (see note* below about the seven steps in the substance submission process)
    -P Parsing // STEP 2 -- progress
    !P Parsing Failed // STEP 2 -- failed
    P Parsed // STEP 2 -- completed
    -S Standardizing // STEP 3 -- progress
    !S Standardization Failed // STEP 3 --failed
    S Standardized // STEP 3 --completed
    -V0 Validating I // STEP 4 -- progress
    !V0 Validation I Failed // STEP 4 --failed
    V0 Validated I // STEP 4 -- completed
    -V1 Validating II // STEP 5 -- progress
    !V1 Validation II Failed // STEP 5 --failed
    V1 Validated II // STEP 5 -- completed (see note † below about the actions you need to take at this point)
    C Committed for PubChem // STEP 6 -- progress
    A Approved for PubChem // STEP 6 -- completed
    R Rejected for PubChem // STEP 6 -- failed
    -D0 Uploading to PubChem // STEP 7 -- progress
    -D Depositing to PubChem // STEP 7 -- progress
    !D Depositing to PubChem Failed // STEP 7 -- failed
    D Deposited in PubChem // STEP 7 -- completed


    *After you FTP your submission to PubChem, you can view it, if desired, in the "Substances/Pending" folder tab of your PubChem Upload Home Page. There, the status of your substance submission is indicated graphically by a series of seven dots, representing the seven steps of the submission process, where dot color indicates the status of each step. (In many ways, an FTP-based deposition is much like a normal deposited file. You can login to your Upload account at any time to see the progress of your deposition(s) or to get your SIDs. You can also download a submission status report from your PubChem FTP directory or through the PubChem Upload web interface.)

    † After processing completes to the point of "Validated II", you will need to log into the PubChem Upload system, review your submission, and then, if there are no issues, commit your data to be loaded into PubChem. An auto-commit feature can be requested, whereby the deposition commit step is performed on your behalf automatically. This removes the necessity for you to login and commit your data into PubChem.

    When processing is complete and your data is loaded into PubChem, you will see the suffixed file ".sdf.err" and, if all went well, the suffixed file ".sdf.out". The file with the ".sdf.out" suffix (e.g., "smid.sdf.out") is your report file containing your PubChem Substance identifiers (SID's).

  • Load codes back to top

  • The "Load Code" column values, described below, allow you to know or track the substances that you have added, modified, or suppressed.

    Load Code Description
    0 substance load failed (internal error)
    1 existing substance replaced (internal use only)
    2 new substance created
    3 new substance version, PubChem structure same
    4 new substance version, PubChem structure changed
    5 no change, identical substance
    6 no change, but new PubChem structure (internal use only)
    7 substance revoked/suppressed
    8 substance is "on-hold"

  • Updates to substance records via FTP back to top

    • To update, please re-deposit the complete substance record, including the new information, using the same registry identifier. Updates are versioned but only the most current data will be readily visible, searchable, or downloadable. Please note that the revised record will still have the same SID. Please also note that PubChem will not version substance records if nothing has changed.

Assay depositions by FTP back to top

general guidelines | four types of assay deposition operations (new, modify: data_only, alter_descr, replace_all) | file format (XML only for assay FTP) | file suffix (*.in, *.status, *.err, *.out) | status codes
  • General guidelines back to top

    • Request FTP account - Bioassay data depositions can be initiated via FTP in much the same way that substance depositions can, but for assays you must additionally specify the type of assay deposition operation. To begin, follow the three-step FTP account setup procedure described above. Note that you use the same FTP account for depositing either substance or assay data.

    • Directory structure - Once your FTP account is setup, you should have the following directory structure under your top level directory: You must decide which of the four types of assay operations you want to perform and place your file to be deposited into the appropriate directory highlighted above. You should be familiar with performing these assay deposition operations before trying them with FTP.


  • Four types of assay deposition operations back to top

    • new - For this operation you need to fill out new description information including a unique aid-soure (RegId) and name for your assay, and assay data results. You can look at one of your existing assays from the PubChem FTP site for guidance. Your assay FTP upload file goes in the directory /assay/new.
      XML example is here.

    • modify - For these three operations, you are doing something to affect your current assay in PubChem. Therefore, you need to specify the assay AID correctly so that it can be found. The best way to do this is to first copy the XML file of your current assay and modify it as you wish. In the following three types of Modify operations, we'll briefly mention what you should change.

      • Add/Change data without description change - For this operation, you should take a copy of your current assay's XML file and replace the data section with the data that you want to add, delete or modify. Be careful to first remove all data or the system will think that you want to add that data again! Also note that for this operation you must make no changes to the existing description.
        Your assay FTP upload file goes in the directory /assay/modify/data_only. XML example is here.

      • Alter description - For this operation, you should take a copy of your current assay's XML description file without the data section and make your minor alterations. Any significant changes, such as adding TID data result definitions, will result in an error. Your assay FTP upload file goes in the directory /assay/modify/alter_descr.
        XML example is here.

      • Replace assay - For this operation, you should take a copy of your current assay's XML file and make your description and data section modifications. Please note that all of your existing data for this assay in PubChem will be replaced by the contents of this uploaded file.
        Your assay FTP upload file goes in the directory /assay/modify/replace_all. XML example is here.


  • File format back to top

    • Only XML or ASN format allowed for FTP - To upload any kind of assay data or description changes, a single XML or ASN.1 file is required. This file must adhere to the specification for assays and be filled out as appropriate. Search in the specification file (XML Schema, ASN.1) for the tag PC-AssayContainer; this will always be the outermost container for your assay, whether it contains description and data or only description. You can find examples of such XML files from the PubChem public FTP site for bioassays. For assay deposition path-specific XML examples, see the following files, which are Bioassay XML examples for FTP:

      1. New assay deposition: XML file, Panel XML file
      2. Altering assay description: XML file
      3. Replacing assay: XML file, Panel XML file
      4. Adding/changing data of assay: XML file

      You can also download templates of XML files from pending depositions that you are making in the Deposition Gateway. You will need one file with both the data and description filled out in the cases of new, data_only or replace_all operations. For the alter_descr operation, only the description should be filled out.

    • XML Validation against PubChem XSD Schema - To increase the efficiency of the data exchange for your Bioassay FTP submission, PubChem highly recommends that depositors first validate XML files before uploading them to the PubChem FTP site for processing. XML validation will make sure that your file conforms to the PubChem Bioassay specification and should help speedup the deposition time by isolating XML errors. To check if your XML document conforms to the PubChem XSD Schema, the XML document must be validated against that XSD Schema. You can find PubChem's XSD schema on the PubChem FTP site.

      One XML validator that you might use is xmllint which is often included in standard Linux installations. To validate XML using xmllint one would run the following Linux command:

      xmllint --noout --schema "ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd" FileName.xml

      Please be advised that PubChem does not support or maintain xmllint, but you can find more information on it here. Depositors may of course use any other equivalent XML package for validation.

    • No CSV files are permitted via FTP for assay depositions


  • FTP file suffix (*.in, *.status, *.err, *.out) back to top

  • File Suffix Description
    *.in The assay submission file that you FTP to PubChem should end with the suffix "*.in". That suffix lets the deposition system know that your file is intended to be a new Assay deposition. After the file is recognized as being present, the file is transferred into the deposition system.
    *.status The "*.status" file contains the current status of the data processing. The status codes are listed below.
    *.err The ".err" file may contain a human readable error as to why your FTP-based assay deposition failed.

    The presence of a non-zero length file containing the suffix ".err" will indicate that there was a problem with your uploaded file and that your data may not be loaded into PubChem. The ".err" file will contain a human readable text message explaining why the FTP uploaded file failed. Please note that the status file is not deleted after processing and publishing are completed. The final contents, if all went well, will be "D", which will mean "Published."
    *.out The "*.out" file contains a load report containing a list of records and the load action taken. This file will appear after processing is completed and your assay is loaded into PubChem. This file will contain the AID that corresponds to your registry ID.

  • Status codes for Assay Submissions by FTP back to top

  • Status Meaning
    I Description created
    U Data Submitted
    -P Parsing Data
    !P Data Parsing Failed
    P Parsed
    -V Validating
    !V Data Validation Failed
    V Data Validated
    C Committed for PubChem
    A Approved for PubChem
    R Revise for PubChem
    -D Depositing to PubChem
    !D Depositing to PubChem Failed
    D Deposited in PubChem


    After processing completes to the point of "Validated", you will need to log into the PubChem Upload system, review your submission, and then, if there are no issues, commit your data to be loaded into PubChem. (An auto-commit feature can be requested, whereby the deposition commit step is performed on your behalf automatically. This removes the necessity for you to login and commit your data into PubChem.) In many ways, FTP-based deposition is much like a normal deposited file. You can login to your deposition account at any time to see the progress of your deposition(s). From the validated stage you will no longer need the FTP system.

 
Log of Changes to PubChem Upload back to top
16 MAY 2014 The Upload system now has a function that allows you to create a temporary URL to on-hold assay data, which you can share with others such as colleagues or reviewers, if/as desired. This function was added in order to facilitate collaboration (whether external or in-house) and administrative work such as publication and grant processing.
29 OCT 2013 A new "Submission History" subfolder allows you to download a list of substance identifiers (SIDs) that were assigned to a particular Upload ID. It appears in the Substances folder tab of your PubChem Upload home page. Also, the "Tools" link that was previously available in the PubChem Upload header has been moved to a "Tools" subfolder, which appears in the Assays folder tab of your PubChem Upload home page.
30 JUL 2013 Release a list of currently on-hold SIDs to the public in the new system. If you wish to release your own SIDs that are currently on-hold, login to PubChem Upload, and on your Upload home page, click on the "Substances > On Hold" tabs. There you will see a "Release Substances" box where you can paste a list of on-hold SIDs (one per line) and they will be released with the daily upload schedule. This function automatically commits for you so be careful!
30 JUL 2013 Fetch all SIDs per bioassay in the new system. login to PubChem Upload, and on your Upload home page, click on the "Assays > In PubChem" tabs, There, you will see a list of all of your assays currently in PubChem, including those on-hold. You can search on any of the summary info about your assays. Once you find the one you are interested in, mouse-over it and you will see an "Export SID List" button. This will allow you to easily obtain the SIDs (including on-hold) for any given assay.
05 APR 2013 Initial release of PubChem Upload, which offers streamlined procedures for data submissions and updates to both the PubChem Substance and BioAssay databases. This system is currently in Beta release, but will eventually replace the original Pubchem Deposition Gateway. The new capabilities are summarized in a PubChem News announcement.
 
Differences between old Deposition Gateway vs. new PubChem Upload back to top

The initial announcement of the PubChem Upload 1.0 beta release, on April 5, 2013, summarized some of the new features available in the Upload system. Additional details about the differences between the old Pubchem Deposition Gateway and new PubChem Upload system are below.
Feature or Function Old Pubchem Deposition Gateway New PubChem Upload
Account setup Requires separate "test" (for uploading and validating data) and "deposition" accounts (for making data public in PubChem).

Data cannot simply be moved from one to the other; rather, it must be removed from the test account and re-entered into the full account in order to submit the data to the official PubChem database.
back to top New users establish a single account that may, if desired, start out as a "test" account and later upgrade that to a "full" account.

There is no need to remove data from a test account and re-enter it into a full account. Rather, you only need to enter your data once. Then you can edit it until you are satisified with the submission, and easily move data from the pending submissions database to the official PubChem database by pressing the "commit" button.

User interface Data are input as specially formatted files (e.g., *sdf files for substances) and therefore require a knowledge of valid tags (i.e., substance data specifications and/or assay data specifications) back to top Flexible methods for data entry, including wizards that assist novice users in entering substance and/or assay data, without requiring knowledge of detailed data specifications. Hints appear at the bottom of most screens, proving tips on data entry and links to examples in the upload tutorial, if/as appropriate. After you type or import data into the wizards, the Upload system will prepare a properly formatted file that conforms to the data specifications.

Performance Modifications of large or complex submissions can sometimes encounter performance issues, as the result of an older web technology. back to top The speed and performance of the user interface is greatly improved with a newer web technology, minimizing possible time-outs and making it possible to more easily modify submissions, including those that are large and complex.

Substance submissions Substances must be uploaded as an *.sdf data files, requiring knowledge of file format and substance data specifications. back to top Substances can be uploaded as *.sdf data file or by filling in forms (using wizards).

The forms allow you to either start from scratch or use an existing substance as a template. (more...)

Assay submissions Requires you to upload files for each part of the assay submission (description, substances, and bioassay data)

Requires knowledge of the assay data specifications in order to format the submission file properly.

Requires an SID for each substance before you could proceed with submission of the assay description and data.
back to top Assays can be uploaded as data files or by filling in forms (using wizards).

The forms allow you to either start from scratch or use an existing assay as a template. (more...)

You don't have to wait for an SID for each substance in order to proceed with the assay submission -- instead, you can use the Registry ID (REG ID) for each substance. Of course, you can use the SID if you have it, but RegID is now another option.

 
References back to top


Citing PubChem Resources: back to top

Please refer to the PubChem Publications page if you are referencing the overall PubChem Substance, Compound, or BioAssay database, or the various PubChem tools. That page lists recommended citations as well as additional articles that have been written about the PubChem resources and how they can be used.

PubChem Data Usage and Citation Guidelines: back to top

Please see the PubChem Data Usage and Citation Guidelines page for information about how to cite individual or multiple records from a PubChem database.


 Revised 11 December 2014