PubChem Substance Tags

PUBCHEM_EXT_DATASOURCE_REGID

Your Catalog Number or Other Unique Registry Identifier External to PubChem.

This mandatory identifier never changes. It can't be duplicated within one submission. If we see a given RegId again in another submission, we treat it as a replacement update for the record and create a new version. You can also use it to revoke records with the PUBCHEM_REVOKE_SUBSTANCE tag. Please use only ASCII text characters in the identifier.

PUBCHEM_SUBSTANCE_SYNONYM

Synonyms including common names, registry-ids (like CAS), chemical names and trade names. This tag can have multiple synonyms by putting each on a separate line or by repeating this tag with a different synonym for each tag occurrence.

Synonyms are the primary keywords by which a substance is known and found via text search. They are collected to name our aggregated PubChem Compound records. Your records are most frequently discovered through your high-quality synonyms and chemical structure. Synonyms must be ASCII text; as with other input, all special characters and HTML tags will be converted to text or stripped out as appropriate.

PUBCHEM_SUBSTANCE_COMMENT

Comments, annotations or keywords available for indexed PubChem text searches.

Multiple lines of comments are helpful to the PubChem user to learn about the biological activity, safety data or special availability information for example. For a submitter, comments give you another opportunity to have your data discovered through keyword text searches.

We reserve the right to suppress or eliminate unsuitable or excessive comments. The expected format must be printable ASCII characters. URLs should be provided without any HTML tags such as "...". All HTML tags are stripped out of the text.

PUBCHEM_EXT_SUBSTANCE_URL

A specific substance webpage (URL) relevant to this record within your organization (external to PubChem).

PUBCHEM_REVOKE_SUBSTANCE

Remove a Substance Record From Search Results of Live Records.

To use this tag, your record must contain only two tags: the PUBCHEM_EXT_DATASOURCE_REGID tag identifying the record to revoke, and this tag whose value is a short comment stating a reason for the revoke.

Note: The Substance will remain in the PubChem archive; however, there will be no direct links to this substance from within PubChem. Effectively, this deletes the record from public view.

PUBCHEM_EXT_DATASOURCE_SMILES

SMILES string specifying the chemical structure.

This tag is ignored if a chemical structure with atoms is also provided in the SD file format CTAB section in the same SDF record submitted.

Please Kekulizé your single SMILES line (no aromatic atoms) to avoid ambiguities.

PUBCHEM_EXT_DATASOURCE_INCHI

InChI string specifying the chemical structure.

This is ignored if a chemical structure with atoms is also provided in the SD file format CTAB section in the same SDF record submitted. Only a single InChI string is allowed for a given Substance. The expected format is a single line of text containing a valid InChI string. This InChI can be standard or non-standard.

PUBCHEM_EXT_DATASOURCE_CID

PubChem Compound Identifier (CID) specifying the chemical structure.

This is ignored if a chemical structure with atoms is also provided in the SD file format CTAB section in the same SDF record submitted. Only a single CID is allowed for a given Substance. The expected format is a single line of text containing a valid PubChem Compound identifier. The CID cannot be on-hold.

PUBCHEM_HOLD_UNTIL_DATE

Optional Hold-Until Date delaying the allowed day on which the record can become public in PubChem. Absent this tag, public release is allowed once the submission is committed by the submitter.

You may wish to coordinate the public release of your data with a journal publication, a patent application or other grant-related administrative deadlines. Each substance record has its own hold-until date. Public records cannot be put on-hold.

A single date is expected using the international standard date notation ISO 8601 such as:

Note that for any public release date specified by the submitter, that is the first day public release is allowed. Upload pipelines typically require an additional delay of 24 hours, but sometimes more.

PUBCHEM_EXT_DATASOURCE_URL

The main webpage (URL) for your organization. NOTE: You don't need to set this because it is auto-populated with your account's URL.

You might choose to set this field only to override your account's URL. For example, you might have a summary webpage for a class of compounds to which you wish to refer.

PUBCHEM_BONDANNOTATIONS

Substance Bond Annotations. This tag is offered as a convenient alternative to directly encoding the information in the SDF format.

Bond Annotations will affect how the Substance is interpreted and validated within PubChem. Multiple Bond Annotations may be provided for a Substance. The allowed format for a Bond Annotation is three unsigned numbers, separated by white-space, per line, representing the AtomIDs of the two atoms, followed by the annotation ID, respectively. Only a single Bond Annotation may be provided per line. The atoms do not have to be explicitly bonded in the SD file format to have a bond annotation. Nonsensical annotations will be suppressed. Atom-Atom Annotation list is in the format: AtomID AtomID AnnotationID where AtomID and AnnotationID are unsigned integer numbers. AnnotationID Meaning ------------ -------------------------------------------------- 1 Crossed Bond, a non-specific stereo double bond 2 Dashed Bond, a 3-D hydrogen bond 3 Wavy Bond, a non-specific stereo single bond 4 Dotted Bond, a complex or fractional bond 5 Wedge-up Bond, a solid wedge stereo bond 6 Wedge-down Bond, a dashed wedge stereo bond 7 Arrow Bond, a dative bond 8 Aromatic Bond, an aromatic bond 9 Resonance Bond, a resonating bond 10 Bold Bond, a thick bond 11 Fischer Bond, use Fischer stereo conventions 12 Close Contact, a 3-D atom-atom close contact

PUBCHEM_NONSTANDARDBOND

Substance Non-Standard Bonds. This tag is offered as a convenient alternative to directly encoding the information in the SDF format.

Non-Standard Bonds will affect how the Substance is interpreted and standardized within PubChem. Multiple Non-Standard Bonds may be provided for a Substance. The allowed format for a Non-Standard Bond is three unsigned numbers, separated by white-space, per line, representing the AtomIDs of the two atoms, followed by the bond type ID, respectively. Only a single Non-Standard Bond may be provided per line. The atoms do not have to be actually bonded in the SD file format to have a nonstandard bond. If the atoms are already bonded in the SD file format, the non-standard bonds provided using this SD tag will supersede that interpreted from the SD file format. Atom-Atom Non-Standard Bond list in the format: AtomID AtomID BondTypeID where AtomID and BondTypeID are unsigned integer numbers. BondTypeID Meaning ---------- ----------------- 1 Single Bond 2 Double Bond 3 Triple Bond 4 Quadruple Bond 5 Dative Bond 6 Complex Bond 7 Ionic Bond

PUBCHEM_GENERIC_REGISTRY_NAME

Substance Generic Registry Name/ID. Generic registry names are typically assigned by an outside organization.

Note: Instead of using this tag, you can add any Registry Names or IDs to your list of Synonyms.

One or more valid Registry Names or IDs one per line. The expected format is either an unsigned number or a series of three unsigned numbers delimited by a "-" character.

PUBCHEM_PUBMED_ID

PubMed Identifier to link to and from one or more articles. One PubMed ID (integer) per line.

PUBCHEM_DOI

Digital Object Identifier (DOI) uniquely and persistently refers to data objects like journal articles, research reports, data sets, and official publications.

PUBCHEM_CITATION

Citation refers to a publication when a PubMed-Id and DOI are not available.

PUBCHEM_PATENT_ID

Patent refers to a patent identifier from any of the widely-recognized patent collections.

PUBCHEM_NCBI_OMIM_ID

Online Mendelian Inheritance in Man (OMIM) ID to link to and from one or more NCBI OMIM records. One OMIM ID (integer) per line.

PUBCHEM_NCBI_MMDB_ID

NCBI/NLM/NIH MMDB ID to link to and from one or more protein structure complexes. One MMDB ID (integer) per line.

PUBCHEM_NCBI_GENE_ID

NCBI/NLM/NIH Gene ID to link to and from one or more Gene IDs (not gene names). One Gene ID (integer) per line.

PUBCHEM_PROTEIN_ACCESSION

Protein Accession can be any current accession (or accession.version) found in NCBI Protein. This includes common protein accessions like UniProt and SWISS-PROT.

PUBCHEM_NUCLEOTIDE_ACCESSION

Nucleotide Accession can be any current accession (or accession.version) found in NCBI Nucleotide. This includes common nucleotide accessions from EMBL and DDBJ.

PUBCHEM_NCBI_PROBE_ID

NCBI/NLM/NIH Probe ID to link to and from one or more Probe IDs. One Probe ID (integer) per line.

PUBCHEM_NCBI_GEO_GSE_ID

NCBI/NLM/NIH Gene Expression Omnibus Series Accession (GEO GSE) ID to link to and from one or more GEO SGE IDs. One GEO SGE ID (integer) per line.

PUBCHEM_NCBI_GEO_GSM_ID

NCBI/NLM/NIH Gene Expression Omnibus Sample Accession (GEO GSM) ID to link to and from one or more GEO SGM IDs. One GEO SGM ID (integer) per line.

PUBCHEM_NCBI_BIOSYSTEM_ID

PubChem Pathway Id (Biosystems) is the integer BioSystems ID from PubChem Pathways.

PUBCHEM_GENBANK_GENERIC_ID

NCBI/NLM/NIH GenBank General ID to link to and from one or more Protein or Nucleotide sequences. One GenBank ID (integer) per line (not accessions).

PUBCHEM_NCBI_TAXONOMY_ID

NCBI/NLM/NIH Taxonomy ID to link to and from one or more organisms (Taxonomy IDs). One Taxonomy ID (integer) per line.

PUBCHEM_DEPOSITOR_RECORD_DATE

Optional Date from Submitter. Specify a publically-searchable internal creation or modification date of your substance record.

This date is not related to PubChem submission or processing; rather it is intended to be the date the substance was last changed in your internal database (mapping to PubChem Entrez search field "SourceReleaseDate"). PubChem provides its own date when the record is added or updated ("DepositDate"). For date format, see Hold-Until Date.

PUBCHEM_STRUCTURE (not SDF tag; info only)

Substance structure may be provided as a PubChem CID (such as "2244"), a SMILES string (such as "C1C(CCC1)CCC"), an InChI string (such as "InChI=1S/C8H16/c1-2-5-8-6-3-4-7-8/h8H,2-7H2,1H3"), or a PubChem SID (such as "123").

Alternatively a structure can be input via PubChem Sketcher by either drawing a structure or uploading a file in a number of different chemical formats that the PubChem Sketcher accepts. To input or edit structure using the Sketcher, click the Edit button or the structure image area.