PubChem Compounds
Since PubChem is an open archive accepting information from many sources about a given molecule, it is imperative to provide the end-user with an aggregated view of all that is known for a single chemical structure. PubChem Compound records are derived summaries that give users access to a rich set of related content. Compound records contain unique chemical structures extracted from contributed Substance records through a process called ‘standardization’. Each Compound record points to at least one Substance record. In contrast, a Substance record might have no derived Compound record if the structure cannot be standardized or is missing (e.g., Chinese tea extract).
Show Me Some Compound Examples
PubChem Compound pages (accession CID) summarize information known about a particular chemical. Take a look at these example pages:
To learn more about the Compound Summary pages, please read this PubChem blog.
What Can I Find On a Compound Page?
One can browse chemical information currently available for PubChem Compound records, using the following link:
https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72
This page can also be reached by selecting the PubChem Table of Contents (TOC) classification from the "Select classification" drop-down menu in the PubChem Classification Browser.
What Is Standardization?
Standardization in PubChem is the validation and determination of a unique chemical structure that is used to create a PubChem Compound from one or more submitted Substance records. Standardization is part of the PubChem Upload pipeline for submitted records with valid chemical structures. It allows PubChem to display one Compound page for aspirin (for example) that includes information from many submitted aspirin Substance records.
Read about the details of PubChem standardization or view a schematic:
Name Weighting
For Compound name lists, PubChem has a weighting scheme that attempts to put "good" names first. See for example the list of synonyms for aspirin. It is very empirical; effectively it up-weights common names that are given to us by many different organizations, and down-weights short and/or all-caps names (which tend to be database identifiers or abbreviations), and very long names (such as IUPAC(-ish) names). The actual code used for this is below. "Frequency" in this case is the number of PubChem depositors that give us the same name for a given Compound.
static const double WeightDropOnSynonymLength = 15.0;
// deprecate synonyms longer than 15 chars
// There is a sharp drop in the number of synonyms longer than 15 chars:
// total number of distinct synonyms: 1,467,773
// datalength > N: Count
// 10: 774636; 11: 727255; 12: 707138; 13: 687647; 14: 678756; 15: 463964;
// 16: 456838; 17: 449340; 18: 441031; 19: 432274; 20: 423138; 21: 413218
// 30: 318323; 40: 223743; 45: 183241; 50: 148660; 55: 119642; 60: 94911;
static const double BiasRatioDisfavorUpperCaseLetter = 0.6;
// versus 1.0 for lower case letters and 0.0 for non-alphabets
static const double kT = WeightDropOnSynonymLength;
// so that synonyms with lengths greater than 2 * WeightDropOnSynonymLength
// start to be substantially (exponentially) deprecated
// with the following scheme, all-lower-case words
// shorter than WeightDropOnSynonymLength chars
// will be judged solely on the basis of their frequency;
// if they have the same frequency, then they all have the same weight
//
unsigned SHProjNS::GetSynonymWeight(const string & synonym, unsigned frequency)
{
unsigned upperCnt = 0, lowerCnt = 0, synonymSz = synonym.size();
for(string::const_iterator iter = synonym.begin(); iter != synonym.end(); iter++)
{
if(*iter >= 'A' && *iter <= 'Z')
upperCnt++;
else if(*iter >= 'a' && *iter <= 'z')
lowerCnt++;
}
double BiasFavorLowerCaseLetter = (1.0 * lowerCnt + BiasRatioDisfavorUpperCaseLetter * upperCnt) / synonymSz;
double Weight = 100.0 + 20.0 * frequency;
if(synonymSz < 4)
Weight *= 0.1;
else if(synonymSz >= 4 && synonymSz <= WeightDropOnSynonymLength)
Weight *= 0.1 + 0.9 * BiasFavorLowerCaseLetter;
else if(synonymSz > WeightDropOnSynonymLength) {
Weight *= 0.1 + 0.9 * BiasFavorLowerCaseLetter * exp(-(synonymSz-WeightDropOnSynonymLength)/kT);
}
// special case to promote MLSMR identifiers slightly
if (synonymSz == 12 && IsStringStartWithSubstring(synonym,"MLS")) {
unsigned int i;
for (i=3; i<12; ++i)
if (!isdigit(synonym[i]))
break;
if (i == 12)
Weight += 20;
}
return (unsigned)(Weight + 0.5);
} // GetSynonymWeight()
References
- "Data Organization" in the PubChem Help documentation
- The "Data Organization" section in "PubChem Substance and Compound databases", S. Kim et al., Nucleic Acids Res. 2016; 44(Database issue):D1202–D1213. doi:10.1093/nar/gkv951
- "What is the difference between a substance and a compound in PubChem?" on PubChem Blog.