PubChemRDF Release Notes

V1.5 beta (See the V1.1 beta Release Notes)

1. Introduction

1.1. What is RDF?
1.2. How can PubChemRDF help your research?

2. Ontology-based Data Integration

3. PubChemRDF URI Constructions

4. PubChemRDF Subdomains

4.1. PubChem Compound
4.2. PubChem Substance
4.3. PubChem Descriptors
4.4. PubChem InChIKey
4.5. PubChem Synonym
4.6. PubChem BioAssay
4.7. PubChem MeasureGroup
4.8. PubChem Endpoint
4.9. PubChem Protein
4.10. PubChem ConservedDomain
4.11. PubChem Gene
4.12. PubChem Biosystem
4.13. PubChem Neighbor
4.14. PubChem Source
4.15. PubChem Reference
4.16. PubChem Concept

5. RESTful INTERFACE

5.1. URI Dereferencing
5.2. Query RESTful Interface
5.3. HTTP Response Status

6. RDF FTP Download Directory Layout

6.1. PubChem Compound
6.2. PubChem Substance
6.3. PubChem Descriptor
6.4. PubChem InChIKey
6.5. PubChem Synonym
6.6. PubChem Bioassay
6.7. PubChem MeasureGroup
6.8. PubChem Endpoint
6.9. PubChem Protein
6.10. PubChem ConservedDomain
6.11. PubChem Gene
6.12. PubChem Biosystem
6.13. PubChem Source
6.14. PubChem Reference

7. PubChemRDF Use Cases

8. Document Version History

1. Introduction

Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and Bioassay databases.

1.1. What is RDF?

RDF constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine-readable discrete pieces, called “triples.” Each “triple” is organized as a trio of ‘subject-predicate-object’. For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin”, the predicate is “may treat”, and the object is “cholesterol.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL. RDF is a core part of semantic web standards. As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information. Semantic web leverages the following technologies: extensible markup language (XML), which provides syntax for RDF; web ontology language (OWL), which extends the ability of RDF to encode information; resource description framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

1.2. How can PubChemRDF help your research?

PubChem users have frequently expressed interest in having a downloadable database. Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. Together these tools enable the schema-less database access and query. There are a number of open-source and commercial triplestores such as the Apache Jena TDB and OpenLink Virtuoso (a list can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of the PubChem knowledge base allows logical inference, such as forward/backward chaining. The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, thus allowing you to avoid downloading parts of PubChem data you will not use. For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in the compound descriptor directory. In addition to bulk download, PubChemRDF also provides programmatic data access through RESTful interface.

This document provides detailed technical information (release notes) about the PubChemRDF project. Additional information is available on the PubChemRDF FTP Site, the slide presentation: PubChemRDF introduction, the slide presentation: PubChemRDF details, and on the PubChem Blog.

2. Ontology-based Data Integration

As depicted in Figure 1, the PubChemRDF content includes a number of semantic relationships, such as those between compounds and substances, the chemical descriptors associated with compounds and substances, the relationships between compounds, the provenance and attribution metadata of substances, and the concise bioactivity data view of substances. Whenever possible, pre-existing ontological frameworks were used to semantically describe information available in the PubChem archive, rather than creating new ones. However, in some cases, no suitable types or relations were defined in standard ontologies, and a PubChem vocabulary was created to define these terms. The set of standardized ontologies used to define the domain-specific knowledge are found in Table 1 and includes: Chemical Entities of Biological Interest (ChEBI), CHEMical INFormation ontology (CHEMINF), Protein Ontology (PRO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Basic Formal Ontology (BFO), Ontology for Biomedical Investigations (OBI), Information Artifact Ontology (IAO), BioAssay Ontology (BAO), Units of Measurement (UO), Quantities, Units, Dimensions and Data Types (QUDT), Citation Typing Ontology (CiTO), FRBR-aligned Bibliographic Ontology (FaBiO), Dublin Core Metadata Initiative (DCMI) Terms, Provenance Authoring and Versioning ontology (PAV), Simple Knowledge Organization System (SKOS), the Friend Of A Friend (FOAF) vocabulary, BioPAX, National Drug File-Reference Terminology (NDF-RT), and National Center Institute thesaurus (NCIt). All of the biomedical ontologies, such as ChEBI, CHEMINF, PRO, GO, BFO, SIO, and BAO, are interfaced by the NIH Roadmap National Center for Biomedical Ontology (NCBO) through its BioPortal, and comply with an evolving set of shared principles established by the Open Biomedical Ontologies (OBO) foundry. Adoption of these core ontologies helps to ensure that the mapping of chemical and biological information is compatible across multiple Semantic Web resources.

Figure 1. Color-coded diagram showing a high-level overview of the PubChemRDF semantic relationships.

Table 1. The prefixes and corresponding namespaces of standardized ontologies used in PubChemRDF.

Prefix

Namespace

Vocabularies

rdfs

http://www.w3.org/2000/01/rdf-schema#

RDF Schema

rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

RDF

owl

http://www.w3.org/2002/07/owl#

OWL

xsd

http://www.w3.org/2001/XMLSchema#

XML Schema

ndfrt

http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#

NDF-RT

ncit

http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#

NCIt

sioa

http://semanticscience.org/resource/

SIO

cheminfa

http://semanticscience.org/resource/

CHEMINF

skos

http://www.w3.org/2004/02/skos/core#

SKOS

obo

http://purl.obolibrary.org/obo/

BFO, OBI, IAO, UO, ChEBI, PR, GO

bao

http://www.bioassayontology.org/bao#

BAO

bp

http://www.biopax.org/release/biopax-level3.owl#

BioPAX

cito

http://purl.org/spar/cito/

CiTO

fabio

http://purl.org/spar/fabio/

FaBio

pdbo

http://rdf.wwpdb.org/schema/pdbx-v40.owl#

PDBo

dcterms

http://purl.org/dc/terms/

DCMI Terms

pav

http://purl.org/pav/

PAV

foaf

http://xmlns.com/foaf/0.1/

FOAF Vocabulary

a The sio and cheminf ontologies share a URI namespace but are distinct.

3. PubChemRDF URI Constructions

In this document, PubChemRDF statements are written in the Turtle syntax with Uniform Resource Identifiers (URIs) in relative form. The Turtle prefix directives can be used to resolve the base URIs relative to the local part. A list of the PubChem subdomain namespaces are listed in Table 2. Both “303 URI” and “hash URI” were employed in the PubChemRDF project according to W3C recommendations; however, the “hash URI” was only used for the PubChem vocabulary subdomain, and the "303 URIs" were used for the rest of PubChemRDF subdomains. PubChem vocabulary serves as a terminology defining the types and relations of some PubChem-specific terms. For instance, the URI for the type of PubChem specific 3-D structural similarity defined in the PubChem vocabulary is as follows:

http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#3D_structural_similarity

Table 2. The prefixes and corresponding namespaces of subdomains used in PubChemRDF.

Prefix

Namespace

compound

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/

substance

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/

descr

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/

inchikey

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/

syno

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/

bioassay

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/

measuregroup

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/

endpoint

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/

protein

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/

conserveddomain

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/

biosystem

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/

gene

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/

reference

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/

nbra

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/

source

http://rdf.ncbi.nlm.nih.gov/pubchem/source/

concept

http://rdf.ncbi.nlm.nih.gov/pubchem/concept/

vocab

http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#

a The RDF triples for the neighbor subdomain are currently only available through the RESTful interface.

The URIs for PubChem compounds and substances were constructed based on primary accession identifiers (CID and SID). For instance, the URIs for CID60823 and SID103554720 can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823 

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID103554720 

which can be abbreviated as compound:CID60823 and substance:SID103554720, respectively. The InChIKey URIs were constructed based on the value of InChIKey. For instance, the URI for InChIKey with value of “XUKUURHRXDUEBC-KAYWLYCHSA-N” (case-insensitive) can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/XUKUURHRXDUEBC-KAYWLYCHSA-N

Most chemical descriptor namespace URIs were constructed based on a combination of CID/SID and descriptor labels, except in the case of depositor-provided synonyms. For instance, the URI for the molecular weight of CID60823 can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight

or simply as descr:CID60823_Molecular_Weight. The URI for the depositor-provided synonyms were constructed based on MD5 hash values, after first converting chemical names to lower-case. For example, ‘Atorvastatin [INN:BAN]’ becomes ‘atorvastatin [inn:ban]’ to produce the MD5 hash ‘7be8fb160fff31a7beea9df539fd36bd’ and ‘(3R,5R)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid’ becomes ‘(3r,5r)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1h-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid’ to produce the MD5 hash ‘c576a26b0c67fa6b072b61a0b4c57a6c’. The use of an MD5 hash in place of the actual chemical name allows PubChem information associated with any given chemical name to be directly accessed using RDF. For instance, the depositor-provided synonym of ‘Atorvastatin’ can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_9a05646d461669f86de312d88ab5748a

or simply as syno:MD5_9a05646d461669f86de312d88ab5748a.

Some of the PubChem synonyms that are equivalent to World Health Organization (WHO) International Nonproprietary Names (INNs) represent pharmaceutical substances, and some of them are assigned with WHO Anatomical Therapeutic Chemical (ATC) codes. The ATC classification system can be used to search and group active ingredients in clinical drugs. Each ATC class was exposed as a skos:concept in the PubChemRDF concept subdomain, and the ATC codes were used to construct the URIs of those concepts. For instance, protein kinase inhibitors have the ATC code “L01XE”, and the URI for this concept is:

http://rdf.ncbi.nlm.nih.gov/pubchem/concept/ATC_L01XE 

PubChem BioAssay records were annotated in different ways depending on the assay type in accordance with the BioAssay Ontology (BAO). Literature extracted bioassays, such as those from ChEMBL, were represented as an instance of BAO measure group (BAO_0000040), since these are summary results abstracted from the literature and missing specific information on how the biological experiment was performed. The URIs for assays records are constructed based on the PubChem BioAssay accession identifiers (AID). For instance, the URI for AID447528 can be assigned as:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID447528

or abbreviated as measuregroup:AID447528.

Some assays in PubChem aggregate literature abstracted bioactivity data from multiple publications. For instance, AID363 deposited by BindingDB contains bioactivity data tested against human Src kinase from 23 different publications. In literature-derived assays of this type, a single bioassay record is broken down into multiple measure groups, and the fragment identifier of each individual measure group is based on the combination of AID and PubMed identifier (PMID), for example:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID363_PMID16161995

or abbreviated as measuregroup:AID363_PMID16161995. However, in contrast to literature-extracted assays, biological screening experiments, such as those from the NIH Molecular Library Program (MLP), were represented as an instance of BAO bioassay (BAO_0000015). For instance, the URI for AID1788 can be assigned as:

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788

or abbreviated as bioassay:AID1788.

Although screening assays and literature-extracted assays are different, they are related. Each screening assay refers to an operational unit and may have one or more instances of BAO measure group (BAO_0000040). If the screening assay is a panel assay (for instance, testing against a panel of multiple targets as occurs when performing a lead profiling screen), the URIs are constructed based on the combination of AID and panel component identifier (PID), for example:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1

or abbreviated as measuregroup:AID1788_1. Therefore, the measure group serves as a basic concept interlinking chemical substances, molecular targets, and the bioactivity endpoints for a given PubChem BioAssay record.

The URIs of bioactivity endpoints were constructed based on the combination of SID and AID, plus PID if the endpoints were produced by panel screening assays or PMID if the endpoints were derived from aggregated literature-extracted assay. The following URIs demonstrate the different reference approaches used for bioactivity endpoints:

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID99445338_AID2202_1

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID8033500_AID363_PMID10395478

or abbreviated as endpoint:SID103164874_AID443491, endpoint:SID99445338_AID2202_1, and endpoint:SID8033500_AID363_PMID10395478, respectively. The first URI above refers to an endpoint derived from a ChEMBL assay, the second URI refers to an endpoint produced by a panel screening assay (assay panel PID is 1), and the last URI refers to an endpoint derived from a literature-extracted assay (PMID is 10395478).

In the case of protein targets, the URIs are created based upon the National Center for Biotechnology Information (NCBI) Protein GI numbers:

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/GI124375976

or abbreviated as protein:GI124375976. In the case of protein complexes as bioassay targets, the URIs are constructed based on a combination of GI numbers, which are in the ascending order:

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/GI2506129GI254763435

The protein targets tested in PubChem BioAssay database can be linked to other NCBI databases, including the NCBI: Conserved Domain, Gene, Biosystems, and PubMed. Protein conserved domains contain recurring sequence patterns, which define the functional and/or structural units of protein sequences. NCBI conserved domains were identified through multiple sequence alignment, and were distinguished through position-specific scoring matrix (PSSM) models. Each PSSM model has a unique PSSM identifier (PSSMID). The NCBI Gene integrates gene information for various species. Each gene has a unique Gene ID (GID). NCBI BioSystems integrates a group of biological entities interacting in a biological process into a single conceptual unit called a biosystem, which may belong to different categories: biological pathway, molecular function, cellular location and signature module. Only the three categories corresponding to Gene Ontology classifications are taken into account in the PubChemRDF project, which are biological pathway, molecular function, and cellular location. Each biosystem was assigned with a unique identifier (BSID). NCBI PubMed database comprises more than 23 million citations from the biomedical literature. Each PubMed record is assigned with a unique identifier (PMID). PubChemRDF provides the RDF triples to expose the linkage information and the basic descriptions of those resources.

The URIs for conserved domains use PSSMIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758

or abbreviated as domain:PSSMID132758. The URIs for genes use NCBI gene IDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367

or abbreviated as gene:GID367. The URIs for biosystems use BSIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/BSID82991

or abbreviated as biosystem:BSID82991. The URIs for publication references use PMIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478

or abbreviated as reference:PMID10395478.

The URIs for PubChem depositors are based on the names of depositors:

http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL

or abbreviated as source:ChEMBL. If the names of depositors are numeric numbers, a prefix “ID” was added; or if the names contains the symbols including “,”, “.”, “&”, “(”, “)”, and “/”, those symbols were deleted; or if the names contains spaces, they were replaced by “_”.

The URIs for PubChem Compound 2-D and 3-D similarity neighbors and PubChem BioAssay protein target sequence similarity neighbors are available in these examples:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity

or abbreviated as nbr:CID60823_CID68019409_2DSimilarity, and nbr:CID60823_CID11330946_3DSimilarity, respectively.

4. PubChemRDF Subdomains

4.1. PubChem Compound

PubChem Compound RDF triples expose the linkage from compound to the chemical descriptor resources and interrelated compounds, such as compound identity groups (CIGs). [See Figure 1 for a diagram of links to other RDF subdomains.] For example, to resolve the URI in the RESTful interface for compound CID60823:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823

Link Type

Example RDF Triple

calculated chemical descriptor

compound:CID60823 sio:has-attribute descr:CID60823_Molecular_Weight .

parent compound

compound:CID60823 vocab:hasParentCompound compound:CID60823 .

component compound

compound:CID22765305 cheminf:CHEMINF_000478 compound:CID2244 .

compound identity group (CIG)

compound:CID60823 cheminf:CHEMINF_000462 compound:CID2250 .

2-D similarity neighbora

compound:CID60823 cheminf:CHEMINF_000482 compound:CID60822 .

3-D similarity neighbora

compound:CID60823 cheminf:CHEMINF_000483 compound:CID10745515 .

a Given the large number of similarity neighbors for a given compound, the RDF statements for similarity neighbors are moved to a separate URI: http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823/nbr

If the compound has links to biological pathways in the NCBI Biosystems database, the RDF triples representing the participation relations are provided:

    compound:CID305 obo:BFO_0000056 biosystem:BSID122548 .

If the compound has links to the corresponding Wikidata records, the RDF triples representing the cross-reference relations are provided:

    compound:CID60823 skos:closeMatch <http://www.wikidata.org/entity/Q668093> .

If the compound can be mapped to the drug ontologies (i.e., ChEBI, NDF-RT and NCIt), the drug classes defined in the classification terminologies are used to annotate the compound:

    compound:CID60823 rdf:type obo:CHEBI_39548 .
    compound:CID60823 rdf:type ndfrt:N0000022046 .
    compound:CID60823 rdf:type ncit:C61527 .

4.2. PubChem Substance

PubChem Substance RDF triples expose the linkage between: substance and chemical descriptor resources, substance and standardized compound resources, substance and measure group resources, and substance and data source resources. For example, to resolve the URI for SID8032774:

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID8032774

Link Type

Example RDF Triple

depositor-provided descriptor

substance:SID8032774 sio:has-attribute descr:SID8032774_Depositor_Identifier .

standardized compound

substance:SID8032774 cheminf:CHEMINF_000477 compound:CID5327844 .

data source

substance:SID8032774 dcterms:source source:BindingDB .

measure group

substance:SID8032774 obo:BFO_0000056 measuregroup:AID363_PMID9357527 .

If the substance was deposited by ChEMBL database, a cross-link to the ChEMBL RDF resource is provided:

    substance:SID194189249 skos:exactMatch <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL3138729> .

If the substance was deposited by NCBI MMDB, a cross-link to the RDF-based PDB resource is provided:

    substance:SID198184368 pdbo:link_to_pdb <http://rdf.wwpdb.org/pdb/4UUB> .

4.3. PubChem Descriptors

PubChem descriptor RDF triples expose the type, value and unit for a given descriptor.

For example, to resolve URI in the RESTful interface for the molecular weight of PubChem Compound record CID60823 and to provide the external depositor identifier for PubChem Substance record SID8032774:

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/SID8032774_Substance_Version

Link Type

Example RDF Triple

type

descr:CID60823_Molecular_Weight rdf:type cheminf:CHEMINF_000334 .

value

descr:CID60823_Molecular_Weight sio:has-value "558.639803"^^xsd:double .

unit

descr:CID60823_Molecular_Weight sio:has-unit obo:UO_0000055 .

4.4. PubChem InChIKey

PubChem InChIKey RDF triples expose the type, value and the link to the corresponding compound(s) for a given InChIKey. For example, to resolve the URI for the InChIKey with a value of “BSYNRYMUTXBXSQ-UHFFFAOYSA-N” (case-insensitive):

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N

Link Type

Example RDF Triple

type

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N rdf:type cheminf:CHEMINF_000399 .

value

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:has-value "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"@en .

compound

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:is-attribute-of compound:CID2244 .

If the InChKey represents a chemical structure in the FDA UNII database , and the UNII code is incorporated as registry number in a MeSH concept, the annotation of InChIKey using MeSH concept is provided:

    inchikey:ADUKCCWBEDSMEB-NSHDSACASA-N dcterms:subject <http://id.nlm.nih.gov/mesh/M0017537> .

4.5. PubChem Synonym

PubChem synonym RDF triples expose the type and value of a given MD5 hash string, and the link to the corresponding compound(s). For example, to resolve the URI for the synonym “5'-CYTIDYLIC ACID”:

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_8437e4cbbdae1037d9da9fc57f8fdd1A

Link Type

Example RDF Triple

compound

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A sio:is-attribute-of compound:SID2244 .

type

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A rdf:type cheminf:CHEMINF_000339 .

value

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A sio:has-value "5'-CYTIDYLIC ACID"@en .

If the synonym represents a MeSH term, or a registry number for a MeSH concept , the annotation of synonym using a MeSH concept is provided:

    synonym:MD5_a5d089eba9ea66d56b43ee9635a478e1 dcterms:subject <http://id.nlm.nih.gov/mesh/M0000161> .

If the synonym represents a WHO INN that has been assigned with an ATC code, the annotation of synonym using an ATC classification system is provided:

    synonym:MD5_47a63fc97978537204b7a7371de13662 dcterms:subject concept:L01XE01 .

4.6. PubChem BioAssay

PubChem BioAssay RDF triples expose the type, title, data source and the linkage to measure groups for a given assay. For example, to resolve the URI for the PubChem Assay record AID1788:

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788

Link Type

Example RDF Triple

type

bioassay:AID1788 rdf:type bao:BAO_0000015 .

title

bioassay:AID1788 dcterms:title "Discovery of novel allosteric modulators of the M1 muscarinic receptor: Agonist Ancillary Activity"@en .

data source

bioassay:AID1788 dcterms:source source:Vanderbilt_Screening_Center_ for_GPCRs_Ion_Channels_and_Transporters .

measure group

bioassay:AID1788 bao:BAO_0000209 measuregroup:AID1788_1 .

4.7. PubChem MeasureGroup

For high throughput screening assays, including panel assays, PubChem measure group RDF triples expose the title, type, as well as the linkage to the participating proteins and endpoints. For example, to resolve the URI for assay panel 1 from the PubChem Assay record AID1788:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1

Link Type

Example RDF Triple

type

measuregroup:AID1788_1 rdf:type bao:BAO_0000040 .

title

measuregroup:AID1788_1 dcterms:title "Adenosine A1 (human) "@en .

protein

measuregroup:AID1788_1 obo:BFO_0000057 protein:GI4501947 .

endpoint

measuregroup:AID1788_1 obo:OBI_0000299 endpoint:SID56353039_AID1788_1 .

For literature-extracted assays, PubChem measure group RDF triples expose the title, type, data source, as well as the linkage to the participating proteins and endpoints. For example, to resolve the URI for literature-extracted assay AID447528:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID447528

Link Type

Example RDF Triple

type

measuregroup:AID447528 rdf:type bao:BAO_0000040 .

title

measuregroup:AID1788_1 dcterms:title "Inhibition of ovine COX1 by enzyme immunoassay"@en .

data source

measuregroup:AID447528 dcterms:source source:ChEMBL .

protein

measuregroup:AID447528 obo:BFO_0000057 protein:GI548481.

endpoint

measuregroup:AID447528 obo:OBI_0000299 endpoint:SID103164874_AID447528 .

4.8. PubChem Endpoint

PubChem endpoint RDF triples expose the type, value, unit, reference, and the linkage to substance. For example, to resolve the URI for the bioassay endpoint between PubChem records SID103164874 and AID443491:

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491

Link Type

Example RDF Triple

type

endpoint:SID103164874_AID443491 rdf:type bao:BAO_0000190 .

value

endpoint:SID103164874_AID443491 qudt:numericValue "0.162"^^xsd:double .

unit

endpoint:SID103164874_AID443491 qudt:unit ops:Micromolar .

substance

endpoint:SID103164874_AID443491 obo:IAO_0000136 substance:SID103164874 .

reference

endpoint:SID103164874_AID443491 cito:citesAsDataSource reference:PMID19880317 .

4.9. PubChem Protein

PubChem protein RDF triples expose the type (Protein Ontology), title, similarity neighbors, conserved domains, encoding genes, organisms, biosystems, references, and the cross-links to UniProt RDF. For example, to resolve the URI for the NCBI Protein record GI124375976:

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/GI124375976

Link Type

Example RDF Triple

type

protein:GI124375976 rdf:type obo:PR_000004191 .

title

protein:GI124375976 dcterms:title "AR protein [Homo sapiens] "@en .

similarity neighbor

protein:GI124375976 vocab:hasSimilarProtein Protein:GI6978663 .

conserved domain

protein:GI124375976 obo:BFO_0000110 conserveddomain:PSSMID132758 .

encoding gene

protein:GI124375976 vocab:encodedBy gene:GID367 .

cross link to Uniprot

protein:GI124375976 skos:closeMatch uniprot:P10275 .

organism

protein:GI124375976 bp:organism taxonomy:9606 .

reference

protein:GI124375976 cito:isDiscussedBy reference:PMID12477932 .

biosystem

protein:GI124375976 obo:BFO_0000056 biosystem:BSID105937 .

If protein entity has crystallized 3D structure in PDB database, a cross link to the RDF-based PDB resource is provided, for example:

        protein:GI230489 pdbo:link_to_pdb <http://rdf.wwpdb.org/pdb/2ER0> .

If the protein complexes have been tested in the bioassays, the measure groups are linked to the protein complexes, which are typed and linked to their component protein units, for example:

    measuregroup:AID1048 obo:BFO_0000057 protein:GI20336229GI32307126 .
    protein:GI20336229GI32307126 rdf:type obo:GO_0043234 .
    protein:GI20336229GI32307126 obo:BFO_0000178 protein:GI20336229 .
    protein:GI20336229GI32307126 obo:BFO_0000178 protein:GI32307126 .

4.10. PubChem ConservedDomain

PubChem conserved domain RDF triples expose the type, title, and references of a given conserved protein domain. For example, to resolve the URI for the NCBI Conserved Domain record PSSMID132758:

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758

Link Type

Example RDF Triple

type

conserveddomain:PSSMID132758 rdf:type obo:SO_0000417 .

title

conserveddomain:PSSMID132758 dcterms:title "NR_LBD_AR"@en .

referencea

conserveddomain:PSSMID132758 cito:isDiscussedBy reference:PMID17940184 .

a The links to literature references are obtained from NCBI Entrez system.

4.11. PubChem Gene

PubChem gene RDF triples expose the type, title, symbol, description, organism, and references. For example, to resolve the URI for NCBI Gene record GID367:

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367

Link Type

Example RDF Triple

type

gene:GID367 rdf:type bp:Gene .

titlea

gene:GID367 dcterms:title "AR"@en .

description

gene:GID367 dcterms:description "androgen receptor"@en .

organism

gene:GID367 bp:organism taxonomy:9606 .

symbol

gene:GID367 vocab:geneSymbol "AR"@en .

referencea

gene:GID367 cito:isDiscussedBy reference:PMID19815331 .

a The links to literature references are obtained from NCBI Entrez system.

4.12. PubChem Biosystem

PubChem biosystem RDF triples expose the type, title, data source, organism, and references. For example, to resolve the URI for the NCBI Biosystems record BSID82991:

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/BSID82991

Link Type

Example RDF Triple

type

biosystem:BSID82991 rdf:type bp:Pathway .

title

biosystem:BSID82991 dcterms:title "Arachidonic acid metabolism"@en .

organism

biosystem:BSID82991 bp:organism taxonomy:9606 .

referencea

biosystem:BSID82991 cito:isDiscussedBy reference:PMID14622984 .

source

Biosystems:BSID82991 dcterms:source source:KEGG .

a The links to literature references are obtained from NCBI Entrez system.

If the biosystems is deposited by Reactome pathway database, a cross-link to the Reactome RDF is provided, for example:

    biosystem:BSID105648 skos:exactMatch reactome:REACT_578 .

4.13. PubChem Neighbor

PubChem neighbor RDF triples describe similarity relationships and their supporting information. Currently, these exist between chemical records and between protein sequences, and the information is only available through RESTful interface The chemical 2-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the similarity score, as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID60823 and CID68019409:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DTanimotoScore

Link Type

Example RDF Triple

relation type

nbr:CID60823_CID10030610_2DSimilarity rdf:type vocab:PC2D_structural_similarity.

compound

nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:CID10030610 .

supporting score

nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID10030610_2DTanimotoScore

score type

nbr:CID60823_CID10030610_2DTanimotoScore rdf:type vocab:PC2D_Fingerprint_TanimotoScore .

score value

nbr:CID60823_CID10030610_2DTanimotoScore sio:has-value "0.98"^^xsd:double .

The chemical 3-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the shape and feature similarity scores (ST and CT, respectively), as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID60823 and CID11330946:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DFeatureTanimotoScore

Link Type

Example RDF Triple

relation type

nbr:CID60823_CID11330946_3DSimilarity rdf:type vocab:PC3D_structural_similarity .

compound

nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:11330946 .

supporting score

nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID11330946_3DFeatureTanimotoScore , nbr:CID60823_CID11330946_3DShapeTanimotoScore .

score type

nbr:CID60823_CID11330946_3DFeatureTanimotoScore rdf:type vocab:PC3D_Feature_TanimotoScore . nbr:CID60823_CID11330946_3DShapeTanimotoScore rdf:type vocab:PC3D_Shape_TanimotoScore .

value

nbr:CID60823_CID11330946_3DFeatureTanimotoScore sio:has-value "0.59"^^xsd:double . nbr:CID60823_CID11330946_3DShapeTanimotoScore sio:has-value "0.88"^^xsd:double .

4.14. PubChem Source

PubChem data source RDF triples expose the type, title, contributor, homepage, and substance categorization classification for a given data source. For example, to resolve the URI for the PubChem data source “ChEMBL”:

http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL

Link Type

Example RDF Triple

type

source:ChEMBL rdf:type dcterms:Dataset .

title

source:ChEMBL dcterms:title “ChEMBL”@en .
contributor source:ChEMBL dcterms:contributor [ rdf:label "European Bioinformatics Institute, EBI, ChEMBL"@en , rdf:type foaf:Organiztion ] .

substance categorization classification

source:ChEMBL dcterms:subject concept:Biological_Property .

4.15. PubChem Reference

PubChem reference RDF triples expose the type, publication date, citation, title, MeSH headings/subheadings (in the MeshHeadingList of PubMed XML files), as well as the literature abstract mentionings of the chemicals (in the ChemicalList of of PubMed XML files) and diseases (in the SupplMeshList of PubMed XML file) provided by Medline indexers. All of the headings/subheadings are represented by MeSH Descriptor (identifier starts with ‘D’)/Qualifier (identifier starts with ‘Q’) pairs, and all of the chemicals are represented by MeSH concepts (identifier starts with ‘M’), and all of the diseases are represented by MeSH supplementary concept records (SCRs, identifier starts with ‘C’). For example, to resolve the URI for the NCBI PubMed record PMID 10395478:

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478

Link Type

Example RDF Triple

type

reference:PMID10395478 rdf:type abio:JournalArticle .

citation

reference:PMID10395478 dcterms:bibliographicCitation "B D Palmer, A J Kraker, B G Hartl, A D Panopoulos, R L Panek, B L Batley, G H Lu, S Trumpp-Kallmeyer, H D Hollis Showalter, W A Denny, Journal of medicinal chemistry, 1999 Jul;42(13):2373-82" .

title

reference:PMID10395478 dcterms:title "Structure-activity relationships for 5-substituted 1-phenylbenzimidazoles as selective inhibitors of the platelet-derived growth factor receptor. "@en .

publication date

reference:PMID10395478 dcterms:date "1999-07-01"^^xsd:date .

heading/subheading

reference:PMID10395478 fabio:hasSubjectTerm mesh:D000255Q000378 .

chemical list

reference:PMID10395478 cito:discusses mesh:M0000395 .

The literature abstract mentionings of diseases are optional, for instance:

        reference:PMID15835820 cito:discusses mesh:C538339 .

4.16. PubChem Concept

The PubChem “concept” subdomain exposes RDF triples related to the biomedical concepts used to annotate the PubChemRDF resource, for instance, the WHO ATC codes used to annotate synonym instances. For example, to resolve the URI for the WHO ATC code L01XE:

http://rdf.ncbi.nlm.nih.gov/pubchem/concept/ATC_L01XE

Link Type

Example RDF Triple

type

concept:ATC_L01XE rdf:type skos:Concept .

concept scheme

concept:ATC_L01XE skos:inScheme concept:ATC .

source

concept:ATC_L01XE pav:importedFrom source:WHO .

parent concept

concept:ATC_L01XE skos:broader concept:ATC_L01X .

label

concept:ATC_L01XE skos:prefLabel "protein kinase inhibitors"@en .

5. RESTful INTERFACE

5.1. URI Dereferencing

All of the aforementioned URIs can be resolved through a RESTful interface, which has some additional functionality beyond resolving URIs. The RDF triples can be presented according to different MIME types (see Table 3).

Table 3. The MIME types allowed and used in the PubChemRDF REST interface for dereferencing URIs.

MIME Type

HTTP Accept Header

URI Suffix Extension

Abbreviated RDF/XML

application/rdf+xml+abbrev

rdfxml-abbrev

RDF/XML

application/rdf+xml

text/rdf

rdfxml

rdf

xml

HTML

application/xhtml+xml

text/html

html

htm

TURTLEa

application/n3

application/rdf+n3

application/turtle

application/x-turtle

text/n3

text/turtle

text/rdf+n3

text/rdf+turtle

turtle

ttl

n3

JSONb

application/json

text/json

json

JSON-LDc

application/x-json+ld

application/x-json+rdf

application/json+ld

application/json+rdf

application/ld+json

application/rdf+json

Jsonld

Json-ld

ldjson

ld-json

N-TRIPLES

text/plain

ntriples (default)

a Turtle is an abbreviation for Terse RDF Triples Language; b JSON is short for JavaScript Object Notation; c JSON-LD is short for JavaScript Object Notation for Linked Data.

Different types of presentations can be produced through specifying the HTTP accept header (see Table 3). For instance, if the Linux cURL command is used to retrieve RDF triples regarding to CID2244, the following commands will output the RDF triples into the files:

If no HTTP header is specified, the default output format is ntriples (text/plain).

If web browsers are used to retrieve RDF triples, HTML format is typically the default. For instance, what Google Chrome sends in the accept header would be something like:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

which means that it will take HTML or XML or others, but it prefers HTML (q=1.0) to XML (q=0.9) and others (q=0.8).

In order to force the desired MIME type using web browsers, the URI suffix extension preceded by a dot (‘.’) can be used (see Table 3). For instance, the following URLs can present the RDF triples with respect to CID2244 (Aspirin) in various RDF data formats:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.rdf

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.html

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.turtle

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.json

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.jsonld

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.ntriples

5.2. Query RESTful Interface

The resolution of URIs under http://rdf.ncbi.nlm.nih.gov/pubchem/ domain will return a 303 redirect HTTP status code, and the request will be redirected to https://pubchem.ncbi.nlm.nih.gov/rest/rdf/ domain. The RESTful interface under the later domain can also be used to query RDF triples. The query calls share the same base URL:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?

The input queries can be provided through key-value pairs, and the keys can be as follows (all lowercase): “graph” (or “domain”), “name” (or “string”), “return” (or “retrieve”), “contain” (or “substring”), “subject” (or “subj”), “predicate” (or “pred”), “object” (or “obj”), “offset”, and “format”.

5.2.1. Queries Based on String Values

The following string values can be used to query PubChemRDF resources: substance synonyms, inchikey values, protein names, gene symbols, data source names, conserved domain titles, biosystem titles, bioassay titles, measuregroup titles, reference titles, and concept labels. Two basic parameters must be provided including “graph” (or “domain”) and “name” (or “string”). For instance, the following query can retrieve the PubChemRDF synonym resource having the value of “aspirin”:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin

Substring search is supported as well with the parameter “contain” (or “substring”), which can be either true or false:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&contain=true

The above queries return synonym resources. If the related compounds or substances are intended, another parameter, “return” (or “retrieve”), should be provided, which can be either “compound” (or “cid”) or “substance” (or “sid”):

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&return=compound

The query functions support content negotiation with parameter “format” specified in Table 4. For instance, the following query will return JSON format:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&format=json

Table 4. The MIME types allowed and used in the PubChemRDF REST interface for query functions.

MIME Type

HTTP Accept Header

URI Suffix Extension

RDF/XML

application/rdf+xml

text/rdf

rdfxml

rdf

xml

HTML

application/xhtml+xml

text/html

html

htm

JSONa

application/json

text/json

json

CSVb

text/csv

csv

a JSON is short for JavaScript Object Notation; b CSV is short for comma-seperated values.

5.2.1. Queries Based on Triple Patterns

PubChemRDF REST interface provides simple SPARQL-like query capabilities for grouping and filtering relevant resources. Given the high computational costs of complicated SPARQL queries, only one triple pattern is allowed in the PubChemRDF REST interface. In addition, two basic parameters must be provided as well including “graph” (or “domain”) and “predicate” (or “pred”). For instance, the following query can retrieve the ChEBI class assignments for the PubChem substances:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=substance&predicate=rdf:type

The number of records returned by each query request can be configured using parameter “limit”, which has maximum and default value as 10 000. Since all of the records have been pre-sorted, the rest of the records can be retrieved by specifying the “offset” parameter. For instance, the next 10 000 records (10 001 to 20 000) can be retrieved using the following query:

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=substance&predicate=rdf:type&offset=10000

In addition to the two basic parameters, either “subject” (or “subj”) or “object” (or “obj”) can be provided for filtering and grouping purpose. For instance, the following query can retrieve the first 10 000 synonyms that are drug brand names (trademarks):

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&pred=rdf:type&obj=sio:CHEMINF_000561

5.3. HTTP Response Status

If the operation after redirection on RESTful interface was successful, the RDF triples will be retrieved along with a 200 HTTP status code. If the server encounters an error, it will return an HTTP status code other than 200 in the response header. The HTTP codes in the 400 range indicate errors on the request side (invalid input of some form), and the HTTP codes in the 500 range indicate errors on the PubChem side (timeout or other issue). In the response content, some descriptive messages will be returned, indicating the potential causes of the errors. The HTTP status codes and corresponding descriptions are as follows:

HTTP Status

Error

Description

400

eBadRequest

Bad Query URL or Request URI

404

eNotFound

Input URI is invalid or cannot be identified in databases

405

eNotAllowed

MIME output format is unspecified or invalid

500

eServerError

Some problem on the server side occurs

504

eTimeout

The request timed out (over 28 second)

Please note that the HTTPS protocol works seamlessly in place of HTTP protocol for all URIs in the PubChemRDF RESTful interface. If you request data using ‘HTTPS’, URIs returned will be ‘https’. This is a feature and it may cause issues for some software packages that depend on the URI uniquely identifying an entity, down to the protocol requesting the URI. Generally speaking, all PubChem web-based resources are configured to work seamlessly with or without HTTP encryption (via ‘https’ or ‘http’ protocol, respectively). By default, and on the PubChemRDF FTP site, all URIs are specified using the HTTP protocol to the NCBI RDF website (http://rdf.ncbi.nlm.nih.gov).

6. RDF FTP Download Directory Layout

The PubChemRDF data can be found on the PubChem FTP site for bulk download:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/

Data is updated in its entirety approximately once per week [i.e., no incremental update is provided at this time]. A Vocabulary of Interlinked Dataset (VoID) description file (void.ttl) is provided at the root directory of the PubChemRDF FTP site. This file provides general metadata information about each release of PubChemRDF, such as provenance information, statistics (e.g., triple counts), dataset release date, and files.

The fundamental layout of the PubChemRDF FTP site is such that it is partitioned into subsets corresponding to different PubChemRDF subdomains. This allows individual subdomains to be downloaded. The top level FTP directories correspond to the subdomains: compound, substance, descriptor, inchikey, synonym, bioassay, measuregroup, endpoint, protein, conserveddomain, gene, biosystem, source, concept, and reference. Since compound and descriptor subdomains have the most number of triples, additional partitions have been applied to them. Compound subdomain has three subsets: general, nbr2d, and nbr3d; descriptor subdomain has two subsets: compound and substance. Within each subdomain, the RDF triples were further split based on the different semantic relations, such that the RDF predicates in each downloadable file are same.

All RDF data files are in turtle format and gzip compressed, as indicated by the suffix “.ttl.gz”. Data from one RDF subdomain may refer to other subdomains. Figure 1 helps to depict these interdependencies by means of arrows indicating out-going references to other subdomains. Each file name has the pattern as “pc_<link>_<range>.ttl.gz”. The <link> indicates the file content type, the “.ttl.gz” suffix indicates the file is in turtle RDF format and gzip compressed, and the <range> (optional) is a number to differentiate the file contents that is only available for large data sets. For example, “pc_compound2descriptor_000001.ttl.gz” indicates the semantic links are from compound to descriptor subdomains, and the number indicates that there are other files containing the same semantic associations.

6.1. PubChem Compound

Data for the PubChem “compound” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/

There are three subdirectories: “general”, “nbr2d”, and “nbr3d”. Each directory may have a ‘README’ file with more current information or additional information.

6.1.1. PubChem Compound “general”

Information contained here includes the links to chemical descriptors, biosystems, ChEBI types, and the non-similarity based compound interrelationships, including parent, components, and compound identity group (CIG). CIGs consider related chemicals by varying degrees of identity. For example, cases of chemicals with identical connectivity (same atoms and bonds) but where stereo isomer and isotopic information may vary. The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

sio:has-attributea

pc_compound2descriptor_

vocab:has-parenta

pc_compound2parent_

cheminf:has-component

pc_compound2component.ttl.gz

cheminf:has-stereoisomer

pc_compound2stereoisomer.ttl.gz

cheminf:has-isotopologue

pc_compound2isotopologue.ttl.gz

cheminf:has-uncharged-counterpart

pc_compound2uncharged.ttl.gz

cheminf:has-same-connectivity-witha

pc_compound2sameconnectivity_

rdf:type

pc_compound_type.ttl.gz

bfo:participates-in

pc_compound2biosystem.ttl.gz

a File prefixes followed by the range numbers.

6.1.2. PubChem Compound “nbr2d”

Information contained here includes links between compounds according to the PubChem 2-D “Similar Compounds” neighboring relationship . The semantic link is “cheminf:has-2D-similar-compound”.

6.1.3. PubChem Compound “nbr3d”

Information contained here includes links between compounds according to the PubChem 3-D “Similar Conformers” neighboring relationship . The semantic link is “cheminf:has-3D-similar-compound”.

6.2. PubChem Substance

Data for the “substance” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/substance/

Information contained here includes the links to the ChEBI types, the PDB crystal structures, depositor-provided descriptors including synonyms, the depositor-provided PubMed references, the standardized compound records, the measure groups, data sources, and the cross links to ChEMBL RDF. The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_substance_type.ttl.gz

pdbo:link_to_pdb

pc_substance2pdb.ttl.gz

sio:has-attributea

pc_substance2descriptor_

cito:isDiscussedBy

pc_substance2reference.ttl.gz

cheminf:has-standardized-compounda

pc_substance2compound_

bfo:participates-ina

pc_substance2measuregroup_

dcterms:sourcea

pc_substance_source_

skos:exactMatch

pc_substance_match.ttl.gz

a File prefixes followed by the range numbers.

6.3. PubChem Descriptor

Data for the “descriptor” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/

There are two subdirectories: “compound” and “substance”.

6.3.1. PubChem Descriptor “compound”

Information contained here includes the type, value and unit (when applicable) of chemical descriptors (not including InChIKey). Since the chemical descriptor is a very large subdomain, it is further categorized into different descriptor types, and the downloadable files are organized accordingly. The file prefixes have the following pattern: pc_descr_<type>_<link>_<range>.ttl.gz. The descriptor types include InChI, canSMILES, isoSMILES, IUPACName, HBondDonor, HBondAcceptor, RotatableBond, Complexity, TautomerCount, XLogP3, DefinedAtomStereoCount, DefinedBondStereoCount, IsotopeAtomCount, HeavyAtomCount, UndefinedAtomStereoCount, UndefinedBondStereoCount, CovalentUnitCount, MolecularFormula, FormalCharge, MolecularWeight, MonoIsotopicWeight, ExactMass, and TPSA. Representative semantic links with the corresponding file prefixes are listed in the following table:

Semantic type

Semantic link

File prefix

TPSA

rdf:type

pc_descr_TPSA_type_a

TPSA

sio:has-value

pc_descr_TPSA_value_ a

TPSA

sio:has-unit

pc_descr_TPSA_unit_ a

a File prefixes followed by the range numbers.

6.3.2. PubChem Descriptor “substance”

Information contained here includes the type and value for a given descriptor (i.e. substance version). The semantic links with the corresponding file prefixes are listed in the following table:

Semantic link

File prefix

rdf:type

pc_SubstanceVersion_type__ a

sio:has-value

pc_SubstanceVersion_value_ a

a File prefixes followed by the range numbers.

6.4. PubChem InChIKey

Data for the PubChem “inchikey” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/

Information contained here includes the type and the value of a given InChIKey. The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_inchikey_type_

sio:has-value

pc_inchikey_value_

dcterms:subject

pc_inchikey_topic.ttl.gz

sio:is-attribute-of

pc_inchikey2compound_ a

a File prefixes followed by the range numbers.

6.5. PubChem Synonym

Data for the PubChem “synonym” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/synonym/

Information contained here includes the type and the corresponding name string of a given MD5 hash string, as well as the links to the corresponding CID(s). The MD5 hash is used to provide a stable identifier for a given synonym. The mappings of synonyms to CIDs may be a subset of those possible from corresponding SID(s). PubChem performs processing on aggregated chemical information between PubChem contributors. This consistency filtering helps to eliminate promiscuous synonyms that correspond to multiple chemical structures (perhaps erroneously). The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_synonym_type_

sio:has-value

pc_synonym_value_

dcterms:subject

pc_synonym_topic.ttl.gz

sio:is-attribute-of

pc_synonym2compound_ a

a File prefixes followed by the range numbers.

6.6. PubChem Bioassay

Data for the “bioassay” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay/

The file “pc_bioassay.ttl.gz” contains the descriptive information for a given AID that is represented as an instance of BAO_0000015, including the type, title, depositor, as well as the links to bioassay neighbors (if it has any) and the corresponding measure groups. The semantic links are “rdf:type”, “dcterms:title”, “dcterms:source”, “bao:has-measure-group”, and “bao:has-summary-assay” (optional).

6.7. PubChem MeasureGroup

Data for the “measuregroup” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/measuregroup/

The files contain the descriptive information for a given AID that cannot be represented as an instance of BAO_0000015, as well as the links from measure group to the participating protein targets (if it has any), and the links from measure group to the corresponding endpoints (if it has any). The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_measuregroup_type.ttl.gz

dcterms:title

pc_measuregroup_title.ttl.gz

dcterms:source

pc_measuregroup_source.ttl.gz

bfo:has-participants

pc_measuregroup2protein.ttl.gz

obi:has-specified-output

pc_measuregroup2endpoint_ a

a File prefixes followed by the range numbers.

6.8. PubChem Endpoint

Data for the “endpoint” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/endpoint/

There are several files containing the type, value, unit, and label of a given endpoint, as well as the links to the substance and reference. The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_endpoint_type.ttl.gz

qudt:value

pc_endpoint_value.ttl.gz

qudt:unit

pc_endpoint_unit.ttl.gz

vocab:PubChemAssayOutcome

pc_endpoint_outcome_ a

rdfs:label

pc_endpoint_label.ttl.gz

iao:is-about

pc_endpoint2substance_ a

cito:citeAsDataSource

pc_endpoint2reference.ttl.gz

a File prefixes followed by the range numbers.

6.9. PubChem Protein

Data for the “protein” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/protein/

The file “pc_protein.ttl.gz” contains descriptive information for a given protein GI identifier, including the type, title, alternative names, cross-links, conserved domain associations, encoding gene information, and involved biosystems, as well as the links to the measure groups and the neighboring relationship between proteins. The semantic links are “rdf:type”, “dcterms:title”, “dcterms:alternative”, “skos:closeMatch”, “bfo:has-part”, “bfo:participates-in”, “bfo:has-function”, “bfo:located-in”, “vocab:endcodedBy”, “bp:organism”, “cito:isDiscussedBy” “pdbo:link_to_pdb”, and “vocab:has-similar-protein”.

6.10. PubChem ConservedDomain

Data for the “conserveddomain” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/conserveddomain/

The file “pc_conserveddomain.ttl.gz” contains the type of a given PSSMID. The semantic link is “rdf:type”, “dcterms:title”, “dcterms:abstract”, and “cito:isDiscussedBy”.

6.11. PubChem Gene

Data for the “gene” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/gene/

The file “pc_gene.ttl.gz” contains the type and symbol of a given GID. The semantic links are “rdf:type”, “bp:organism”, “cito:isDiscussedBy”, “dcterms:title”, “dcterms:description”, “skos:closeMatch”, and “vocab:geneSymbol”.

6.12. PubChem Biosystem

Data for the “biosystem” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem/

The RDF statements contained in the file “pc_biosystem.ttl.gz” provide the basic descriptions, including the type, title, and source for a given BSID. The semantic links are “rdf:type”, “bp:organism”, “cito:isDiscussedBy”, “skos:exactMatch”, “dcterms:title”, and “dcterms:source”.

6.13. PubChem Source

Data for the “source” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source/

The file “pc_source.ttl.gz” contains descriptive information for a given PubChem contributor; including data source identifier, display name, alternative names, organization, and homepage, as well as any classification of the given source, such as the Substance Categorization Classification information.

6.14. PubChem Reference

Data for the “reference” RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/reference/

There are several files containing the type, topics, citation, title, and date of a given reference. The semantic links with the corresponding file names or prefixes are listed in the following table:

Semantic link

File name or prefix

rdf:type

pc_reference_type.ttl.gz

cito:discusses

pc_reference2chemical_disease_ a

fabio:hasSubjectTerm

pc_reference2meshheading_ a

dcterms:bibliographicCitation

pc_reference_citation_ a

dcterms:title

pc_reference_title_ a

dcterms:date

pc_reference_date.ttl.gz

a File prefixes followed by the range numbers.

7. PubChemRDF Use Cases

This section gives some SPARQL query examples of how PubChemRDF can be used under available Semantic Web frameworks. [Please note that these use cases assume some familiarity and proficiency with these tools.] Three popular Semantic Web frameworks that provide multiple collections of API functions to process RDF data are Apache Jena, OpenRDF Sesame, and Redland RDF libraries. Jena and Sesame are publically available Java frameworks, and Redland comprises a set of open source C libraries. All of them can be readily used to read, write, parse, serialize, and interpret RDF statements, and all of them provide both in memory and persistent storage, as well as SPARQL querying mechanisms.

Recent technology development has changed the landscape, in particular, for very large RDF stores (such as FRANZ AllegroGraph, OpenLink Virtuoso, Ontotext OWLIM, Garlik 4store, and SYSTAP Bigdata) that can handle fast loading and querying of billions of triples. AllegroGraph is compatible with Jena framework. Virtuoso provides fully operational data access and management through interface implementations of Jena, Sesame, and Redland frameworks. Bigdata supports Sesame API functions. OWLIM can deliver extensible and configurable performance with Jena and Sesame frameworks. According to the most recent DB-Engines ranking, Jena and Virtuoso are among the most popular RDF persistent stores. The open source version of the Virtuoso 7 installed on single server can readily handle the core subset of PubChemRDF data (over 7 billion triples), which does not include the compound 2D/3D similarity triples. Therefore, we will describe how to load and query the PubChemRDF data (the core subset) using CentOS 6 Linux system.

The core subset of PubChemRDF data can be downloaded through “wget” command. You can copy and save the following scripts in a file, like download_script.sh:

    #!/bin/sh
    wget -r -A ttl.gz -nH --cut-dirs=3 -P compound ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/substance
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/synonym
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/measuregroup
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/endpoint
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/protein
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/conserveddomain
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/gene
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/concept
    wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/reference

Then make the file executable:

    chmod u+x download_script.sh

This script will download all of the PubChemRDF data in the current working directory except compound 2D/3D similarity triples, and the downloaded files (over 40 GB) are organized in the same way as shown on FTP site.

It is recommended to install Virtuoso on a server with at least 64 GB memory and a 500 GB SSD(solid state disk), and to configure the performance of Virtuoso quad store through editing the “virtuoso.ini” file. Linux swapping should be rendered as well. Extensive experiments have shown that the default index scheme of Virtuoso 7 can already yield acceptable performance. To further improve the query performance, one more indexes OGSP, (O: object; G: graph; S: subject; P: predict) should be added by running the following command in “isql” command line:

CREATE COLUMN INDEX RDF_QUAD_OGSP on DB.DBA.RDF_QUAD (O, G, S, P);

Virtuoso has built-in bulk load functions to load RDF triples from multiple files in parallel. Before running bulk load, it is recommended to change the transaction isolation as “read commited” to avoid deadlock during bulk loading. The transaction isolation level can be changed by adding the following line in the “virtuoso.ini” file:

DefaultIsolation = 2

Another way to change the transaction isolation is through the “isql” command line:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

It is highly recommended to load PubChemRDF data into multiple graphs, and specify the graph using FROM clause in SPARQL queries. If so, the graph index provided by Virtuoso can be readily used to avoid full index scanning and, as a result, to improve the query performance. The bulk loading can be achieved through two steps:

First you should register all the files to be loaded in a given directory to the corresponding graph. The following scripts can be run in “isql” command line to register datasets to be loaded in the given graphs (<Path> should be substituted by the local directory):

    ld_dir ('<Path>/compound/general', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/compound'); \
    ld_dir_all ('<Path>/substance', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/substance'); \
    ld_dir ('<Path>/descriptor/compound', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/compound'); \
    ld_dir ('<Path>/descriptor/substance', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/substance'); \
    ld_dir_all ('<Path>/synonym', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/synonym'); \
    ld_dir_all ('<Path>/inchikey', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey'); \
    ld_dir ('<Path>/measuregroup', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup'); \
    ld_dir ('<Path>/endpoint', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint'); \
    ld_dir ('<Path>/bioassay', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay'); \
    ld_dir ('<Path>/protein', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/protein'); \
    ld_dir ('<Path>/biosystem', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem'); \
    ld_dir ('<Path>/conserveddomain', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain'); \
    ld_dir ('<Path>/gene', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/gene'); \
    ld_dir ('<Path>/reference', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/reference'); \
    ld_dir ('<Path>/source', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source'); \
    ld_dir ('<Path>/concept', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source'); \
    checkpoint;

Second you can execute the bulk load function (“rdf_loader_run()”) multiple times, like:

    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    isql 1111 dba dba exec="rdf_loader_run();" &
    wait
    isql 111 dba dba exec="checkpoint;"

The number of threads for bulk loading depends on the number of available processors. The core subset of PubChemRDF data can be loaded into Virtuoso quad store using 10 processes within 10 hours, and the loaded datasets will take approximately 500 GB SSD at the time of writing. After loading, it is recommended to check the loading registry table (DB.DBA.load_list) to check if any job was killed or failed due to errors, by running the following command in “isql” command:

    select * from DB.DBA.load_list;

It is always a good practice to check “virtuoso.log” file after any operation. If the files to be loaded contain syntax errors, you may see the error messages in the log file.

Virtuoso has “dash board” graphical user interface (GUI) called Virtuoso conductor, which can be accessed through http://<server-name>:<port-number>. You can run SPARQL queries through the Virtuoso conductor after login, like you run queries through the “isql” command line. Another option is the Virtuoso SPARQL endpoint that can be accessed through http://<server-name>:<port-number>/sparql. The SPARQL endpoint is protected through a set of parameters defined in “virtuoso.ini” file, like timeout limit and the maximum number of concurrent users. You can change the configurations either through the Virtuoso conductor or by editing “virtuoso.ini” file directly.

You can either access the SPARQL endpoint service using a browser or by sending HTTP GET/POST requests to the SPARQL query service:

curl –F “query=<your SPARQL query here> from <graphURI>” http://<server-name>:<port-number>/sparql

It is recommended to set namespace prefixes for the ontologies used in PubChemRDF using the predefined function (DB.DBA.XML_SET_NS_DECL) that can run in “isql” command:

    DB.DBA.XML_SET_NS_DECL ('compound', 'http://rdf.ncbi.nlm.nih.gov/pubchem/compound/', 2);
    DB.DBA.XML_SET_NS_DECL ('substance', 'http://rdf.ncbi.nlm.nih.gov/pubchem/substance/', 2);
    DB.DBA.XML_SET_NS_DECL ('descriptor', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/', 2);
    DB.DBA.XML_SET_NS_DECL ('synonym', 'http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/', 2);
    DB.DBA.XML_SET_NS_DECL ('inchikey', 'http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/', 2);
    DB.DBA.XML_SET_NS_DECL ('bioassay', 'http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/', 2);
    DB.DBA.XML_SET_NS_DECL ('measuregroup', 'http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/', 2);
    DB.DBA.XML_SET_NS_DECL ('endpoint', 'http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/', 2);
    DB.DBA.XML_SET_NS_DECL ('reference', 'http://rdf.ncbi.nlm.nih.gov/pubchem/reference/', 2);
    DB.DBA.XML_SET_NS_DECL ('protein', 'http://rdf.ncbi.nlm.nih.gov/pubchem/protein/', 2);
    DB.DBA.XML_SET_NS_DECL ('conserveddomain', 'http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/', 2);
    DB.DBA.XML_SET_NS_DECL ('gene', 'http://rdf.ncbi.nlm.nih.gov/pubchem/gene/', 2);
    DB.DBA.XML_SET_NS_DECL ('biosystem', 'http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/', 2);
    DB.DBA.XML_SET_NS_DECL ('source', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source/', 2);
    DB.DBA.XML_SET_NS_DECL ('concept', 'http://rdf.ncbi.nlm.nih.gov/pubchem/concept/', 2);
    DB.DBA.XML_SET_NS_DECL ('vocab', 'http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#', 2);
    DB.DBA.XML_SET_NS_DECL ('obo', 'http://purl.obolibrary.org/obo/', 2);
    DB.DBA.XML_SET_NS_DECL ('sio', 'http://semanticscience.org/resource/', 2);
    DB.DBA.XML_SET_NS_DECL ('skos', 'http://www.w3.org/2004/02/skos/core#', 2);
    DB.DBA.XML_SET_NS_DECL ('bao', 'http://www.bioassayontology.org/bao#', 2);
    DB.DBA.XML_SET_NS_DECL ('bp', 'http://www.biopax.org/release/biopax-level3.owl#', 2);
    DB.DBA.XML_SET_NS_DECL ('ndfrt', 'http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#', 2);
    DB.DBA.XML_SET_NS_DECL ('ncit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#', 2);
    DB.DBA.XML_SET_NS_DECL ('wikidata', 'http://www.wikidata.org/entity/', 2);
    DB.DBA.XML_SET_NS_DECL ('ops', 'http://www.openphacts.org/units/', 2);
    DB.DBA.XML_SET_NS_DECL ('cito', 'http://purl.org/spar/cito/', 2);
    DB.DBA.XML_SET_NS_DECL ('fabio', 'http://purl.org/spar/fabio/', 2);
    DB.DBA.XML_SET_NS_DECL ('pav', 'http://purl.org/pav/', 2);
    DB.DBA.XML_SET_NS_DECL ('uniprot', 'http://purl.uniprot.org/uniprot/', 2);
    DB.DBA.XML_SET_NS_DECL ('pdbo', 'http://rdf.wwpdb.org/schema/pdbx-v40.owl#', 2);
    DB.DBA.XML_SET_NS_DECL ('pdbr', 'http://rdf.wwpdb.org/pdb/', 2);
    DB.DBA.XML_SET_NS_DECL ('taxonomy', 'http://identifiers.org/taxonomy/', 2);
    DB.DBA.XML_SET_NS_DECL ('reactome', 'http://identifiers.org/reactome/', 2);
    DB.DBA.XML_SET_NS_DECL ('chembl', 'http://rdf.ebi.ac.uk/resource/chembl/molecule/', 2);
    DB.DBA.XML_SET_NS_DECL ('chemblchembl', 'http://linkedchemistry.info/chembl/chemblid/', 2);
    DB.DBA.XML_SET_NS_DECL ('foaf', 'http://xmlns.com/foaf/0.1/', 2);
    DB.DBA.XML_SET_NS_DECL ('void', 'http://rdfs.org/ns/void#', 2);
    DB.DBA.XML_SET_NS_DECL ('dcterms', 'http://purl.org/dc/terms/', 2);

By doing this the query results can be visualized and understood much easier, in particular, using the RDF turtle format. It is noteworthy that the “isql” command cannot override the existing namespaces, so if there is an existing prefix with the same name, you need to either use another predefined function (DB.DBA.XML_REMOVE_NS_BY_PREFIX) or manually locate them and change them through theVirtuoso conductor, under the tab of “Linked Data”-> “Namespaces”. For example, to remove the existing namespace for the prefix ‘obo’, the following ‘isql’ command can be used:

    DB.DBA.XML_REMOVE_NS_BY_PREFIX ('obo', 2);

The following sample SPARQL queries can help you to understand more about the PubChemRDF dataset. You can build other SPARQL queries based on the sample queries below.

Query 1: What protein targets does donepezil (CHEBI_53289) inhibit with an IC50 less than 10 microMolar?

    SELECT distinct ?protein ?title
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    WHERE {
		?sub rdf:type obo:CHEBI_53289 ; obo:BFO_0000056 ?mg .
		?mg obo:BFO_0000057 ?protein ; obo:OBI_0000299 ?ep .
		?protein rdf:type bp:Protein ; dcterms:title ?title .
		?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
		filter (?value < 10 )
    }

Query 2: What pharmacological roles of SID46505803 are defined by CHEBI? (CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>)

    PREFIX obov: <http://purl.obolibrary.org/obo#>
    SELECT DISTINCT ?rolelabel
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/compound>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
    WHERE {
		substance:SID46505803 sio:CHEMINF_000477 ?comp .
		?comp rdf:type ?chebi .
		?chebi rdfs:subClassOf [ a owl:Restriction ;
		owl:onProperty obov:has_role ;
		owl:someValuesFrom ?role ] .
		?role rdfs:label ?rolelabel .
    }

Query 3: What compound have a pharmacological role of NSAID as defined by CHEBI and molecular weight less than 200 g/mol? (CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>)

    PREFIX obov: <http://purl.obolibrary.org/obo#>
    SELECT distinct ?compound
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/compound>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
    WHERE {
    ?compound rdf:type ?chebi .
    ?chebi rdfs:subClassOf [ a owl:Restriction ;
    owl:onProperty obov:has_role ;
    owl:someValuesFrom obo:CHEBI_35475 ] .
    ?comp sio:has-attribute ?MW .
    ?MW rdf:type sio:CHEMINF_000334 .
    ?MW sio:has-value ?MWValue .
    filter (?MWValue < 200 )
    }

Query 4: What substances have a pharmacological role of NSAID as defined by CHEBI and the depositor-provided 3D X-ray structure information? (CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>)

    PREFIX obov: <http://purl.obolibrary.org/obo#>
    SELECT DISTINCT ?substance ?source
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/source>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
    WHERE {
		?substance dcterms:source ?source .
		?source dcterms:subject concept:Protein_3D_Structures .
		?substance rdf:type ?chebi .
		?chebi rdfs:subClassOf [ a owl:Restriction ;
		owl:onProperty obov:has_role ;
		owl:someValuesFrom obo:CHEBI_35475 ] .
    }

Query 5: What protein targets are inhibited by substances with an IC50 less than 10 µM and have a pharmacological role of cholinesterase inhibitors as defined by CHEBI?

    PREFIX obov: <http://purl.obolibrary.org/obo#>
    select distinct ?title
    from <http://purl.obolibrary.org/obo>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
    where {
		?chebi rdfs:subClassOf [ a owl:Restriction ; owl:onProperty obov:has_role ; owl:someValuesFrom obo:CHEBI_37733 ] .
		?sub rdf:type ?chebi ; obo:BFO_0000056 ?mg .
		?mg obo:BFO_0000057 ?protein ; obo:OBI_0000299 ?ep .
		?protein rdf:type bp:Protein ; dcterms:title ?title .
		?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
		filter (?value < 10 )
    }

Query 6: Which substances inhibit protein targets similar to GI548481 and have the function domain PSSMID188648?

    select distinct ?substance ?protein ?value
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain>
    where {
		?substance obo:BFO_0000056 ?measuregroup .
		?measuregroup obo:BFO_0000057 ?protein .
		protein:GI548481 vocab:hasSimilarProtein ?protein .
		?protein obo:BFO_0000110 conserveddomain:PSSMID188648 .
		?measuregroup obo:OBI_0000299 ?endpoint .
		?endpoint obo:IAO_0000136 ?substance .
		?endpoint rdf:type bao:BAO_0000190 .
		?endpoint sio:has-value ?value .
    }

Query 7: What protein targets are inhibited by substances with IC50 less than 10 µM and have the same standardized chemical structure (CID3152)?

    select distinct ?sub ?protein ?title
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    where {
		?sub sio:CHEMINF_000477 compound:CID3152 ; obo:BFO_0000056 ?mg .
		?mg obo:BFO_0000057 ?protein ; obo:OBI_0000299 ?ep .
		?protein rdf:type bp:Protein ; dcterms:title ?title .
		?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
		filter (?value < 10 )
    }

Query 8: What substances inhibit the proteins involved in the same biological pathway: prostaglandin biosynthetic process (GO:0001516), with an IC 50 less than 10 µM?

    select distinct ?substance ?protein
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem>
    where {
		?substance obo:BFO_0000056 ?measuregroup .
		?measuregroup obo:BFO_0000057 ?protein .
		?protein rdf:type bp:Protein .
		?protein obo:BFO_0000056 obo:GO_0001516 .
		?measuregroup obo:OBI_0000299 ?endpoint .
		?endpoint obo:IAO_0000136 ?substance .
		?endpoint rdf:type bao:BAO_0000190 .
		?endpoint sio:has-value ?value .
		filter (?value < 10)
    }

Query 9: What the pharmacological roles defined by CHEBI are for the substances that inhibit protein target GI17531135 with an IC50 less than 10 µM?

    select distinct ?rolelabel
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
    from <http://purl.obolibrary.org/obo>
    where {
		?sub obo:BFO_0000056 ?mg .
		?mg obo:BFO_0000057 protein:GI7531135 ; obo:OBI_0000299 ?ep .
		?sub rdf:type ?chebi .
		?chebi rdfs:subClassOf _:I .
		_:I a owl:Restriction .
		_:I owl:onProperty <http://purl.obolibrary.org/obo#has_role> .
		_:I owl:someValuesFrom ?role .
		?role rdfs:label ?rolelabel .
		?ep obo:IAO_0000136 ?sub ; rdf:type bao:BAO_0000190 ; sio:has-value ?value .
		filter (?value < 10 )
    }

Query 10: Summarize the statistics about the total number of substances tested in the PubChem database against each protein target (please note that this may be a time-consuming query).

    select (count(?sub) as ?subcnt) ?protein
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
    from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
    where {
		?sub obo:BFO_0000056 ?mg .
		?mg obo:BFO_0000057 ?protein .
		?protein rdf:type biopax:Protein .
		?mg obo:OBI_0000299 ?ep .
		?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
    }
    group by ?protein
    order by ?subcnt

8. Document Version History

V1.0.b – 2014Jan13 – Initial beta release.

V1.1.b – 2014Mar05 – added sections 1.1 and 1.2 and added Virtuoso bulk loading description.

V1.5.b – 2015June – overview of major changes: