PubChemRDF Release Notes

V1.1.beta

1. Introduction

1.1. What is RDF
1.2. How can PubChemRDF help your research

2. Ontology-based Data Integration

3. PubChemRDF URI Constructions

4. PubChemRDF Subdomains

4.1. PubChem Compound
4.2. PubChem Substance
4.3. PubChem Descriptors
4.4. PubChem InChIKey
4.5. PubChem Synonym
4.6. PubChem BioAssay
4.7. PubChem MeasureGroup
4.8. PubChem Endpoint
4.9. PubChem Protein
4.10. PubChem ConservedDomain
4.11. PubChem Gene
4.12. PubChem Biosystem
4.13. PubChem Neighbor
4.14. PubChem Source
4.15. PubChem Reference

5. REST-FUL INTERFACE

6. RDF FTP Download Directory Layout

6.1. PubChem Compound
6.2. PubChem Substance
6.3. PubChem Descriptor
6.4. PubChem InChIKey
6.5. PubChem Synonym
6.6. PubChem Bioassay
6.7. PubChem MeasureGroup
6.8. PubChem Endpoint
6.9. PubChem Protein
6.10. PubChem ConservedDomain
6.11. PubChem Gene
6.12. PubChem Biosystem
6.13. PubChem Source
6.14. PubChem Reference

7. PubChemRDF Use Cases

8. Document Version History

1. Introduction

Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and Bioassay databases.

1.1. What is RDF ?

RDF constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine readable discrete pieces, called "triples." Each "triple" is organized as a trio of ‘subject-predicate-object’. For example, in the phrase "atorvastatin may treat hypercholesterolemia," the subject is "atorvastatin", the predicate is "may treat", and the object is "cholesterol." RDF uses a Uniform Resource Identifier (URI) to name each part of the "subject-predicate-object" triple. A URI looks just like a typical web URL. RDF is a core part of semantic web standards. As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information. Semantic web leverages the following technologies: extensible markup language (XML), which provides syntax for RDF; web ontology language (OWL), which extends the ability of RDF to encode information; resource description framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

1.2. How can PubChemRDF help your research ?

PubChem users have frequently expressed interest in having a downloadable database. Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into atriplestore, and query using a SPARQL query interface. Together these tools enable the schema-less database access and query. There are a number of open-source or commercial triplestores such as the Apache Jena TDB andOpenLink Virtuoso (a list can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of PubChem knowledge base allows logical inference, such as forward/backward chaining. The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, so you can avoid downloading parts of PubChem data you will not use. For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in compound descriptor subdomain. In addition to bulk download, PubChemRDF also provides programmatic data access through REST-full interface.

This document provides detailed technical information (release notes) about the PubChemRDF project. Downloadable RDF data is available on the PubChemRDF FTP Site. Past presentations on the PubChemRDF project are available giving a PubChemRDF introduction and on the PubChemRDF details. The PubChem Blog may provide most recent updates on the PubChemRDF project. Please note that the PubChemRDF is evolving as a function of time. However, we intend for such enhancements to be backwards compatible by adding additional information and annotations.

2. Ontology-based Data Integration

As depicted in Figure 1, the PubChemRDF content includes a number of semantic relationships, such as those between compounds and substances, the chemical descriptors associated with compounds and substances, the relationships between compounds, the provenance and attribution metadata of substances, and the concise bioactivity data view of substances. Whenever possible, pre-existing ontological frameworks were used to semantically describe information available in the PubChem archive, rather than creating new ones. However, in some cases, no suitable types or relations were defined in standard ontologies, and so a PubChem vocabulary was created to define these terms. The set of standardized ontologies used to define the domain-specific knowledge are found in Table 1 and includes: Chemical Entities of Biological Interest (ChEBI) , CHEMical INFormation ontology (CHEMINF), Protein Ontology (PRO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Basic Formal Ontology (BFO), Ontology for Biomedical Investigations (OBI), Information Artifact Ontology (IAO), BioAssay Ontology (BAO), Units of Measurement (UO), Quantities, Units, Dimensions and Data Types (QUDT), Citation Typing Ontology (CiTO), FRBR-aligned Bibliographic Ontology (FaBiO), Dublin Core Metadata Initiative (DCMI) Terms, Provenance Authoring and Versioning ontology (PAV), Simple Knowledge Organization System (SKOS), the Friend Of A Friend (FOAF) vocabulary, and BioPAX. All of the biomedical ontologies, such as ChEBI, CHEMINF, PRO, GO, BFO, SIO, and BAO, are interfaced by the NIH Roadmap National Center for Biomedical Ontology (NCBO) through its BioPortal, and comply with an evolving set of shared principles established by the Open Biomedical Ontologies (OBO) foundry. Adoption of these core ontologies helps to ensure that the mapping of chemical and biological information is compatible across multiple Semantic Web resources.

Figure 1. Color-coded diagram showing a high-level overview of the PubChemRDF semantic relationships.

Table 1. The prefixes and corresponding namespaces of standardized ontologies used in PubChemRDF.

Standardized Ontologies

Prefix

Namespace

Vocabularies

rdfs

http://www.w3.org/2000/01/rdf-schema#

RDF Schema

rdf

http://www.w3.org/1999/02/22-rdf-syntax-ns#

RDF

owl

http://www.w3.org/2002/07/owl#

OWL

xsd

http://www.w3.org/2001/XMLSchema#

XML Schema

chebia

http://purl.obolibrary.org/obo/

ChEBI

uoa

http://purl.obolibrary.org/obo/

UO

siob

http://semanticscience.org/resource/

SIO

cheminfb

http://semanticscience.org/resource/

CHEMINF

skos

http://www.w3.org/2004/02/skos/core#

SKOS

obo

http://purl.obolibrary.org/obo/

BFO, OBI, and IAO

bao

http://www.bioassayontology.org/bao#

BAO

bp

http://www.biopax.org/release/biopax-level3.owl#

BioPAX

qudt

http://data.nasa.gov/qudt/owl/qudt#

QUDT

cito

http://purl.org/spar/cito/

CiTO

fabio

http://purl.org/spar/fabio/

FaBio

ops

http://www.openphacts.org/units/

Open PHACTS Vocabulary

pr

http://purl.obolibrary.org/obo/pr#

PRO

go

http://purl.obolibrary.org/obo/go#

GO

dcterms

http://purl.org/dc/terms/

DCMI Terms

pav

http://purl.org/pav/

PAV

foaf

http://xmlns.com/foaf/0.1/

FOAF Vocabulary

a The chebi and uo ontologies share a URI namespace but are distinct. b The sio and cheminf ontologies share a URI namespace but are distinct.

3. PubChemRDF URI Constructions

In this document, PubChemRDF statements are written in the Turtle syntax with Uniform Resource Identifiers (URIs) in relative form. The Turtle prefix directives can be used to resolve the base URIs relative to the local part. A list of the PubChem subdomain namespaces are listed in Table 2. Both '303 URI' and 'hash URI' were employed in the PubChemRDF project according to W3C recommendations; however, the "hash URI" was only used for the PubChem vocabulary subdomain, and the "303 URIs" were used for the rest of PubChemRDF subdomains. PubChem vocabulary serves as a terminology defining the types and relations of some PubChem-specific terms. For instance, the URI for the type of PubChem-specific 3-D structural similarity defined in the PubChem vocabulary is as follows:

http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#3D_structural_similarity

Table 2. The prefixes and corresponding namespaces of subdomains used in PubChemRDF.

PubChemRDF Subdomains

Prefix

Namespace

compound

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/

substance

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/

descr

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/

inchikey

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/

syno

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/

bioassay

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/

measuregroup

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/

endpoint

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/

protein

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/

domain

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/

biosystem

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/

gene

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/

reference

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/

nbra

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/

source

http://rdf.ncbi.nlm.nih.gov/pubchem/source/

vocab

http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#

a The RDF triples for the neighbor subdomain are currently only available through the REST-full interface.

The URIs for PubChem compounds and substances were constructed based on primary accession identifiers (CID and SID). For instance, the URIs for CID60823 and SID103554720 can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID103554720

which can be abbreviated as compound:CID60823 and substance:SID103554720, respectively. The InChIKey URIs were constructed based on the value of InChIKey. For instance, the URI for InChIKey with value of "XUKUURHRXDUEBC-KAYWLYCHSA-N" (case-insensitive) can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/XUKUURHRXDUEBC-KAYWLYCHSA-N

Most chemical descriptor namespace URIs were constructed based on a combination of CID/SID and descriptor labels, except in the case of depositor-provided synonyms. For instance, the URI for the molecular weight of CID 60823 can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight

or simply as descr:CID60823_Molecular_Weight. The URI for the depositor-provided synonyms were constructed based on MD5 hash values, after first converting chemical names to lower-case. For example, 'Atorvastatin [INN:BAN]' becomes 'atorvastatin [inn:ban]' to produce the MD5 hash '7be8fb160fff31a7beea9df539fd36bd' and '(3R,5R)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid' becomes '(3r,5r)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1h-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid' to produce the MD5 hash 'c576a26b0c67fa6b072b61a0b4c57a6c'. The use of an MD5 hash in place of the actual chemical name allows PubChem information associated with any given chemical name to be directly accessed using RDF. For instance, the depositor-provided synonym of 'Atorvastatin' can be represented as:

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_9a05646d461669f86de312d88ab5748a

or simply as syno:MD5_9a05646d461669f86de312d88ab5748a.

PubChem BioAssay records were annotated in different ways depending on the assay type in accordance with the BioAssay Ontology (BAO). Literature extracted bioassays, such as those from ChEMBL, were represented as an instance of BAO measure group (BAO_0000040), since these are summary results abstracted from the literature and missing specific information on how the biological experiment was performed. The URIs for assays records are constructed based on the PubChem BioAssay accession identifiers (AID). For instance, the URI for AID 447528 can be assigned as:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID447528

or abbreviated as measure-group:AID447528.

Some literature extracted assays aggregate literature abstracted bioactivity data from multiple publications. For instance, AID 363 deposited by BindingDB contains bioactivity data tested against human Src kinase from 23 different publications. In literature-derived assays of this type, a single bioassay record is broken down into multiple measure groups, and the fragment identifier of each individual measure group is based on the combination of AID and PubMed identifier (PMID), for example:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID363_PMID16161995

or abbreviated as measuregroup:AID363_PMID16161995. However, in contrast to literature-extracted assays, biological screening experiments, such as those from the NIH Molecular Library Program (MLP), were represented as an instance of BAO bioassay (BAO_0000015). For instance, the URI for AID 1788 can be assigned as:

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788

or abbreviated as bioassay:AID1788.

Although screening assays and literature-extracted assays are different, they are related. Each screening assay refers to an operational unit and may have one or more instances of BAO measure group (BAO_0000040). If the screening assay is a panel assay (for instance, testing against a panel of multiple targets as occurs when performing a lead profiling screen), the URIs are constructed based on the combination of AID and panel component identifier (PID), for example:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1

or abbreviated as measuregroup:AID1788_1. Therefore, the measure group serves as a basic concept interlinking chemical substances, molecular targets, and the bioactivity endpoints for a given PubChem BioAssay record.

The URIs of bioactivity endpoints were constructed based on the combination of SID and AID, plus PID if the endpoints were produced by panel screening assays or PMID if the endpoints were derived from aggregated literature-extracted assay. The following URIs demonstrate the different reference approaches used for bioactivity endpoints:

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID99445338_AID2202_1

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID8033500_AID363_PMID10395478

or abbreviated as endpoint:SID103164874_AID443491, endpoint:SID99445338_AID2202_1, and endpoint:SID8033500_AID363_PMID10395478, respectively. The first URI above refers to an endpoint derived from a ChEMBL assay, the second URI refers to an endpoint produced by a panel screening assay (assay panel PID is 1), and the last URI refers to an endpoint derived from a literature-extracted assay (PMID is 10395478).

In the case of protein targets, the URIs utilize National Center for Biotechnology Information (NCBI) Protein GI numbers:

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/GI124375976

or abbreviated as protein:GI124375976. The protein targets tested in PubChem BioAssay database can be linked out to other NCBI databases, including NCBI conserved domain database, NCBI gene database, NCBI biosystems database, and NCBI PubMed database. Protein conserved domains contain recurring sequence patterns, which define the functional and/or structural units of protein sequences. NCBI conserved domains were identified through multiple sequence alignment, and were distinguished through position-specific scoring matrix (PSSM) models. Each PSSM model has a unique PSSM identifier (PSSMID). NCBI Gene database integrate gene information for various species. Each gene has a unique Gene ID (GID). NCBI BioSystems database integrates a group of biological entities interacting in a biological process into a single conceptual unit called a biosystem, which may belong to different categories: biological pathway, molecular function, cellular location and signature module. Each biosystem was assigned with a unique identifier (BSID). NCBI PubMed database comprises more than 23 million citations for biomedical literature. Each PubMed record is assigned with a unique identifier (PMID). PubChemRDF provides the RDF triples to expose the linkage information and the basic descriptions of those resources.

The URIs for conserved domains use PSSMIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758

or abbreviated as domain:PSSMID132758. The URIs for genes use NCBI gene IDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367

or abbreviated as gene:GID367. The URIs for biosystems use BSIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/BSID82991

or abbreviated as biosystem:BSID82991. The URIs for publication references use PMIDs:

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478

or abbreviated as reference:PMID10395478.

The URIs for PubChem depositors are based on the names of depositors:

http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL

or abbreviated as source:ChEMBL. If the names of depositors are numeric numbers, a prefix "ID" was added; or if the names contains the symbols including ",", ".", "&", "(", ")", and "/", those symbols were deleted; or if the names contains spaces, they were replaced by "_".

The URIs for PubChem Compound 2-D and 3-D similarity neighbors and PubChem BioAssay protein target sequence similarity neighbors are available as follows:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/GI548481_GI129900_SequenceSimilarity

or abbreviated as nbr:CID60823_CID68019409_2DSimilarity, nbr:CID60823_CID11330946_3DSimilarity, and nbr:GI548481_GI129900_SequenceSimilarity, respectively.

4. PubChemRDF Subdomains

4.1. PubChem Compound

PubChem Compound RDF triples expose the linkage between: compound and chemical descriptor resources and interrelated compounds, such as compound identity groups (CIGs). [See Figure 1 for a diagram of links to other RDF subdomains.] For example, to resolve the URI in the RESTful interface for compound CID 60823:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823

Link Type

Example RDF Triple

calculated chemical descriptor

compound:CID60823 sio:has-attribute descr:CID60823_Molecular_Weight .

parent compound

compound:CID60823 vocab:hasParentCompound compound:CID60823 .

component compound

compound:CID22765305 cheminf:CHEMINF_000478 compound:CID2244 .

compound identity group (CIG)

compound:CID60823 cheminf:CHEMINF_000462 compound:CID2250 .

2-D similarity neighbor

compound:CID60823 cheminf:CHEMINF_000482 compound:CID60822 .

3-D similarity neighbor

compound:CID60823 cheminf:CHEMINF_000483 compound:CID10745515 .

4.2. PubChem Substance

PubChem Substance RDF triples expose the linkage between: substance and chemical descriptor resources, substance and standardized compound resources, substance and measure group resources, and substance and data source resources. For example, to resolve the URI for SID 8032774:

http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID8032774

Link Type

Example RDF Triple

depositor-provided descriptor

substance:SID8032774 sio:has-attribute descr:SID8032774_Depositor_Identifier .

standardized compound

substance:SID8032774 cheminf:CHEMINF_000477 compound:CID5327844 .

data source

substance:SID8032774 dcterms:source source:BindingDB .

measure group

substance:SID8032774 obo:BFO_0000056 measuregroup:AID363_PMID9357527 .

4.3. PubChem Descriptors

PubChem descriptor RDF triples expose the linkage between chemical descriptor and compound and substance resources, as well as the type, value and unit for a given descriptor.

For example, to resolve URI in REST-ful interface for the molecular weight of PubChem Compound record CID 60823 and to provide the external depositor identifier for PubChem Substance record SID 8032774:

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight

http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/SID8032774_Depositor_Identifier

Link Type

Example RDF Triple

compound or substance

descr:CID60823_Molecular_Weight sio:is-attribute-of compound:CID60823 .

type

descr:CID60823_Molecular_Weight rdf:type cheminf:CHEMINF_000334 .

value

descr:CID60823_Molecular_Weight sio:has-value "558.639803"^^xsd:double .

unit

descr:CID60823_Molecular_Weight sio:has-unit obo:UO_0000055 .

4.4. PubChem InChIKey

PubChem InChIKey RDF triples expose the type and the corresponding InChI of a given InChIKey, as well as the linkage to the corresponding CID(s). For example, to resolve the URI for the InChIKey with a value of "BSYNRYMUTXBXSQ-UHFFFAOYSA-N" (case-insensitive):

http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N

Link Type

Example RDF Triple

compound

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:is-attribute-of compound:CID2244 .

type

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N rdf:type cheminf:CHEMINF_000399 .

value

inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:has-value "InChI=1S/C33H35FN2O5/c1-21(2)31-30(33(41)35-25-11-7-4-8-12-25)29(22-9-5-3-6-10-22)32(23-13-15-24(34)16-14-23)36(31)18-17-26(37)19-27(38)20-28(39)40/h3-16,21,26-27,37-38H,17-20H2,1-2H3,(H,35,41)(H,39,40)/t26-,27-/m1/s1"@en .

4.5. PubChem Synonym

PubChem synonym RDF triples expose the type and value of a given MD5 hash string, as well as the linkage to the corresponding SID(s). For example, to resolve the URI for the synonym "5'-CYTIDYLIC ACID":

http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_8437e4cbbdae1037d9da9fc57f8fdd1A

Link Type

Example RDF Triple

compound

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A sio:is-attribute-of compound:SID2244 .

type

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A rdf:type cheminf:CHEMINF_000339 .

value

syno:MD5_8437e4cbbdae1037d9da9fc57f8fdd1A sio:has-value "5'-CYTIDYLIC ACID"@en .

4.6. PubChem BioAssay

PubChem BioAssay RDF triples expose the type, title, homepage, data source and the linkage to measure groups for a given assay. For example, to resolve the URI for the PubChem Assay record AID 1788:

http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788

Link Type

Example RDF Triple

type

bioassay:AID1788 rdf:type bao:BAO_0000015 .

title

bioassay:AID1788 dcterms:title "Discovery of novel allosteric modulators of the M1 muscarinic receptor: Agonist Ancillary Activity"@en .

data source

bioassay:AID1788 dcterms:source source:Vanderbilt_Screening_Center_for_GPCRs_Ion_Channels_and_Transporters .

measure group

bioassay:AID1788 bao:BAO_0000209 measuregroup:AID1788_1 .

4.7. PubChem MeasureGroup

For high throughput screening assays, including panel assays, PubChem measure group RDF triples expose the title, type, as well as linkage to bioassay, participants, and endpoints. For example, to resolve the URI for assay panel 1 from the PubChem Assay record AID 1788:

http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1

Link Type

Example RDF Triple

type

measuregroup:AID1788_1 rdf:type bao:BAO_0000040 .

title

measuregroup:AID1788_1 dcterms:title "Adenosine A1 (human)"@en .

bioassay

measuregroup:AID1788_1 bao:BAO_0000426 bioassay:AID1788 .

participant

measuregroup:AID1788_1 obo:BFO_0000057 substance:SID56353039 .

endpoint

measuregroup:AID1788_1 obo:OBI_0000299 endpoint:SID56353039_AID1788_1 .

For literature-extracted assays, PubChem measure group RDF triples expose the title, type, data source, as well as the linkage to participants and endpoints.

Link Type

Example RDF Triple

type

measuregroup:AID447528 rdf:type bao:BAO_0000040 .

title

measuregroup:AID1788_1 dcterms:title "Inhibition of ovine COX1 by enzyme immunoassay"@en .

data source

measuregroup:AID447528 dcterms:source source:ChEMBL .

participant

measuregroup:AID447528 obo:BFO_0000057 substance:SID103164874 .

endpoint

measuregroup:AID447528 obo:OBI_0000299 endpoint:SID103164874_AID447528 .

4.8. PubChem Endpoint

PubChem endpoint RDF triples expose the type, value, unit and the linkage to measure groups, substance, and reference resources. For example, to resolve the URI for the bioassay endpoint between PubChem records SID 103164874 and AID 443491:

http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491

Link Type

Example RDF Triple

type

endpoint:SID103164874_AID443491 rdf:type bao:BAO_0000190 .

value

endpoint:SID103164874_AID443491 qudt:numericValue "0.162"^^xsd:double .

unit

endpoint:SID103164874_AID443491 qudt:unit ops:Micromolar .

substance

endpoint:SID103164874_AID443491 obo:IAO_0000136 substance:SID103164874 .

measure group

endpoint:SID103164874_AID443491 obo:OBI_0000312 measuregroup:AID443491 .

reference

endpoint:SID103164874_AID443491 cito:citesAsDataSource reference:PMID19880317 .

4.9. PubChem Protein

PubChem protein RDF triples expose the type (Protein Ontology), title, similarity neighbors, conserved domains, encoding genes, as well as the linkages to the measure groups and biosystems for a given protein. For example, to resolve the URI for the NCBI Protein record GI 124375976:

http://rdf.ncbi.nlm.nih.gov/pubchem/protein/GI124375976

Link Type

Example RDF Triple

type

protein:GI124375976 rdf:type pr:PR_000004191 .

title

protein:GI124375976 dcterms:title "AR protein [Homo sapiens]"@en .

similarity neighbor

protein:GI124375976 vocab:hasSimilarProtein Protein:GI6978663 .

conserved domain

protein:GI124375976 obo:BFO_0000110 domain:PSSMID132758 .

encoding gene

protein:GI124375976 vocab:encodedBy gene:GID367 .

measure group

protein:GI124375976 obo:BFO_0000056 measuregroup:AID588515 .

biosystem

protein:GI124375976 obo:BFO_0000056 biosystem:BSID219801 .

4.10. PubChem ConservedDomain

PubChem conserved domain RDF triples expose the type and linkages to the proteins that have been tested in measure groups. For example, to resolve the URI for the NCBI Conserved Domain record PSSMID 132758:

http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758

Link Type

Example RDF Triple

type

domain:PSSMID132758 rdf:type obo:SO_0000417 .

protein

domain:PSSMID132758 obo:BFO_0000177 protein:GI124375976 .

4.11. PubChem Gene

PubChem gene RDF triples expose the type, symbol, and linkages to the proteins that have been tested in measure groups. For example, to resolve the URI for NCBI Gene record Gene ID 367:

http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367

Link Type

Example RDF Triple

type

gene:GID367 rdf:type bp:Gene .

symbol

gene:GID367 vocab:geneSymbol "AR"@en .

protein

gene:GID367 vocab:encoding protein:GI124375976 .

4.12. PubChem Biosystem

PubChem biosystem RDF triples expose the type, title, homepage, and the linkages to the proteins that have been tested in measure groups. For example, to resolve the URI for the NCBI Biosystems record BSID 82991:

http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/BSID82991

Link Type

Example RDF Triple

type

biosystem:BSID82991 rdf:type bp:Pathway .

title

biosystem:BSID82991 dcterms:title "Arachidonic acid metabolism"@en .

homepage

biosystem:BSID82991 foaf:homepage <http://www.kegg.jp/pathway/hsa00590> .

protein

biosystem:BSID82991 obo:BFO_0000057 protein:GI543583725 .

4.13. PubChem Neighbor

PubChem neighbor RDF triples describe similarity relationships and their supporting information. Currently, these exist between chemical records and between protein sequences. The chemical 2-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the similarity score, as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID 60823 and CID 68019409:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DTanimotoScore

Link Type

Example RDF Triple

relation type

nbr:CID60823_CID10030610_2DSimilarity rdf:type vocab:PC2D_structural_similarity .

compound

nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:CID10030610 .

supporting score

nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID10030610_2DTanimotoScore .

score type

nbr:CID60823_CID10030610_2DTanimotoScore rdf:type vocab:PC2D_Fingerprint_TanimotoScore .

score value

nbr:CID60823_CID10030610_2DTanimotoScore sio:has-value "0.98"^^xsd:double .

The chemical 3-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the shape and feature similarity scores (ST and CT, respectively), as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID 60823 and CID 11330946:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DFeatureTanimotoScore

Link Type

Example RDF Triple

relation type

nbr:CID60823_CID11330946_3DSimilarity rdf:type vocab:PC3D_structural_similarity .

compound

nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:11330946 .

supporting score

nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID11330946_3DFeatureTanimotoScore , nbr:CID60823_CID11330946_3DShapeTanimotoScore .

score type

nbr:CID60823_CID11330946_3DFeatureTanimotoScore rdf:type vocab:PC3D_Feature_TanimotoScore .

nbr:CID60823_CID11330946_3DShapeTanimotoScore rdf:type vocab:PC3D_Shape_TanimotoScore .

value

nbr:CID60823_CID11330946_3DFeatureTanimotoScore sio:has-value "0.59"^^xsd:double .

nbr:CID60823_CID11330946_3DShapeTanimotoScore sio:has-value "0.88"^^xsd:double .

The protein sequence similarity neighbor RDF triples expose the type and proteins involved in the neighboring relation, the value and type of the protein sequence similarity scores (E-value and identity), as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the NCBI Protein records GI 548481 and GI 29900:

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/GI548481_GI129900_SequenceSimilarity

http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/GI548481_GI129900_Evalue

Link Type

Example RDF Triple

relation type

nbr:GI548481_GI129900_SequenceSimilarity rdf:type vocab:ProteinSequenceSimilarityRelation .

protein

nbr:GI548481_GI129900_SequenceSimilarity sio:refers-to protein:GI548481 , protein:129900 .

supporting score

nbr:GI548481_GI129900_SequenceSimilarity sio:has-measurement-value nbr:GI548481_GI129900_Evalue , nbr:GI548481_GI129900_SimilarityIdentity .

score type

nbr:GI548481_GI129900_Evalue rdf:type vocab:BLAST-Evalue .

nbr:GI548481_GI129900_SimilarityIdentity rdf:type vocab:ProteinSequenceSimilarityIdentity .

similarity value

nbr:GI548481_GI129900_Evalue sio:has-value "0"^^xsd:double .

nbr:GI548481_GI129900_SimilarityIdentity sio:has-value "0.864"^^xsd:double .

4.14. PubChem Source

PubChem data source RDF triples expose the type, title, contributor, homepage, and substance categorization classification for a given data source. For example, to resolve the URI for the PubChem data source 'ChEMBL':

http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL

Link Type

Example RDF Triple

type

source:ChEMBL rdf:type dcterms:Dataset .

title

source:ChEMBL dcterms:title "ChEMBL"@en .

homepage

Source:ChEMBL foaf:homepage <http://www.ebi.ac.uk/chembldb/> .

contributor

source:ChEMBL dcterms:contributor [ rdf:label "European Bioinformatics Institute, EBI, ChEMBL"@en , rdf:type foaf:Organiztion ] .

substance categorization classification

source:ChEMBL dcterms:subject vocab:Biological_Property .

4.15. PubChem Reference

PubChem reference RDF triples expose the type, publication date, citation, title, and webpage for a given PMID. For example, to resolve the URI for the NCBI PubMed record PMID 10395478:

http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478

Link Type

Example RDF Triple

type

reference:PMID10395478 rdf:type fabio:JournalArticle .

citation

reference:PMID10395478 dcterms:bibliographicCitation "B D Palmer, A J Kraker, B G Hartl, A D Panopoulos, R L Panek, B L Batley, G H Lu, S Trumpp-Kallmeyer, H D Hollis Showalter, W A Denny, Journal of medicinal chemistry, 1999 Jul;42(13):2373-82" .

title

reference:PMID10395478 dcterms:title "Structure-activity relationships for 5-substituted 1-phenylbenzimidazoles as selective inhibitors of the platelet-derived growth factor receptor."@en .

publication date

reference:PMID10395478 dcterms:date "1999-07-01"^^xsd:date .

homepage

reference:PMID10395478 foaf:homepage <http://www.ncbi.nlm.nih.gov/pubmed/10395478> .

5. REST-FUL INTERFACE

All of the aforementioned URIs can be resolved through a REST interface, which has some additional functionality beyond resolving URIs. The RDF triples can be presented according to different MIME types (see Table 3).

Table 3. The MIME types allowed and used in the PubChemRDF REST interface.

MIME Type

HTTP Accept Header or URI extension

application/rdf+xml+abbrev

default

application/rdf+xml+abbrev

rdfxml-abbrev

application/rdf+xml

application/rdf+xml

text/rdf

rdfxml

rdf

xml

application/xhtml+xml

application/xhtml+xml

text/html

html

application/x-turtlea

application/rdf+n3

application/turtle

application/x-turtle

turtle

ttl

application/jsonb

application/json

text/json

json

text/plain

text/plain

ntriples

text/rdf+n3

text/n3

text/rdf+n3

Accept: n3

a Turtle is an abbreviation for Terse RDF Triples Language; b JSON is short for JavaScript Object Notation.

Different types of presentations can be produced through specifying the HTTP accept header (see Table 3). For instance, if Linux cURL command is used to retrieve RDF triples regarding to CID2244, the following commands will output the RDF triples into the files:

curl -v -L -H "Accept: text/rdf" -o CID2244.rdf http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244

If no HTTP header is specified, the default output format is application/rdf+xml+abbrev. The abbreviated RDF/XML format is compact and easier for a human to read.

If web browsers are used to retrieve RDF triples, HTML format is typically the default. For instance, what Google Chrome sends in the accept header would be something like:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

which means that it will take HTML or XML, but it prefers HTML to XML, with the "q" value given as 0.9. So through Google Chrome, HTML output format is preferred.

In order to force the desired MIME type using web browsers, the URI extension preceded by a dot ('.') can be used (see Table 3). For instance, the following URLs can present the RDF triples with respect to CID2244 (Aspirin) in various RDF data formats:

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.rdf

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.html

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.turtle

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.json

http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.ntriples

The resolution of URIs under "http://rdf.ncbi.nlm.nih.gov/pubchem/" domain will return a 303 redirect HTTP status code, and the request will be redirected to "http://pubchem.ncbi.nlm.nih.gov/rest/rdf/" domain. The REST-full interface under the later domain can also be used to query RDF triples based on the input string values of Compound InChI and Substance synonym. The query calls share the same base URL:

http://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?

The input queries can be provided through key-value pairs, and the keys can be as follows: "synonym", "inchi", and "format". The string values of "synonym" and "inchi" specify the input queries, and the string values of "format" specify the aforementioned MIME types as output format. In the "synonym" query, the REST API first converts the string value to a corresponding MD5 hash, then outputs the RDF triples for the URI based on that MD5 hash value. In the "inchi" query, the REST API first convert the string value to a corresponding InChIKey, then output the RDF triples for the URI based of that InChKey value. If the conversion fails, the REST API returns a 405 HTTP status. For instance, the following query strings can be used to retrieve PubChem RDF triples:

http://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?synonym=Atorvastatin&format=html

http://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?synonym=5'-cytidylic%20acid&format=html

http://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?InChI=1S/C33H35FN2O5/c1-21(2)31-30(33(41)35-25-11-7-4-8-12-25)29(22-9-5-3-6-10-22)32(23-13-15-24(34)16-14-23)36(31)18-17-26(37)19-27(38)20-28(39)40/h3-16,21,26-27,37-38H,17-20H2,1-2H3,(H,35,41)(H,39,40)/t26-,27-/m1/s1&format=html

If the operation after redirection on REST-full interface was successful, the RDF triples will be retrieved along with a 200 HTTP status code. If the server encounters an error, it will return an HTTP status code other than 200 in the response header. The codes in the 400 range indicate errors on the request side (invalid input of some form), and the codes in the 500 range indicate errors on the PubChem side (timeout or other issue). In the response content, some descriptive messages will be returned, indicating the potential causes of the errors. The HTTP status codes and corresponding descriptions are as follows:

HTTP Status

Error

Description

400

eBadRequest

Bad Query URL or Request URI

404

eNotFound

Input URI is invalid or cannot be identified in databases

405

eNotAllowed

MIME output format is unspecified or invalid

500

eServerError

Some problem on the server side occurs

504

eTimeout

The request timed out (over 28 second)

Please note that the HTTPS protocol works seamlessly in place of HTTP protocol for all URIs in the PubChemRDF RESTful interface. If you request data using 'https', URIs returned will be 'https'. This is a feature and it may cause issues for some software packages that depend on the URI uniquely identifying an entity, down to the protocol requesting the URI. Generally speaking, all PubChem web-based resources are configured to work seamlessly with or without HTTP encryption (via 'https' or 'http' protocol, respectively). By default, and on the PubChemRDF FTP site, all URIs are specified using the HTTP protocol to the NCBI RDF website (http://rdf.ncbi.nlm.nih.gov).

6. RDF FTP Download Directory Layout

The PubChemRDF data can be found on the PubChem FTP site for bulk download:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/

Data is updated in its entirety approximately once per week [i.e., no incremental update is provided at this time]. A Vocabulary of Interlinked Dataset (VoID) description file (void.ttl) is provided at the root directory of the PubChemRDF FTP site. This file provides general metadata information about each release of PubChemRDF, such as provenance information, statistics (e.g., triple counts), dataset release date, and files.

The fundamental layout of the PubChemRDF FTP site is such that it is partitioned into subsets corresponding to different PubChemRDF subdomains. This allows desired subdomains to be downloaded. The top level FTP directories correspond to the subdomains: compound, substance, descriptor, inchikey, synonym, bioassay, measuregroup, endpoint, protein, conserveddomain, gene, biosystem, source, and reference.

All RDF data files are in turtle format and gzip compressed, as indicated by the suffix ".ttl.gz". Data from one RDF subdomain may refer to other subdomains. Figure 1 helps to depict these interdependencies by means of arrows indicating out-going references to other subdomains. Each file is of the type "prefix_xxx_yyy.ttl.gz". The "prefix" indicates the file content type, the ".ttl.gz" suffix indicates the file is in turtle RDF format and gzip compressed, while "xxx" and "yyy" indicate a range of values, where what the value means is context dependent. For example, "compound_1_100.ttl.gz" indicates the "compound" subdomain and "1_100" indicates data for CID1 through CID100.

6.1. PubChem Compound

Data for the PubChem "compound" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/

There are three subdirectories: "general", "nbr2d", and "nbr3d". Each directory may have a 'README' file with more current information or additional information.

6.1.1. PubChem Compound "general"

Information contained here includes links to chemical descriptors and non-similarity based compound interrelationships, including parent, components, and compound identity group (CIG). CIGs consider related chemicals by varying degrees of identity. For example, cases of chemicals with identical connectivity (same atoms and bonds) but where stereo isomer and isotopic information may vary.

6.1.2. PubChem Compound "nbr2d"

Information contained here includes links between compounds according to the PubChem 2-D "Similar Compounds" neighboring relationship.

6.1.3. PubChem Compound "nbr3d"

Information contained here includes links between compounds according to the PubChem 3-D "Similar Conformers" neighboring relationship.

6.2. PubChem Substance

Data for the "substance" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/substance/

There are two subdirectories: "general" and "assay".

6.2.1. PubChem Substance "general"

Information contained here includes links to depositor-provided descriptors including synonyms, and a link to the standardized compound record for a given substance record.

6.2.2. PubChem Substance "assay"

Information contained here includes links from the substance to the participating measure groups.

6.3. PubChem Descriptor

Data for the "descriptor" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/

There are two subdirectories: "compound" and "substance".

6.3.1. PubChem Descriptor "compound"

Information contained here includes the type, value and unit (when applicable) of chemical descriptors, as well as the link from the chemical descriptor to the corresponding compound.

6.3.2. PubChem Descriptor "substance"

Information contained here includes the type and value for a given descriptor (including depositor-provided synonyms), as well as the link from the descriptors and synonyms to the corresponding substances.

6.4. PubChem InChIKey

Data for the PubChem "inchikey" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/

There are two subdirectories: "general" and "compound".

6.3.1. PubChem InChIKey "general"

Information contained here includes the type and the InChI of a given InChIKey. [Please note that (to date) there is a one-to-one association between InChI and InChIKey in PubChem. This is not guaranteed to always be the case due to the hash nature of the InChIKey. (E.g., InChIKey hash collisions can exist when programmatically generating various possible chemical structure variations.)]

6.3.2. PubChem InChIKey "compound"

Information contained here includes the linkages to the corresponding CID(s) records.

6.5. PubChem Synonym

Data for the PubChem "synonym" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/synonym/

There are three subdirectories: "general", "substance", and "compound".

6.4.1. PubChem Synonym "general"

Information contained here includes the type and the corresponding synonym of a given MD5 hash string. The MD5 hash is used to provide a stable identifier for a given synonym.

6.4.2. PubChem Synonym "substance"

Information contained here includes the linkages to corresponding SID(s).

6.4.3. PubChem Synonym "compound"

Information contained here includes linkages to corresponding CID(s). The mappings of synonyms to CIDs may be a subset of those possible from corresponding SID(s). PubChem performs processing on aggregated chemical information between PubChem contributors. This consistency filtering helps to eliminate promiscuous synonyms that correspond to multiple chemical structures (perhaps erroneously).

6.6. PubChem Bioassay

Data for the "bioassay" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay/

Information contained here includes the descriptive information for a given AID that is represented as an instance of BAO_0000015, including the type, title, contributor, and depositor provided URL. It also includes links to bioassay neighbors and any corresponding measure groups.

6.7. PubChem MeasureGroup

Data for the "measuregroup" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/measuregroup/

The files contain the descriptive information for a given AID that cannot be represented as an instance of BAO_0000015, as well as the links to the corresponding bioassays. The remaining files contain the links from measure group to the participating substances and protein targets, as well as the links from measure group to the corresponding endpoints.

6.8. PubChem Endpoint

Data for the "endpoint" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/endpoint/

There are several files containing the type, value, unit, and label of a given endpoint, as well as links to the corresponding measure group, substance and reference.

6.9. PubChem Protein

Data for the "protein" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/protein/

The file 'pc_protein.ttl.gz' contains descriptive information for a given protein GI identifier, including the type, title, alternative names, conserved domain associations, encoding gene information, and involved biosystems. The files "pc_ protein2measuregroup_xxx_yyy.ttl.gz" contains the links from proteins to the measure groups. The file 'pc_protein_nbr.ttl.gz' contains the neighboring relationship between proteins.

6.10. PubChem ConservedDomain

Data for the "conserveddomain" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/conserveddomain/

The file 'pc_conserveddomain.ttl.gz' contains the type of a given PSSMID and the links to the proteins that have been tested in measure groups.

6.11. PubChem Gene

Data for the "gene" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/gene/

The file 'pc_gene.ttl.gz' contains the type and symbol of a given GID, as well as the links to the proteins that have been tested in measure groups.

6.12. PubChem Biosystem

Data for the 'biosystem' RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem/

The RDF statements contained in the file 'pc_biosystem.ttl.gz' provide the basic descriptions , including the type, title, homepage, and the related proteins (GIs) that have been tested in measure groups for a given BSID.

6.13. PubChem Source

Data for the "source" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source/

The file 'pc_source.ttl.gz' contains descriptive information for a given PubChem contributor; including data source identifier, display name, alternative names, organization, and homepage, as well as any classification of the given source, such as the Substance Categorization Classification information.

6.14. PubChem Reference

Data for the "reference" RDF subdomain directory can be found here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/reference/

The file 'pc_reference.ttl.gz' contains the type, publication date, citation, title, and homepage linkage for a given PMID. Each PMID described here has one or more than one links from bioassay endpoints, and the links from bioassay endpoints to PMIDs are exposed in PubChem Endpoint subdomain.

7. PubChemRDF Use Cases

This section gives some examples of how PubChemRDF can be used under available Semantic Web frameworks. [Please note that these use cases assume some familiarity and proficiency with these tools.] Three most popular Semantic Web frameworks that provide multiple collections of API functions to process RDF data are Apache Jena, OpenRDF Sesame, and Redland RDF libraries. Jena and Sesame are publically available Java frameworks, and Redland comprises a set of open source C libraries. All of them can be readily used to read, write, parse, serialize, and interpret RDF statements, and all of them provide both in memory and persistent storage, as well as SPARQL querying mechanisms. A list of persistent RDF triple stores can be found here: http://en.wikipedia.org/wiki/Triplestore

Recent technology development has changed the landscape, in particular, for very large triple stores (such as FRANZ AllegroGraph, OpenLink Virtuoso, Ontotext OWLIM, Garlik 4store, and SYSTAP Bigdata) that can handle fast loading and querying of billions of triples. AllegroGraph is compatible with Jena framework. Virtuoso provides fully operational data access and management through interface implementations of Jena, Sesame, and Redland frameworks. Bigdata supports Sesame API functions. OWLIM can deliver extensible and configurable performance with Jena and Sesame frameworks. According to the most recent DB-Engines ranking, Jena and Virtuoso are among the most popular RDF triple stores. Hence, we will demonstrate how to load PubChemRDF data into Jena and Virtuoso triple stores.

In the Jena framework, each semantic web resource (e.g. a subject, predicate, or object in a RDF triple) is represented as a Java object. A collection of API functions are provided by the Jena framework to read, write, add, delete, and search the associated properties of a given Java object (representing the semantic web resource). Jena has specific modules, TDB triple store and ARQ query engine, to handle the persistent storage and SPARQL query of RDF triples, respectively. TDB can handle multiple millions of triples on a single machine with sufficient performance. A TDB triple store can be accessed and managed using both the command line scripts and the Jena API functions. In addition, a TDB triple store can be queried over HTTP protocol using the Fuseki HTTP SPARQL Server. ARQ is a SPARQL query engine that supports SPARQL 1.1 query language specification, and it has API functions to retrieve and analyze the query results. In order to work with Jena API functions, you need to download the JAR files, and make sure they are available in active CLASSPATH.

In Jena, a data structure called "model" contains all of the RDF statements, and an in-memory model can be created:

Model model = ModelFactory.createDefaultModel();

The model can read RDF statements as an input stream either through a file:

String filename = "pc_compound_00000001_00100000.ttl";
InputStream in = FileManager.get().open(filename);
if (in != null) {
	model.read(in, "TURTLE");
}

or a URI revolved through REST-ful interface:

URL urlobj = new URL( "http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244" );
try {
	HttpURLConnection urlConnection = (HttpURLConnection) urlobj.openConnection();
	urlConnection.setInstanceFollowRedirects(true);	// allow redirection
	urlConnection.setRequestProperty("Accept","text/turtle"); // HTTP accept: text/turtle
	if (urlConnection.getResponseCode() == HttpURLConnection.HTTP_OK) {
		InputStream in = new BufferedInputStream(urlConnection.getInputStream());
	    model.read( in, "http://rdf.ncbi.nlm.nih.gov/pubchem/", "TURTLE" );
	}
} catch (Exception e) {
	e.printStackTrace();
}

The in-memory storage usually can only handle hundreds or thousands of RDF triples, depending on the size of memory. In contrast, the persistent storage can deal with a much larger number (millions) of RDF triples. The persistent storage can be achieved through a TDB backed model:

String directory = "pubchemrdf-dataset" ;
Dataset dataset = TDBFactory.createDataset( directory ) ;
dataset.begin(ReadeWrite.READ) ;
Model model = dataset.getDefaultModel() ;
…	// model reading and writing functions
dataset.end() ;

TDB provides a wrapper function to perform the same loading function, which is called as TDB loader:

String directory = "pubchem-dataset";
String filename = "pc_compound_00000001_00100000.ttl";
TDBLoader.load( TDBInternal.getBaseDatasetGraphTDB( TDBFactory.createDatasetGraph( directory ) ), filename, true);
Dataset dataset = TDBFactory.createDataset( directory );
Once the RDF statements are loaded into TDB dataset, SPARQL queries can be carried out:
String queryString = "SELECT ?s ?p ?o WHERE { ?s ?p ?o }";
Query query = QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query, dataset);
try 
{
    ResultSet results = qexec.execSelect();
    ResultSetFormatter.out( results, query ) ;    
} finally {
    qexec.close();
}

Virtuoso is a multi-model data server with hybrid architecture that can be installed on a single server or a cluster of servers for RDF data management, access, and integration. A RDF graph is a basic storage unit in Virtuoso server. A single server can store and access several RDF graphs with up to billions of triples. Beyond two billion triples, Virtuoso highly recommends to operate on a cluster of servers with proper hash partitioning. Virtuoso universal server provides several data access providers (drivers) including Virtuoso Jena provider, Virtuoso Sesame provider, and Virtuoso Redland provider. Virtuoso Jena provider takes advantage of the Jena API functions to read and write RDF triples from files, databases, and URL-based interface into a Virtuoso RDF quad store. In order to work with Virtuoso Jena provider API functions, additional JAR files should be downloaded and available in active CLASSPATH.

A Virtuoso JDBC driver is required to establish a connection to the server:

String url = "jdbc:virtuoso://:";
VirtDataSource conn = new VirtDataSource( url, "user", "password");

A JDBC connection backed graph can be created:

VirtGraph graph = new VirtGraph ( "graphname", conn );

The graph can read URI dereferencing RDF triples directly:

graph.read("http://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2244.rdf", "RDF/XML");

The graph can also read RDF triples from an input stream or a file:

VirtuosoUpdateRequest req = VirtuosoUpdateFactory.read( inputstream, graph );
req.exec();

or

VirtuosoUpdateRequest req = VirtuosoUpdateFactory.read( "filename", graph );
req.exec();

Once the RDF triples are loaded into the graph, a SPARQL query can be run on the graph:

Query query = QueryFactory.create("SELECT * WHERE { GRAPH ?graph { ?s ?p ?o } } limit 100");
VirtuosoQueryExecution qexec = VirtuosoQueryExecutionFactory.create ( query, graph);

The results can be retrieved as follows:

ResultSet results = vqe.execSelect();
	while (results.hasNext()) {
        QuerySolution result = results.nextSolution();
	    RDFNode graph = result.get("graph");
	    RDFNode s = result.get("s");
	    RDFNode p = result.get("p");
	    RDFNode o = result.get("o");
	    System.out.println(graph + " { " + s + " " + p + " " + o + " . }");
}

The SPARQL query can also be used to insert or load RDF triples into the graph:

Query loadquery = QueryFactory.create( "sparql load <ResourceURI> into graph <GraphURI>" );

Besides API functions, Virtuoso has built-in bulk load functions to load RDF triples from one or more files into one graph dataset at one time. Either Virtuoso "dash board" graphical user interface (GUI) or "isql" command line can be used in this case. The GUI can be accessed through

http://<server-name>:<port-number>. 
Once the files of PubChemRDF data are downloaded and placed on the local server, two "isql" commands can be used:

    SQL> ld_dir ('<path-to-directory>', '*.ttl.gz', '<GraphURI>');
    SQL> rdf_loader_run();

The first command registers the files to be loaded, and the second one executes the bulk loading process.

PubChemRDF can also be used in a graph database. According to DB-Engines ranking, Neo4j is among the most popular graph database. Michael's blog has explained how to load RDF triples into Neo4j using the Sesame parser. Based on his blog, we will demonstrate how to load PubChemRDF into Neo4j using Jena API functions. We followed the same logic when we load RDF triples into graph database: the RDF predicate "rdf:type" was interpreted as a unique node label, and the literal strings were translated as node properties, instead of independent nodes in the Neo4j graph. In addition to Jena libraries, you need to download Neo4j JAR files, and probably Apache Commons IO JAR files. All of the JAVA libraries should be available in active CLASSPATH.

At the beginning, we need to create an instance for graph database service:

String dbpath = "target/neo4j_rest" ;
Map<String, String> config = new HashMap<String, String>();
config.put( "neostore.nodestore.db.mapped_memory", "1024M" );
GraphDatabaseService graphDb = new GraphDatabaseFactory()
    .newEmbeddedDatabaseBuilder( dbpath )
    .setConfig( config )
    .newGraphDatabase();

Then we can create a node factory, which can create nodes and set the URI string values as the distinguishable properties for the nodes:

UniqueFactory<Node> factory = new UniqueFactory.UniqueNodeFactory( graphDb, "URIs" )
{
    protected void initialize( Node created, Map<String, Object> properties ) {
        created.setProperty( "URI", properties.get( "URI" ) );
    }
};

The node factory can check whether a node already exists in a given graph based on the property value of the unique URI string value provided:

Node subjectNode = factory.getOrCreate( "URI", "uri_string_value" );

Then we can read the RDF triples for a URI reference into a graph database:

URL urlobj = new URL( "http://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2244.ttl" );
HttpURLConnection urlConnection = (HttpURLConnection) urlobj.openConnection();
urlConnection.setInstanceFollowRedirects(true);
if (urlConnection.getResponseCode() == HttpURLConnection.HTTP_OK) {
    InputStream in = new BufferedInputStream(urlConnection.getInputStream());
    Model tripleModel = ModelFactory.createDefaultModel();	// Jena RDF model
    tripleModel.read( in, "http://rdf.ncbi.nlm.nih.gov/pubchem/", "TURTLE" );
    StmtIterator stmt = tripleModel.getResource(curURL).listProperties();
    while ( stmt.hasNext() ) {
        Statement st = stmt.nextStatement();
        Property predicate = st.getPredicate();
        RDFNode object = st.getObject();
        // add Label to the subjectNode if the predicate specify the type of itself
        if ( predicate.getURI().equals("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") ) {
            Label label = DynamicLabel.label(((Resource) object).getLocalName());
            if ( !subjectNode.hasLabel(label) ) {
                subjectNode.addLabel(label);
            }
        }
        // add properties to the subjectNode if the object of the triple is a literal
        else if (object.isLiteral()) {
            String datatype = object.asLiteral().getDatatypeURI();
            Object value;


			if (datatype == null) // treat as String
                value = object.asLiteral().getValue();
            else {				
                if (datatype.toLowerCase().contains("int")) {
                    value = object.asLiteral().getInt();
                } else if (datatype.toLowerCase().contains("float")) {
                    value = object.asLiteral().getFloat();
                } else if (datatype.toLowerCase().contains("date")) {
                    value = object.asLiteral().getDate();
                } else {
                    value = object.asLiteral().getString();
                }
            }

            subjectNode.setProperty( predicate.getLocalName(), value );
        } 
        // create another node for object if the object is a resource
        else { 
            Node objectNode = factory.getOrCreate("URI", object.asResource().getURI());
        }
        // Make sure the predicate specify a unique relationship
        RelationshipType relType = DynamicRelationshipType.withName( predicate.getLocalName() );
	    subjectNode.createRelationshipTo(objectNode, relType);
    }
}

Once the PubChemRDF data is loaded into Neo4j, one can perform graph traversal and other graph queries using Cypher query language.

8. Document Version History

V1.0.b – 2014Jan13 – Initial beta release.

V1.1.b – 2014Mar05 – added sections 1.1 and 1.2 and added Virtuoso bulk load case to the section 7.