A PUG REST Tutorial


The purpose of this document is to explain how PubChem’s PUG REST service is structured, with a variety of usage cases as illustrations, to help new users learn how the service works and how to construct the URLs that are the interface to this service. PUG stands for Power User Gateway, which encompasses several variants of methods for programmatic access to PubChem data and services. This REST-style interface is intended to be a simple access route to PubChem for things like scripts, javascript embedded in web pages, and 3rd party applications, without the overhead of XML, SOAP envelopes, etc. that are required for other versions of PUG. PUG REST also provides convenient access to information on PubChem records that is not possible with any other service.

Some other documents that may be useful are:

·         A more technical and complete PUG REST specification document, but that is a little harder to read: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html

·         The original purely XML-based PUG: https://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html

·         PUG SOAP, for applications that have built-in SOAP handlers, or programming languages with an API generated from a SOAP WSDL: https://pubchem.ncbi.nlm.nih.gov/pug_soap/pug_soap_help.html

·         A general (and large) PubChem help page: https://pubchem.ncbi.nlm.nih.gov/help.html


PUG REST is a new service and is undergoing active development, so check this page for new features. For comments, help, or to suggest new functionality or topics for this tutorial, please contact pubchem-help@ncbi.nlm.nih.gov.


USAGE POLICY: Please note that PUG REST is not designed for very large volumes (millions) of requests. We ask that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers. If you have a large data set that you need to compute with, please contact us for help on optimizing your task, as there are likely more efficient ways to approach such bulk queries.


Table of Contents


How PUG REST Works. 1

Input: Design of the URL. 1

Output: What You Get Back. 1

Error Handling. 1

Access to PubChem Substances and Compounds. 1

Input Methods. 1

By Identifier. 1

By Name. 1

By Structure Identity. 1

By Structure Search. 1

By Fast (Synchronous) Structure Search. 1

By Cross-Reference (XRef). 1

Available Data. 1

Full Records. 1

Images. 1

Compound Properties. 1

Synonyms. 1

Cross-References (XRefs). 1

And More….. 1

Access to PubChem BioAssays. 1

From AID.. 1

Assay Description. 1

Assay Data. 1

Targets. 1

Activity Name. 1

From SID and CID.. 1

Dealing with Lists of Identifiers. 1

Storing Lists on the Server. 1

Inter-conversion of Identifiers. 1

How To Use HTTP POST. 1

Conclusion. 1



How PUG REST Works


The fundamental unit upon which PUG REST is built is the PubChem identifier, which comes in three flavors – SID for substances, CID for compounds, and AID for assays. The conceptual framework of this service, that uses these identifiers, is the three-part request: 1) input – that is, what identifiers are we talking about; 2) operation – what to do with those identifiers; and 3) output – what information should be returned. The beauty of this design is that each of these three parts of the request is (mostly) independent, allowing a combinatorial expansion of the things you can do in a single request. Meaning that, for example, any form of input that specifies some group of CIDs can be combined with any operation that deals with CIDs, and any output format that’s relevant to the chosen operation. So instead of a list of separate narrowly defined service requests that are supported, you can combine these building blocks in many ways to create customized requests.

For example, this service supports input of chemical structure by SMILES. It supports output of chemical structure as images in PNG format. You can combine these two into a visualization request for a SMILES string – in this case, whether or not that particular chemical is even in the PubChem database at all! And it’s something you can almost type manually into a web browser:


Or, combine input by chemical name with InChI property retrieval, and you have a simple name-to-InChI service in a single request:


The possibilities are nearly endless, and more importantly, the action of the service is simple to understand from the URL alone, without needing any extra programming, XML parsing, etc. And at the same time, more complex data handling is available for programmers who want rigorous schema-based XML communications, or who want to use JSON data to embed functionality in a web page via JavaScript.


Input: Design of the URL


PUG REST is entirely based on HTTP (or HTTPS) requests, and most of the details of the request are encoded directly in the URL path – which is what makes the service RESTful (informally anyway, as it does not adhere strictly to REST principles, which are beyond the scope of this discussion). Continuing with the last example above, let’s examine the structure of the URL, which is divided into the three main parts of input, operation, and output, in an ordered sequence that will sometimes be referred to in this document as the URL path:










Taking each section individually, first we have the prolog – the HTTP address of the service itself, which is common to all PUG REST requests. The next part is the input, which in this case says “I want to look in the PubChem Compound database for records that match the name ‘vioxx’.” Note that there some subtleties here, in that the name must already be present in the PubChem database, and that a name may refer to multiple CIDs. But the underlying principle is that we are specifying a set of CIDs based on a name; at the time of writing, there is only one CID with this name. The next section is the operation, in this case “I want to retrieve the InChI property for this CID.” And finally the output format specification, “I want to get back plain text.”

Some requests may use optional parameters, things after the ‘?’ at the end of the URL path. PUG REST is also able to handle by HTTP POST some types of input that cannot be put into a URL path, such as InChI strings or certain types of SMILES that contain special characters that conflict with URL syntax, or multi-line SDF files. There is a separate section towards the end of this document that explains this process in more detail. There are some additional complexities to the HTTP protocol details that aren’t covered here – see the specification document for more information.


Output: What You Get Back


The results of most operations can be expressed in a variety of data formats, though not all formats are relevant to all operations, meaning for example you can’t get back a list of CIDs in SDF format, or a chemical structure in CSV format. It is your choice which format to use, and will likely depend on the context of the requests; a C++ application may wish to use XML that is automatically parsed into class objects based on the schema, while a JavaScript applet in a web page would more naturally use JSON to have convenient access to the result data. Plain text (TXT) output is limited to certain cases where all output values are of the same simple type, such as a list of chemical name synonyms or SMILES strings. The available output formats are listed below.

Output Format



standard XML, for which a schema is available


JSON, JavaScript Object Notation


JSONP, like JSON but wrapped in a callback function


standard binary ASN.1, NCBI’s native format in many cases


NCBI’s human-readable text flavor of ASN.1


chemical structure data


comma-separated values, spreadsheet compatible


standard PNG image data


plain text



Error Handling                                                                                                   


If there is a problem with a request, PUG REST will usually return some sort of human-readable message indicating what went wrong – whether it’s an invalid input, or nothing was found for the given query, or the request was too broad and took too long to complete (more than 30 seconds, the NCBI standard time limit on web service requests), etc. See the specification document for more detail on result and HTTP status codes.


Access to PubChem Substances and Compounds


This section covers some of the basic methods of accessing PubChem chemical structure information, with many working samples. It is not intended to be a comprehensive document covering all PUG REST features, but rather enough of the common access methods to give a reasonable overview of how the service is designed and what it can do, so that one can quickly begin to use it for custom applications.


Input Methods


The first part of any PUG REST request is the input, which tells the service which records you’re interested in. There are many ways to approach this; the most common are presented here, with examples.


By Identifier


The most straightforward way to tell PUG REST what records you’re interested in is by specifying the SIDs or CIDs directly. This example says “give me the names of substance with SID 10000 in XML format”:


IDs may also be specified in a comma-separated list, here retrieving a CSV table of compound properties:


Large lists of IDs that are too long to put in the URL itself may be specified in the POST body, but be aware that if a PUG REST requests takes more than 30 seconds to complete, it will time out, so it’s better to deal with moderately sized lists of identifiers.


By Name


It is often convenient to refer to a chemical by name. Be aware though that matching chemical names to structure is an inexact science at best, and a name may often refer to more than one record. For example, “glucose” gives (at the time of writing) four CIDs, the same as if you were to search for that name by full synonym match in Entrez:


Some operations will use only the first identifier in the list, so if you want a picture of glucose, you can get what PubChem considers the “best” match to that name with the following URL:


By default, the given name must be an exact to match the entire name of the record; optionally, you can specify that matches may be to individual words in the record name:



By Structure Identity


There are numerous ways to specify a compound by its chemical structure, using SMILES, InChI, InChI key, or SDF. InChI and SDF require the use of POST because the format is incompatible with a simple URL string, so they won’t be discussed here. But specifying with most  SMILES strings, or InChI key, is straightforward. For some operations, a SMILES can be used to get data even if the structure is not present in PubChem already, but may not work for others like retrieval of precomputed properties. The InChI key must always be present in the database, since unlike these other formats, it is not possible to determine structure from the key alone. This example will give the PubChem CID for the SMILES string “CCCC” (butane):



By Structure Search


The previous section describes how one can specify a structure, for which PUG REST will return only an exact match. There are however more sophisticated structure search techniques available, including substructure and superstructure, similarity (2D Tanimoto), and various partial identity matches (like same atom connectivity but unspecified stereochemistry). Molecular formula search also falls into this category in PUG REST. The complication with these searches is that it takes time to search the entire PubChem database of tens of millions of compounds, and so results may not be available within the previously mentioned 30 second time period of a PUG REST request. To work around this, PUG REST uses what is called an “asynchronous” operation, where you get a request identifier that is a sort of job ticket when you start the search. Then it is the caller’s responsibility to check periodically (say, every 5-10 seconds) whether the search has finished, and if so, retrieve the results. It is a two stage process, where the search is initiated by a request like this search for all records containing a seven-membered carbon ring:


The result of that request will include what PUG REST calls a “ListKey” – which is currently a unique, private, randomly-assigned 64 bit integer. This key is used in a subsequent request like:


… where the listkey number in the above URL is the actual ListKey returned in the first call. This will either give a message to the effect of “your search is still running,” or if complete, will return the list of CIDs found - in this example at this time, around 230,000 records. (See below for more on how to deal with large lists of identifiers.)


By Fast (Synchronous) Structure Search


Some re-engineering of the PubChem search methods has enabled faster searching by identity, similarity (2D and 3D), substructure, and superstructure. These methods are synchronous inputs, meaning there is no waiting/polling necessary, as in the majority of cases they will return results in a single call. (Timeouts are possible if the search is too broad or complex.) These are normal input methods and can be used with any output. Some examples are below:






By Cross-Reference (XRef)


PubChem substances and compounds often include a variety of cross-references to records in other databases. Sometimes it’s useful to do a reverse lookup by cross-reference value, such as this request that returns all SIDs that are linked to a patent identifier:


For a full list of cross-references available, see the specification document.


Available Data


Now that you’ve learned how to tell PUG REST what records you want to access, the next stage is to indicate what to do with these records – what information about them you want to retrieve. One of the major design goals of PUG REST is to provide convenient access to small “bits” of information about each record, like individual properties, cross-references, etc., which may not be possible with any other PubChem service without having to download a large quantity of data and sort through it for the one piece you need. That is, PubChem provides many ways to retrieve bulk data covering the entire database, but if all you want is, say, the molecular weight of one compound, PUG REST is the way to get this simply and quickly. (Whereas PUG REST is not the best way to get information for the whole database – so it’s probably not a good idea to write a “crawler” that calls PUG REST individually for every SID or CID in the system – there are better ways to get data for all records.)


Full Records


PUG REST can be used to retrieve entire records, in the usual formats that PubChem supports – ASN.1 (NCBI’s native format), XML, SDF. Now you can even get full records in JSON(P) as well. In fact, full record retrieval is the default action if you don’t specify some other operation. For example, both of these will return the record for aspirin (CID 2244) in various fully equivalent formats:



You can also request multiple records at once, though be aware that there is still the timeout limit on record retrieval so large lists of records may not be practical this way – but of course PubChem provides separate bulk download facilities.





As far as PUG REST is concerned, images are really a flavor of full-record output, as they depict the structure as a whole. So all you have to do to retrieve an image is to specify PNG format output instead of one of the other data formats described in the previous section. Note though that an image request will only show the first SID or CID in the input identifier list, there is currently no way to get multiple images in a single request. (However, PubChem’s download service can be used to get multiple images.) Image retrieval is fully compatible with all the various input methods, so for example you can use this to get an image for a chemical name, SMILES string, InChI key, etc.:





Compound Properties


All of the pre-computed properties for PubChem compounds are available through PUG REST, individually or in tables. See the specification document for a table of all the property names. For example, to get just a single molecular weight value:


Or a CSV table of multiple compounds and properties:





Chemical names can be both input and output in PUG REST. For example, to see all the synonyms of Vioxx that PubChem has, a rather long list:



Cross-References (XRefs)


PubChem has many cross-references between databases, all of which are available through PUG REST. See the specification document for a table of all the cross-reference types. For example, to get all the MMDB identifiers for protein structures that contain aspirin:


Or the inverse of an example above, retrieving all the patent identifiers associated with a given SID:



And More…


This gives you some idea of the sorts of data one can access through PUG REST. It is not a comprehensive list, as we have not covered dates, classifications, BioAssay information and SID/CID/AID cross-links (detailed more below), etc.; more features may be added in the future. And we welcome feedback on new feature suggestions as well!

Access to PubChem BioAssays


In this section we describe the various types of BioAssay information available through PUG REST. A PubChem BioAssay is a fairly complex and sometimes very large entity with a great deal of data, so there are routes both to entire assay records and various component data readouts, etc., so that you can more easily get just the data that you’re interested in.


From AID

Assay Description


An assay is composed of two general parts: the description of the assay as a whole, including authorship, general description, protocol, and definitions of the data readout columns; and then the data section with all the actual test result values. To get just the description section via PUG REST, use a request like:


There is also a simplified summary format that does not have the full complexity of the original description as above, and includes some information on targets, active and inactive SID and CID counts, etc. For example:



Assay Data


BioAssay data may be conceptualized as a large table where the columns are the readouts (enumerated in the description section), and the rows are the individual substances tested and their results for each column. So, retrieving an entire assay record involves the primary AID – the identifier for the assay itself – and a list of SIDs. If you want all the data rows of an assay, you can use a simple request like this one, which will return a CSV table of results. Note that full-data retrieval is the default operation for assays.


However, as some assays have many thousands of SID rows, there is a limit, currently 10,000, on the number of rows that can be retrieved in a single request. If you are interested in only a subset of the total data rows, you can use an optional argument to the PUG REST request to limit the output to just those SIDs (and note that with XML/ASN output you get the description as well when doing data retrieval). There are other ways to input the SID list, such as in the HTTP POST body or via a list key; see below for more detail on lists stored on the server.


Some assay data may be recast as dose-response curves, in which case you can request a simplified output:





When the target of a BioAssay is known, it can be retrieved either as a sequence or gene, including identifiers in NCBI’s respective databases for these:


Note though that not all assays have protein or gene targets defined.

It is also possible to select assays via target identifier, specified by GI, Gene ID, or gene symbol, for example:



Activity Name


BioAssays may be selected by the name of the primary activity column, for example to get all the AIDs that are measuring an EC50:



From SID and CID


PUG REST can also retrieve a summary of BioAssay information associated with a given SID or CID; in this case, multiple AIDs may be returned (and possibly multiple data rows in a given AID). For example:




Dealing with Lists of Identifiers

Storing Lists on the Server


Some PUG REST requests may result in a very long list of identifiers, and it may not be practical to deal with all of them at once. Or you may have a set of identifiers you want to be able to use for several subsequent requests of different types. For this reason, we provide a way to store lists on the server side, and retrieve them in part or whole. The basic idea is that you request a “List Key” for your identifiers – in fact the same sort of key you get from a structure search as mentioned above. But any operation that results in a list of SIDs, CIDs, or AIDs can be stored in a ListKey this way, not just structure search.

Say for example you want to look at all the SIDs tested in a large assay. First make the request to get the SIDs and store them on the server:


This will return a ListKey – along with the size of the list, and values needed to retrieve this same list from Entrez’s eUtils services. You can then use that listkey in subsequent request. For example, since assay data retrieval is limited in the number of rows, you could break it up into multiple requests of 1,000 SID rows at a time, like:


Here, substitute the “listkey” value with the key returned by the initial request above, then “listkey_start” is the zero-based index of the first element of the list to use, and “listkey_count” is how many. Simply repeat the request with increasing values of “listkey_start” in order to loop over the entire assay – either to get the contents of the whole assay, or (with a smaller count value perhaps) to show one page of results at a time in a custom assay data viewer, with pagination controls to move through the whole set of results.

A ListKey can be used in most places that could otherwise take an explicit list of identifiers. So, for example, the same list of SIDs can be used in the context of substance requests, such as this one to get the synonyms associated with the first 10 records on the same list:


You can even create lists from identifiers specified in the URL (or in the HTTP POST body):



Inter-conversion of Identifiers


PubChem has many (many) types of cross-links between databases, or between one records and other records in the same database. That is, you can move from “SID space” to “CID space” in a variety of ways, depending on just what relationship you’re interested in. The specification document has a complete table of these identifier inter-conversion options, depending on whether you’re starting from SIDs, CIDs, or AIDs. We’ll show a few examples here.

You’ve already seen one example just above of getting back SIDs associated with a given AID. That request returns all SIDs, but it’s also possible to get just the SIDs that are active in the assay, in this case a much smaller list than the full set of ~96,000 SIDs that were tested:


Or to retrieve all the substances corresponding exactly to the structure of aspirin (CID 2244), which shows all the records of this chemical structure supplied to PubChem by multiple depositors – and there are many in this case. This sort of conversion operation can also be combined with ListKey storage in the same way discussed above, in case the results list is long.


There are operations to retrieve the various groups of related chemical structures that PubChem computes, such as this request to retrieve all compounds – salts, mixtures, etc. – whose parent compound is aspirin; that is, where aspirin is considered to be the “important” part of the structure:


Sometimes it’s possible to group lists of identifiers in the result according to identifiers in the input, and PUG REST includes options for that as well. Compare the output of the following two requests. The first simply returns one group of all standardized SIDs corresponding to any compound with the name ‘glucose’ (that is, deposited records that match one of the glucose CIDs exactly). The second groups them by CID, which is actually the default for this sort of request, unless you are storing the list on the server via ListKey, in which case it is necessarily flattened.






While being able to write most PUG REST requests as simple URLs is convenient, sometimes there are inputs that do not work well with this approach because of syntax conflicts or size restrictions. For example, a multi-line SDF file, any name or SMILES string or InChI that has ‘/’ (forward slash) or other special characters that are reserved in URL syntax, or long lists of identifiers that are too big to put directly in the URL of an HTTP GET request, can be put in the HTTP POST body instead. Many (though not necessarily all) of the PUG REST input types allow the argument to be specified by POST. While this isn’t something that one can type into a regular web client, most programmatic HTTP interface libraries will have the ability to use POST. Technically, there is no limit to the size of the POST body, but practically, a very large input may take a long time for PUG REST to process, leading to timeouts if it takes longer than 30 seconds.

There are existing standards for just how the information in the POST body is formatted, and you must include in the PUG REST call an HTTP header that indicates which content type you are supplying. The simpler format is “Content-Type: application/x-www-form-urlencoded” which is the same as the URL argument syntax after the ‘?’ separator, of the general form “arg1=value1&arg2=value2&…” (See here for more technical detail on these content types.) For example, use a URL like


with “Content-Type: application/x-www-form-urlencoded” in the request header, and put the string


in the POST body. (With InChI this looks a little weird, because the first “inchi=” is the name of the PUG REST argument, and the second “InChI=” is part of the InChI string itself.) You should get back CID 6334 (propane). Note that some special characters may still need to be escaped, in particular the ‘+’ (plus sign) character which is a standard replacement for a space character in URL syntax. You must replace this with “%2B”, such as “smiles=CC(=O)OC(CC(=O)[O-])C[N%2B](C)(C)C” to use the SMILES string for CID 1 (acetylcarnitine). If PUG REST is giving you a “bad request” or “structure cannot be standardized” error message with your input, it’s possible there are other special characters that need to be escaped this way.

The first method just described above works well for single-line input strings, but is not applicable to inputs like SDF which are necessarily multi-lined. For this type, you’ll need to use the multipart/form-data type, and an appropriately formatted input. This method is a little more complex because of the existing protocol standard. To use the same example as above, first prepare a file (or string) that looks like this:


Content-Disposition: form-data; name="inchi"

Content-Type: text/plain





Note that the POST body string/file in this case must have DOS-style “CR+LF” line endings, and there must be an empty line between the content headers and actual data line(s) (and no blank lines anywhere else). But in this format, no further escaping of special characters is needed. It looks a little strange, but your HTTP library may know how to construct this sort of thing automatically, check your documentation. This would be sent to the same URL as before, e.g.:


but this time with “Content-Type: multipart/form-data; boundary=AaB03x” in the request header. It is essential that the arbitrary boundary string given in the header match what’s used in the POST body (“AaB03x” in this example).




If you’ve read this far, hopefully by now you have a good understanding of the sorts of things PUG REST can do to facilitate access to PubChem data, and how to write your own PUG REST requests. Please feel free to contact us at pubchem-help@ncbi.nlm.nih.gov for assistance, if there’s something you’d like to be able to do with this service but can’t quite figure out how to formulate the requests, or if the features you need simply aren’t present and you would like us to consider adding them.