![]() |
![]() |
PubMed | Entrez | Structure | PubChem | Help | |
| PubChem » PubChem Help » PUG Help | |||
|
PubChem PUG Help The PubChem Power User Gateway (PUG) provides access to PubChem services via a programmatic interface. The basic design principle is straightforward. There is a single CGI (pug.cgi, referred to hereafter as simply PUG) that is the central gateway to multiple PubChem functions. PUG takes no URL arguments; all communication with PUG is through XML. To perform any request, one formulates input in XML and then HTTP POST it to PUG. The CGI interprets your incoming request, initiates the appropriate action, then returns results (also) in XML format. (This document assumes a basic familiarity with XML tags and data structures. To learn more about XML, visit the URL: http://en.wikipedia.org/wiki/XML) PubChem services are queued. As such, a submitted task will (usually) complete sometime after PUG responds to the initial request. The initial PUG response contains the request ID of your task. This request ID must be used for further communication with PUG concerning your submitted task. When PUG is interrogated about an outstanding request using the request ID, PUG will return either the results of your task, if completed, or the status of your task. Each PubChem service enabled for use with PUG is documented separately. This service by service documentation will detail the input, output, and options. All XML used by PUG is specified in the data type definition (DTD), which may be found at: http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd or in the equivalent XML Schema definition at: http://pubchem.ncbi.nlm.nih.gov/pug/pug.xsd We strongly recommend using an XML parser/generator tool to read and write the XML data, rather than composing XML manually. PubChem PUG enabled services (typically) have the ability to save and open valid PUG requests designed for that service. You can use this feature to learn how to compose valid PUG XML requests and to verify that your PUG XML request does what is intended. An example of such a service is the PubChem Compound Structure Search service: http://pubchem.ncbi.nlm.nih.gov/search/search.cgi Additional documentation on PubChem and its services may be found at http://pubchem.ncbi.nlm.nih.gov and via help links throughout PubChem’s web site. If you can’t find what you need there, further requests for information or help may be sent to the highly knowledgeable and responsive NCBI help desk at info@ncbi.nlm.nih.gov.
All communication to PUG is via XML sent to the CGI at the URL: http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi The primary data container used in all transactions is <PCT-Data>, the top-level container for any PUG input or output. (“PCT” stands for PubChem Tools, a data specification that is shared by both PUG and internal PubChem applications. See the introduction in this document for more information.) This <PCT-Data> object may contain either a <PCT-InputData> or a <PCT-OutputData> object. Users of PUG will always send <PCT-Data> containing <PCT-InputData>, and always receive <PCT-Data> containing <PCT-OutputData>. After a new task is submitted to PUG, your request is queued, rather than executing immediately. As such, PUG will return an XML message containing a request ID to be used for further actions on your request. In the example PUG XML reply below, the message says that the request was successfully submitted and that the request ID is “402936103567975582”. It will then be up to you to (periodically) poll PUG, using the request ID, until your task is complete. When your task is completed, PUG will return the result; otherwise it will simply return a status message. See examples in other sections of this document on how to properly poll PUG. Example of a PUG reply to a newly submitted request:
The <PCT-InputData> object is a choice between request types. Tasks specific to various PubChem services are contained by <PCT-InputData> and are described in different sections of this document. Primary to the use of PUG is the <PCT-InputData> input type used to perform request management, <PCT-Request>. Request management enables you to enquire about the status of or to cancel a previous PUG request. For example, to cancel a PUG request with request ID “402936103567975582”, the PUG XML input message will look like this:
The <PCT-OutputData> object contained in the output from PUG will always include a status message in a <PCT-Status-Message>, which consists of an enumerated status in <PCT-Status> and an optional message string. When a new task is queued by PUG, the <PCT-OutputData> returned to you will (likely) contain a <PCT-Waiting> which contains your request ID. If the request finishes quickly, the initially returned <PCT-OutputData> object will actually contain the appropriate result of your task specific to the requested service. Similarly, when polling PUG using your request ID, the <PCT-OutputData> object will contain either your task result or a status message.
PubChem services currently enabled for use by PUG include PubChem Download (http://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi), PubChem Compound Structure Search (http://pubchem.ncbi.nlm.nih.gov/search/search.cgi), and PubChem Structure Standardization (http://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi). Each PUG service has its own expected input and provided output. The sections below detail how to use each service with PUG.
This service allows you to download sets of PubChem records – substances or compounds – using PUG’s <PCT-Download> sub-object. You will need to specify which records to download, using a <PCT-QueryUids> object, the desired output format (ASN.1, XML, or SDF), and, optionally, the desired compression method (gzip or bzip2). The options available through PUG are equivalent to those for the interactive PubChem Download service. The <PCT-QueryUids> object enables you to specify an explicit list of record IDs, or to provide an existing Entrez history key (see eUtils documentation: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) from either the PubChem Compound (“pccompound”) or the PubChem Substance (“pcsubstance”) Entrez databases. Currently there is an upper limit of 250,000 structures per download request; if you find this limit too restrictive for your purposes, please consider using the PubChem FTP site which contains all available PubChem contents: ftp://ftp.ncbi.nlm.nih.gov/pubchem/ When your download request is successfully completed, the returned <PCT-OutputData> object will hold a <PCT-Download-URL> containing the URL you may use to download your results. Again, please note that the result of a download task is an URL, not the record data itself (which may be quite large). To obtain the data requested, you must use the provided URL. Example: You want to download CID 1 and CID 99 – being uids 1 and 99 in the “pccompound” Entrez database – in SDF format with gzip compression. The typical flow of information is as follows. First, the initial input XML is sent to PUG via HTTP POST. Note the input data container with the download request and uid and format options:
If the request is small and finishes very quickly, you may get a final URL right away (see further below). But usually PUG will respond initially with a waiting message and a request ID (<PCT-Waiting_reqid>) such as:
You would then parse out this request id, being “402936103567975582”, in this case, and use this id to “poll” PUG on the status of the request, composing an XML message like:
Note that here the request type “status” is used; there is also the request type “cancel” that you may use to cancel a running job. If the request is still running, you well get back another waiting message as above, and then you’d poll again after some reasonable interval. If the request is finished, you will get a final result message like:
You would parse out the URL from the <PCT-Download-URL_url> tag, and then use a tool of your choice to connect to that URL to retrieve the actual requested data.
The <PCT-Query> object may be used to perform queries against PubChem data that are not possible using Entrez. The <PCT-Query> object consists of a series of queries and a database to query against. One must be careful when formulating queries to select compatible database and query types, as outlined in the documentation of each query task type. For example, you will not want to perform a chemical similarity search on a list of bioassay identifiers (AIDs), since chemical searches may only be performed on compound identifiers (CIDs). The <PCT-Query> object can perform multiple queries in a single request. The semantic of multiple queries in a single task is to “AND” the result between queries, which is to say that the resulting list of identifiers will satisfy all queries requested. Sometimes it is best to perform multiple search tasks individually rather than in a single task, unless otherwise noted in the task documentation.
3.2.1. Chemical Structure Query Tasks. To perform PubChem Compound structure searches using PUG, you will need to make a request using a <PCT-Query> object. Chemical structure search tasks use the query objects <PCT-QueryCompoundCS> and <PCT-QueryCompoundEL>. You may submit a structure search by mixing and matching more than one of these two query types in a series. Furthermore, only the PubChem Compound (“pccompound”) Entrez database may be specified in <PCT-QueryUids>, when performing chemical structure queries. The <PCT-QueryCompoundCS> and <PCT-QueryCompoundEL> objects can encode many different types of chemical structure searches. To help you understand how to encode a structure search, please consider using the PubChem Structure Search web site. It has the ability to translate your structure search query into the XML necessary for use with PUG, and can be very helpful to demonstrate how to encode complex queries. The PubChem Structure Search system is located at the URL: http://pubchem.ncbi.nlm.nih.gov/search/ Please note that the output result of a chemical structure search is an Entrez history key (see eUtils documentation). To obtain the list of compounds matching your query, you must use eUtils; more information on eUtils is below. There is currently a limit of two million compound identifiers returned by the structure search (through either PUG or the interactive web site).
Example: You wish to perform a chemical similarity search of CID 2244 at a Tanimoto similarity value of 80% with at most 300 results returned. The initial HTTP POST to PUG to initiate the search would contain XML like:
If the request is processed and started successfully, PUG would respond with a waiting message and request id, for example:
You would then use this request ID, “271473836860076709” in this case, to “poll” PUG for the status of the request:
If the search is still running, you would get another waiting message as above, and you would then need to poll again after a reasonable interval. If the search task is completed, PUG would give an Entrez history key for the resulting CID (compound identifier) list:
More information on using Entrez history (and eUtils) to retrieve hit lists is below. If for some reason your initial query cannot be properly interpreted, PUG would respond with an error message with some indication of the problem encountered:
If you wish to cancel a queued or running request, you would send to PUG:
And when PUG cancels your task, you would get back:
3.3. PubChem Standardization Tasks. The PubChem Standardization service allows you to standardize the representation of a chemical structure using a <PCT-Standardize> sub-object. PubChem uses a normalization procedure on all PubChem substance records to remove variation due to different representations of functional groups, tautomeric or resonance forms, etc., to create the PubChem Compound database, which contains the unique chemical structures in the PubChem Substance database. This procedure verifies and validates that a chemical structure is reasonable (to a certain degree) through examination of the atoms and their valence and involves a valence-bond canonicalization processing for tautomer invariance. The input to structure standardization is a chemical structure and the output is either a failure message or a chemical structure. To use this service, you will need to specify an input structure and its format. You also need to specify the output format you desire. This service operates on only a single structure at a time. Example: You would like to standardize the representation of guanine input in SMILES format and output in SDF format. The typical flow of information is as follows. First, the initial input XML is sent to PUG via HTTP POST. Note the input data container with the download request and uid and format options:
If the request is small and finishes very quickly, you may get a final URL right away (see further below). But usually PUG will respond initially with a waiting message and a request ID (<PCT-Waiting_reqid>) such as:
You would then parse out this request id, being “402936103567975582”, in this case, and use this id to “poll” PUG on the status of the request, composing an XML message like:
Note that here the request type “status” is used; there is also the request type “cancel” that you may use to cancel a running job. If the request is still running, you well get back another waiting message as above, and then you’d poll again after some reasonable interval. If the request is finished, you will get a final result message like:
You would parse out the output from the <PCT-Download-URL_url > tag to retrieve the standardized structure.
4. PUG, NCBI eUtils, and Entrez History. NCBI’s Entrez integrates the scientific literature, DNA and protein sequence databases, 3D protein structure and protein domain data, population study datasets, expression data, assemblies of complete genomes, taxonomic information, and PubChem Compound, Substance, and BioAssay databases (among others) into a tightly interlinked system. It is a retrieval system designed for searching its linked databases. Entrez history provides a record of the searches performed during a search session. PubChem communicates with Entrez history through Entrez Programming Utilities (eUtils) to enhance data analysis. NCBI’s eUtils are used extensively by PubChem services. Results from queries are often provided in the form of an Entrez history, which represents a list of database specific identifiers within the Entrez search system. These identifiers are, for example, your PubChem CIDs (compound identifiers). This allows you, the user, to interact with other Entrez databases and to perform hit list management tasks using eUtils, e.g., to logically combine the results of different queries using AND, OR, or NOT operations. PubChem services typically accept an Entrez history as a means to provide a subset of identifiers as input, so that your query operates only on a subset of a PubChem database contents. Use of Entrez history can help you avoid sending and receiving (potentially) very large lists of identifiers. To learn more about eUtils, please visit the URL: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html Histories in Entrez are database specific. Each time an Entrez search is executed, the search terms, the time the search was executed, and the search results are numbered consecutively and saved automatically in Entrez history for that database. The history can be recalled at any time during a search session, but histories are lost after 8 hours of inactivity. There is also a running limit of 100 searches (across all databases) saved in any given session. PUG is integrated with Entrez in that it may use Entrez history keys (also know as “webenv” keys) as both input and output, depending on the task. For example, structure search via PUG may return an Entrez history, and the resulting hit list can be retrieved as a list of CIDs using Entrez’s eFetch utility. PUG can also take a history key as input, if you wanted to download the records resulting from either a prior structure search or a programmatic Entrez search via the eSearch utility. (See the eUtils documentation for more details.) Entrez histories are referred to programmatically by the trio of a database name, a WebEnv string, and a query key number. You can see this in the example structure search above. The part of PUG’s response that contains this information is the <PCT-Entrez> tag:
This Entrez history information may be used in a variety of ways. If you want to view these hits on a regular web page, you can direct a browser to an URL as follows, which shows the results in HTML in the usual Entrez docsum format:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Select+from+History& On the other hand, if you’re writing an application and want to retrieve the hit list directly via HTTP, you can use eFetch with the same information, which can return the list in XML (with its own DTD/XSD that is not related to PUG’s), for example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? Finally, if you want to download the compounds from this search in, e.g., SDF format with gzip compression, you would send PUG a request with the <PCT-Entrez> information instead of an explicit CID list. From here, the download process would continue as in the example above.
PUG and eUtils together make possible a wide variety of powerful programmatic data analysis tools for PubChem and other Entrez databases. There is a SOAP wrapper for PUG. The WSDL for which can be found at: http://pubchem.ncbi.nlm.nih.gov/pug/pug.wsdl The interface is very simple, with a single “RunPUG” operation, and a <PCT-Data> element in the body for both request and response envelopes. Otherwise the behavior is the same as regular (non-SOAP) PUG, where the request <PCT-Data> expects to have <PCT-InputData>, and the response <PCT-Data> will contain <PCT-OutputData>. Document Version History. V1.1.0 – 2007Jan11 – Added new section on the PubChem Standardization service. Minor cleanup of and additions to the previous documentation. V1.0.0 – 2007May10 – Initial release. |
| Write to Helpdesk | Disclaimer | Privacy statement | Accessibility | |