Wiring ClinicalTrials.gov v2 + openFDA + ChEMBL into a License-Clean Drug Intelligence Endpoint

Pharma R&D, medical-AI startups, and pharmacovigilance teams all want the same thing: one endpoint that pulls trials, labels, adverse events, and bioactivity in a license-clean way. Here's the working architecture, with the trap doors that have surprised teams in production.
TL;DR
- •ClinicalTrials.gov v2 (REST + JSON) replaced the legacy v1 in 2024 — schema is cleaner but pagination, optional fields, and historical drift catch new integrators out.
- •openFDA covers drug labels (SPL), FAERS adverse-event reports, and recall data, free; rate-limited to 240 req/min unauthenticated, 120k/day with a key.
- •ChEMBL gives you bioactivity (IC50, Ki, Kd, EC50), targets, and assays — the structural / mechanistic dimension other databases lack.
- •DrugBank's commercial license is the trap: academic use is permitted; any product, even small SaaS, falls under commercial licensing terms most builders don't read until they're served notice.
- •API Pick Clinical Search wraps ClinicalTrials, openFDA, ChEMBL, and DrugBank pharmacology in one POST endpoint — 30 credits per call, license-clean, only-on-success.
The shape of the problem
Three audiences end up needing roughly the same drug-data pipeline, for different reasons:
- Biopharma R&D and drug-repurposing teams want bioactivity data (ChEMBL), trial history (ClinicalTrials.gov), and adverse-event signals (FAERS) joined to evaluate a candidate.
- Medical-AI startups building chatbots or clinical decision support layers need drug labels (openFDA SPL) and trial data joined to ground LLM answers in regulatory sources.
- Pharmacovigilance teams want FAERS plus drug-label structured fields plus mechanism information from ChEMBL/DrugBank to assess signal plausibility.
Each of these audiences ends up wiring four databases together: ClinicalTrials.gov, openFDA, ChEMBL, and DrugBank. Each database has its own schema, rate limits, and license terms. DrugBank is the one that bites — its commercial-use clause catches teams that integrated it in development without reading the license, and serving notice on small-SaaS founders is a real thing.
Here's the architecture we recommend, including a license-clean alternative path that avoids the DrugBank trap.
The four sources, one paragraph each
ClinicalTrials.gov v2
US National Library of Medicine. Definitive registry for US-registered clinical trials and the de facto global standard. v2 launched 2024 — REST + JSON, replacing the legacy v1 CSV/XML. Free, rate-limited (10 req/sec). Documentation at clinicaltrials.gov/data-api/api. Strengths: authoritative, comprehensive, no license issues. Weaknesses: optional-field sparseness for older studies, schema migration friction for teams still on v1.
openFDA
FDA-run public API. Covers drug labels (SPL — Structured Product Labels), FAERS (Adverse Event Reporting System), recall data, and food/device equivalents. Free, rate-limited 240 req/min unauthenticated and 120,000 req/day with an API key. Strengths: authoritative regulatory source, structured data, broad coverage. Weaknesses: SPL parsing requires understanding HL7 conventions; FAERS deduplication is the user's problem.
ChEMBL
EBI / EMBL-EBI. Curated bioactivity database — IC50, Ki, Kd, EC50 measurements across compounds, targets, and assays. Free, REST + JSON, no rate-limit headaches at moderate volume. Strengths: structural and mechanistic data nothing else covers. Weaknesses: focus is research-grade; therapeutic/clinical mappings are partial.
DrugBank
University of Alberta-originated, now commercial. Drug-target mappings, pharmacology, drug-drug interactions, polypharmacology. Academic use is free; commercial use requires a paid license. The license applies to any commercial product, including free SaaS tools — read the terms before integrating.
API Pick Clinical Search (license-clean alternative)
Semantic search across ClinicalTrials.gov, FDA drug labels, ChEMBL bioactivity, and DrugBank pharmacology metadata that we license. JSON in / JSON out, 30 credits per call (~$0.03), only-on-success billing. Output is consistent with regulatory and structural data terms; no commercial-license trap for end users.
Side by side
| ClinicalTrials.gov v2 | openFDA | ChEMBL | DrugBank | API Pick Clinical | |
|---|---|---|---|---|---|
| Coverage | Trials registry | Labels + FAERS + recalls | Bioactivity, targets, assays | Drugs + targets + interactions | All four, semantic |
| Format | REST + JSON | REST + JSON | REST + JSON | REST + JSON / SQL dumps | JSON, snippets pre-shaped |
| Rate limit | 10 req/sec | 240/min unauth, 120k/day with key | Generous | License-tier dependent | Per-call (no per-user) |
| License | Public domain | Public domain | CC-BY-SA | Academic free / commercial paid | API Pick TOS |
| Best fit | Trial protocols, sponsor disambig | Regulatory labels, AE signals | Mechanistic / structural | Drug-drug interactions, polypharm | AI-agent retrieval over all |
Working code: each source
ClinicalTrials.gov v2
import requests
# Trials for a specific condition + intervention
r = requests.get(
"https://clinicaltrials.gov/api/v2/studies",
params={
"query.cond": "non-small cell lung cancer",
"query.intr": "pembrolizumab",
"filter.overallStatus": "RECRUITING",
"pageSize": 25,
"format": "json",
},
)
studies = r.json()["studies"]
for s in studies[:3]:
proto = s["protocolSection"]
nct = proto["identificationModule"]["nctId"]
title = proto["identificationModule"]["briefTitle"]
sponsor = proto["sponsorCollaboratorsModule"]["leadSponsor"]["name"]
print(f"{nct}: {title} (sponsor: {sponsor})")openFDA: drug label + FAERS signal
import requests
from collections import Counter
# Drug label lookup
r = requests.get(
"https://api.fda.gov/drug/label.json",
params={"search": "openfda.brand_name:Lipitor", "limit": 1},
).json()
label = r["results"][0]
print("Indications:", label.get("indications_and_usage", ["—"])[0][:200])
# FAERS — most reported adverse events for atorvastatin
r = requests.get(
"https://api.fda.gov/drug/event.json",
params={
"search": 'patient.drug.medicinalproduct:"ATORVASTATIN CALCIUM"',
"count": "patient.reaction.reactionmeddrapt.exact",
"limit": 10,
},
).json()
print("Top reported reactions:")
for r_ in r["results"]:
print(f" {r_['term']}: {r_['count']}")ChEMBL: target bioactivity
import requests
# Target search → activity for a specific target
r = requests.get(
"https://www.ebi.ac.uk/chembl/api/data/activity.json",
params={
"target_chembl_id": "CHEMBL204", # PD-L1
"standard_type": "IC50",
"limit": 25,
},
).json()
for a in r["activities"][:5]:
cid = a["molecule_chembl_id"]
val = a["standard_value"]
unit = a["standard_units"]
print(f"{cid}: IC50 = {val} {unit}")API Pick Clinical Search: one call, all sources
import requests
r = requests.post(
"https://www.apipick.com/api/search/clinical",
headers={"x-api-key": "pk_yourkey"},
json={"query": "PD-L1 inhibitors in NSCLC trials and adverse events"},
)
for hit in r.json()["results"][:5]:
print(hit["title"], "→", hit["url"], f"(source: {hit.get('source')})")
# Returns ranked semantic matches across trials + labels + bioactivity.
# 30 credits per call, only on HTTP 200.Three patterns that come up in production
1. Drug-repurposing screening
Take an approved drug. Pull its mechanism (ChEMBL targets), its current indications (openFDA label), and any trials testing it in new indications (ClinicalTrials.gov). Cross-reference with FAERS for safety signals in the new indication. The agent assembles all four pieces and surfaces candidates worth a pharmacologist's time.
2. Pharmacovigilance signal triage
Hourly cron pulls new FAERS reports for a watchlist of drugs. Compute Reporting Odds Ratio against the rest of the database. Flag any signals where ROR > 2 with 95% CI excluding 1. Pair with ClinicalTrials.gov to check whether the indication-of-use is on-label or off-label. Output a ranked list for the team's morning review — analogous to the morning-briefing pattern for news.
3. AI medical assistant grounding
For any drug-related answer the assistant gives, pull the openFDA label and use it as the authoritative ground truth. Cite the FDA label section explicitly. Refuse to answer dose-related questions when the label can't be retrieved. This is the citation-grounded pattern from the UK case law write-up applied to medicine — with even higher stakes.
The DrugBank trap
Worth re-emphasising. The DrugBank academic license is well-publicised but its terms shift the moment you charge anyone money for anything that uses the data — including a free product whose users you intend to convert to paid later. Several small SaaS founders have discovered this the hard way after a notice landed in their inbox.
Two clean paths:
- Pay the commercial license. Standard pricing is opaque; expect to negotiate. For mature products with funding this is the right answer because DrugBank's drug-drug interaction data is hard to match.
- Use license-clean alternatives for early stages. ChEMBL covers most mechanistic data. RxNorm + DailyMed (NIH) cover drug-name normalisation and labels. FAERS covers adverse events. The combination misses some DrugBank-specific data (rich interaction tables, polypharmacology) but is enough for most early-stage products. API Pick Clinical Search wraps the license-clean subset for you.
Where this generalises
The 'wire four public databases together with rate-limit handling and license discipline' pattern shows up in many regulated verticals — financial filings (SEC + earnings transcripts + equity stats), patents (USPTO + EPO + WIPO + JPO + KIPO + CNIPA), legal (Find Case Law + legislation.gov.uk + foreign equivalents). The drug-data version is unusual mostly because the license footing is more contested. Every other axis — schema diversity, rate limits, deduplication, cross-source identifier mapping — generalises.
For one-call retrieval across the license-clean drug data sources, API Pick Clinical Search does the wiring. For the deeper integrations (full SPL parsing, FAERS signal computation, ChEMBL target trees), you still go to each source directly. Pick the right level of abstraction for each part of the pipeline.
Frequently Asked Questions
What changed in ClinicalTrials.gov v2 that breaks pipelines?
Three things. (1) Endpoint structure — v2 is REST + JSON instead of v1's CSV/XML. (2) Field names — the new protocolSection wrapper and snake_case-to-camelCase changes are the most common refactor. (3) Optional-field population — many fields documented as 'available' are sparsely populated, especially for older studies. Migration usually takes 2-3 days plus a week of bug-fixing as edge-case studies turn up.
What's the deal with DrugBank licensing?
DrugBank is free for academic and personal research. Any commercial use — including a free SaaS product, a startup's MVP, or a tool used in a paid consultancy — falls under DrugBank's commercial licensing terms. The Thinklab post 'Sounding the alarm on DrugBank's new license' from a few years back is still the canonical write-up. Many builders integrate DrugBank in development without realising the moment they ship a product, the licence applies. Read the terms before integrating, or use a license-clean alternative.
How do I do basic pharmacovigilance signal detection?
The standard disproportionality measures — Reporting Odds Ratio (ROR), Proportional Reporting Ratio (PRR), Bayesian BCPNN — over the FAERS adverse-event database. Open-source libraries like vigipy implement these. The trap is FAERS deduplication: many reports are duplicate filings of the same case from different parties; libraries like the WHO's VigiMatch handle this but you pay for it. For most med-AI use cases, openFDA's FAERS endpoint plus a simple ROR calculation is enough for surfacing signals worth investigating.
Is API Pick Clinical Search HIPAA-compliant?
The data sources we wrap (ClinicalTrials.gov, openFDA, ChEMBL, DrugBank pharmacology metadata) contain no protected health information — they cover trial protocols, drug labels, adverse-event aggregates, and structural/bioactivity data. HIPAA compliance applies to PHI, which doesn't appear in our index. If you're building a downstream product that does handle PHI (e.g. clinical decision support over patient records), you'll need to handle that separately. The data flowing through our endpoint is consistent with public regulatory data.
Can output from this be used for clinical decisions?
No. Output from any retrieval API is informational; it does not constitute medical advice or clinical decision support. Use the data to support qualified personnel — a pharmacist, physician, regulatory specialist — not to replace them. This applies to API Pick Clinical Search and to any other API in this space.
APIs used in this article
Sarah Choy is the CEO of API Pick. She writes about building production-ready APIs for AI agents and LLM workflows.