Wiring ClinicalTrials.gov v2 + openFDA + ChEMBL into a License-Clean Drug Intelligence Endpoint

Sarah ChoyPublished May 3, 202611 min read
Wiring ClinicalTrials.gov v2 + openFDA + ChEMBL into a License-Clean Drug Intelligence Endpoint

Pharma R&D, medical-AI startups, and pharmacovigilance teams all want the same thing: one endpoint that pulls trials, labels, adverse events, and bioactivity in a license-clean way. Here's the working architecture, with the trap doors that have surprised teams in production.

TL;DR

  • ClinicalTrials.gov v2 (REST + JSON) replaced the legacy v1 in 2024 — schema is cleaner but pagination, optional fields, and historical drift catch new integrators out.
  • openFDA covers drug labels (SPL), FAERS adverse-event reports, and recall data, free; rate-limited to 240 req/min unauthenticated, 120k/day with a key.
  • ChEMBL gives you bioactivity (IC50, Ki, Kd, EC50), targets, and assays — the structural / mechanistic dimension other databases lack.
  • DrugBank's commercial license is the trap: academic use is permitted; any product, even small SaaS, falls under commercial licensing terms most builders don't read until they're served notice.
  • API Pick Clinical Search wraps ClinicalTrials, openFDA, ChEMBL, and DrugBank pharmacology in one POST endpoint — 30 credits per call, license-clean, only-on-success.

The shape of the problem

Three audiences end up needing roughly the same drug-data pipeline, for different reasons:

  • Biopharma R&D and drug-repurposing teams want bioactivity data (ChEMBL), trial history (ClinicalTrials.gov), and adverse-event signals (FAERS) joined to evaluate a candidate.
  • Medical-AI startups building chatbots or clinical decision support layers need drug labels (openFDA SPL) and trial data joined to ground LLM answers in regulatory sources.
  • Pharmacovigilance teams want FAERS plus drug-label structured fields plus mechanism information from ChEMBL/DrugBank to assess signal plausibility.

Each of these audiences ends up wiring four databases together: ClinicalTrials.gov, openFDA, ChEMBL, and DrugBank. Each database has its own schema, rate limits, and license terms. DrugBank is the one that bites — its commercial-use clause catches teams that integrated it in development without reading the license, and serving notice on small-SaaS founders is a real thing.

Here's the architecture we recommend, including a license-clean alternative path that avoids the DrugBank trap.

The four sources, one paragraph each

ClinicalTrials.gov v2

US National Library of Medicine. Definitive registry for US-registered clinical trials and the de facto global standard. v2 launched 2024 — REST + JSON, replacing the legacy v1 CSV/XML. Free, rate-limited (10 req/sec). Documentation at clinicaltrials.gov/data-api/api. Strengths: authoritative, comprehensive, no license issues. Weaknesses: optional-field sparseness for older studies, schema migration friction for teams still on v1.

openFDA

FDA-run public API. Covers drug labels (SPL — Structured Product Labels), FAERS (Adverse Event Reporting System), recall data, and food/device equivalents. Free, rate-limited 240 req/min unauthenticated and 120,000 req/day with an API key. Strengths: authoritative regulatory source, structured data, broad coverage. Weaknesses: SPL parsing requires understanding HL7 conventions; FAERS deduplication is the user's problem.

ChEMBL

EBI / EMBL-EBI. Curated bioactivity database — IC50, Ki, Kd, EC50 measurements across compounds, targets, and assays. Free, REST + JSON, no rate-limit headaches at moderate volume. Strengths: structural and mechanistic data nothing else covers. Weaknesses: focus is research-grade; therapeutic/clinical mappings are partial.

DrugBank

University of Alberta-originated, now commercial. Drug-target mappings, pharmacology, drug-drug interactions, polypharmacology. Academic use is free; commercial use requires a paid license. The license applies to any commercial product, including free SaaS tools — read the terms before integrating.

API Pick Clinical Search (license-clean alternative)

Semantic search across ClinicalTrials.gov, FDA drug labels, ChEMBL bioactivity, and DrugBank pharmacology metadata that we license. JSON in / JSON out, 30 credits per call (~$0.03), only-on-success billing. Output is consistent with regulatory and structural data terms; no commercial-license trap for end users.

Side by side

Snapshot at the time of writing. Verify current rate limits and licensing before commercial integration.
ClinicalTrials.gov v2openFDAChEMBLDrugBankAPI Pick Clinical
CoverageTrials registryLabels + FAERS + recallsBioactivity, targets, assaysDrugs + targets + interactionsAll four, semantic
FormatREST + JSONREST + JSONREST + JSONREST + JSON / SQL dumpsJSON, snippets pre-shaped
Rate limit10 req/sec240/min unauth, 120k/day with keyGenerousLicense-tier dependentPer-call (no per-user)
LicensePublic domainPublic domainCC-BY-SAAcademic free / commercial paidAPI Pick TOS
Best fitTrial protocols, sponsor disambigRegulatory labels, AE signalsMechanistic / structuralDrug-drug interactions, polypharmAI-agent retrieval over all

Working code: each source

ClinicalTrials.gov v2

import requests

# Trials for a specific condition + intervention
r = requests.get(
    "https://clinicaltrials.gov/api/v2/studies",
    params={
        "query.cond": "non-small cell lung cancer",
        "query.intr": "pembrolizumab",
        "filter.overallStatus": "RECRUITING",
        "pageSize": 25,
        "format": "json",
    },
)
studies = r.json()["studies"]
for s in studies[:3]:
    proto = s["protocolSection"]
    nct = proto["identificationModule"]["nctId"]
    title = proto["identificationModule"]["briefTitle"]
    sponsor = proto["sponsorCollaboratorsModule"]["leadSponsor"]["name"]
    print(f"{nct}: {title} (sponsor: {sponsor})")

openFDA: drug label + FAERS signal

import requests
from collections import Counter

# Drug label lookup
r = requests.get(
    "https://api.fda.gov/drug/label.json",
    params={"search": "openfda.brand_name:Lipitor", "limit": 1},
).json()
label = r["results"][0]
print("Indications:", label.get("indications_and_usage", ["—"])[0][:200])

# FAERS — most reported adverse events for atorvastatin
r = requests.get(
    "https://api.fda.gov/drug/event.json",
    params={
        "search": 'patient.drug.medicinalproduct:"ATORVASTATIN CALCIUM"',
        "count": "patient.reaction.reactionmeddrapt.exact",
        "limit": 10,
    },
).json()
print("Top reported reactions:")
for r_ in r["results"]:
    print(f"  {r_['term']}: {r_['count']}")

ChEMBL: target bioactivity

import requests

# Target search → activity for a specific target
r = requests.get(
    "https://www.ebi.ac.uk/chembl/api/data/activity.json",
    params={
        "target_chembl_id": "CHEMBL204",      # PD-L1
        "standard_type": "IC50",
        "limit": 25,
    },
).json()
for a in r["activities"][:5]:
    cid = a["molecule_chembl_id"]
    val = a["standard_value"]
    unit = a["standard_units"]
    print(f"{cid}: IC50 = {val} {unit}")

API Pick Clinical Search: one call, all sources

import requests

r = requests.post(
    "https://www.apipick.com/api/search/clinical",
    headers={"x-api-key": "pk_yourkey"},
    json={"query": "PD-L1 inhibitors in NSCLC trials and adverse events"},
)
for hit in r.json()["results"][:5]:
    print(hit["title"], "→", hit["url"], f"(source: {hit.get('source')})")
# Returns ranked semantic matches across trials + labels + bioactivity.
# 30 credits per call, only on HTTP 200.

Three patterns that come up in production

1. Drug-repurposing screening

Take an approved drug. Pull its mechanism (ChEMBL targets), its current indications (openFDA label), and any trials testing it in new indications (ClinicalTrials.gov). Cross-reference with FAERS for safety signals in the new indication. The agent assembles all four pieces and surfaces candidates worth a pharmacologist's time.

2. Pharmacovigilance signal triage

Hourly cron pulls new FAERS reports for a watchlist of drugs. Compute Reporting Odds Ratio against the rest of the database. Flag any signals where ROR > 2 with 95% CI excluding 1. Pair with ClinicalTrials.gov to check whether the indication-of-use is on-label or off-label. Output a ranked list for the team's morning review — analogous to the morning-briefing pattern for news.

3. AI medical assistant grounding

For any drug-related answer the assistant gives, pull the openFDA label and use it as the authoritative ground truth. Cite the FDA label section explicitly. Refuse to answer dose-related questions when the label can't be retrieved. This is the citation-grounded pattern from the UK case law write-up applied to medicine — with even higher stakes.

The DrugBank trap

Worth re-emphasising. The DrugBank academic license is well-publicised but its terms shift the moment you charge anyone money for anything that uses the data — including a free product whose users you intend to convert to paid later. Several small SaaS founders have discovered this the hard way after a notice landed in their inbox.

Two clean paths:

  • Pay the commercial license. Standard pricing is opaque; expect to negotiate. For mature products with funding this is the right answer because DrugBank's drug-drug interaction data is hard to match.
  • Use license-clean alternatives for early stages. ChEMBL covers most mechanistic data. RxNorm + DailyMed (NIH) cover drug-name normalisation and labels. FAERS covers adverse events. The combination misses some DrugBank-specific data (rich interaction tables, polypharmacology) but is enough for most early-stage products. API Pick Clinical Search wraps the license-clean subset for you.

Where this generalises

The 'wire four public databases together with rate-limit handling and license discipline' pattern shows up in many regulated verticals — financial filings (SEC + earnings transcripts + equity stats), patents (USPTO + EPO + WIPO + JPO + KIPO + CNIPA), legal (Find Case Law + legislation.gov.uk + foreign equivalents). The drug-data version is unusual mostly because the license footing is more contested. Every other axis — schema diversity, rate limits, deduplication, cross-source identifier mapping — generalises.

For one-call retrieval across the license-clean drug data sources, API Pick Clinical Search does the wiring. For the deeper integrations (full SPL parsing, FAERS signal computation, ChEMBL target trees), you still go to each source directly. Pick the right level of abstraction for each part of the pipeline.

Frequently Asked Questions

What changed in ClinicalTrials.gov v2 that breaks pipelines?

Three things. (1) Endpoint structure — v2 is REST + JSON instead of v1's CSV/XML. (2) Field names — the new protocolSection wrapper and snake_case-to-camelCase changes are the most common refactor. (3) Optional-field population — many fields documented as 'available' are sparsely populated, especially for older studies. Migration usually takes 2-3 days plus a week of bug-fixing as edge-case studies turn up.

What's the deal with DrugBank licensing?

DrugBank is free for academic and personal research. Any commercial use — including a free SaaS product, a startup's MVP, or a tool used in a paid consultancy — falls under DrugBank's commercial licensing terms. The Thinklab post 'Sounding the alarm on DrugBank's new license' from a few years back is still the canonical write-up. Many builders integrate DrugBank in development without realising the moment they ship a product, the licence applies. Read the terms before integrating, or use a license-clean alternative.

How do I do basic pharmacovigilance signal detection?

The standard disproportionality measures — Reporting Odds Ratio (ROR), Proportional Reporting Ratio (PRR), Bayesian BCPNN — over the FAERS adverse-event database. Open-source libraries like vigipy implement these. The trap is FAERS deduplication: many reports are duplicate filings of the same case from different parties; libraries like the WHO's VigiMatch handle this but you pay for it. For most med-AI use cases, openFDA's FAERS endpoint plus a simple ROR calculation is enough for surfacing signals worth investigating.

Is API Pick Clinical Search HIPAA-compliant?

The data sources we wrap (ClinicalTrials.gov, openFDA, ChEMBL, DrugBank pharmacology metadata) contain no protected health information — they cover trial protocols, drug labels, adverse-event aggregates, and structural/bioactivity data. HIPAA compliance applies to PHI, which doesn't appear in our index. If you're building a downstream product that does handle PHI (e.g. clinical decision support over patient records), you'll need to handle that separately. The data flowing through our endpoint is consistent with public regulatory data.

Can output from this be used for clinical decisions?

No. Output from any retrieval API is informational; it does not constitute medical advice or clinical decision support. Use the data to support qualified personnel — a pharmacist, physician, regulatory specialist — not to replace them. This applies to API Pick Clinical Search and to any other API in this space.

APIs used in this article

Written by
Sarah Choy
CEO, API Pick

Sarah Choy is the CEO of API Pick. She writes about building production-ready APIs for AI agents and LLM workflows.