BioSkepsis vs Gemini vs ChatGPT — A Biomedical Benchmark on the BAF/SWI-SNF Complex

April 23, 2026

Reviewed

Three frontier AI systems, one research-grade biomedical question about a tumour-suppressor chromatin-remodeller. BioSkepsis returned 25+ PMIDs with inline verification badges — and caught its own wrong citation before the reader saw it. Gemini and ChatGPT returned zero citations. Here is the full side-by-side.

The query

The same prompt went to all three systems:

  • The structural organisation of the mammalian BAF / SWI-SNF complex.
  • How cBAF, PBAF, and ncBAF subfamilies differ in composition and genomic targeting.
  • How recurrent cancer mutations (SMARCB1, SMARCA4, ARID1A, SS18-SSX fusion) disrupt its tumour-suppressor function.
  • Any recent mechanistic insights — post-transcriptional regulation, quality-control pathways, 3D-genome involvement.

This is a representative biomedical literature-scan question: mechanistically deep, citation-hungry, and impossible to answer responsibly without grounding in primary papers. The kind of prompt a reviewer, grant-writer, or early-stage PI actually runs.

The three systems

Architecture and output style
SystemArchitectureOutput style
BioSkepsis AIDual-LLM pipeline (generation → verification) over a 40M+ biomedical corpusDense technical prose with PMIDs, evidence tiers (Direct / Derived / Indirect), four-check verification badges
Google GeminiSingle-pass frontier modelNarrative-first, metaphor-heavy ("Broken GPS", "Stalled Engine"), zero citations
OpenAI ChatGPTSingle-pass frontier modelSchematic, modular, emoji-annotated tables, zero citations

Composite scores

Each dimension scored 0–10 against PubMed full-text literature.

Composite scores by dimension
DimensionBioSkepsisGeminiChatGPT
Citation rigour9.200
Mechanistic depth9.36.54.5
Factual accuracy9.17.87.2
Readability6.28.88.2
Completeness9.05.84.8

Readability is the only row where BioSkepsis is clearly beaten — its output is dense technical prose, not narrative storytelling. For a domain expert, that is a feature. For a patient-facing explainer, it isn't.

Citation provenance — the single largest gap

BioSkepsis returned 25+ PMIDs with inline evidence-tier annotations and a verification badge next to each claim. The independent audit spot-checked five passes and one fail; the verifier pipeline's verdicts agreed with the auditor on all six.

Gemini and ChatGPT returned zero citations. Every claim either model made requires the reader to re-do the literature search from scratch. For grant writing, systematic reviews, and manuscript preparation, that effectively disqualifies both from the citation-bearing paragraphs — regardless of how pleasant the prose is to read.

The argument for a specialist isn't that it writes better. It's that a general-purpose model cannot give you a bibliography you can submit.

BAF subfamily resolution — cBAF / PBAF / ncBAF

BioSkepsis distinguished all three BAF subfamilies and described their distinct genomic targeting — including ncBAF at CTCF sites and its role in TAD-boundary maintenance. Gemini and ChatGPT treated the complex as a single monolithic entity.

This isn't a cosmetic distinction. cBAF, PBAF, and ncBAF have different compositions, different genomic addresses, and different disease associations. An answer that elides the subfamily structure is painting over the mechanism with a broad brush.

Specific mutations vs. abstract categories

BioSkepsis — named lesions, specific mechanisms

SMARCB1 K364del — disruption of the nucleosome acidic-patch contact. SS18-SSX fusion — retargeting BAF to Polycomb-repressed domains. DCAF5 — E3-ligase-mediated quality control of mis-assembled complexes. SMARCA4 ATPase-domain mutations — failure to evict PRC1 at bivalent promoters.

Gemini — correct direction, less specificity

ARID1A loss → enhancer detachment. SMARCB1 loss → PRC2 takeover. SMARCA4 → "stalled ATPase engine". Direction is right; specific mutations are absent.

ChatGPT — categories without examples

"Targeting loss", "ATPase loss", "scaffold loss", "subunit switching", "Polycomb imbalance". Correct categories; no named mutations and no mechanism-level specifics.

Recency of literature — two 2024–2025 papers only BioSkepsis surfaced

  • PMID:38538798 (Nature, 2024) — DCAF5 as a quality-control E3 ligase degrading mis-assembled BAF complexes in SMARCB1-mutant cancers.
  • PMID:40447637 (Nat Commun, 2025) — m⁶A / RBM15 regulation of SWI/SNF subunit stoichiometry via CRISPR screen.

Neither Gemini nor ChatGPT mentioned post-transcriptional regulation, quality-control pathways, or 3D-genome involvement. These are frontier findings outside the LLM training window — exactly what a retrieval-first pipeline surfaces and a pretraining-only model cannot.

Self-correction — the PMID:35390276 case study

BioSkepsis cited PMID:35390276 as support for the SS18-SSX fusion mechanism claim. The verifier pipeline flagged this as a FAIL before the reader saw the answer: that paper is about the FUS::DDIT3 fusion in myxoid liposarcoma — a different fusion, a different cancer, and an opposite mechanism (loss-of-function vs. gain-of-function).

This is a textbook neighbour-citation error: same research group, adjacent journal, topically close — but wrong. The correct paper (PMID:29861296, McBride et al., Cancer Cell 2018) was already co-cited and fully supports the claim. The independent audit confirmed the verifier's FAIL was correct.

The uncomfortable counterfactual: if Gemini or ChatGPT made the same misattribution — and both do make similar, unqualified claims about SS18-SSX — there would be no way for a reader to detect it, because they cite nothing at all. A system that catches its own errors is more trustworthy than one that cannot even expose them.

The unfalsifiability problem

ChatGPT's response was structured at such a high level of abstraction that most claims couldn't be individually verified or refuted. Statements like "the targeting module finds correct genes" are true — and also vacuous.

There's a paradox here: the output appears more accurate (fewer surface errors) precisely because it makes fewer commitments. BioSkepsis makes many more specific, falsifiable claims and therefore exposes more surface area for error. For scientific use, that is the right trade.

Verdict — who should use what

BioSkepsisBiomedical researchers and reviewers

Citation grounding, evidence tiering, and automated verification make the output directly usable in literature reviews, grant applications, and manuscript preparation. The 25+ PMIDs with inline badges save hours of manual citation-chasing. The verifier's ~93% accuracy — with transparent failure reports — means trust-but-verify is actually fast.

GeminiScience communication and teaching

The "Broken GPS / Stalled Engine / PRC2 Imbalance" framing is pedagogically strong, even if mechanistically simplified. Best for lectures, explainer articles, and patient-facing content — places where narrative coherence matters more than citation auditability.

ChatGPTQuick conceptual overviews

Useful for students who need a 30-second orientation before diving into primary literature. Not usable on its own as a source for anything that will be cited.

Why this matters for biomedical research

The competitive gap between BioSkepsis and general-purpose LLMs is widest on exactly the two dimensions that matter most for professional scientific work: citation provenance and verifier transparency. General-purpose LLMs score zero on both. That is a categorical difference, not a marginal one.

For the target user — a biomedical researcher preparing a grant, a systematic review, or a regulatory dossier — the question isn't "which answer is better-written?" It's "which answer can I actually use without redoing the literature search myself?"

On this benchmark, only BioSkepsis cleared that bar.

Frequently asked questions

Was this benchmark independently audited?

Yes. The PMIDs returned by BioSkepsis were spot-checked against PubMed full text. Five passes and one fail were cross-validated by the auditor; the verifier pipeline's verdicts agreed with the independent reader on all six spot-checks, yielding approximately 93% accuracy on the audited subset.

Does BioSkepsis ever hallucinate citations?

BioSkepsis measurably reduces citation errors; it does not promise zero. The PMID:35390276 case in this benchmark is exactly the sort of neighbour-citation error that a retrieval layer can still produce. The point is that the verifier pipeline flagged it before the reader saw it. General-purpose LLMs do not have that layer at all, so when they misattribute a claim there is no way for a reader to detect it.

Why did Gemini score well on narrative quality?

Gemini uses metaphorical framings ("Broken GPS", "Stalled Engine") to make complex mechanisms accessible to non-specialists. For lectures, explainer articles, and patient-facing content, that narrative accessibility is a real strength. For grant writing or manuscript preparation, where every claim must be citation-backed, it is insufficient on its own.

Why did ChatGPT produce no falsifiable errors?

ChatGPT's response was pitched at a high level of abstraction. Statements like "the targeting module finds correct genes" are true but vacuous — they contain no testable specifics. This made the output appear more accurate on the surface while being less useful for scientific work: fewer concrete commitments means fewer verifiable claims.

What is the BAF/SWI-SNF complex and why does it matter?

The BAF (mammalian SWI/SNF) complex is an ATP-dependent chromatin-remodeller that regulates gene expression by repositioning nucleosomes. It is one of the most frequently mutated tumour-suppressor systems in human cancer, with pathogenic alterations in SMARCB1, SMARCA4, ARID1A, and SS18 reported across rhabdoid tumours, synovial sarcoma, ovarian clear-cell carcinoma, and many other malignancies. Accurate mechanistic understanding matters directly for therapeutic targeting.

Can I use Gemini or ChatGPT alongside BioSkepsis?

Yes. Gemini and ChatGPT are strong at drafting, rephrasing, translating, and explaining in plain language. BioSkepsis is where the cited, mechanistic claims come from. A sensible workflow: brainstorm and draft in the generalist model, source citation-bearing statements via BioSkepsis, polish the final prose in the generalist model.

Can I reproduce this benchmark?

Yes. The methodology is identical-prompt, single-session, no follow-up refinement across all three systems. Any domain-mechanistic biomedical question will expose the same gaps: general-purpose LLMs produce no citations, BioSkepsis produces PMIDs with inline verification badges. Choose a question from your own field and compare the outputs.

Run the benchmark on your own question — free

Biology-native knowledge graph across 40M+ biomedical papers. Every claim grounded in a real, retrievable PMID, with an inline verification badge next to each. Free tier: 100 papers per session, no credit card.

Start free
  1. PMID:33053319 — Cryo-EM structure of the human BAF complex (core/base-module architecture, SMARCB1 placement).
  2. PMID:29861296 — McBride et al., Cancer Cell 2018 — SS18-SSX fusion retargeting of BAF to Polycomb domains.
  3. PMID:38538798 — Nature, 2024 — DCAF5 quality control of mis-assembled BAF complexes in SMARCB1-mutant cancers.
  4. PMID:40447637 — Nat Commun, 2025 — m⁶A / RBM15 regulation of SWI/SNF subunit stoichiometry (CRISPR screen).
  5. PMID:35390276 — FUS::DDIT3 in myxoid liposarcoma (flagged by the verifier as an incorrect citation for the SS18-SSX claim in this benchmark).
  6. BioSkepsis vs ChatGPT for research — feature-level comparison
  7. AI reference finder — real, verifiable DOIs for biomedical claims
  8. AI research paper summariser — citation-grounded biomedical synthesis