Does factually use large language models to fact-check? How does factually function?

Are you looking for more information regarding Factually? Check out our FAQ!

Still have questions? Reach out!

Checked on December 1, 2025
Disclaimer: Factually can make mistakes. Please verify important information or breaking news. Learn more.

Executive summary

Factually—if you mean automated fact‑checking systems built around large language models (LLMs)—is not a single, universally defined product in the sources provided; researchers and fact‑checking organizations are increasingly using LLMs both to detect claims and to check them against external evidence, but the literature emphasizes limits: LLMs hallucinate, factuality degrades in long outputs, and professional fact‑checkers remain skeptical of fully automated checks [1] [2] [3].

1. Why researchers are turning to LLMs for fact‑checking

Academic teams and industry labs have built benchmarks and toolchains that use LLMs to surface claims, generate candidate evidence and score claims’ support, because LLMs are good at understanding language and producing structured outputs. Projects such as FACTS Grounding and LongFact evaluate whether models’ responses are grounded in supplied documents and use automated judges to approximate human ratings [4] [5]. Researchers also break long responses into individual facts and then search for supporting evidence, an approach embodied by SAFE in LongFact work [6] [5].

2. How these LLM‑powered pipelines typically function

The common architecture in the literature is: (a) extract or generate claims from a text; (b) retrieve external evidence (search, databases); (c) use an LLM or another model to verify whether evidence supports the claim; (d) score or revise the answer and sometimes iterate [7] [5]. Some systems add automated judges or utility functions that assess factuality and then prompt the LLM to “check your facts and try again,” improving grounding by looping on external knowledge [7] [8].

3. What the benchmarks measure — and what they don’t

Benchmarks like FACTS Grounding and LongFact measure whether model outputs are fully grounded in provided context, or whether individual facts in long answers are supported by web evidence; other efforts like SimpleQA and short‑form factuality tests narrow the task to single factoid questions [4] [9] [10]. These evaluations assure measurable progress, but they do not capture broader editorial judgment, source trustworthiness nuance, or real‑world adversarial behavior [4] [1].

4. Known failure modes: hallucinations, drift, and gaming evaluations

Multiple surveys and papers document persistent faults: LLMs hallucinate, make unsupported claims in later sentences of long outputs, and sometimes produce plausible‑sounding but incorrect statements [11] [5] [2]. The literature also raises a specific concern that models could be optimized to “game” search‑based checkers, producing text that passes automated evidence searches without being truly faithful to facts [6] [5].

5. Evidence that automation can help — with caveats

Automated systems can speed and scale parts of the fact‑checking workflow: SAFE and related methods can break responses into facts, search at scale, and in some tests match or exceed crowd annotators on agreement measures, while being cheaper than human review [5] [6]. But these gains are bounded: larger models often perform better on long‑form factuality, and improvements rely on retrieval and careful benchmarking rather than raw generation alone [5] [6].

6. How professional fact‑checkers see LLM tools

Fact‑checking professionals welcome assistance but remain sceptical about full automation: interviews and CHI research show fact‑checkers demand explainable outputs, clear evidence trails, and the ability to scrutinize automated reasoning before adopting it into workflows [3]. The International Fact‑Checking Network and established outlets still prioritize human processes and transparent sourcing [12] [13].

7. Two competing perspectives — promise versus prudence

Optimists point to measurable advances: benchmarks, iterative verification systems and retrieval‑augmented pipelines show that LLMs can improve factual grounding and scale detection of claims [7] [5]. Skeptics warn that hallucination, overfitting to benchmarks and the need for editorial judgment make fully automated fact‑checking risky without human oversight [11] [3].

8. Practical takeaway for users and newsrooms

LLM‑based fact‑checking systems function best as evidence‑retrieval and triage tools that accelerate human reviewers: they can extract claims, gather candidate sources, and surface likely contradictions, but they do not yet replace human evaluation of source reliability, context and intent [7] [3]. For high‑stakes decisions, the literature recommends combining retrieval + automated verification + human fact‑checkers rather than relying solely on model outputs [7] [3].

Limitations: available sources describe research systems, benchmarks and surveys but do not document a single commercial product called “Factually” or a comprehensive list of production deployments by specific newsrooms; those specifics are not found in current reporting (not found in current reporting).

Want to dive deeper?
What is the business model and pricing of Factually's fact-checking service?
Which datasets and sources does Factually use to validate claims?
How accurate is Factually compared with human fact-checkers and other automated tools?
What methods does Factually use to detect misinformation and manage adversarial prompts?
How can newsrooms or researchers integrate Factually's API into their workflows?