Average doc length
This fact-check may be outdated. Consider refreshing it to get the most current information.
Executive summary
“Average document length” is a metric with multiple meanings: in information retrieval it usually means the total number of terms in a document (used by BM25), while in practical file-management studies it can mean file size in kilobytes—the Metadata Consulting blog reported an “average office document” size of 3,210 KB (for a 2012 dataset extrapolated to 2020) [1] [2]. Different disciplines measure “length” as words, characters, pages, or bytes; pick the definition that matches your use case [1] [3].
1. What people mean when they ask “average doc length” — multiple definitions battle for attention
There is no single canonical meaning of “average document length.” In search and ranking research, document length typically equals the total number of terms in the document and is a required input for algorithms such as BM25; alternative definitions—like counting unique terms, vector norms, or post‑processing token counts after stopword removal—also appear in the literature and implementation notes [1]. In writing and publishing communities, “length” is usually measured in words or pages, with guides using word counts (for example, an average novel cited at ~90,000 words) to set expectations for book length [3]. Systems and storage teams, by contrast, often mean file size in bytes or kilobytes when they ask the same question [2] [4].
2. How IR systems compute document length — practical choices change the number
Implementations matter: for BM25 and similar probabilistic models, practitioners treat document length as the total term count after any preprocessing you adopt—lowercasing, stop‑word removal, stemming—because those steps change both per‑doc length and corpus averages; some authors even advocate counting unique terms instead, while vector‑space models sometimes use vector norm rather than raw term counts [1]. The Stack Overflow discussion clarifies that simple examples may show a document length of “4” (and an average of 4) but warns developers to inspect the exact tokenization and preprocessing used by their library [1].
3. Storage and operational teams mean bytes — beware of conflating metrics
When capacity planners ask for “average document size,” they usually want bytes per item: a Microsoft Enterprise Search survey and related summaries have been used to estimate average office document sizes (one blog cites 3,210 KB as an average office doc size derived from broader enterprise sampling and projects that figure forward) [2]. Database and document‑store communities perform this calculation directly (for example, Couchbase forum users compute AVG(length(ENCODE_JSON(t))) to get average document size) to inform sizing and queries [4]. Treating KB as interchangeable with “words” will produce misleading capacity and performance decisions [2] [4].
4. Genre and time drive “average” — averages hide broad distributions
Research into corpora shows that average document length varies dramatically by genre and tends to grow over time for some domains: studies of financial disclosures and corpora annotated for NLP report increasing word counts in certain document types and large differences across genres such as broadcast transcripts versus longform documents [5] [6]. Using a single corpus average to normalize scores or plan storage without checking the distribution and temporal trends misrepresents reality [5] [6].
5. Practical guidance — choose the right metric and be explicit about preprocessing
Decide up front whether you need words, unique terms, characters, pages, or bytes. For ranking and BM25 use term counts after your chosen tokenization and stop‑word/stemming pipeline; for authorship and publishing use word counts or pages (e.g., targets like 90,000 words for novels); for infrastructure use bytes or encoded JSON length as in practical Couchbase examples [1] [3] [4]. Document and publish your preprocessing rules so that your “average” is reproducible and comparable [1].
6. Limitations in available reporting and next steps
Available sources show examples and methodologies but do not provide a single, authoritative global number for “average document length” because the metric depends on definition, corpus, and preprocessing [1] [2] [3]. If you want a concrete number for your use case, supply the corpus or the exact measurement target (words, unique terms, characters, KB) and the preprocessing rules; absent that, compute averages directly—e.g., AVG(length(ENCODE_JSON(t))) for stored documents in Couchbase or an average word count across your text collection [4] [3].