Linguistics

Linguistic Corpora

There are many types of data of potential interest to linguists; however, for the time being, this page will focus on corpus data.

A corpus is a body of texts collected as a representative sample. For example, the contents of a corpus may be gathered to represent a particular language at a particular time or capture a language among a particular subset of users.

Researchers can use corpora to:

formulate hypotheses about the workings of language
create statistics and metrics to reinforce theories and research

Many corpora are free while others involve fees to use.

Individual corpora

British National Corpus
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. (British English)
Brown Corpus
The Brown Corpus was the first computerized corpus when compiled in the 1960s. The contents of the corpus are accessible through the Natural Language Toolkit for Python. (American English)
语料库在线 / Chinese National Corpus (CNCorpus)
CNCorpus was created by the Chinese Ministry of Education and is arguably the most comprehensive corpus for both classical and modern Chinese. Chinese interface. (Chinese).
Corpus of Contemporary American English (COCA)
COCA is a collection of 450 million words maintained by Brigham Young University. (American English)
Digital Corpus of Sanskrit (DCS)
Sandhi-split corpus of Sanskrit texts with full morphological and lexical analysis. (Sanskrit).
Национальный корпус русского языка / Russian National Corpus
Corpus created by the Institute of Russian Language at the Russian Academy of Sciences containing 1 billion+ word forms. (Russian)
Open American National Corpus (OANC)
A fully open subset of the American National Corpus that contains that includes texts of all genres and transcripts of spoken data produced from 1990 onward. Free with no restrictions. (American English).
The Speech Accent Archive
A large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph, which is transcribed. (English)
Türkçe Ulusal Derlemİ (TUD) / Turkish National Corpus (TNC)
A 50 million word corpus of contemporary Turkish that consists of textual data across a wide variety of genres captured between 1990 and 2013. (Turkish).

Collections of corpora

Centre for English Corpus Linguistics
The Centre is responsible for the compilation of several learner, pedagogical, multilingual, and native novice writing corpora. (English plus some French, Dutch, Swedish)
中文语言资源联盟 / Chinese Linguistic Data Consortium (CLDC)
CLDC is a collection of corpora managed by a government entity, the Chinese Information Processing Society of China. Chinese and English interfaces. Mix of free and paid. (Chinese).
European Language Grid (ELG)
Platform hosting datasets, corpora, language models, source code and language technology tools and services for European languages. Mix of free and paid corpora (multi-language).
International Corpus of English (ICE)
ICE documents varieties of English from over 20 countries or regions. Each ICE corpus consists of one million words of spoken and written English produced after 1989 and data collections remains on-going as of 2023. (English)
Linguistic Data Consortium (LDC)
The LDC is an open consortium of universities, libraries, corporations and government research laboratories. It serves as a repository that includes numerous corpora. Mix of free and paid corpora. Note that UMass Amherst is not a member. (multi-language)
Oxford Text Archive (OTA)
Oxford Text Archive is an archive of electronic texts and other literary and language resources intended for research into literary and linguistic topics. (multi-language).
English-Corpora.org
A collection of corpora created by Mark Davies in 2004. Each corpus provides information on how native speakers speak and write, language variation, bibliometrics, and the design of language teaching material and resources. The site provides direct access to the Corpus of Contemporary American English (COCA) and NGram viewers. (English, Spanish, Portuguese)
Survey of English Usage
The Survey has collected samples of naturally-occurring language for the purposes of description and analysis since 1959. Mix of free and paid corpora. (English)
TalkBank
TalkBank was established in 2002 to foster research in the study of human communication with an emphasis on spoken communication. It includes corpora related to first language acquisition, second language acquisition, conversation analysis, classroom discourse, and aphasic language (multi-language).
Traitement de Corpus Oraux en Français (TCOF)
A collection French language corpora. The spoken corpora were recorded in the 80s and 90s and later enriched with written corpora. (French)

📚Further reading

Searching for SU "Corpora (Linguistics)" in Discovery will show you resources that employ corpora or are about them.

Corpora in applied linguistics by Hunston, Susan
Publication Date: 2022
Corpus linguistics and linguistically annotated corpora by Kubler, Sandra; Zinsmeister, Heike
Call Number: ebook

Publication Date: 2015
Exploring corpus linguistics: language in action by Cheng, Winnie
Call Number: ebook

Publication Date: 2012
The fundamental principles of corpus linguistics by McEnery, Tony; Brezina, Vaclav
Publication Date: 2022
Introduction to corpus linguistics by Zufferey, Sandrine
Publication Date: 2020

ebook
The Oxford handbook of corpus phonology by Durand, Jacques; Gut, Ulrike; Kristoffersen, Gjert (editors)
Publication Date: 2014

Last Updated: May 21, 2025 2:42 PM
URL: https://guides.library.umass.edu/linguistics
Print Page

Subjects: Linguistics

Tags: corpora, language