Linguistics
Linguistic Corpora
There are many types of data of potential interest to linguistics; however, for the time being, this page will focus on corpus data.
A corpus is a body of texts collected as a representative sample. For example, the contents of a corpus may be gathered to represent a particular language at a particular time or capture a language among a particular subset of users.
Researchers can use corpora to:
- formulate hypotheses about the workings of language
- create statistics and metrics to reinforce theories and research
Many corpora are free while others involve fees to use.
Individual corpora
- British National CorpusThe British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. (British English)
- Brown CorpusThe Brown Corpus was the first computerized corpus when compiled in the 1960s. The contents of the corpus are accessible through the Natural Language Toolkit for Python. (American English)
- 语料库在线 / Chinese National Corpus (CNCorpus)CNCorpus was created by the Chinese Ministry of Education and is arguably the most comprehensive corpus for both classical and modern Chinese. Chinese interface. (Chinese).
- Corpus of Contemporary American English (COCA)COCA is a collection of 450 million words maintained by Brigham Young University. (American English)
- Digital Corpus of Sanskrit (DCS)Sandhi-split corpus of Sanskrit texts with full morphological and lexical analysis. (Sanskrit).
- Национальный корпус русского языка / Russian National CorpusCorpus created by the Institute of Russian Language at the Russian Academy of Sciences containing 1 billion+ word forms. (Russian)
- Open American National Corpus (OANC)A fully open subset of the American National Corpus that contains that includes texts of all genres and transcripts of spoken data produced from 1990 onward. Free with no restrictions. (American English).
- The Speech Accent ArchiveA large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph, which is transcribed. (English)
- Türkçe Ulusal Derlemİ (TUD) / Turkish National Corpus (TNC)A 50 million word corpus of contemporary Turkish that consists of textual data across a wide variety of genres captured between 1990 and 2013. (Turkish).
Collections of corpora
- Centre for English Corpus LinguisticsThe Centre is responsible for the compilation of several learner, pedagogical, multilingual, and native novice writing corpora. (English plus some French, Dutch, Swedish)
- 中文语言资源联盟 / Chinese Linguistic Data Consortium (CLDC)CLDC is a collection of corpora managed by a government entity, the Chinese Information Processing Society of China. Chinese and English interfaces. Mix of free and paid. (Chinese).
- European Language Grid (ELG)Platform hosting datasets, corpora, language models, source code and language technology tools and services for European languages. Mix of free and paid corpora (multi-language).
- International Corpus of English (ICE)ICE documents varieties of English from over 20 countries or regions. Each ICE corpus consists of one million words of spoken and written English produced after 1989 and data collections remains on-going as of 2023. (English)
- Linguistic Data Consortium (LDC)The LDC is an open consortium of universities, libraries, corporations and government research laboratories. It serves as a repository that includes numerous corpora. Mix of free and paid corpora. Note that UMass Amherst is not a member. (multi-language)
- Oxford Text Archive (OTA)Oxford Text Archive is an archive of electronic texts and other literary and language resources intended for research into literary and linguistic topics. (multi-language).
- English-Corpora.orgA collection of corpora created by Mark Davies in 2004. Each corpus provides information on how native speakers speak and write, language variation, bibliometrics, and the design of language teaching material and resources. The site provides direct access to the Corpus of Contemporary American English (COCA) and NGram viewers. (English, Spanish, Portuguese)
- Survey of English UsageThe Survey has collected samples of naturally-occurring language for the purposes of description and analysis since 1959. Mix of free and paid corpora. (English)
- TalkBankTalkBank was established in 2002 to foster research in the study of human communication with an emphasis on spoken communication. It includes corpora related to first language acquisition, second language acquisition, conversation analysis, classroom discourse, and aphasic language (multi-language).
- Traitement de Corpus Oraux en Français (TCOF)A collection French language corpora. The spoken corpora were recorded in the 80s and 90s and later enriched with written corpora. (French)
📚Further Reading
Searching for SU "Corpora (Linguistics)" in Discovery will show you resources that employ corpora or are about them.
- Corpora in applied linguistics by Hunston, SusanPublication Date: 2022
- Corpus linguistics and linguistically annotated corpora by Kubler, Sandra; Zinsmeister, HeikeCall Number: ebookPublication Date: 2015
- Exploring corpus linguistics: language in action by Cheng, WinnieCall Number: ebookPublication Date: 2012
- The fundamental principles of corpus linguistics by McEnery, Tony; Brezina, VaclavPublication Date: 2022
- Last Updated: Oct 2, 2024 4:14 PM
- URL: https://guides.library.umass.edu/linguistics
- Print Page