Skip to Main Content
The University of Massachusetts Amherst


Linguistic Corpora

There are many types of data of potential interest to linguistics; however, for the time being, this page will focus on corpus data.

A corpus is a body of texts collected as a representative sample. For example, the contents of a corpus may be gathered to represent a particular language at a particular time or capture a language among a particular subset of users.

Researchers can use corpora to:

  • formulate hypotheses about the workings of language
  • create statistics and metrics to reinforce theories and research

Many corpora are free while others involve fees to use.

Individual corpora

Collections of corpora