Linguistic research and automated text analysis often involve breaking sentences apart into tokens and identifying the base form and meaning of each individual element, with the resulting "interpretative" data being either used directly or fed into various kinds of downstream processing tools and algorithms. The tokenisation algorithms used to perform this task draw on carefully crafted linguistic datasets, the quality of which often predicts the usefulness of the produced annotations.
The Oxford English Dictionary, or OED, is one of the largest dictionaries ever compiled, which makes it one of the most valuable and comprehensive linguistic datasets for the English language. The definitive record of the language published by Oxford University Press, it features about 600000 words, 3 million quotations, and over 1000 years of English.
The OED Text Annotator is the first and only publicly available tool for linguistic annotation based on the OED. Built by Oxford Languages, the system performs performs tokenisation, part-of-speech tagging, lemmatisation on a given text and links each word to its corresponding OED lexeme through sense disambiguation. In the publicly available version of the application, the word origin and usage data from the resulting annotations is picked up by another tool called the OED Text Visualizer which displays it in an interactive visual format.
The OED Text Annotator mobilises the value and richness of the OED dataset for linguistic analysis and NLP applications. It may therefore both lead to important discoveries in all areas of research as well as help power more advanced and robust NLP systems supporting various industrial applications.