Opera del Vocabolario Italiano

Istituto del Consiglio Nazionale delle Ricerche

Data Structures

Word form

By word form, we mean every single word, distinguishable from others exclusively on the basis of its graphical structure, which occurs any number of times in a corpus or text.
The concept of a word form should not be confused with that of an occurrence, which represents an event – a single appearance of a word form in a text.
So, for example, the phrase showing page numbers on every page is made up of six occurrences of five different word forms.
• Two occurrences of the word form page
• One occurrence of the word form showing
• One occurrence of the word form numbers
• One occurrence of the word form on
• One occurrence of the word form every
There is a directory of all the different word forms present in a corpus within an archive which is known as a word form index. There is no differentiation between upper and lower case letters, and so, for example, the strings Oxford, oxford and OXFORD all refer to the same word form.
Word forms can be either simple or compound. Generally speaking, all word forms lifted from the text
are considered to be simple word forms. However, there are special codes which allow several words from a text to be gathered together in a compound expression which is treated as if it were one word, and can be lemmatised as such. For example, the string ‘Elizabeth of York’ would normally generate the three word forms ‘Elizabeth’, ‘of’ and ‘York’ in the word form index, which would be searchable and lemmatisable separately. However, if the string were enclosed between codes that defined it as a compound expression, GATTO would save one word form, ‘Elizabeth of York’, in the word form index.
A word form can be made up of any sequence of letters, with or without diacritics.
During the creation of a new corpus, you can specify whether you want arabic numerals to be accepted within word forms – if this were the case, all the numerals present in the corpus texts, including any mixed sequences of letters and numerals, would be searchable.