Data Structures
By word form, we mean every single word, distinguishable from others exclusively on the basis of its graphical structure, which occurs any number of times in a corpus or text.
The concept of a word form should not be confused with that of an occurrence, which represents an event – a single appearance of a word form in a text.
So, for example, the phrase showing page numbers on every page is made up of six occurrences of five
different word forms.
• Two occurrences of the word form page
• One occurrence of the word form showing
• One occurrence of the word form numbers
• One occurrence of the word form on
• One occurrence of the word form every
There is a directory of all the different word forms present in a corpus within an archive which is known
as a word form index. There is no differentiation between upper and lower case letters, and so, for example,
the strings Oxford, oxford and OXFORD all refer to the same word form.
Word forms can be either simple or compound. Generally speaking, all word forms lifted from the text
are considered to be simple word forms. However, there are special codes which allow several words from a
text to be gathered together in a compound expression which is treated as if it were one word, and can be
lemmatised as such. For example, the string ‘Elizabeth of York’ would normally generate the three word
forms ‘Elizabeth’, ‘of’ and ‘York’ in the word form index, which would be searchable and lemmatisable
separately. However, if the string were enclosed between codes that defined it as a compound expression,
GATTO would save one word form, ‘Elizabeth of York’, in the word form index.
A word form can be made up of any sequence of letters, with or without diacritics.
During the creation of a new corpus, you can specify whether you want arabic numerals to be accepted
within word forms – if this were the case, all the numerals present in the corpus texts, including any mixed
sequences of letters and numerals, would be searchable.
A lemma refers to a group of word forms which are distinguishable from each other only by their
graphical structure (graphical variations, either with or without phonetic variations) and/or by the fact that
they are formed from inflections of the same verb, noun or adjective. They usually – but not always –
correspond to a dictionary entry.
No distinction is made in the lemma index between upper and lower case letters, although you can
differentiate between homographic lemmas by specifying a grammatical category. Each lemma may be
linked to a differentiating string, known as the disambiguator which can be used to differentiate between
homographic lemmas which belong to the same grammatical category. Lastly, each lemma may have a brief
comment attached to it, which is often displayed but does not contribute to a definition of the lemma or help
to distinguish it from others. Put simply, if GATTO is to recognise two lemmas as being distinct from each
other, they must belong to different lexicographical entries and/or to different grammatical categories. If it is
necessary to distinguish between two lemmas which belong to the same grammatical category, arbitrary
strings must be entered into the disambiguator field of one or both of them. While a new lemma has to be
assigned to an occurrence and grammatical category, the disambiguator and comment fields are optional.
All this information can be modified by the user. However the insertion and deletion of lemmas within
the lemma index are functions controlled automatically by GATTO.
When a link is created between a lemma and a specific occurrence of a word form in the lemmatisation
areas, a lemma-word form link is created which is valid throughout the corpus and allows the word form to
be accessed by searching for lemmas, and vice versa. Since it is likely that each lemma will be linked to
occurrences of several different word forms, when searching for a lemma you can change the search settings
to search for all or some of the occurrences which have specifically been lemmatised with the lemma in
question, those which have been lemmatised with other lemmas, and those which have not been lemmatised.
In the same way that the first linking of a lemma with an occurrence automatically inserts the new
lemma into the lemma index of the corpus, the removal of the last lemmatisation which makes use of it,
either by changing the lemmatisation or deleting the text from the corpus, automatically deletes the lemma
from the lemma index.
Another mechanism by which lemmas are inserted into the lemma index is linked to the inclusion of
previously lemmatised texts in a corpus. (These could have been extracted from another corpus). In this case,
each text would bring its own lemmas with it which would be automatically inserted into the lemma index if
they were not already included in it.
Together all the lemmas associated with a word form within a corpus make up the word form lemma list
of a word form. This list, which is used during lemmatisation, is completely automatic. Each list entry, or
rather link between a word form and a lemma, is created as soon as a link is made between the lemma and an
occurrence in a corpus text and is removed as soon as the last use of this link is removed from the corpus.
A lemma can be made up of any sequence of letters, with or without diacritics.
During the creation of a new corpus, you can specify whether you want arabic numerals to be accepted
within lemmas.
Il meccanismo di gestione dinamica del lemmario e del dizionario di macchina adottato in GATTO, ottimale nella maggior parte dei casi, ne rende però difficoltosa l'applicazione ad un corpus diverso da quello sulla base del quale tali archivi sono stati costruiti.
É stata perciò introdotta una nuova categoria di lemmi e collegamenti forma-lemma, detti muti perchè presenti negli archivi senza però essere riferiti ad alcuna occorrenza nel corpus: data una coppia forma-lemma muta, non esiste nel corpus nessuna occorrenza della forma lemmatizzata col lemma; così come non esiste nessuna occorrenza nel corpus lemmatizzata con un lemma muto.
Gli elementi muti sono strumenti che un corpus può ereditare da un altro già lemmatizzato e sfruttare ai fini della propria lemmatizzazione.
Imagine that within a corpus a number of lemmas referring to animals have been defined (dog, cat,
fox, wolf, eagle, sparrow, owl...). You could run a search for occurrences of word forms linked to a lemma
which indicated a mammal, such as, for example, a search for lemmas of dog, cat, etc. However, you would need to be sure that you had cited all the lemmas present to obtain a complete result. This is why the option exists to link each one of the four first lemmas in the example with a hyperlemma – for example, the
hyperlemma mammal – in a similar way to that used to link lemmas with word forms.
Running a search for the hyperlemma mammal would bring up the same results as a search for lemmas
linked to all occurrences of dog, cat, fox and wolf.
Still using the original example, the lemmas eagle, sparrow and owl could be linked to the hyperlemma
bird, with the same result.
If we extend this concept, we could think about introducing a hyperlemma animal, which could be
linked not just with other lemmas but also with the hyperlemmas mammal and bird. Running a search for the
hyperlemma animal would bring up all occurrences of lemmatised word forms and linked to lemmas
connected (through mammal and bird) to the hyperlemma animal.
In GATTO terminology, mammal and bird are known as level 1 hyperlemmas since they are connected directly with lemmas, whereas animal is a level 2 hyperlemma since it is connected with level 1
hyperlemmas. You can create hyperlemmas at higher levels even than this as long as a chain of descending
connections is in place to link them to one or more lemmas present in the corpus.
A second use of hyperlemmas involves linking them directly with specific occurrences. Only level 1
hyperlemmas can be created and used in this way. As a result, you can run searches on hyperlemmas
specifying that you do not wish to search for all occurrences lemmatised with the lemmas linked to that
hyperlemma, but only those expressly linked to the hyperlemma indicated. Consequently this second type of
hyperlemma search is limited to occurrences linked to specific word form-lemma-hyperlemma groups.
In conclusion, a hyperlemma can simultaneously be linked to hyperlemmas of both higher and lower levels (and lemmas if it is a level 1 hyperlemma). In the example given, mammal, a level 1 hyperlemma, is
linked on a higher level with animal and on a lower level with cat, dog, fox and wolf.
GATTO uniquely identifies a hyperlemma by means of 3 attributes: its graphical appearance
(compulsory), its disambiguator (optional) and its level (compulsory).
In GATTO a text is a text file. Within GATTO, texts are uniquely identifiable by codes made up of
between one and three alphanumeric characters.
Generally speaking, a text coincides with a complete literary document. However there is nothing to prevent a complete work being divided up into more than one text file, and therefore more than one GATTO
text. For example, you might want to divide The Divine Comedy into three files, assigning each of the three
books a different code. Conversely, you could also insert a collection of novels by one author into the same
file, thereby treating it as one document.
You should bear in mind, however, that each text file is linked to a single bibliographical record and will
therefore be assigned the same author, title etc.
In practical terms, a text is contained within a text file (also known as a base text) held within the text directory. These text files, written in ANSI, must be prepared using a text editor or appropriate word
processing program. Any notes are inserted into another ANSI file which is called an associated text (Note).
Any second editions or translations into other languages can be inserted into a third ANSI file known as an associated text (Trad).
functions are lemmatisations, searches, data changes etc. Each corpus is identifiable by its name, made up of
between 1 and 12 alphanumeric characters. Physically, a corpus is placed in a directory whose name is
derived from that of the corpus. For example, the corpus Dante on drive C is automatically placed in the
directory C:\dante.gat). You can either have corpuses with different names on the same disk or drive, or
corpuses of the same name on different disks or drives.
A corpus is made up of texts which are inserted into it one by one. You can group the texts within a
corpus into corpus subsets dynamically defined according to their bibliographical data. Up to six corpus
subsets may be defined at any one time and a text can belong to more than one corpus subset.
The structure of a corpus subset, (essentially a list of its constituent texts), can be saved to disk and
recalled in future sessions for further use.
If corpus subsets are not saved to disk before exiting from the corpus, they will be lost, in the sense that
they will need to be redefined for future sessions if they are needed again.
Corpuses and corpus subsets can consist of just one single text.
Searches can be run within a corpus, a corpus subset or a combination of different corpus subsets. Any
duplicated results will be automatically removed.
A corpus can contain up to 16,000 texts and 2 billion occurrences.
Each text included in a corpus is linked to a whole array of bibliographical information (title, author,
edition, code etc) which is inserted in a record in a special database, which is identifiable by the name
bibliography or bibliographical archive.
GATTO’s functions therefore include the creation and deletion of bibliographical archives and the
insertion, modification and deletion of records within a bibliographical archive.
When a new corpus is created, the user is asked to indicate the file which contains the bibliographical
archive which should be linked to this corpus. If the file does not exist, it will be created.
Each corpus is linked to one bibliographical archive which must eventually contain all the records
relating to texts within that corpus. However, one bibliographical archive may cover more than one corpus.
Depending on which environment is in use, GATTO enables either the entire archive or just the part which
contains records linked to the texts of the corpus in use to be accessed.
You can always replace the bibliographical archive linked to a corpus. All the user needs to do is check that the new archive contains all the necessary records. If this is not done, then the functions which require
access to texts which no longer have bibliographical data will be shut down, and the user notified of the
problem.
As stated previously, one bibliographical archive may serve different corpuses.
Physically, a bibliographical archive is a file whose name and location are chosen by the user.