Data Structures
A form refers to any single word, distinguished from others solely by its spelling, which may occur any number of times in a corpus or text.
A form may be either monorhematic or polyrhematic: in the latter case, the nature of the form must be indicated in the text using appropriate codes.
The concept of form must not be confused with that of occurrence, which represents an event, namely a single appearance of a form in a text.
The list of different forms present in a corpus is contained in a file known as a formarium.
Depending on the settings decided by the creator of the text corpus, forms may consist solely of letters or also of Arabic numerals.
A lemma refers to a group of word forms which are distinguishable from each other only by their graphical structure (graphical variations, either with or without phonetic variations) and/or by the fact that they are formed from inflections of the same verb, noun or adjective. They usually – but not always – correspond to a dictionary entry.
No distinction is made in the lemma index between upper and lower case letters, although you can differentiate between homographic lemmas by specifying a grammatical category. Each lemma may be linked to a differentiating string, known as the disambiguator which can be used to differentiate between homographic lemmas which belong to the same grammatical category. Lastly, each lemma may have a brief comment attached to it, which is often displayed but does not contribute to a definition of the lemma or help to distinguish it from others. Put simply, if GATTO is to recognise two lemmas as being distinct from each other, they must belong to different lexicographical entries and/or to different grammatical categories. If it is necessary to distinguish between two lemmas which belong to the same grammatical category, arbitrary strings must be entered into the disambiguator field of one or both of them. While a new lemma has to be assigned to an occurrence and grammatical category, the disambiguator and comment fields are optional.
All this information can be modified by the user. However the insertion and deletion of lemmas within
the lemma index are functions controlled automatically by GATTO.
When a link is created between a lemma and a specific occurrence of a word form in the lemmatisation areas, a lemma-word form link is created which is valid throughout the corpus and allows the word form to be accessed by searching for lemmas, and vice versa. Since it is likely that each lemma will be linked to occurrences of several different word forms, when searching for a lemma you can change the search settings to search for all or some of the occurrences which have specifically been lemmatised with the lemma in question, those which have been lemmatised with other lemmas, and those which have not been lemmatised.
In the same way that the first linking of a lemma with an occurrence automatically inserts the new lemma into the lemma index of the corpus, the removal of the last lemmatisation which makes use of it, either by changing the lemmatisation or deleting the text from the corpus, automatically deletes the lemma from the lemma index.
Another mechanism by which lemmas are inserted into the lemma index is linked to the inclusion of previously lemmatised texts in a corpus. (These could have been extracted from another corpus). In this case, each text would bring its own lemmas with it which would be automatically inserted into the lemma index if they were not already included in it.
Together all the lemmas associated with a word form within a corpus make up the word form lemma list of a word form. This list, which is used during lemmatisation, is completely automatic. Each list entry, or rather link between a word form and a lemma, is created as soon as a link is made between the lemma and an occurrence in a corpus text and is removed as soon as the last use of this link is removed from the corpus.
A lemma can be made up of any sequence of letters, with or without diacritics.
During the creation of a new corpus, you can specify whether you want arabic numerals to be accepted within lemmas.
The mechanism for dynamically managing the headword list and machine dictionary adopted in GATTO, whilst optimal in most cases, makes it difficult to apply the system to a corpus other than the one on which these archives were built.
A new category of headwords and form-headword links has therefore been introduced, termed ‘silent’ because they are present in the archives without, however, referring to any occurrence in the corpus: given a silent form-headword pair, there is no occurrence in the corpus of the form lemmatised with that headword; just as there is no occurrence in the corpus lemmatised with a silent headword.
Silent elements are tools that a corpus can inherit from another, already lemmatised corpus and utilise for the purposes of its own lemmatisation.
Imagine that within a corpus a number of lemmas referring to animals have been defined (dog, cat, fox, wolf, eagle, sparrow, owl...). You could run a search for occurrences of word forms linked to a lemma which indicated a mammal, such as, for example, a search for lemmas of dog, cat, etc. However, you would need to be sure that you had cited all the lemmas present to obtain a complete result. This is why the option exists to link each one of the four first lemmas in the example with a hyperlemma – for example, the hyperlemma mammal – in a similar way to that used to link lemmas with word forms.
Running a search for the hyperlemma mammal would bring up the same results as a search for lemmas linked to all occurrences of dog, cat, fox and wolf.
Still using the original example, the lemmas eagle, sparrow and owl could be linked to the hyperlemma
bird, with the same result.
If we extend this concept, we could think about introducing a hyperlemma animal, which could be linked not just with other lemmas but also with the hyperlemmas mammal and bird. Running a search for the hyperlemma animal would bring up all occurrences of lemmatised word forms and linked to lemmas connected (through mammal and bird) to the hyperlemma animal.
In GATTO terminology, mammal and bird are known as level 1 hyperlemmas since they are connected directly with lemmas, whereas animal is a level 2 hyperlemma since it is connected with level 1 hyperlemmas. You can create hyperlemmas at higher levels even than this as long as a chain of descending connections is in place to link them to one or more lemmas present in the corpus.
A second use of hyperlemmas involves linking them directly with specific occurrences. Only level 1 hyperlemmas can be created and used in this way. As a result, you can run searches on hyperlemmas specifying that you do not wish to search for all occurrences lemmatised with the lemmas linked to that hyperlemma, but only those expressly linked to the hyperlemma indicated. Consequently this second type of hyperlemma search is limited to occurrences linked to specific word form-lemma-hyperlemma groups.
In conclusion, a hyperlemma can simultaneously be linked to hyperlemmas of both higher and lower levels (and lemmas if it is a level 1 hyperlemma). In the example given, mammal, a level 1 hyperlemma, is linked on a higher level with animal and on a lower level with cat, dog, fox and wolf.
GATTO uniquely identifies a hyperlemma by means of 3 attributes: its graphical appearance (compulsory), its disambiguator (optional) and its level (compulsory).
In GATTO a text is a text file. Within GATTO, texts are uniquely identifiable by codes made up of between one and three alphanumeric characters.
Generally speaking, a text coincides with a complete literary document. However there is nothing to prevent a complete work being divided up into more than one text file, and therefore more than one GATTO text. For example, you might want to divide The Divine Comedy into three files, assigning each of the three books a different code. Conversely, you could also insert a collection of novels by one author into the same file, thereby treating it as one document.
You should bear in mind, however, that each text file is linked to a single bibliographical record and will therefore be assigned the same author, title etc.
In practical terms, a text is contained within a text file (also known as a base text) held within the text directory. These text files, written in ANSI, must be prepared using a text editor or appropriate word processing program. Any notes are inserted into another ANSI file which is called an associated text (Note).
Any second editions or translations into other languages can be inserted into a third ANSI file known as an associated text (Trad).
A corpus is the textual domain within which operations performed using GATTO take place, whether these involve lemmatisation, searches, data modifications, etc. Each corpus is identified by a name. A single disk may contain corpora with different names, or corpora with the same name may be located on different disks.
The elements of a corpus are the texts added to it over time. It is possible to group the texts within a corpus into subcorpora defined dynamically on the basis of their bibliographic data. Up to six subcorpora can be defined simultaneously. A single text may belong to more than one subcorpus.
The structure of a subcorpus, understood as the list of constituent texts, can be saved to a file and recalled in a subsequent session to redefine the same subcorpus.
Corpora and subcorpora may also consist of a single text.
Searches can be carried out within the corpus, a subcorpus or a combination of several subcorpora; any duplicates in the results will be automatically removed.
A corpus can contain over 16,000 texts and 2 billion occurrences.
Each text included in a corpus is linked to a whole array of bibliographical information (title, author, edition, code etc) which is inserted in a record in a special database, which is identifiable by the name bibliographical archive.
GATTO’s functions therefore include the creation and deletion of bibliographical archives and the insertion, modification and deletion of records within a bibliographical archive.
When a new corpus is created, the user is asked to indicate the file which contains the bibliographical archive which should be linked to this corpus. If the file does not exist, it will be created.
Each corpus is linked to one bibliographical archive which must eventually contain all the records relating to texts within that corpus. However, one bibliographical archive may cover more than one corpus. Depending on which environment is in use, GATTO enables either the entire archive or just the part which contains records linked to the texts of the corpus in use to be accessed.
You can always replace the bibliographical archive linked to a corpus. All the user needs to do is check that the new archive contains all the necessary records. If this is not done, then the functions which require access to texts which no longer have bibliographical data will be shut down, and the user notified of the problem.
As stated previously, one bibliographical archive may serve different corpuses.
Physically, a bibliographical archive is a file whose name and location are chosen by the user.