User:MWindhouwer

From Open Annotation Collaboration

Jump to: navigation, search

Menzo Windhouwer

Head of Knowledge Software, The Language Archive

Max Plank Institute for Psycholinguistics

Use Case: Annotation of linguistic resources

The archive from the Max Planck Institute for Psycholinguistics contains various large collections of linguistic resources, including the DOBES archive for material on endangered languages. These materials include:

  • audio and video recordings,
  • annotation files of these recordings,
  • images and photos,
  • lexica, possibly with multimedia extensions, and
  • textual material of all sorts, such as field notes, notes about phonetics and prosody, sketch grammars, etc.

For a research project a linguist collects and annotates many references to related (fragments of) these materials and of course many more (web-based) sources. These reference annotations can take any form the linguistic research requires, i.e., they could mark a good example, but also consist of critique on a used scientific method or theory or indicate an (presumed) error in a gloss or metadata entry. A typologist, for example, would be interested in collecting and annotating citations from textual materials for a wide range of languages on a specific linguistic phenomenon. On the other hand the field linguist who creates these textual materials on a specific language, like sketch grammars and various types of notes, is interested in keeping track of example utterances and their annotation tiers. If these annotations could be shared by all linguists, ethnologists, musicologists etc. this would, for example, mean that the typologist could extend his collection of annotated references to include fragments of the primary multimedia data on which the field linguist has based his analyses.

Throughout the years various tools have been developed for the archive, which provide some of this functionality:

  • the IMDI editor allows metadata to be associated with all materials stored in the archive,
  • using ELAN and ANNEX any form of exactly aligned standoff annotations for a recording can be created and viewed,
  • in LEXUS, a web-based tool to create and edit lexical databases, multimedia resources can be associated with lexical entries, and
  • using VICOS relations can be drawn between the lexicalized concepts from various lexica in LEXUS, thus allowing the creation of conceptual spaces.

These tools each create their own resources, where needed with a link to primary data in the archive like a recording or series of photos. ADDIT was a first experiment to not only annotate and relate data elements within these resources, e.g., between elements in metadata files, annotations and lexical entries, but also include other arbitrary web-based content. Users found the approach interesting, yet it did not take off mainly due to missing proper user friendly software design (interaction, visualization, etc.) and the big problem of missing technology to make the weaving an efficient task. Which means that it is still an open issue how to establish and annotate the network of arbitrary resources which covers all aspects of the linguistic analysis of a language from primary data via grammatical descriptions to typological abstractions. From our tests we know that this attractive paradigm will only succeed if suitable technology will be offered.

Within the large scale European infrastructure project CLARIN a foundation for such an annotation network has been laid. For example, European consortia and federations have been established for authentication and authorization and the assignment of persistent identifiers to resources. This means that annotations can use persistent references to resources and these resources can be easily accessed by the researchers. Large proposed projects linked closely with CLARIN and even other initiatives in the humanities and social sciences, for example CLARICLE and DASISH, now include the goal to establish an efficient annotation layer on top of the infrastructure and to invest substantially in suitable technology. The OAC model is considered a prime candidate for making this new annotation layer interoperable.