Transversal project: Corpora and databases

Coordination: Christian Chanard & Amina Mettouchi
Duration: until 2018 and beyond
Participants: all members of the research unit


A) Establishing and disseminating good practices
Based on existing conventions (cf. CLARIN, IRCOM etc.), we develop good practices for field linguistics in relation to

  1. the recording of audio and video in the field: criteria to choose one’s equipment, standard formats for audio and video, training for audio and video recordings
  2. the type of information to be recorded and to be provided for each recording in order to facilitate the exchange of data to make use of them in different settings (in linguistics, anthropology, literature etc.)
  3. ethical and juridical questions.

B) Management of corpora and databases
Creating the conditions for long-time archiving (managed through TGE-Adonis/Cines)

C) Scientific aspects

  1. Scientific objectives and decisions about annotations: For what purpose is a corpus set up? What does one want to learn from the annotations? Can one define a “minimal corpus” for linguistic fieldwork / sociolinguistics / literature / typology?
  2. Problems of comparability between corpora of different languages: How can one find a solution for the problem of irreducible differences between languages (what is the status of so-called “comparative” or “universal” categories)?
  3. Connections between corpora and databases: Reflections on the scientific bases of these links (required before addressing technical questions of interoperability)
    Corpora=archives of resources: audio, video, texts, metadata
    Databases=organisation of data in tables to allow for efficient exploitation (e.g. indexation of annotations to facilitate complex searches)

D) Valorisation (mobilisation) of corpora
In collaboration with the UPS 2259: production of documentaries, presentation of languages on maps, samples etc.
Returns to the communities: production of websites for the communities (with their participation) after having reflected on the tools of disseminating information, on possible hosting sites etc.

Concrete deliverables of the transversal project

  • Creation and development of a corpus management tool which allows for the addition of resources (audio, video, annotated texts) with their metadata, for updating texts and existing metadata and for establishing links between metadata and resources
  • Creation and development of a navigator (of a type like the IMDI-browser of the MPI) which makes the existing data (metadata, audio, video, texts) accessible and consultable in different modes
  • Elaboration of a consultation and exploitation charter developed in cooperation with the three programmes of our research unit.