Metadata, Data Management, and Archiving in Language Documentation

Universitas Katolik Indonesia Atma Jaya

Bradley McDonnell
University of Hawaii at Manoa

October 13, 2023

Workshop outline

  1. Preliminaries
  2. Data Management
  3. Archiving
  4. Metatdata

Metadata activity with Lameta software.

What is Language Documentation?

What is Language Documentation?

‘a language documentation is a lasting, multipurpose record of a language … for generations and user groups whose identity is still unknown and who may want to explore questions not yet raised at the time when the language documentation was compiled’

(Himmelmann 2006: 1–2)

What is Language Documentation?

‘Language documentation is the creation, annotation, preservation and dissemination of transparent records of a language.’

(Woodbury 2011: 159)

What are the goals of language documentation?

Produce a multipurpose record of the (speech/sign) practices of a community that can be used by future generations.

Products of language documentation

  1. An archival collection of audio/video recordings of various natural and staged events with transcriptions and translations of the audio/video recordings
  2. Materials based on the archival collection such as Dictionaries, Grammars, and other descriptions of the language.

What’s an archive?

We take archive to mean “a trusted repository created and maintained by an institution with a demonstrated commitment to permanence and the long-term preservation of archived resources” (Johnson 2004:143).

As cited in (Henke & Berez-Kroeker 2016: 412)

  • Archives ensure the data is safe for now and in the future

Choosing an archive

Is it a reputable archive?

  1. Member or Associate Member of Digital Endangered Languages and Musics Archives Network (DELAMAN)
  2. Meets Open Language Archives Community (OLAC) standards
  3. Metadata harvested through OLAC

Choosing an archive

Are you given permission to archive with them?

  1. May require funding from institution: ELAR < ELDP
  2. Restricted areal focus: PARADISEC1

Summary

  1. Language documentation seeks to create a multipupose record of the practices of speech/sign community (especially when one of the languages is endangered!)
  2. To ensure that these records are available (and useful!) now and far into the future

Data management and language documentation

Data management and language documentation

Proper data management is crucial to develop such products in language documentation!

Data management

Materials

Planning

Plan as much as you can before you begin work.

Data Management Plan

A document (2 pages minimum) that outlines how you are going to go about managing data.

Compoments of a Data Management Plan (DMP)

  1. Data collection, analysis, and handling
  2. Data storage, backup, and security
  3. Ethics of data collection and use
  4. Documentation and metadata

Data collection, analysis, and handling

DMP: collection, analysis, handling

  1. What kind of data will your project produce? - e.g. What type, format, and amount of digital data will you produce?
  2. What best practices for data collection … will you follow for collecting the data?
  3. What equipment, software, and/or other tools will you use to collect, process and analyze the data?
  4. How will you manage different versions of the files?

Data collection in language documentation

  1. Audio and video recordings
    • Speech events (conversations, narratives, songs)
    • Elicitation sessions (asking questions about phonology, grammar, culture)
  2. ELAN transcriptions
  3. FLEx lexical database
  4. Scans of fieldnotes and other documents
  5. Photos of participants

Audio/video recording and processing

  1. Record audio and video at high quality
    • At least HD quality
    • At least 44.1 kHz 16 bit, 96 kHz 24 bit recommended
  2. Archive original audio/video files
    • If files need to be exported
  3. Create smaller working copies for transcription or to hand out to the community

Software for video editing

HandBrake (Free, but limited)

Davinci Resolve (Free, takes a lot computing power)

Adobe Premiere (Paid)

DMP: Data storage, backup, and security

  1. How will you store and back up your data during data collection and analysis?
  2. How/when will you migrate your data?
  3. If you will be generating data at a field site, how will you safely and securely transfer it to your office/home?
  4. How will you keep your data secure?

Data storage

3-2-1 rule

  • 3 copies in 2 different storage media with at least 1 off-site copy.

Importance of archive

Archives already do this for you, so archiving early and often will help!

Backup ongoing transcriptions, fieldnotes

  • Back up data as often as you can
    • In ELAN, set automatic saving to 1 min
  • Create online (Drive, Dropbox) and offline (external harddrive)

Ethics of data generation and use

DMP: Ethics of data generation and use

What are the ethical considerations associated with the data?

  1. What are the requirements of your institution?
  2. Does the community you’re working with have any cultural restrictions?
  3. How will you obtain informed consent?
  4. Will you need to anonymize or de-identify the data?

Ethics and Open data

“Data should be as open as possible, but as closed as necessary (Horizon 2020 Programme 2017).”

(Kung 2022: 110)

  • Many archives have “access conditions” that allow you to make items open or closed to users of the archive.

Documentation and metadata

DMP: Documentation and metadata

  1. What file-naming schema will you use?
  2. How will you manage versions?
  3. How will you collect and track your metadata?

File naming

There is no one way to name files and good practice is a memorable and sustainable one for the individual researcher and team. (68)

Rules of Thumb

  1. Brevity
  2. No spaces or special characters
  3. Separating with _, -, or CamelCase

Metadata

Metadata are information about an object that helps us to understand, find, and use that object.

Metadata heirarchy

Project

(Corpus)

Session

Resource

Project-level metadata

  1. Claudia Leto, Winarno Salim Alamudi, Nikolaus P. Himmelmann, Sonja Riesberg, Jani Kuhnt-Saptodewo, Antara News Tolitoli, and Bapak Zaharman. (1988 - 2010). Collection “Totoli”. The Language Archive. https://hdl.handle.net/1839/da11addf-bef3-4742-9c00-d85a446f2cdb. (Accessed 2023-10-11).

  2. Louise Baird (collector). Keo Corpus. Collection LRB2 at catalog.paradisec.org.au [Closed Access]. https://dx.doi.org/10.4225/72/56E97A07C4F58 (Accessed 2023-10-11).

Session level metadata

This is a conversation at night between extended family members at Yawan’s house on the downriver side in the village of Karang Tanding, Jarai, Lahat, South Sumatra. Sira (M, 48) was in charge of recording the event. His goal was to have a casual conversation with his mother Sawia (F, 76) and his aunt Juria (F, 86), two older members of the community, about the history of Karang Tanding. Most of the conversation revolves around Sira asking questions to Sawia and Juria, but ocassionally Yawan (M, 46) and his wife, Partiwi (F, 44), join in on the conversation.

Recorded on a Marantz PMD 670 solid-state audio recorder with an Audio-Technica AT875 stereo microphone.

Session-level metadata

This is an elicitation session with Neti and her daughter Nefi Amelia. It was video recorded with a Canon XA30 video camera outfitted with a Rode RTG2 shotgun microphone and audio recorded with a Tascam DR-70D with an Audio Technica AT8022 stereo microphone and Shure SM-35 or AKG C520 headset microphones. The session elicits different properties of the universal quantifier, the voice system, and generalized noun modifying clause constructions in Besemah. The original filename for these recordings was PSE-20180726-E.

Resource level metadata

Collecting participant metadata

Detailed metadata for participants is critical to understanding the data

  • See examples 1 and 2

Participant metadata

Make sure to fit the participant metadata to the context in which you are collecting data and not the other way around.

Organizing metadata

Two ways of organizing metadata:

  1. Spreadsheet PARADISEC
  2. Database

Lameta

Lameta

  • lameta can be used on Mac and PC
  • based on SayMore, which is only available for PC

Organized into three main areas

  1. Project
  2. Sessions
  3. People

Lameta activities

  1. Create a new lameta project based on
    1. an existing project
    2. a project you planning on initiating
    3. a completely made up project
  2. Fill in some information about the project, such as…
    • Project ID
    • Subject and Working Language
    • Location, Region, Country, Continent
    • Depositor, Description
  1. Create a new participant in the people tab with information about yourself, including a picture as well.

  2. Create a second participant

    • This could be a friend or family member or even someone in the workshop

If you have time…

Add a custom field with additional information about your participants.

  1. Create a new session in the Sessions tab
  2. In the Session Tab, input information about your session, including a description of what happened, genre, location, …
  3. In the Contributions Tab, add people to this session.

People with multiple roles

People often have multiple roles in a session. They can be a recorder and a speaker, for example. Lameta allows you to add the same person more than once with different roles.

  1. Add media files to the session in the upper right hand corner.
  2. You can add additional metadata about each file (resource-level metadata).

Creating a separate media folder

Often our media files are larger than we can store on our laptop, so lameta allow us to use an external drive to add our media files.

Click File > Media Folders Settings... and select the drive you would like to use.

You are now ready to export your project and send your data to the archive. * Click File > Export Project... * For now I recommend exporting as Paradisec CSV

Summary

  1. With a focus on preserving linguistic diversity, language documenters need to develop a clear plan for data management that contains a plan for archiving.
  2. Tools like lameta can help in this process, especially in organizing data and associated metadata.

Also, you can help translate Lameta into Indonesian!

Want to know more?

I recommend a self-paced online course

References

Cox, Christopher. 2022. Managing Data in a Language Documentation Corpus. In Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, & Lauren B. Collister (eds.), The Open Handbook of Linguistic Data Management, 0. The MIT Press. DOI: https://doi.org/10.7551/mitpress/12200.003.0027
Henke, Ryan E. & Andrea L. Berez-Kroeker. 2016. A Brief History of Archiving in Language Documentation, with an Annotated Bibliography. Language Documentation 10. 47.
Himmelmann, Nikolaus P. 2006. Language documentation: What is it and what is it good for. In Jost Gippert, Nikolaus P. Himmelmann, & Ulrike Mosel (eds.), Essentials of language documentation, 1–30. Berlin: Mouton de Gruyter.
Kung, Susan Smythe. 2022. Developing a Data Management Plan. In Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, & Lauren B. Collister (eds.), The Open Handbook of Linguistic Data Management,. Cambridge: The MIT Press. DOI: https://doi.org/10.7551/mitpress/12200.003.0012
Woodbury, Anthony C. 2011. Language Documentation. In Peter K. Austin & Julia Sallabank (eds.), The Cambridge Handbook of Endangered Languages, 159–176. Cambridge: Cambridge University Press.