loading page

Internet of Samples: Creating and Mapping Controlled Vocabularies for Specimen Type, Material Type, and Sampled Feature
  • +14
  • Dave Vieglais,
  • Quan Gan,
  • Yuxuan Zhou,
  • Stephen Richard,
  • Hong Cui,
  • Neil Davies,
  • John Deck,
  • Eric Kansa,
  • Sarah Kansa,
  • John Kunze,
  • Danny Mandel,
  • Chris Meyer,
  • Thomas Orrell,
  • Sarah Ramdeen,
  • Rebecca Snyder,
  • Ramona Walls,
  • Kerstin Lehnert
Dave Vieglais
University of Kansas

Corresponding Author:[email protected]

Author Profile
Quan Gan
University of Arizona
Author Profile
Yuxuan Zhou
University of Arizona
Author Profile
Stephen Richard
U.S. Geoscience Information Network
Author Profile
Hong Cui
University of Arizona
Author Profile
Neil Davies
University of California Berkeley
Author Profile
John Deck
University of California Berkeley
Author Profile
Eric Kansa
Open Context, The Alexandria Archive Institute
Author Profile
Sarah Kansa
Open Context, The Alexandria Archive Institute
Author Profile
John Kunze
California Digital Library
Author Profile
Danny Mandel
University of Arizona
Author Profile
Chris Meyer
Smithsonian National Museum of Natural History
Author Profile
Thomas Orrell
Smithsonian Institution
Author Profile
Sarah Ramdeen
Columbia University
Author Profile
Rebecca Snyder
Smithsonian Institution
Author Profile
Ramona Walls
University of Arizona
Author Profile
Kerstin Lehnert
Columbia University
Author Profile

Abstract

Material samples are vital across multiple scientific disciplines with samples collected for one project often proving valuable for additional studies. The Internet of Samples (iSamples) project aims to integrate large, diverse, cross-discipline sample repositories and enable access and discovery of material samples as FAIR data (Findable, Accessible, Interoperable, and Reusable). Here we report our recent progress in controlled vocabulary development and mapping. In addition to a core metadata schema to integrate SESAR, GEOME, Open Context, and Smithsonian natural history collections, three small but important controlled vocabularies (CVs) describing specimen type, material type, and sampled feature were created. The new CVs provide consistent semantics for high-level integration of existing vocabularies used in the source collections. Two methods were used to map source record properties to terms in the new CVs: Keyword-based heuristic rules were manually created where existing terminologies were similar to the new CVs, such as in records from SESAR, GEOME, and Open Context and some aspects of Smithsonian Darwin Core records. For example specimen type =liquid>aqueous in SESAR records mapped to specimen type = liquid or gas sample and material type = liquid water. A machine learning approach was applied to Smithsonian Darwin Core records to infer sampled feature terms from record text describing habitat, locality, higher geography, and higher classification fields. Applying fastText with a 600-billion-token corpus in the general domain, we provided the machine a level of “understanding” of English words. With 200 and 995-record training sets, 87%, 94% precision and 85%, 92% recall were obtained respectively, yielding performance sufficient for production use. Applying these approaches, more than 3x106 records of the four large collections have been mapped successfully to a common core data model facilitating cross-domain discovery and retrieval of the sample records.