Peggy Newman, Martin Westgate, Amanda Buyan, Dax Kellie & Shandiya Balasubramaniam

The problem


For researchers, getting data out of GBIF nodes is easy…

…but sharing your own data is hard.

Hurdles


  • Darwin Core Standard formatting isn’t easy (e.g., .xml)
  • Existing documentation isn’t well-suited to newbies
  • Poor integration with existing workflows (i.e. in R or Python)
  • Sharing data is low on priority list

Q: How can we help researchers share biodiversity data?

galaxias (and friends)


galaxias: Build, check & publish DWCAs
corella: Convert a tibble to Darwin Core
delma: Convert markdown to EML or xml

Darwin Core

An archive is a .zip file containing three things:

data
csv format
metadata
eml format
schema
xml format

Process



data metadata schema archive validate submit

Data

Load galaxias

library(galaxias)



delma and corella are loaded automatically

Data

Load an example dataset

library(readr)

df <- read_csv("my_example_data.csv")
df
# A tibble: 2 × 5
  latitude longitude date       time  species                 
     <dbl>     <dbl> <chr>      <chr> <chr>                   
1    -35.3      149. 14-01-2023 10:23 Callocephalon fimbriatum
2    -35.3      149. 15-01-2023 11:25 Eolophus roseicapilla   

Data

How should we convert this dataset to Darwin Core?

suggest_workflow(df)

Data

If we follow that advice:

df_dwc <- df |>
  set_occurrences(occurrenceID = sequential_id(),
                  basisOfRecord = "humanObservation") |> 
  set_coordinates(decimalLatitude = latitude, 
                  decimalLongitude = longitude) |>
  set_datetime(eventDate = lubridate::dmy(date),
               eventTime = lubridate::hm(time)) |>
  set_scientific_name(scientificName = species, 
                      taxonRank = "species")

df_dwc
# A tibble: 2 × 8
  basisOfRecord    occurrenceID decimalLatitude decimalLongitude eventDate 
  <chr>            <chr>                  <dbl>            <dbl> <date>    
1 humanObservation 01                     -35.3             149. 2023-01-14
2 humanObservation 02                     -35.3             149. 2023-01-15
# ℹ 3 more variables: eventTime <Period>, scientificName <chr>, taxonRank <chr>

Data

Save as occurrences.csv:

use_data(df_dwc)

Process



data metadata schema archive validate submit

Metadata

Generate a metadata file

use_metadata_template() # creates the following file:
---
 title: A Descriptive Title for your Dataset in Title Case
 output: html_document
 date: 2025-02-01
 ---
 
 ```{=html}
 <!--
 This is a metadata template. 
 
 It is formatted to render as an html document (using the "Knit" button or
 `knitr::knit()`) AND to Ecological Metadata language (EML) using the 
 {delma} R package. Sections can 
 be added, re-arranged or removed to suit the dataset being described. Some 
 features to be aware of:

Metadata

Convert to EML

use_metadata("metadata.Rmd") # creates the following file:
<?xml version="1.0" encoding="UTF-8"?>
 <eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="the-doi-for-this-archive" system="https://doi.org" scope="system" xsi:schemaLocation="http://rs.gbif.org/schema/eml-gbif-profile/1.3/eml-gbif-profile.xsd">
   <dataset>
     <title>A Descriptive Title for your Dataset in Title Case</title>
     <creator>
       <individualName>
         <givenName>Firstname</givenName>
         <surName>Lastname</surName>
       </individualName>
       <address>
         <deliveryPoint>215 Road Street</deliveryPoint>
         <city>Canberra</city>
         <administrativeArea>ACT</administrativeArea>
         <postalCode>2601</postalCode>
         <country>Australia</country>

Process



data metadata schema archive validate submit

Archive

Automated process for zipping the /data-publish folder.

build_archive()
Data (minimum of one)
  • occurrences.csv ✔
  • events.csv      ✖
  • multimedia.csv  ✖
Metadata
  • eml.xml         ✔
Schema
  • meta.xml        ✔

Archive

We can check that the correct files are present.

fs::path_abs("../dwc-archive.zip") |>
  zip::zip_list() |>
  tibble::as_tibble() |>
  dplyr::select(filename:timestamp)
# A tibble: 3 × 4
  filename        compressed_size uncompressed_size timestamp          
  <chr>                     <dbl>             <dbl> <dttm>             
1 occurrences.csv             194               283 2026-05-12 12:11:38
2 eml.xml                    1220              2973 2026-05-12 12:11:38
3 meta.xml                    336               940 2026-05-12 12:11:38


The schema file (eml.xml) has been built automatically.

Process



data metadata schema archive validate submit

Validate

# validate locally
check_directory() 

# validate via GBIF API
check_archive(username = "a_gbif_user",
              email = "my@email.com",
              password = "a_secure_password")

Process



data metadata schema archive validate submit

Submitting

Run submit_archive() to create an issue on data-publication repository

Process



data metadata schema archive validate submit

Benefits of galaxias


  • Darwin Core Standard formatting is easy (e.g., .xml)
  • Documentation well-suited to newbies
  • Good integration with existing workflows (i.e. in R or Python)
  • Sharing data is on the priority list (?)

Thank you


Peggy Newman
Martin Westgate
Amanda Buyan
Dax Kellie
Shandiya Balasubramaniam

galaxias
corella
delma
galah