GBIF biodiversity example

This short example shows how the packages rgbif, taxa, and metacoder can be used to retrieve biodiversity data, filter it by taxonomic characteristics, and visualize it. Each of these packages do much more than this; see the links at the end of this example for more information.

Install packages

install.packages("metacoder")
install.packages("taxa")
install.packages("rgbif")

Load packages

library(rgbif)
library(taxa)
library(metacoder)

Geting biodiversity data

We can use the rgbif package to get a table of recorded plant occurrences in Oregon from the Global Biodiversity Information Facility (GBIF). Each row is a record of a plant being observed.

occ <- occ_data(scientificName = "Plantae", stateProvince = "Oregon")
occ$data
## # A tibble: 500 x 72
##       key scientificName decimalLatitude decimalLongitude issues datasetKey publishingOrgKey
##     <int> <chr>                    <dbl>            <dbl> <chr>  <chr>      <chr>           
##  1 2.01e9 Quercus suber…            44.6            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  2 1.99e9 Polystichum m…            45.5            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  3 1.99e9 Chrysolepis c…            44.4            -122. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  4 1.99e9 Lysichiton am…            46.1            -124. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  5 1.99e9 Cardamine hir…            45.5            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  6 1.99e9 Thuja plicata…            45.4            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  7 2.01e9 Quercus garry…            44.9            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  8 1.99e9 Arum italicum…            45.5            -122. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
##  9 1.99e9 Umbellularia …            43.5            -124. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
## 10 1.99e9 Oenanthe sarm…            44.6            -123. cdrou… 50c9509d-… 28eb1a3f-1c15-4…
## # … with 490 more rows, and 65 more variables: networkKeys <list>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
## #   basisOfRecord <chr>, taxonKey <int>, kingdomKey <int>, …

Converting to the taxmap format

Note how the taxonomy is encoded in a set of columns:

occ$data[26:31]
## # A tibble: 500 x 6
##    kingdom phylum       order        family          genus        species                 
##    <chr>   <chr>        <chr>        <chr>           <chr>        <chr>                   
##  1 Plantae Tracheophyta Fagales      Fagaceae        Quercus      Quercus suber           
##  2 Plantae Tracheophyta Polypodiales Dryopteridaceae Polystichum  Polystichum munitum     
##  3 Plantae Tracheophyta Fagales      Fagaceae        Chrysolepis  Chrysolepis chrysophylla
##  4 Plantae Tracheophyta Alismatales  Araceae         Lysichiton   Lysichiton americanus   
##  5 Plantae Tracheophyta Brassicales  Brassicaceae    Cardamine    Cardamine hirsuta       
##  6 Plantae Tracheophyta Pinales      Cupressaceae    Thuja        Thuja plicata           
##  7 Plantae Tracheophyta Fagales      Fagaceae        Quercus      Quercus garryana        
##  8 Plantae Tracheophyta Alismatales  Araceae         Arum         Arum italicum           
##  9 Plantae Tracheophyta Laurales     Lauraceae       Umbellularia Umbellularia californica
## 10 Plantae Tracheophyta Apiales      Apiaceae        Oenanthe     Oenanthe sarmentosa     
## # … with 490 more rows

We can use these columns to infer a taxonomy and organize the data according to it. The function parse_tax_data from the taxa package can convert data that includes a taxonomy into the taxmap format.

obj <- parse_tax_data(occ$data, class_cols = 26:31, named_by_rank = TRUE)
obj
## <Taxmap>
##   454 taxa: ab. Plantae, ac. Tracheophyta ... rm. Clematis vitalba
##   454 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af ... im->rj, gs->rk, fq->rl, ki->rm
##   1 data sets:
##     tax_data:
##       # A tibble: 500 x 73
##         taxon_id    key scientificName decimalLatitude decimalLongitude issues datasetKey
##         <chr>     <int> <chr>                    <dbl>            <dbl> <chr>  <chr>     
##       1 kj       2.01e9 Quercus suber…            44.6            -123. cdrou… 50c9509d-…
##       2 kk       1.99e9 Polystichum m…            45.5            -123. cdrou… 50c9509d-…
##       3 kl       1.99e9 Chrysolepis c…            44.4            -122. cdrou… 50c9509d-…
##       # … with 497 more rows, and 66 more variables: publishingOrgKey <chr>,
##       #   networkKeys <list>, installationKey <chr>, publishingCountry <chr>,
##       #   protocol <chr>, lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
##       #   basisOfRecord <chr>, taxonKey <int>, …
##   0 functions:

Note how the original table is now part of the object and a unique taxon ID has been added to each row.

Plotting and filtering taxonomic data

Using the taxmap format keeps the taxonomic and non-taxonomic parts of the data in sync, allowing for filtering data based on taxon characteristics and vice versa. It also allows you to use the heat_tree function from the metacoder package to plot per-taxon data. In the example below, we plot the number of occurrence for each taxon after filtering out:

  • any taxa at the rank of species
  • any taxa not a subtaxa of Tracheophyta (vascular plants)
  • any order that does not have at least 10 subtaxa
obj %>% 
  filter_taxa(taxon_ranks != "species") %>% 
  filter_taxa(taxon_names == "Tracheophyta", subtaxa = TRUE) %>%
  filter_taxa(taxon_ranks == "order", n_subtaxa > 10, subtaxa = TRUE, supertaxa = TRUE) %>% 
  heat_tree(node_label = taxon_names,
            node_color = n_obs,
            node_size = n_obs,
            node_color_axis_label = "# occurrences")

Other examples

Voting organized by geography

Any hierarchical data can be used with metacoder and taxa. Here is an example of plotting voting results from the 2016 US Democratic primary between Bernie Sanders and Hillary Clinton, using geography as the hierarchy instead of taxonomy.

Link to analysis code

Differential taxon abundance in the human microbiome

The metacoder package focuses on ecological analysis of taxonomic data, especially microbiome data. Here is an example that uses metacoder to visualize differences in the human microbiome among body sites.

Link to analysis code

The grey tree on the lower left functions as a key for the unlabeled trees. Each of the smaller trees represent a comparison between body sites in the columns and rows. A taxon colored brown is more abundant in the body site in the column and a taxon colored green is more abundant in body site of the row. For example, Bacteroidetes is more abundant (i.e. has more reads) in the throat samples than the nose samples and this is due to the greater abundance of two genera Prevotella and Porphyromonas.