This is probably the first question you may ask. As OBIS integrates data from many sources, it can provide a comprehensive list of all taxa inside a defined geometry, say a country EEZ, an EBSA, or a user defined polygon.

the robis package comes with a very handy function that allows you to retrieve this information: checklist().

In the next pages you will learn to:

Define the geometry of the area of interest 
Extract the check list of all taxa present
Count how many species are in your region of interest

So, Open a R Studio Session, load the robis package and be ready to play!

First lets load the required packages

library(robis)
## it is also good habot to set up your working directory at the beggining. 
## Uncomment the setwd line and insert the path of your directory
## setwd("your_working_directory here")

Define the Geometry

As you probably already know, the geometry is the polygon that defines our region of interest (ROI). It could also be other type of geometry but polygon is the most common.

The geometry must be described in WKT terms, and these terms are passed as a parameter to the checklist function. We will do that later. First lets define a simple geometry: a rectangle.

In R Studio create a variable called WKT with the following coordinates:

WKT = "POLYGON ((142 -40, 150 -40, 150 -45, 142 -45, 142 -40))"

This is an area around Tasmania Island, Australia. This will be our ROI. Note the double parenthesis in the WKT definition.

Get the checklist of taxa

Next, extract the list of the taxa present in our ROI (this operation could take some time):

taxa = checklist(geometry=WKT)

Retrieved 2000 records of 2947 (67%)
Retrieved 2947 records of 2947 (100%)

Once completed, the function returns a data frame with all the taxa and some field associated. Explore the structure of the returned table:

str(taxa)
'data.frame':   2947 obs. of  19 variables:
 $ id       : int  395361 395467 395584 395585 395618 395650 395664 395665 395723 395764 ...
 $ valid_id : int  395361 395467 395584 395585 395618 395650 395664 395665 395723 395764 ...
 $ parent_id: int  695212 438892 775109 775109 395617 748828 776102 776102 769378 695236 ...
 $ rank_name: chr  "Species" "Genus" "Species" "Species" ...
 $ tname    : chr  "Aaptos aaptos" "Abralia" "Abyssianira bathyalis" "Abyssianira tasmaniensis" ...
 $ tauthor  : chr  "(Schmidt, 1864)" "Gray, 1849" "Just, 1990" "Just, 1990" ...
 $ worms_id : int  134241 137930 258633 258636 235514 125303 279463 279464 165648 107581 ...
 $ records  : int  1 1 4 5 1 5 2 58 1 2 ...
 $ datasets : int  1 1 1 2 1 2 1 3 1 1 ...
 $ phylum   : chr  "Porifera" "Mollusca" "Arthropoda" "Arthropoda" ...
 $ order    : chr  "Suberitida" "Oegopsida" "Isopoda" "Isopoda" ...
 $ family   : chr  "Suberitidae" "Enoploteuthidae" "Paramunnidae" "Paramunnidae" ...
 $ genus    : chr  "Aaptos" "Abralia" "Abyssianira" "Abyssianira" ...
 $ species  : chr  "Aaptos aaptos" NA "Abyssianira bathyalis" "Abyssianira tasmaniensis" ...
 $ class    : chr  "Demospongiae" "Cephalopoda" "Malacostraca" "Malacostraca" ...
 $ redlist  : logi  NA NA NA NA NA NA ...
 $ status   : chr  NA NA NA NA ...
 $ gisd     : logi  NA NA NA NA NA NA ...
 $ hab      : logi  NA NA NA NA NA NA ...

An example

As OBIS database changes constantly, let do the next exercise with a test file.

Please download the checklist of taxa for the Caribbean sea (of course you can use you own data extracted with the checklist function)

Once downloaded into your computer import the data into R using the following command:

## the test file is located under the data directory of the working directory
taxa = read.csv("data/taxaChecklistCarib-20170419.csv", stringsAsFactors = F)

Remember to correctly specify the location and name of the file downloaded.

As we saw, the checklist data frame contains many fields (variables). We’re interested to know how many phyla exists in our ROI (the Caribbean area in this test). The variable that contain the information of the phylum is, of course, “phylum”.

Explore how many different phyla are in the data frame. As usual in R, there are more than one way to do that. Try to make a table of the frequencies for each phyla, using “table”

table(taxa$phylum)

         Acidobacteria         Actinobacteria              Amoebozoa               Annelida             Arthropoda 
                     2                     58                      1                    502                   2303 
            Ascomycota          Bacteroidetes          Basidiomycota                 Bigyra            Brachiopoda 
                     7                     79                      1                      1                     31 
               Bryozoa         Cephalorhyncha               Cercozoa           Chaetognatha             Chlamydiae 
                     8                     19                      1                      4                      7 
              Chlorobi            Chloroflexi            Chlorophyta               Chordata             Ciliophora 
                     3                     12                     62                   2693                     12 
              Cnidaria          Crenarchaeota            Cryptophyta             Ctenophora          Cyanobacteria 
                  1216                      4                      2                      3                     25 
       Deferribacteres    Deinococcus-Thermus          Echinodermata          Elusimicrobia            Embryophyta 
                     3                      4                    621                      1                      1 
            Euglenozoa          Euryarchaeota          Fibrobacteres             Firmicutes           Foraminifera 
                     2                     10                      1                     78                    630 
          Fusobacteria           Gastrotricha       Gemmatimonadetes        Gnathostomulida             Haptophyta 
                     7                      2                      2                      1                     39 
          Hemichordata          Lentisphaerae          Magnoliophyta               Mollusca                Myzozoa 
                     2                      2                      7                   2008                     38 
              Nematoda               Nemertea            Nitrospirae             Ochrophyta               Oomycota 
                    10                      8                      4                     45                      1 
             Phoronida         Planctomycetes Plantae incertae sedis        Platyhelminthes               Porifera 
                     2                      7                      1                      6                    591 
        Proteobacteria               Radiozoa             Rhodophyta               Rotifera              Sipuncula 
                   338                      1                    115                      1                     32 
          Spirochaetes          Synergistetes             Tardigrada            Tenericutes            Thermotogae 
                     3                      2                      7                      4                      1 
          Tracheophyta        Verrucomicrobia        Xenacoelomorpha 
                     7                      9                     18 

You will have as a result a full list of all phyla and the number of records present for each of them. As we’re working with the checklist table, each record could by any taxonomic level below phylum (included), so be careful with the interpretation.

OBIS allows any taxonomic level in the stored records. So you could probably found one register that has been classified to the genus level or to the family level only. This is acceptable. But normally OBIS has most of the records classified to the species level.

The checklist table has one variable that indicates the taxonomic level of the record: “rank_name”.

You can make a frequency table of all the rank names present in this checklist table:

table(taxa$rank_name)

table(taxa$rank_name)

      Class    Division      Family       Forma       Genus  Infraclass  Infraorder Infraphylum     Kingdom       Order 
         67           5         523           4        2050           4           9           1           6          93 
     Phylum     Species    Subclass   Subfamily    Subgenus    Suborder   Subphylum  Subspecies  Superclass Superdomain 
         43        8549          12          11          13           7           6         322           1           1 
Superfamily  Superorder     Variety 
          5           2           3 

and you could plot it in a bar chart to easily compare the numbers. Try to make the plot by yourself!

barplot(sort(table(taxa$rank_name), decreasing = T), horiz = T, las=1, col="coral", xlab="Number of records")

From the table we see that there are 8549 species reported for the Caribbean (using this test file!). Lets see the full table:

## we want the table sorted by number of records in each taxa
sort(table(taxa$rank_name))

Infraphylum  Superclass Superdomain  Superorder     Variety       Forma  Infraclass    Division Superfamily     Kingdom 
          1           1           1           2           3           4           4           5           5           6 
  Subphylum    Suborder  Infraorder   Subfamily    Subclass    Subgenus      Phylum       Class       Order  Subspecies 
          6           7           9          11          12          13          43          67          93         322 
     Family       Genus     Species 
        523        2050        8549 

Lets then make a barplot of the number of species per phylum (that is, leaving out any record with rank_name different to “Species”). Try to make the plot by yourself!

barplot(sort(table(taxa$phylum[taxa$rank_name=="Species"]), decreasing = T), horiz = T, las=1, col="coral", xlab="Number of species")

How many phyla don’t have a single record identified to the species level?

This is a very interesting question. A frequent mistake is to consider the number of higher taxa as the number of species. Both numbers are different as well as the interpretation. To know how many phyla don’t have a single record identified to species level you can try this:

As usual, there are more than one way to calculate that. This is one:

  1. Count the number of unique phyla
  2. Count how many phyla has records identified up to species level
  3. Take the difference of both numbers
## Number of unique phyla
phyla.all = length(unique(taxa$phylum))
## Number of phyla with species names
phyla.species = length(table(taxa[!is.na(taxa$species),]$phylum))
## make the difference to have the phyla that don't have species names
phyla.all - phyla.species
[1] 40

There are 40 phyla with not a single record identified at the species level.

Conclusion

Congratulations, you have completed the lesson.

Please remember:

  1. checklist() produces a data frame with all the unique taxa names present in a specified geometry.

  2. The variable rank_name indicates the taxonomic level of each record.

  3. Not all the records are identified to the species level. If you want to know the number of species retrieved y checklist, just count the number of records with the variable “rank_name” equal to “Species”.

---
title: "What Species are in my Regio of Interest?"
author: "E Klein. eklein@usb.ve"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

This is probably the first question you may ask. As OBIS integrates data from many sources, it can provide a comprehensive list of all taxa inside a defined geometry, say a country EEZ, an EBSA, or a user defined polygon. 

the robis package comes with a very handy function that allows you to retrieve this information: checklist(). 

In the next pages you will learn to: 

    Define the geometry of the area of interest 
    Extract the check list of all taxa present
    Count how many species are in your region of interest

So, Open a R Studio Session, load the robis package and be ready to play!

First lets load the required packages

```{r load_pckg}
library(robis)

## it is also good habit to set up your working directory at the beggining. 
## Uncomment the setwd line and insert the path of your directory
## setwd("your_working_directory here")

```


## Define the Geometry

As you probably already know, the geometry is the polygon that defines our region of interest (ROI). It could also be other type of geometry but polygon is the most common. 

The geometry must be described in WKT terms, and these terms are passed as a parameter to the checklist function. We will do that later. First lets define a simple geometry: a rectangle.

In R Studio create a variable called WKT with the following coordinates: 

```{r define_coords}
WKT = "POLYGON ((142 -40, 150 -40, 150 -45, 142 -45, 142 -40))"
```


This is an area around Tasmania Island, Australia. This will be our ROI. Note the double parenthesis in the WKT definition.

## Get the checklist of taxa

Next, extract the list of the taxa present in our ROI (this operation could take some time):

```{r checklist}
taxa = checklist(geometry=WKT)
```

Once completed, the function returns a data frame with all the taxa and some field associated. Explore the structure of the returned table:

```{r checklist_str}
str(taxa)
```


## An example

As OBIS database changes constantly, let do the next exercise with a test file.

Please download the checklist of taxa for the Caribbean sea (of course you can use you own data extracted with the `checklist` function)

Once downloaded into your computer import the data into R using the following command:
```{r read_testdata}
## the test file is located under the data directory of the working directory
taxa = read.csv("data/taxaChecklistCarib-20170419.csv", stringsAsFactors = F)

```

Remember to correctly specify the location and name of the file downloaded.

As we saw, the checklist data frame contains many fields (variables). We're interested to know how many phyla exists in our ROI (the Caribbean area in this test). The variable that contain the information of the phylum is, of course, "phylum".

Explore how many different phyla are in the data frame. As usual in R, there are more than one way to do that. Try to make a table of the frequencies for each phyla, using "table"

```{r phyla_table}
table(taxa$phylum)
```

You will have as a result a full list of all phyla and the number of records present for each of them. As we're working with the checklist table, each record could by any taxonomic level below phylum (included), so be careful with the interpretation.

OBIS allows any taxonomic level in the stored records. So you could probably found one register that has been classified to the genus level or to the family level only. This is acceptable. But normally OBIS has most of the records classified to the species level. 

The checklist table has one variable that indicates the taxonomic level of the record: "rank_name".

You can make a frequency table of all the rank names present in this checklist table: 

table(taxa$rank_name)
```{r rankname_table}
table(taxa$rank_name)
```


and you could plot it in a bar chart to easily compare the numbers. Try to make the plot by yourself!

```{r taxa_barplot}
barplot(sort(table(taxa$rank_name), decreasing = T), horiz = T, las=1, col="coral", xlab="Number of records")
```

From the table we see that there are 8549 species reported for the Caribbean (using this test file!). Lets see the full table:

```{r taxa_table}
## we want the table sorted by number of records in each taxa
sort(table(taxa$rank_name))

```


Lets then make a barplot of the number of species per phylum (that is, leaving out any record with rank_name different to "Species"). Try to make the plot by yourself!

```{r n_spp}
barplot(sort(table(taxa$phylum[taxa$rank_name=="Species"]), decreasing = T), horiz = T, las=1, col="coral", xlab="Number of species")
```


## How many phyla don't have a single record identified to the species level?

This is a very interesting question. A frequent mistake is to consider the number of higher taxa as the number of species. Both numbers are different as well as the interpretation. To know how many phyla don't have a single record identified to species level you can try this:

 As usual, there are more than one way to calculate that. This is one:

1. Count the number of unique phyla
2. Count how many phyla has records identified up to species level
3. Take the difference of both numbers


```{r phyla_nospp}
## Number of unique phyla
phyla.all = length(unique(taxa$phylum))
## Number of phyla with species names
phyla.species = length(table(taxa[!is.na(taxa$species),]$phylum))
## make the difference to have the phyla that don't have species names
phyla.all - phyla.species
```

There are 40 phyla with not a single record identified at the species level.



### Conclusion

Congratulations, you have completed the lesson.

Please remember: 

1. checklist() produces a data frame with all the unique taxa names present in a specified geometry.

2. The variable rank_name indicates the taxonomic level of each record.

3. Not all the records are identified to the species level. If you want to know the number of species retrieved y checklist, just count the number of records with the variable "rank_name" equal to "Species".


