R language practice – rWCVP: R package of the world list of vascular plants

rWCVP: An R package for the World Catalog of Vascular Plants

  • introduce
    • 1. Refer to github installation and simple examples
      • 1.1 Install rWCVP
      • 1.2 Install rWCVPdata
      • 1.3 github example
    • 2. Read the original literature of rWCVP
      • 2.1 Preface (Background)
      • 2.2 Function overview
        • 2.2.1 Name matching (wcvp_match_names(), wcvp_match_exact(), wcvp_match_fuzzy())
        • 2.2.2 Name resolution after matching
        • 2.2.3 Spatial integration and distribution mapping (wcvp_distribution(), wcvp_distribution_map() and wgsrpd3)
        • 2.2.4 Mapping geographic locations between levels (get_wgsprd3_codes() and get_area_name())
        • 2.2.5 Summary table (wcvp_summary () and wcvp_summary_gt())
        • 2.2.6 Find record matrix (wcvp_occ_mat())
        • 2.2.7 Generate checklist (wcvp_checklist())
      • 2.3 rWCVP application

Introduction

Borrowing abstracts from the literature to introduce this package: *World Checklist of Vascular Plants (WCVP) is a very high-quality database resource that provides a solid foundation for the fields of plant science, plant protection, ecology, and evolution. However, mastering the handling of such a large and complex database is a challenge for many users. Therefore, the authors launched rWCVP, which is convenient for users to perform data cleaning and other common processing on WCVP through this open source R package. These capabilities include taxonomy name correction, geospatial integration, mapping, and generation of several different WCVP summaries in data and reporting formats.

1. Refer to github installation and simple example

1.1 Install rWCVP

Just run the next line

devtools::install_github("matildabrown/rWCVP")

or

install.packages("rWCVP")

or
Complete the download and installation through the Install Packages of Tools in RStudio

1.2 Install rWCVPdata

When running the example, it shows that rWCVPdata must be installed. Alternative methods that do not install rWCVPdata are mentioned in the literature. But I still recommend installing it, it is always right to learn more.

Just run the next line

devtools::install_github("matildabrown/rWCVPdata")

1.3 github example

https://github.com/matildabrown/rWCVP

rWCVP can easily obtain and map the known distribution areas of plant species:

library(rWCVP)

distribution = wcvp_distribution("Myrcia guianensis", taxon_rank="species")

# global map
wcvp_distribution_map(distribution)

# zoomed-in map
wcvp_distribution_map(distribution, crop_map=TRUE)

Result:


2. rWCVP original document reading

By reading the original literature, we can better understand the author’s original intention of implementing the package, the principle and process of function realization, and provide guidance for its use. Compared with section 3, which will read the user manual to understand rWCVP, this section presents you from the logical level rather than the operational level.

2.1 Preface (Background)

WCVP is a world-renowned list of vascular plants, and the database is continuously updated. It connects to the International Plant Inventory (IPNI). **In the latest version, WCVP also includes distribution data for all species (World Geographic Program level 3 for recording plant distributions). **Additional data can be integrated using name matching and analyzed taxonomically, spatially and morphologically – WCVP also includes life form data for over 75% of species.

Taxonomic and distributional data held in WCVP are useful for applications in plant science and biodiversity research, but need to be processed, filtered or aggregated to produce meaningful output. The size of the dataset itself limits the tools available to analyze it – at 2.9 million rows (when names and distributions are combined), WCVP cannot be fully opened by Microsoft Excel, so even simple filtering operations require some programming skills.

The functions of this package are:

2.2 Function Summary

2.2.1 Name matching (wcvp_match_names(), wcvp_match_exact(), wcvp_match_fuzzy())

In fact, to achieve name matching, you only need to call wcvp_match_names(), whilewcvp_match_exact() and wcvp _match_fuzzy() can also directly called, but not necessary. Because the internal process of wcvp_match_names() is a set of processes for calling other two functions:

2.2.2 Name resolution after matching

Name matching is a fundamental step in various analyses, so there is no general way to address the output of the name matching functions in RWCVP. For some studies, all ambiguous and multiple matches can be checked and curated manually, while for others this must be done algorithmically.

To simplify this process, RWCVP provides additional information in the dataset returned from ‘wcvp_match_names’.

One method that can be used to resolve multiple matches is:
(1) If only one name is accepted, keep that name and discard the other.
(2) If no names are accepted and only one is a synonym (instead
to illegal/invalid), keep one and discard the other.
(3) Keep the author’s most similar name (use “author_edit_distance” or “author_lcs”)

However, not all of these steps apply to any dataset. In particular, the handling of synonyms should be carefully considered during name resolution.

The handling of synonyms after name matching will depend on the dataset and the purpose of the name matching. For example, consider a species name that has been reduced to a synonym (i.e. material that was previously distinguished as two different species is now considered the same species).

2.2.3 Spatial integration and distribution mapping (wcvp_distribution(), wcvp_distribution_map() and wgsrpd3)

The rWCVP package provides functions for retrieving and plotting plant species distributions from WCVP. These distributions are returned as space vectors or polygons of WGSRPD level 3 plant countries (the base distribution unit of WCVP) in a simple feature format. Note that although these regions are often referred to as “botanical countries”, they are often sub-national units or distinct from political boundaries. WGSRPD level 3 region polygons are included in the rWCVPDATA package as object ‘wgsrpd3’.

The ‘wcvp_distribution’ function retrieves the distribution of a taxon, which can then be plotted as a range map or used for other analysis using ‘wcvp_distribution_map’

The ‘wcvp_distribution_map’ function includes options to include or exclude each type (native, introduced, extinct, and suspiciously present) and produces a map that is aesthetically consistent with the POWO web interface, where occurrence types are indicated by color. The “crop_map” parameter can be used to scale to the extent of the distribution.

2.2.4 Mapping geographic locations between levels (get_wgsprd3_codes() and get_area_name())

Many of the functions in rWCVP are designed to work with multiple WGSRPD plant countries, but manually identifying the three-letter codes that make up a broader geographic location is a tedious and error-prone task.

To solve this problem, we included the function “get_wgsrpd3_codes”, which extracts the set of level 3 codes that make up the named geography. Supported named geographies include: WGSRPD Level 3 Plant Country Names (eg “Tasmania”), Level 2 Regional Names (eg “Caribbean”), Level 1 Continental Names (eg “Europe”), Hemisphere (such as “Southern Hemisphere”), country (such as “South Africa”), and “Global”. The complete set of supported geolocations is contained in the data object “wgsrpd3_mapping”.

When this function is called, a message will be displayed informing the user to enter a level of geographic matching – some geographic locations are both a country and a region (eg “Brazil”). If the input geography is a hemisphere, the user can choose to include or exclude level 3 regions that straddle the equator using the “include_equatorial” parameter – if not specified, these regions are included and a message is displayed notifying the user of this behavior. We’ve also included the inverse of this function, a function that takes an array of area codes and returns the geographical name – “get_area_name”. This feature can be used to automatically generate titles, filenames, or informative plot labels. For examples, refer to R language practice–rWCVP introduction.

2.2.5 Summary table (wcvp_summary () and wcvp_summary_gt())

**WCVP is often used to quickly, clearly, and concisely explore and describe plant species richness at the national level. **We implemented this function in “wcvp_summary” to perform appropriate data manipulation and format this table in “wcvp_summary_gt” using the “GT” R package. This function can filter WCVPs by category and geography, and can provide summary statistics grouped in various ways.

For example, we might want to determine how many species of grasses are found in each state of Australia, including a breakdown of native, endemic, introduced and extinct species:

2.2.6 Find record matrix (wcvp_occ_mat())

Discovery record matrices, where each row is a taxon and each column is a site/location, are useful for providing summaries of species distributions, generating indicators of diversity and species richness, and analyzing symbiotic patterns. In rWCVP, you can use
generate_occurrence_matrix() function.
This function will return a data frame with the accepted species forming the rows and the WGSRPD plant country codes forming the columns. Presence (i.e., record found) is indicated by 1 and absence by 0.

When calling the function, the user can limit the taxonomic and geographic extent of the matrix, and can choose to include or exclude each occurrence type (native, introduced, extinct, and suspicious), but note that the discovery types are grouped together in the output. For an example workflow including generating and formatting an occurrence matrix, see rWCVP Generate a Publishable-Level Species Discovery Record Matrix

2.2.7 Generate checklist (wcvp_checklist())

Some purposes require more information than simple summary tables or occurrence matrices. For example, specimens may be labeled with older names, so it is useful to be able to look up synonyms and quickly identify currently accepted names. Again, the full WCVP dataset is unwieldy, so we devised the function “wcvp_checklist”. This function and output format is based on building
Inventory tool from the World Plant Portal

As with other features, the names to be included in the list can be filtered by taxonomy (by species, genus, family, order or
higher taxa), geography (by botanical country code or using get_wgsrpd3_codes to name geography) and taxonomic status (all names, or only accepted names).
Setting “render_report = TRUE” will generate a formatted html report for printing and/or offline use. The report includes information-rich cutting-edge content, including keys to enter geographic maps, plant country codes, and citation information.

The list itself can be constructed in one of two ways); names can be arranged alphabetically (grouped by family) or by taxonomic status (i.e. synonyms appear below their accepted names) using the “report_type” parameter.

Manifests can also be customized (e.g. with additional filters), e.g. using rWCVP to generate custom manifests

2.3 rWCVP application

We detailed some uses of specific functions in RWCVP above, but there are many ways to combine them as part of a larger workflow. Summary functions are particularly useful for identifying candidate populations or for feasibility studies or conservation (an example is given in “Using RWCVP to generate a publishable occurrence matrix”, where we identified a genus based on the number of species as an example).
One of the most immediate and likely impactful applications of the RWCVP is the identification of important areas of plant diversity, which can then be prioritized for conservation. Plants are often excluded from conservation priority plans due to a lack of authoritative, accessible data. Ironically, the vast amount of numerically available data now presents a problem for botanists focusing on specific countries, as many of the available specimen data lack geo-referencing, and some of the most important datasets even lack information on the country where the plants were collected .