Genesys enhanced by GRIN Taxonomy

By matija.obreza@croptrust.org
6 September 2020

USDA’s curated scientific name database provides a taxonomic backbone for Genesys, making searching easier and more productive.

Genesys receives accession passport data from genebanks in Multi-Crop Passport Descriptors (MCPD) format. This common standard allows genebanks to document the taxonomic identity of an accession in up to five fields in their databases: GENUS, SPECIES, SPAUTHOR (species authority), SUBTAXA and SUBTAUTHOR (subtaxon authority at the lowest taxonomic level). But what if there’s a typo in the contents of any of these descriptors? That will obviously make finding material of the mis-spelled genus or species difficult.

Fortunately, USDA maintains the GRIN Taxonomy database, which is widely, though not universally, used in the plant genetic resources community. The Genesys Passport Data Validation Tool uses GRIN Taxonomy to identify possibly problematic names among those provided by genebanks, for example because the names in the GENUS, SPECIES or SUBTAXA descriptors are mis-spelled or outdated. Exact matches between the scientific names provided by genebanks and records in GRIN Taxonomy are easy to find. When no exact match is found, however, it gets a little more complicated. In such case, the algorithm in the Validation Tool attempts to find the best match in GRIN Taxonomy based on the Levenshtein and Dice coefficients of text similarity. This will return multiple potential matches, with the best match first, followed by other, less “good”, options. It is up to the data curator to decide if a change is warranted in their own database and to select a “better” taxonomic name for any given accession.

While useful to genebanks in identifying possible issues in their data, however, this does not help much in searching Genesys itself.

That’s because, up to now, Genesys has allowed users to filter accessions based only on the taxonomic data provided by genebanks, i.e. the values of the GENUS, SPECIES and SUBTAXA descriptors as received by Genesys. And those may be mis-spelled, as we’ve seen, or reflect an outdated taxonomy. To give an example, some genebanks use the name Lycopersicon esculentum (i.e. GENUS=Lycopersicon AND SPECIES=esculentum) for tomato, and some use Solanum lycopersicum. This means that some records for tomato accessions in the Genesys database are under one name, and some under the other. To find them all, the user has to know that two names have existed for the crop in question, and do a search for two separate species names. 

No longer. With the recent adoption of the taxonomic backbone provided by GRIN Taxonomy, a search of Genesys for Solanum lycopersicum, which is the currently accepted name for the tomato in GRIN Taxonomy, will also return accessions documented as Lycopersicon esculentum, and indeed other synonyms.

The algorithm developed for the Validation Tool was slightly adapted to allow Genesys to automatically find a match in GRIN Taxonomy based on the GENUS, SPECIES and SUBTAXA values provided by genebanks. However, in case of multiple matches, the method does not automatically link the taxon name provided by genebanks to a GRIN Taxonomy name: manual intervention by the Genesys team is required to find and assign a name. That’s not so bad, though, because names provided by genebanks matched GRIN Taxonomy names for about 80% of the accessions in Genesys, when we checked in August 2020, and manual intervention for only a handful of additional distinct names resolved a further 10%.

As more data is shared on Genesys, most accessions will immediately be linked to the taxonomic names accepted by GRIN Taxonomy. Experts at USDA are continuously updating the GRIN Taxonomy database, as concepts change and new taxonomic literature is published. We will periodically retrieve the latest version, and in turn update the Genesys database accordingly.

For users who download passport data from Genesys, we’ve extended the resulting Excel spreadsheet with three new columns: GRIN_TAXON_ID, GRIN_NAME (full species name) and GRIN_AUTHOR (authority at the lowest taxonomic level). These document the accepted name, according to GRIN Taxonomy, associated with the name that was originally provided by the genebank for each accession, hopefully making data analysis a bit easier. 

These enhancements of Genesys will of course benefit users of the website, whether they are looking for data or germplasm. However, genebanks sharing data on Genesys can also benefit from the integration of GRIN Taxonomy. If genebank managers don’t have such a link built into their own documentation system, they can use this new Genesys functionality to update their own databases on the basis of the latest taxonomic knowledge.

  1. USDA, Agricultural Research Service, National Plant Germplasm System. 2020. Germplasm Resources Information Network (GRIN-Taxonomy). National Germplasm Resources Laboratory, Beltsville, Maryland. Link.

  2. Food and Agriculture Organization of the United Nations (FAO), Rome (Italy); Bioversity International, Rome (Italy), 2015. FAO/Bioversity Multi-Crop Passport Descriptors V.2.1. Link.

You may also be interested in