Searching for similar accessions

By matija.obreza@croptrust.org
8 February 2021

A complex problem is made easier by Genesys, and how history repeats itself.

Before adding new samples to their collection, curators consult their records to see if the material was perhaps previously accessioned from another source. They do this in the first instance by checking the passport data of the incoming material against all existing records in their database. If characterization data are available, of course they’ll check that too, but passport data is the first step.

DOIs can reliably and unambiguously identify a specific sample, and thus assist in this process. The GLIS DOI Portal scans the internet for connections between DOI-enabled PGR and is able to correctly link accessions in different genebanks. However, the adoption of DOIs for PGR is optional, meaning that only a sub-set of conserved and managed crop diversity can be linked in this way using fully automated tools. But even accessions with a DOI commonly do not link to the material from which they were sourced, as the DOIs were not known at the time of acquisition.

The GLIS DOI system, and also pedigree databases, record two key pieces of information that allow germplasm tracking and cross referencing: the method by which the germplasm was created from its source, and a reference to the immediate parent, or source. Unfortunately, both of these are limited in the domain they cover: the GLIS database does not contain information about PGR without a DOI and pedigree databases generally cover specific crops with large breeding programmes.

Groundhog day

From 2007 to 2010, CGIAR genebanks developed eight Crop Registries as part of the Global Public Goods Project Phase 2 (GPG2) project. The aim was to recover missing information about accessions of crops held in common among the centres, in order to both complete centre-own databases on the one hand, and on the other also identify which samples are unique and  which are duplicates of samples held in different genebanks (SGRP, 2010). Searching for duplicates was also implemented by other organizations, e.g. CGN’s “PGR duplicate finder” (CGN, 2012).

What all these tools have in common is that some type of data standardization and validation is applied before data analysis is performed. The standardization minimizes differences in data formatting, coding etc. so that the comparative analysis becomes more effective. Genesys already took care of this preliminary, time consuming step by enforcing strong standardization rules. Genesys now does the matching analysis too.

History repeats itself

The Similarity Search in Genesys applies the same approach as the tools developed a decade ago. Both exact and fuzzy text matches are considered. Clean taxonomic data based on GRIN Taxonomy helps with typos and synonyms in species names. The Search returns a list of ranked matches to the selected accession by evaluating the similarity of two accessions based on the following rules:

  1. Accession number matching the donor number (and donor institute) scores highest
  2. Matching taxonomic data contributes strongly to the score
  3. Matching country of provenance, collecting date, site and coordinates (with a margin of error) add to the score
  4. All other identifiers are compared, scored for similarity, and the results added to obtain a final score

The search for similar accessions is available to registered users and can be started from the accession details page. Please contact helpdesk@genesys-pgr.org if the function is not available to you!

Genesys returns a limited number of matches (with the best first) after scanning the entire database of 4,000,000+ records for similar accessions. The candidate is displayed side-by-side to the selected accession to allow the user to evaluate the result.

The Match rating  ⭐⭐⭐⭐ is a representation of the closeness of the match as assessed by Genesys. Matching entries in the passport data of the two accessions are listed under the label “Matches”. In the example below, accession 2766 at CIAT is a good match to accession 11635 at ILRI because they reference ILCA-11635 and CIAT 2766 respectively in “Other identifiers”. The two also match in species, coordinates and country of provenance, making this pair the highest ranked match among all candidate pairs.


Figure: Comparing ILRI accession 11635 against accession 2766 at CIAT.

The results bar shows the currently selected match in the middle, with a button to the next most highly ranked match (>) on the right and the previous match on the left (<):

Now what?

Genesys contains over 250,000 records of Hordeum accessions. As Genesys returns up to 20 best matches for each of 250,000 accessions, scanning all Hordeum will generate up to 2,500,000 pairs from 5,000,000 potential matches. The number of pairs can be trimmed down by considering only the matches within 90% (or 80%) of the highest scoring match per accessions. This still leaves a million pairs that require expert evaluation to confirm or reject the match.

The results of each expert evaluation must be recorded to avoid re-evaluation of the match between the same pair of accessions in the future. In the future, Genesys can maintain the list of confirmed and rejected matches. Confirmed matches can be publicly listed and made available to all users -- and could be permanently stored in GLIS, forever!

The GLIS DOI Portal allows genebanks to permanently record the relationships of an accession to other material with a DOI. There is no reliable mechanism for material without DOI, however. Exact information about replication of material will be possible when all PGR in genebanks adopt DOIs. So, please mint DOIs for all genebank accessions!

So for now, have fun exploring potential duplicate accessions on Genesys! 

Contact helpdesk@genesys-pgr.org with your questions and suggestions for improvements.

 

 

How does Genesys do it?

To find accessions with similar passport data to accession A, each candidate accession B is evaluated for similarity score SS to accession A, where a score of 0 means there is no similarity between the two.

The passport data of candidate accession B are compared to the passport data of accession A, field by field. When the two strings in a given field are an exact match, they get full points (e.g. +200). When there are differences in notation or spelling we calculate the similarity of the two strings is calculated as a score between 0 and 1, and add only the same proportion of the full score: a 90% string match for 200 points adds 180 points (200 x 90% = 180) to the overall accession similarity. This is denoted as ≤200.

For free-text fields like description of collecting site, a combination of both Dice's coefficient and Levenshtein distance is used. This approach does not work well when comparing accession numbers: “PI 123123” and “PI123124” are two completely different accession numbers, but have a very high text similarity (>0.9). We developed a distance method for accession numbers that evaluates prefixes, numbers and suffixes separately. This method gives “PI 123123” and “PI123124” a similarity value of 0.5 because of matching prefix “PI”, "123123" and "PI 123124" result in 0, while “123123” and “PI 123123” are scored at 0.666.

The points arising from the field-by-field comparisons are added to the score of B with respect to its similarity to A as follows:

Donor information:

  • +100 if A or B specify the other institute in donor institute code
  • +50 if A and B have the same donor institute code
  • ≤50 if the donor names of A and B partly match
  • +400 if accession number of one exactly matches donor number of the other
    • ≤400 if accession number of one partly matches donor number of the other
  • +200 if donor number of one exactly matches donor number of the other
    • ≤200 if donor number of one partly matches donor number of the other

Taxonomic data:

  • +200 if genus and species epithet match
    • +50 if only genus matches
  • ≤50 for matching subtaxa
  • +100 if the current GRIN taxon of A and B is the same

Country of provenance, collecting data and coordinates:

  • +80 if the country of provenance matches
  • +200 if collecting site description matches
    • ≤100 for similar collecting site description
  • ≤160 for collecting date (+20 for each matching character)
  • +100 for matching collecting number
    • ≤100 if collecting numbers match only partly
  • +20 for matching collecting mission identifier
    • ≤10 for partly matching identifiers
  • +50 if the difference in elevation is less than 100m
  • ≤200 when latitude and longitude are within 2 degrees (+200 if coordinates match exactly, linearly decreasing to 0 at 2 degrees difference.

Other identifiers:

Each unique name and other number (aliases) of A is compared to aliases of B and vice-versa:

  • ≤200 if accession name of one matches the other
  •  +100 if the alias is exactly the same in A and B
    • ≤80 if the alias partly matches any other alias
  • +1000 if the alias is the DOI of the other accession
  • +200 if the alias is the accession number of the other accession
    • ≤200 if the alias partly matches the accession number
  • +100 if the alias matches the donor number of the other accession
    • ≤100 if the alias partly matches the donor number of the other accession

Similarity rating

After scoring all candidates, the 20 highest scoring candidates are taken and given a similarity rating SR (from 1 to 4). The score of the best candidate SMAX is used to determine the similarity ranking for each hit. SR is assigned as follows:

  1. SR = 4 when SS / SMAX > 90%
  2. SR = 3 when SS / SMAX > 70%
  3. SR = 2 when SS / SMAX > 40%
  4. SR = 1 in all other cases

In cases when the score of the best candidate is very high, all other (good) candidates get a very low ranking. To assist with visibility of good candidates, we cap SMAX at 1000 before calculating the ratings.