Subsetting Tool user guide

By helpdesk@genesys-pgr.org
27 April 2023

The Subsetting Tool allows you to create scenarios of abiotic factors, to search for accessions that can grow under these conditions. The tool will consider only accessions that specify the coordinates of their collecting sites.

Text provided by Steven Sotelo¹, Julian Ramirez¹, Zakaria Kehel², Khadija Aouzal², Brayan Mora¹, Victor Manuel Hernandez¹, and figures drafted by Christelle Rabil³.

¹Alliance of Bioversity International & CIAT
²International Center for Agricultural Research in the Dry Areas (ICARDA)
³The Global Crop Diversity Trust (Crop Trust)

The Subsetting Tool allows you to create scenarios of abiotic factors, to search for accessions that can grow under these conditions. The tool will consider only accessions that specify the coordinates of their collecting sites. To generate different scenarios, there is a set of agroclimatic indicators, which are grouped into six categories and are explained below:

Drought stress

  • Total precipitation: Total monthly precipitation measured in millimeters (mm).

  • Consecutive dry days: Maximum number of consecutive days with precipitation less than 1 mm. 

  • Number of days with water stress: Number of days per month in which the ratio between actual and potential evapotranspiration (ERATIO) is less than 0.5. ERATIO is calculated following a simple water balance model as in Jones and Thornton (2009).

Flood stress

  • Extreme daily precipitation: 95th percentile of precipitation per month measured in millimeters; this indicator shows the highest values of precipitation.

  • Number of days with flooding: Number of days per month in which the ratio between actual and potential evapotranspiration (ERATIO) is greater than 0.5. ERATIO is calculated following a simple water balance model as in Jones and Thornton (2009).

Heat stress

  • Average minimum temperature: Average monthly minimum temperature, measured in degrees Celsius. 

  • Average maximum temperature: Average monthly maximum temperature, measured in degrees Celsius.

  • Average VPD: Monthly average vapor pressure deficit.

  • Number of days with high VPD: Number of days per month for which the vapor pressure deficit is greater than or equal to 4 kPa.

Photoperiod indicators

  • Mean solar radiation: Monthly average of solar radiation measured in W/m2.

  • Julian day length: Average daylight hours per month. 

Soil indicators

  • Bulk density: Bulk density averaged over the top 60 cm of soil, measured in Cg/cm3.

  • Cation exchange capacity: Average cation exchange capacity over the top 60 cm of the soil, measured in mmol(c)/kg.

  • Type of soil texture: Qualitative variable indicating soil classes taken from USDA soil taxonomy.

  • Organic carbon content: Organic carbon content averaged over the top 60 cm of soil, measured in dg/kg.

  • PH:  Soil pH averaged over the top 60 cm of soil, measured at pH * 10.

  • Salinity: Qualitative variable indicating soil salinity class taken from FAO.

Crop-specific indicators

The crop-specific indicators are only available for the crops indicated in Table 1. The table describes the cut-off points used to calculate the indicators for each crop.

  • Number of days with high temperatures: Number of days per month in which the maximum temperature exceeds the maximum temperature supported by the crop.

  • Number of days with low temperatures: Number of days per month in which the minimum temperature is below the minimum temperature supported by the crop.

  • Number of days with optimal temperatures: Number of days per month in which the average temperature oscillates between the optimum temperatures for crop growth.

image.png

Table 1: Temperature cut-off points by crop

Accessing the Subsetting Tool

The tool is accessible only to registered Genesys users. You can log in at the top of any page. If you don’t have an account, you need to register.

Step 1: Once in the Genesys database, first filter for accessions of interest based on their passport data.

image.png

Figure 1: Applying filters by passport data

Figure 1 shows a set of filters that allow you to narrow your search only to the accessions of interest. Genesys allows filtering of accessions in multiple ways:

  • TEXT SEARCH: Search accessions that contain the text in their fields.

  • HOLDING INSTITUTE: Search accessions by the institute code of the institute holding the accessions. 

  • ACCESSION NUMBER: Search by accession number. 

  • DATE SEARCH: Search accessions that were registered or updated within a custom time period.

  • CROP: Search accessions that belong to the selected crop. 

  • TAXONOMY: Search accessions by taxonomy. 

  • ORIGIN OF MATERIAL: Search accessions by origin country.

  • COLLECTING DATA: Search accessions by collecting date, collecting number, collecting mission, and collecting location. 

  • BIOLOGICAL STATUS OF ACCESSION: Search accessions that were categorized as wild, weedy, landrace, breeding material, improved cultivar, GMO, or other. 

  • TYPE OF GERMPLASM STORAGE: Search accessions considering the type of germplasm storage: seed collection, field collection, in vitro collection, cryopreserved collection, DNA collection, or other.

  • STATUS: Search accessions that are available for distribution, that have georeferencing data, that are included in the Multilateral System, that have a backup copy in the Svalbard Global Seed Vault, or that have images if they are European Genebank Integrated System (AEGIS) accessions.

  • REFERENCED ACCESSIONS: Search accessions that have been referenced in subsets or trait datasets.

  • CLIMATE AT ORIGIN: Search accessions considering climatic patterns of the places where they were collected.

 

Step 2: Once you have changed the passport data filters, you should click on the button “APPLY FILTERS” (Figure 2, Step 1) and then go to the “SUBSETTING TOOL” tab (Figure 2, Step 2). 

image.png

Figure 2: Steps to apply filters and open the Subsetting Tool

Using the Subsetting Tool

The Subsetting Tool is best explained in use cases: a Colombian farmer, dedicated 100% to growing beans, wishes to request seeds from the genebank located in Palmira, Colombia. The farmer wants to plant the seeds in his field. From experience, the farmer knows that his field has high rainfall and low temperatures.

Considering the history of the user, he is interested in obtaining accessions of one crop – beans – and for this a passport filter should be applied before entering the subsetting tool, as shown in Figure 3.

image.png

Figure 3: Filtering beans accessions based on passport data before entering the subsetting tool

Note: Figure 3 shows the proper steps for the subsetting tool to work correctly. A common mistake in this process is that a user will skip Step 2; this causes the subsetting filter to ignore the previous selection. In this case, if Step 2 is omitted, the user will not be filtering by beans as the crop and the subsetting tool will consider all accessions of all crops available in Genesys.

Once the procedure indicated in Figure 3 has been carried out, the user can make use of the subsetting tool, and this is shown in Figure 4.

image.png

Figure 4: Subsetting tool

In Figure 4 there are 82,403 records of bean accessions, of which 37,526 have geographic coordinate information. The subsetting tool considers a spatial data resolution of 5 km (approximately a 5 * 5 km box near the equator). This means some accessions share the same pixel, which explains why 6,654 area records or pixels of interest are finally obtained to carry out the analysis.

The Subsetting Tool has two modes of use: 1) basic indicators selection and 2) advanced indicators selection. The basic mode allows a selection of indicators with general parameters. The advanced mode allows a more advanced selection, meaning you must provide a range of values in the indicators of interest. How to use the advanced mode is explained later in the document. The tool defaults to the basic indicators selection, so for this use case, it is not necessary for the user to make any changes regarding the indicators.

The farmer knows the climatic conditions of his field are high rainfall and low temperature. Therefore, in the subsetting tool, he proceeds to select indicators that meet these conditions as shown in Figure 5.

image.png

Figure 5: Selection of indicators

Figure 5 shows that given the climatic conditions of the farmer’s field, he has selected three indicators to carry out the analysis. Two of these indicators are contained in the Flood stress category and the other indicator is contained in the Crop-specific indicators category. The farmer can select a category of indicators or specific indicators (one by one) regarding his needs.

Further, Figure 5 shows a histogram for each indicator, which shows the distribution of the number of accessions considering the range of values that the indicator takes.

1. The user should select a range of the number of subsets that he is interested in generating as shown in Figure 6. In this case, he swipes from two to five subsets and then clicks on the GENERATE 2–5 SETS button to carry out the analysis. This will run the analysis and build the subsets.

image.png

Figure 6: Generation of subsets

2. Once the subsets have been generated, the user must choose the subset of interest; in other words, the farmer must choose the subset with the most similar climatic conditions to those of the field in which he wants to plant the seeds. The subsetting tool has information that will help the farmer for making this decision. Figure 7 shows the results of the analysis and a summary of the outcomes.

image.png

Figure 7: Results of the analysis

Four subsets were obtained as a result of the analysis. The pie diagram in Figure 7 shows the number of accessions found in each subset, next to a descriptive statistical table by indicator and subset.

The descriptive statistics table has three fixed columns: the first is the name of the generated subset; the second indicates the number of pixels or unique locations of the accessions considered for analysis; and the third indicates the number of accessions. Further to these three columns, a column is added to the table for each selected indicator with its respective statistics.

The subsetting tool offers three types of data summaries for statisticians: minimum, average (mean), or maximum. This can be changed as shown in Figure 7 (Select the type of summary).

Regarding the choice that the farmer must make, the set that has more presence of heavy rainfall is set 4, followed by set 2. In addition to this, set 2 has the highest number of days with minimum temperatures compared to the other sets. Considering the above and that these descriptive statistics of the mentioned indicators are calculated by month within each multi-year average (average of 33 years), it is possible to graphically verify (Figure 8 and Figure 9) the behavior of the indicator month by month in order to make a better decision.

image.png

Figure 8:  Mean extreme precipitation (line plot)

Figure 8 shows that set 4 and set 2 present their highest values of extreme precipitation in the months of January–April and October–December.

image.png

Figure 9: Mean number of days with extreme minimum temperatures (line plot)

Figure 9 shows that set 2 presents almost consistent values for the number of days with extreme minimum temperature throughout the months of the year (between 20 and 28 days every month), while the other sets present sharp fluctuations for the number of days with extreme minimum temperature in the months of the year.

The subsetting tool offers a further option to view the spatial distribution for all accessions grouped by set. Figure 10 shows a view of the accessions retrieved in the analysis.

image.png

Figure 10: Geographical distribution of accessions

With these views of the data, the user has good information for making decisions. The next step is to select one subset.

3. Figure 11 shows two options to select a set with a click. In this case, the farmer selects set 2, since it is the one that most resembles the conditions he is looking for.

image.png

Figure 11: Selection of the subset of interest

4. After choosing a subset, the user gets a list of accessions (Figure 12) with their respective passport data, where all these accessions meet the conditions he requested. 

image.png

Figure 12: List of accessions resulting after the choice of the subset

If the group of resulting accessions is large (> 50 accessions), there are three options for filtering and obtaining the candidate subset. We recommend reviewing the section Candidate subsets and downloading accessions of subsets below.

Using advanced mode of indicator selection

Context: A PhD student with a degree in agricultural engineering wants to carry out a study in the first semester of next year to evaluate the mortality rate and the yield of different cassava accessions. To fulfill this purpose, she will make a request through her university to the genebank to obtain accessions that can develop in a good way in the climatic conditions existing in the field she has available to carry out the study.

From the experience of previous studies carried out on this same field, the student knows that during the months of January–June there are high temperatures (maximum temperatures above 25°C) and very little rainfall (approximately 51 mm per month) and therefore generally drought problems.

  1. Considering the needs of the user, she initially needs to apply a passport filter for cassava as the crop and thereafter enter the subsetting tool as shown in Figure 13.

image.png

Figure 13: Steps to enter the subsetting tool with cassava selected as the crop

2. The student should then select “Advanced indicator selection with ranges” as shown in Figure 14. 

image.png

Figure 14: Activating advanced mode 

In Figure 14 there are 13,470 records of cassava accessions, of which 8,659 have geographic coordinate information. The subsetting tool considered a spatial data resolution of 5 km (approximately a 5 * 5 km box near the equator). This means some accessions share the same pixel, which explains why 2,000 area records or pixels of interest are finally obtained to carry out the analysis.

In advanced mode, the user has the option of making a filter by temporality – that is, she can choose the months in which the analysis is to be carried out (in this case January–June). Given that the indicators were calculated annually for a period of 33 years, a multi-year aggregation was made considering the following time periods:

  • 2010–2016 (Last 6 years)

  • 2005–2016 (Last 11 years)

  • 2000–2016 (Last 16 years)

  • 1995–2016 (Last 21 years)

  • 1990–2016 (Last 26 years)

  • 1983–2016 (All years)

3. The user’s requirement is to analyze all years, but just the first semester. Figure 15 shows how to select the period needed.

image.png

Figure 15: Selecting temporality

4. The student wishes to carry out her test in a field where, from experience, she knows that during the months of January–June there are high temperatures (maximum temperatures above 25°C) and very little rainfall (approximately 51 mm per month). She therefore applies filters to create a scenario with the following climatic conditions. 

  • Drought stress: 

    • Total precipitation [0–51 mm]
    • Consecutive dry days [All range]

    • Number of days with water stress [All range]

  • Heat stress:   

    • Average maximum temperature [> 25°C] 

  • Crop-specific indicators

    • Number of days with high temperatures [All range]

Figure 16 shows how to select indicators and filter them in intervals.

image.png

Figure 16: Selection of indicators

5. The next step is to filter the accessions that meet the conditions selected by the user. The student should click on the button as shown in Figure 17.

image.png 

Figure 17: Filter accessions

6. Figure 18 shows the accessions that make up the first subset generated, considering the indicator filters applied by the user.

image.png

Figure 18: Accessions tab

Furthermore, the tool also allows the student to view the geographical distribution of all accessions filtered. She can see these on the Map tab as shown in Figure 19.

image.png

Figure 19: Geographic distribution of accessions on the Map tab

In the Plots tab, she can see statistical information on the filtered accessions according to each selected indicator. Figure 20 shows how to change the indicator in order to view the statistics; in this case, the student is considering the indicator for consecutive dry days. The box plot shows that considering the value of the said indicator for the accessions that meet the initial requirements of the user, the month with the most consecutive dry days is January and the one with the fewest is May.

image.png

Figure 20: Plots tab

The accessions that make up this first subset are the accessions that meet the initial requirements of the user, but she is also interested in the accessions that support more extreme values of drought and temperature. She therefore proceeds to carry out a grouping analysis of the resulting accessions. For this, the following methods are available:

  • Agglomerative method:  Classic method where the user needs to propose a range of number of subsets she is interested in generating.

  • Dbscan method: Method based on the density of the points. The user needs to include two values: Epsilon, which refers to the physical distance between the points to be joined; and Minpts, which refers to the number of accessions per group.

  • Hdbscan method: Hierarchical density based method where the user needs to include a Min_cluster_size value which refers to the number of points per cluster.

7. The student selects the agglomerative method (Figure 21) and generates three to five subsets from the accessions that make up the first subset generated.

image.png

Figure 21: Selecting the clustering method and number of subsets

8. The student should choose the subset with the most similar climatic conditions to those of the field in which she wants to reproduce the cassava material. The subsetting tool displays information that will help her make this decision. These aids are the same ones available in the basic indicator selection method above; therefore, this section focuses on the interpretation of results rather than on the operation of the tool. 

Figure 22 shows that five subsets were generated from the accessions that made up the first subset generated. In addition to this, the student can see that the accessions supporting more hostile environments of drought and heat are those that make up set 5.

image.png

Figure 22: Selecting subset candidates

Figure 23 shows that set 5 is the driest during the months of January–April and set 1 is the driest during the months of May–June.

image.png

Figure 23: Line graph (mean number of consecutive dry days)

Figure 24 shows the geographic distribution of the accessions (each subset in a different color). 

image.png

Figure 24: Geographical distribution of accessions

8. Based on this analysis, the student establishes that set 5 is the most suitable for her study. To select this subset, she just has to click on the summary box or on the pie chart. Once she has selected the subset, she can see a list of accessions with their passport data, as shown in Figure 25

image.png

Figure 25: List of accessions resulting after the choice of a subset

If the group of resulting accessions is large (> 50 accessions), there are three options for filtering and obtaining the candidate subset. We recommend reviewing the section Candidate subsets and downloading accessions of subsets, which follows.

Candidate subsets and downloading accession data of subsets

If the group of resulting accessions is large (> 50 accessions), the subsetting tool offers three options to filter and obtain the candidate subset.

  • Random: A random filter can be applied to a resulting subset, by default selecting 10 accessions. If you want to choose a larger number of accessions, this value can be modified.

  • Core collection: A resulting subset can be filtered by core collection, selecting the most representative accessions that capture the greatest variability of their species. This also chooses 10 accessions by default, which can be modified if you want to choose a larger number. 

  • Manual: If you are familiar with the resulting accessions, you can also do a manual selection of accessions. 

image.png

Figure 26: Choice of candidate subset

At this point there is a subset of accessions that meets the initial conditions set forth, but this is not the only subset that can be generated, given that if the selection methodology changes, the accessions within the generated group can change. In other words, you can have several candidate subsets. You must select one of the methods explained above to generate a candidate subset.

The accessions can be saved to a list in your profile. Figure 27 shows the accessions with their passport data, which are part of the candidate subset, in addition to the geographic distribution of the accessions.

image.png

Figure 27: A candidate subset

As the accessions are selected, they are saved in a section called My List, from where they can be downloaded as seen in Figure 28.

image.png

Figure 28: Downloading selected accessions from the candidate subset

Resources

List of links to download indicators used in the subsetting too

Generic indicators

Crop-specific indicators

Bibliography

  • Hengl, T., Mendes de Jesus, J., Heuvelink, G. B., Ruiperez Gonzalez, M., Kilibarda, M., Blagotić, A., ... & Kempen, B. (2017). SoilGrids250m: Global gridded soil information based on machine learning. PLoS one12(2), e0169748.

  • Funk, C., Peterson, P., Landsfeld, M., Pedreros, D., Verdin, J., Shukla, S., ... & Michaelsen, J. (2015). The climate hazards infrared precipitation with stations—a new environmental record for monitoring extremes. Scientific data2(1), 1-21.

  • Verdin, A., Funk, C., Peterson, P., Landsfeld, M., Tuholske, C., & Grace, K. (2020). Development and validation of the CHIRTS-daily quasi-global high-resolution daily temperature data set. Scientific Data7(1), 303.

  • Ruane, A. C. (2021). AgMERRA and AgCFSR Climate Forcing Datasets for Agricultural Modeling.

You may also be interested in