Training and Prediction Dataset for Seed Nitrogen and protein content in common bean using Near-Infrared Spectroscopy
Common bean (Phaseolus vulgaris L.) is the world’s most important legume crop and a vital staple food for millions of people in Latin America and Africa. Given the rising demand for beans and their critical role in nutrition and food security, especially in these regions, enhancing the nutritional quality of common bean seeds through breeding has become increasingly urgent. García et al. (2025) demonstrated that Near-Infrared Spectroscopy (NIRS) can effectively estimate seed nitrogen content in a non-destructive manner while providing valuable nutritional information to facilitate the use of large genebank collections in bean improvement. Here, we report a dataset containing the data used to train and test different prediction models, as well as the predicted nitrogen values generated using the best-performing model, which achieved a concordance correlation coefficient (CCC) of 0.84. This dataset includes laboratory-estimated nitrogen content (g/kg) and protein content (calculated as N-total × 6.25) for 300 cultivated accessions from the common bean core collection, 100 wild accessions, and one biofortified variety, all quantified using the Kjeldahl method as described by García et al. (2025) (S-Table 1). In addition, the dataset provides NIRS-predicted nitrogen content for 1,392 cultivated accessions from the core collection, 360 wild accessions, and two biofortified varieties, obtained using Near-Infrared Spectroscopy and the best predictive model described in García et al. (2025) (S-Table 6). Descriptive statistics for the NIRS-predicted nitrogen values, including minimum, maximum, and mean, are reported in S-Table 7.
List of accessions included in the dataset