Details of data processing

Sequence alignment and variation identification

The assembly release version 7.0 of genomic pseudomolecules of japonica cv. Nipponbare was downloaded from Michigan State University and used as the reference genome.  Reads of all varieties were aligned to the pseudomolecules using the software BWA v0.7.12-r1039.  SNPs/INDELs were identified using GATK v3.3-0-g37228af. We first map the reads to the reference with BWA mem and then Generate GVCF per-sample with HaplotypeCaller (with parameters of -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000, mapping quality ≥20 were used), after creating the GVCF file, we use  CombineGVCFs to generate VCF file. The variations identified by GATK were further filtered: the allele count in VCF file must >10, depth must >=50. 

Imputing missing genotype using an LD-KNN algorithm

After obtaining raw genotype calls from GATK, 33.4% of genotypes were missing due to low-coverage sequencing. We then performed imputation using an in-house modified k nearest neighbor algorithm.  In imputation, heterozygous calls were set to missing and we split the variations to 4058 bins (each 5000variations) for imputation.  For these missing genotype in high coverage region,  we set it to be 'DEL'. After imputation, we got an overall missing data rate reduced to 2.32% and overall DEL rate to 9.57%. The detailed precision rate and missing rate of each bin after imputation are shown below: 

 

      

Figure 1. Precision rate statistic after imputation

 

   

Figure 2. Missing rate  statistic after imputation

 

The genetic structure and diversity of the rice germplasms

The population structure of the 4,726 accessions was inferred using ADMIXTURE based on 195,349 SNPs which randomly selected from the genome (per 5Kb randomly pick out 3 SNPs). The parameter of the number of ancient clusters K was set from 2 to 7 to obtain different inferences. Each accession was classified based on its maximum subpopulation component. Accessions with the maximum subpopulation component value differing from the second value less than 0.4 were classified as intermediate. 

When K=2, accessions were divided into indica and japonica varietal groups. 

At K=3, the aus cluster (Aus) appeared within the indica varietal group.

At K=4, the indica were further divided into two sub groups (indica I and indica III, also denote as IndI and IndIII), indica accessions with similar components of IndI and IndIII (<0.4) were classified as Indica Intermediate. 

At K=5, the indI were further divided into two sub groups (indica I and indica II, also denote as IndI and IndII), indica accessions with similar components of IndI and IndII (<0.4) were classified as Indica Intermediate. 

At K=6, japonica were divided into two sub groups, corresponding to tropical japonica (TrJ) and temperate japonica (TeJ), japonica accessions with similar components of TeJ and TrJ (<0.4) were classified as Japonica Intermediate. 

At K=7, an independent group (VI) emerged, which is an intermediate group between indica and japonica. Only fourteen accessions belonged to VI and we found that nine of them were with mutated fragrance gene fgr, which suggested that VI is corresponding to Group V/Aromatic group reported in other studies (Glaszmann et al. Theor Appl Genet, 1987, 74: 21-30; 1. Garris et al. Genetics, 2005, 169: 1631-1638). 

The set of 4729 rice accessions sequenced in this study was accordingly classified into 595 IndI, 465 IndII, 913 IndIII, 786 indica intermediate, 767 TeJ, 504 TrJ, 241 japonica intermediate, 269 Aus, 96 VI, and 90 intermediate, The details of classification and values of subpopulation component can be queried in Cultivar Information page. 

 

Figure 3. Neighbor-joining tree of 4729 accessions constructed from matching the distance of 202,509 even-distributed and randomly selected SNPs. Different subpopulations, indica I (IndI), indica II (IndII), Indica III (IndIII), Aus, temperate japonica (TeJ) and tropical japonica (TrJ) are shown in different color and the numbers of accessions in each subpopulation are marked. In this figure, the number of accessions of Intemediate contains VI group (denotes in pink).

Figure 4. The distribution of the estimated subpopulation components for each accession analyzing by ADMIXTURE under different assumptions of ancient clusters K = 2 to 7 for 4729 accessions.