'The Thousand Polish Genomes Project’-a national database of Polish variant allele frequencieshttps://doi.org/10.1101/2021.07.07.451425Abstract
Although Slavic populations account for over 3.5% of world inhabitants, no centralized, open source reference database of genetic variation of any Slavicpopulation exists to date. Such data are crucial for either biomedical research and genetic counseling and are essential for archeological and historical studies. Polish population, homogenous and sedentary in its nature but influenced by many migrations of the past, is unique and could serve as a good genetic reference for middle European Slavic nations.The aim of the present study was to describe first results of analyses of a newly created national database of Polish genomic variant allele frequencies. Never before has any study on the whole genomes of Polish population been conducted on such a large number of individuals (1,079).A wide spectrum of genomic variation was identified and genotyped, such as small and structural variants, runs of homozygosity, mitochondrial haplogroups and Mendelian inconsistencies. The allele frequencies were calculated for 943 unrelated individuals and released publicly as The Thousand Polish Genomes database. A precise detection and characterisation of rare variants enriched in the Polish population allowed to confirm the allele frequencies for known pathogenic variants in diseases, such as Smith-Lemli-Opitz syndrome (SLOS) or Nijmegen breakage syndrome (NBS). Additionally, the analysis of OMIM AR genes led to the identification of 22 genes with significantly different cumulative allele frequencies in the Polish (POL) vs European NFE population. We hope that The Thousand Polish Genomes database will contribute to the worldwide genomic data resources for researchers and clinicians.
---------------------------------------------------------------
High depth (>30x) PCR-free whole-genome sequencing was applied to all samples.
Whole Genome Sequencing (WGS) was performed on the Illumina NovaSeq 6000 platform using 150 bp paired-end reads, with median average depth of 35.72X.
the reads were subsequently mapped to the GRCh38 human reference genome
Mitochondrial haplogroupsUsing variant calls in the mitochondrial genome, we inferred haplogroups among the unrelated individuals. In 930 individuals with high quality haplogroup assignment the most abundant haplogroup was H with 410 (44.1%) representatives, U with 161 (17.3%), J with 92 (9.9%), and T with 83 (8.9%) individuals (Fig. 6). The largest H sub-haplogroup was H1 (N=128; 31.2% of the H haplogroup), and a similar number of individuals was divided between subclades H2, H5, H6 and H11 (N=116; together 28.3% of the H haplogroup). The second most abundant sub-haplogroup in the cohort was U5 with 98 (10.5%) individuals.
In the first comparison with continental populations, we observed that the POL cohort is homogenous and clustered within the European population (Fig.11A and 11B). After prediction using a random forest method only one sample was located in the AMR population cluster. In PCA of European subpopulations, almost all POL samples (938 out of 943) were clustered with other European ancestries, with 496 individuals belonging to the GBR, 427 to the CEU, 12 to the TSI, and 3 to the IBS subpopulations. Five samples were closer to non-European populations.
Compared against the world populations, the POL cohort was similar to the European populations at low K values (K = 2 to 5), and at K above 5 (favoured by the cross-validation analysis for the world dataset) forms a distinct cluster, with some common ancestry with GBR and CEU, and also FIN.
Although in terms of sample size our project does not compare to the world's largest, it remains one of the largest open allele-frequency datasets generated on high-coverage WGS data and the largest of Slavic population.
https://naszegenomy.pl/