Из ФБ-группы DNA.Land - о работе с геномами и алгоритме вычисления родства:
As we work on improving our results, I thought to provide some information about the algorithmic steps behind relative matching. Please notice that this is a work in progress and things are likely to change.
1. File uploaded to the website are converted to standard format of build 37/hg19.
2. The files are phased using SHAPEIT and imputed using IMPUTE2 using the samples in the 1000Genomes phase1 data.
IMPUTE2 Reference:
http://journals.plos.org/plosgenetics/article?id=10.1371%2Fjournal.pgen.1000529For imputation accuracy see:
https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_9-12-13.html3. Out of the 39 million imputed SNPs, we select 4 million SNPs. This list is pre-determined and fixed for all samples. The SNPs have (i) MAF>5% across all human populations (ii) bi-allelic (iii) not in repeat regions.
4. We use GERMLINE to find IBD matches between the phased haplotypes of pairs of samples using the four million SNPs.
GERMLINE reference:
http://genome.cshlp.org/content/19/2/318.long[one issue we discovered is that our current parameters with GERMLINE are a bit sensitive the imputation errors. This creates a break in long IBD segments as was described by the excellent post of dnagenealogy. Another issue is that GERMLINE produces false positive matches in specific segments of the genome.- Dr. Tris Hayeck is working on that.]
5. The IBD segments are fed to ERSA to estimate the most likely number of meiosis events. ERSA performs hypothesis testing that classifies IBD segments as "ancient" or "recent" based on their length. "Ancient" segments refer to short IBD segments that segregate in unrelated individuals and tells nothing about relatedness. Only "Recent" segments are scored towards relatedness [There was some discussion that we use very short cutoff (3cM) as part of the IBD detection. These short segments are (almost) always classified as ancient and do not confound our model. However, please note that the relative finder report shows the total cM that includes both **ancient and recent** segments]
ERSA reference:
http://genome.cshlp.org/content/21/5/768.full6. ERSA has some glitches with very close relatives such as brothers, MZ twins (or duplicated), etc. We have a final step to measure the Identity-by-State between samples with high relatedness in ERSA to refine close relatives.
DONE.