Speeding up Ytime
You may well find that locating age confidence intervals using the “Ytime” function is a slow process. There are two things you can do to speed things up.
1)Use “Ytimeboot” rather than “Ytime”. “Ytimeboot” is described in Section 9. Informal evaluations I have carried out suggest that the confidence interval estimates returned by “Ytimeboot” are very similar to those returned by “Ytime”, and take a fraction of the time to estimate
2)Copy the “*.dll” files from the “\dll” folder to its parent directory. This will make Matlab use the compiled *.dll functions rather than the uncompiled *.m functions. However, please note that these *.dll files have been compiled using Matlab Compiler 3.0 for Matlab version 6.5. They may not work or, worse, produce erratic results in other versions. Users are welcome to compile the *.m files themselves if they know how to – please refer to the “makefile.txt” file.
Calculating observed ASD
Estimates of the TMRCA and its confidence interval are obtained via a statistic here called ASD, or “Average Square Distance”. The term “Average Square Distance” was originally coined as statistic measuring genetic distance between populations. Here the term “Distance” is a misnomer, as the statistic is being calculated is an average square difference between a sample of chromosomes and the MRCA, but its use is retained for consistency with other published papers. ASD is defined in this context as:
where the term within the large brackets defines, for a given STR locus i, the average difference (over a sample of n chromosomes) in the STR repeat size between each sampled chromosome () and the STR repeat size of the MRCA (). The complete formula, therefore, gives the average of this average over all STR loci.
You may, if you wish, calculate ASD yourself and input it into the function “Ytime”. Alternatively, there are two ways within the Ytime package to calculate the ASD from raw data that is input into MATLAB. The first takes data from a matrix “M” where each row of “M” represents a different sampled chromosome and each column represents a different STR locus. The values in “M” are the corresponding STR repeat sizes. For example:
M=[
13 25 15 10 11
12 25 16 10 11
13 25 16 10 11
13 24 15 10 11
13 25 15 10 11
];
anc=[13 25 16 10 11];
oASD = calcASD(M,anc)
“anc” defines the STR repeat sizes for the MRCA.
The function “calcASD” returns “oASD”, the observed ASD for this data from the STR haplotype defined for the MRCA.
The second way takes a matrix “M” similar to the one defined above, only now there is one row for each unique haplotype observed in the sample of n chromosomes. An additional vector “Freq” defines the number of times each haplotype is observed in the sample of n chromosomes. For example:
M=[
13 25 15 10 11
12 25 16 10 11
13 25 16 10 11
13 24 15 10 11
];
Freq=[
2
1
1
1
];
anc=[13 25 16 10 11];
oASD = calcASD(M,anc,Freq)
As a convenient way of altering the above to match your data, you can cut and paste data from, say, an Excel file into the above examples and then cut and paste the commands into MATLAB. Be aware, however, that typically Excel with export values separated by tabs, and some versions of MATLAB will only support values separated by regular spaces.
Calculating TMRCA with confidence limits
To find the unbiased TMRCA for a given STR mutation model use:
[T,Tq] = Ytime(q,oASD, n, nloci, SMM, Rgrowth, mua, mub, anc)
An example of how to call ‘YTime’ is given at the end of this section.
The outputs (inside square brackets) are as follows:
T is the unbiased estimate of the TMRCA, in generations
Tq is the series of confidence limit estimates provided for the quantiles specified in “q” below.
The input parameters (inside curved brackets) are as follows:
q is a vector giving the quantiles of the confidence interval to be found. For example, set q=[0.025 0.975] to return equal-tailed 95% confidence limits. Set q=[] if you do not want any confidence limits to be found.
oASD is the observed Average Square Distance between each sampled chromosome and the MRCA STR-haplotype. See function “ASDcalc” below for more details.
n is the number of sampled chromosomes.
nloci is the number of STR loci typed for each chromosome.
SMM specifies the Stepwise Mutation Model. [‘S-SMM’] selects the Simple Stepwise Mutation Model, [‘L-SMM’] selects the Linear Length Dependent Stepwise Mutation Model. Note ‘S-SMM’ and ‘L-SMM’ must be in upper case, and must be enclosed in single quotes.
Rgrowth specifies scaled rate parameter R for the growth model. R=N*r where N is the current effective population size and r is the instantaneous growth rate per generation. Exponential growth is assumed to extend indefinitely back in time. If Rgrowth=0, a no-growth model is specified. If Rgrowth=[‘star’], a star-genealogy model is specified, in which all lineages trace back independently to the MRCA. Note ‘star’ must be in lower case.
mua is the mutation rate per generation when the S-SMM is specified. If a single number is supplied, it is assumed that this is the same mutation rate for all STR loci. Alternatively, if a series of numbers is supplied within square brackets, then this series specifies a different mutation rate for each locus (series must be same length as nloci). If the L-SMM is specified, mua stores the intercept value(s) in the equation mu = mua + mub*L, where L is the repeat size of the STR. Again, mua may either be a single value for all loci, or a series specifying a different value for each locus.
mub need only be entered if the L-SMM has been specified, in which case mub stores the slope value(s) in the equation mu = mua + mub*L as defined above. Again, mub may either be a single value for all loci, or a series specifying a different value for each locus.
anc need only be entered if the L-SMM has been specified, in which case anc stores a series of numbers representing the STR repeat size of the MRCA at each one of the STR loci typed. e.g. anc=[12 15 10], if there are 3 STR loci.
The following is an example of how to set up the input parameters for Ytime. You can amend the following code to suit your needs and cut and paste it into the MATLAB command window.
oASD = 0.0234; %Observed ASD
n = 34; %No. sampled chromosomes
nloci = 3; %No. STR loci typed
SMM = ['S-SMM']; %Specifies STR mutation model
Rgrowth = 0; %Specifies growth model
mua = [0.0018 0.0023 0.0012]; %Specifies mutation rate
q = [0.025 0.975]; %Specifies the confidence limits
[T,Tq] = Ytime(q,oASD, n, nloci, SMM, Rgrowth, mua)
How unbiased estimates of the TMRCA are calculated
Unbiased estimates of T are provided from the following equations:
For S-SMM:
T = oASD / mean(mua);
where mean(mua) is the mean mutation rate over all loci
For L-SMM;
T = oASD / mean(mua + mub*anc)
where mean(mua + mub*anc) is the mean mutation rate over all loci, where STR repeat size values are those specified for the MRCA.
See Slatkin (1995), Genetics 139: 457-62 and Goldstein et al. (1995) , Genetics 139: 463-71 for a derivation of the S-SMM equation, and Calabrese et al. (2001), Genetics 159: 839-852 for a derivation of the L-SMM equation (what they call the “PS/0M” model)
How confidence limits for the TMRCA are found using Ytime
Let S be a statistic S that can be calculated from some data, and let S0 be the observed value of the statistic. The sampling distribution of S depends on some known parameters ? plus an unknown scalar parameter ?. Provided there is a monotonic relationship between S and ?, we can define ?L[?] as the value of ? such that the probability of observing S?S0, given ? and ?, is ?/2. Since all values of ? less than ?L[?] will produce even smaller probabilities of observing S?S0, ?L[?] represents the smallest value of ? that has at least a ?/2 chance of producing data as extreme as S0. A similar argument can be used to define ?U[?], the value of ? such that the probability of observing S?S0, given ? and ?, is ?/2. ?L[?] and ?U[?] define the equal-tailed lower and upper confidence limits for ? at the 1-? level. For example, if ?=0.05 these are the 95% confidence limits.
Ytime finds ?L[?] and ?U[?] directly by finding Monte Carlo sampling distributions for S for a range of values of ? and interpolating the values for which P(S?S0|?,?)=?/2 and P(S?S0|?,?)=?/2. Here, ? is the TMRCA and S is the ASD.
By default, Ytime uses 9 values of ? over a sensible range, performs 1000 Monte Carlo simulations at each point and quadratic interpolation to estimate ?crit. However, it is also possible to specify these options yourself, using the function ‘findt’.
function [tq,x,y] = findt(q, oASD, n, nloci, SMM, Rgrowth, mua, mub, anc, Tqlow, Tqup, bins, MCruns)
%The input parameters (inside curved brackets) are as follows:
%q = vector of quantiles for which tq required. e.g, set q=[0.025 0.975] for equal-tailed 95% confidence limits. Set q=[] for no Tq.
%oASD = observed Average Square Distance between each sampled chromosome and the MRCA STR-haplotype.
%n = number of sampled chromosomes.
%nloci = number of STR loci typed for each chromosome.
%SMM = 'S-SMM' selects Simple Stepwise Mutation Model. SMM='L-SMM' selects the Linear Length Dependent Stepwise Mutation Model.
%Rgrowth = scaled rate parameter R for the growth model (R=N*r). Rgrowth=0 sets no-growth model. Rgrowth='star' sets a star-genealogy model.
%mua = mutation rate (per gen) under S-SMM, = intercept under L_SMM. Specify single value or 'nloci' values.
%mub = intercept under L_SMM. Specify single value or 'nloci' values. Do not need to specifiy under L-SMM.
%anc = row vector giving MRCA haplotype under L-SSM, not used under S-SMM.
%
%The following additional parameters are available:
%Tqlow specifies the lower value of Tq to be simulated in range-finding.
%Tqup specifies the upper value of Tq to be simulated.
%bins specifies the number of Tq values to be simulated (evenly spaced between Tqlow and Tqup).
%MCruns specifies the number of simulations to perform at each value of T (default=1000).
Note that if SMM=‘S-SMM’ is specified, values for mub and anc must still be specified, but these values will be ignored by the program. For example, you could specify mub=[]; and anc=[];
Suitable values for mua and mub under the L-SMM model
Stumpf and Goldstein (2001), Science 291: 1738-42, suggest that Y-STRs follow the following formula: V = –6.62 + 0.62R
Using Kayser et al (2000), AJHG 66: 1580-8, mean mu=0.0028 and mean rpt size (R) = 16.96.
It follows that:
mu = a + bR where
a = (-6.62)*0.0028/(-6.62 + 0.62*16.96)
b = (0.62)*0.0028/(-6.62 + 0.62*16.96)
Thus the following are reasonable estimates for mua and mub.
a = -0.00475867734648
b = 4.456767303347710e-004
Finding bootstrap confidence limits for the TMRCA
The method employed by the ‘Ytime’ function is robust but time consuming. This is because many simulated distribution of S (the sample ASD value) but be generated for each one of a range of values of ? (the TMRCA parameter). A faster route to confidence intervals is by parametric bootstrap using one vale of ? only. The obvious value to use is the unbiased estimate obtained from Section 4 above. The quantiles from this bootstrap distribution are used directly to obtain confidence intervals for S, and these are then used via the equations in Section 4 to generate the CI for ?. The implicit assumption when deriving CI’s by this procedure is that there exists some transformation f(S) which converts the sampling distribution of S into a symmetric distribution of constant shape regardless of the value of ?. For a clear exposition of this point, see BJF Manly (1997) “Randomization, Bootstrap and Monte Carlo Methods in Biology, Second Edition”, CRC Press. It is difficult to prove that this assumption is true, but on the other hand bootstrap CI’s are widely used in biology without bothering to check this.
To find CI’s using this method, use the function YtimeBoot. The input and output parameters fro this function are the same as for Ytime. For example:
oASD = 0.0234; %Observed ASD
n = 34; %No. sampled chromosomes
nloci = 3; %No. STR loci typed
SMM = ['S-SMM']; %Specifies STR mutation model
Rgrowth = 0; %Specifies growth model
mua = [0.0018 0.0023 0.0012]; %Specifies mutation rate
q = [0.025 0.975]; %Specifies the confidence limits
[T,Tq] = Ytimeboot(q,oASD, n, nloci, SMM, Rgrowth, mua)
Troubleshooting
You may find you get error messages regarding the DLL files in this installation. If so, delete all DLL files then relaunch Matlab. The functions will still work, but slower. If you wish, use the “mcc” command to recompile the functions to work on your platform. See Matlab User Guide for more details.
Citing YTime
The YTime package was first described in the following paper, which should be cited when referring to this software.
Behar DM, Thomas MG, Skorecki K, Hammer MF, Bulygina E, Rosengarten D, Jones AL, Held K, Moses V, Goldstein D, Bradman N & Weale ME (2003) Multiple origins of Ashkenazi Levites: Y chromosome evidence for both Near Eastern and European ancestries. American Journal of Human Genetics 73: 768-79