Last updated: 20 Nov 2013
It was probably evident from Chapter 1 that, as a mathematician, I have been looking, so far almost completely in vain, for a rigorous mathematical or statistical model to aid my understanding of the science of genetics and DNA. While possibly in danger of merely re-inventing the wheel, I have tried to develop a simple mathematical and statistical approach to the genealogical aspects that particularly interest me.
The next best thing to a mathematical or statistical model is a graphical tool, and Family Tree DNA's chromosome browser is a beautiful, multicoloured graphical tool. At first, however, it was definitely a hindrance rather than an aid to my understanding. So I have written this page to explain how the light dawned for me.
First, we need a tiny bit of basic mathematics.
The mathematical relation R is said to be transitive if XRY and YRZ imply XRZ.
For example, equals (=) is a transitive relation, since X=Y and Y=Z imply X=Z.
Other simple mathematical examples of transitive relations are is greater than (>), is less than (<) and is a subset of.
A mathematical example of a relation which is not transitive is is not equal to. For example, 1 is not equal to 2 and 2 is not equal to 1, but 1 is equal to 1.
More generally - verbally - is identical to is a transitive relation, since X is identical to Y and Y is identical to Z imply that X is identical to Z.
In genealogy, is related to is not a transitive relation, since X is related to Y and Y is related to Z do not necessarily imply that X is related to Z. X could be related to Y on Y's paternal side and Y related to Z on Y's maternal side, in which case X is not related to Z. Or X, Y and Z could have a common ancestor (more likely, a common ancestral couple, depending on whether they are full-cousins or half-cousins) in which case X is related to Z. Or Y could have some more remote ancestral couple for whom one spouse is related to X and the other spouse is related to Z in which case again X is not related to Z. I think that takes care of all the possibilities. Some genealogists like to talk about the concept of is connected to, which is a transitive relation. (If X is related to Y and Y is related to Z, then we say that X is connected to Z, regardless of whether X is related to Z.)
Other simple genealogical examples of transitive relations are is an ancestor of and is a descendant of.
As an aside, and speaking of ancestors versus ancestral couples, it is surprising how much more common the former phrase is than the latter in what I have read about DNA. If two cousins find that they share a significant segment of autosomal DNA (see below), then they can, with a bit of co-operation, work out which is the most recent ancestral couple from whom they have both inherited that segment. At that stage, it is generally equally likely that the segment was inherited from the husband in that couple as from the wife. (Political correctness probably means that I should call them male partner and female partner rather than husband and wife!) Eventually, a more distant cousin sharing the same segment may show up and reveal whether the segment came from the husband or from the wife in the most recent ancestral couple of the first two cousins. This only pushes the conundrum back another generation or more, to the most recent ancestral couple shared by all three cousins.
The advantage of mathematics over the English language is its lack of ambiguity.
Is matches a transitive relation? It depends on the sense in which the word is used.
In many uses of the word matches, if X matches Y and Y matches Z, then X matches Z, and so matches is (sometimes) a transitive relation.
For example, if matches is used in the sense of is identical to, then we have already seen that it is a transitive relation.
However, Family Tree DNA uses the word matches in the sense of is related to, and we have already seen that then matches is clearly NOT a transitive relation (even without the added complication that in this case the relationship is probable rather than known).
Further confusion can arise from the multiple uses of the verb match, with different connotations, by Family Tree DNA and its users:
To users of familytreedna.com, matches can also sometimes mean matches on a particular segment in the chromosome browser.
Here's an example of what the chromosome browser looks like:
Someone I know through genealogy sent me this example, based on her mother's FTDNA kit. Let's call her Terry. There are lots more similar examples on the ISOGG Wiki.
Terry and her mother have both tested with FTDNA and therefore are, of course, FTDNA matches.
Terry's mother and I are also FTDNA matches. As discussed on the FTDNA facebook page, we match on 14 segments. Our number of shared segments, longest block shared (8.93) and total shared cM (38.24) combine to bring us in above FTDNA's threshold for matches.
Terry and I are not FTDNA matches. For reasons that I will come to later, we cannot see how much, or which segments, of the DNA that her mother and I share Terry has inherited from her mother, but we obviously expect it to be about half. What matters for now is that it brings Terry and I in below FTDNA's threshold for matches.
The blue segments in the image above are the 14 segments of 1cM or more on which Terry's mother and I match; the original default image showed only the 1 segment of 5cM or more on which we match, but there is a dropdown menu which can be used to reduce the threshold and display the smaller segments.
The orange segments in the image are those on which Terry and her mother match: pretty much everywhere.
For my first couple of days looking at these pretty pictures in the chromosome browser, I made the false assumption that this meaning of matches was a transitive relation, but this example shows that it clearly isn't. Terry matches her mother and her mother matches me, both in general and in the blue segments in the chromosome browser, but Terry doesn't match me in either sense.
The next clue that I had misunderstood something was the fact that Terry's DNA and her mother's DNA seem to match in 100% of locations, but we expect to find that Terry inherited only 50% of her DNA from her mother (one chromosome in each pair), and the other 50% (the other chromosome in each pair) from her father. The parent-child example in the ISOGG wiki looks just the same as Terry's example, so I knew there had to be a rational explanation.
I turned to Google in search of this explanation, and eventually found a reference to half-identical regions. Thinking that I might be on the right track, I googled that phrase, which brought me to Lesson 9 of the Beginners Guide To Genetic Genealogy on the Wheaton Surname Resources website, at which stage a light-bulb finally went off in my head!
My initial confusion comes from the fact that there are 22 pairs of chromosomes, but the chromosome browser appears to show only 22 single chromosomes.
A chromosome, as shown in the chromosome browser, is essentially an array of pairs of the four letters A, C, G and T. Suppose we are comparing person X and person Y. At each locus, X has a pair and Y has another pair.
As one letter from each pair comes from the father and the other from the mother, one might expect these to be ordered pairs (of which there are 16 possible values, namely any one of ACGT with any one of ACGT: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT). While I may have misunderstood this point, it appears that the parental source of the letters can not be observed, so we can only observe unordered pairs (of which there are ten possible values: AA, CC, GG, TT, AC, AG, AT, CG, CT and GT).
Consider the comparison between X's DNA and Y's DNA on a particular chromosome. At a particular location on the chromosome, X's unordered pair in fact matches Y's unordered pair if at least one letter is common to both pairs. In other words, for example, AC does not match GT, but AC matches any other unordered pair. Or AA matches AC, AG and AT, but not any other unordered pair. Pairs which match in this sense are called half-identical.
If the two pairs are chosen at random, there is quite a high probability that they match. A quarter of the time, the first pair will comprise two identical letters, e.g. AA; 7 of the 16 possible values of the second pair will match this (AA, AC, AG, AT, CA, GA, TA). (While we cannot observe the ordered pairs, we must use them to calculate the probabilities.) Three quarters of the time, the first pair will comprise two different letters, e.g. AC. 12 of the 16 possible values of the second pair will match this (all except GT, TG, GG and TT). So the probability that two randomly chosen pairs match is 0.25*(7/16)+0.75*(12/16)=43/64=0.671875.
Things could get confusing here, as we are comparing two people, each of whom has a pair of letters at every point on the chromosome. Remember that pair refers to the two letters, not to the two people.
If we look at 10 consecutive locations on the chromosome, and assume that the (unobservable) ordered pairs for the two people at each location are independently chosen randomly from a uniform distribution, what is the probability that the two people are half-identical at all 10 locations? The answer is 0.671875 to the power of 10, which is only around 1.9%, compared to around 67.2% for a single location. For 20 consecutive locations, the probability drops to roughly 0.04%. For 50 consecutive locations, it is of the order of 10-9. For 100 consecutive locations, it is of the order of 10-18 and it quickly becomes vanishingly small. A sequence of locations where the pairs are half-identical is known as a half-identical region.
In practice, consecutive pairs are not independent, but are inherited in chunks from both parents. So we observe many more half-identical regions in practice than pure chance would suggest. These half-identical regions don't arise by chance, but by inheritance. In the unlikely event that they arise by chance, they are said to be identical by state (IBS); in the more likely event (at least for longer regions) that they arise by inheritance, they are called identical by descent (IBD).
Long half-identical regions are described at familytreedna.com as shared segments. Their lengths are measured in the strange variable unit known as the centiMorgan, which I don't really understand. It is derived from the "expected number of chromosomal crossovers" and the number of base-pairs to which it corresponds varies widely across the genome because different regions of a chromosome have different propensities towards crossover. These expectations and propensities presumably come from experimental data and change as more data is collected, so that the definition of centiMorgan may also change. The number of pairs per centiMorgan varies both from chromosome to chromosome and within chromosomes. As can be calculated from the table below, a centiMorgan in one part of chromosome 3 can be under 800,000 pairs, but a centiMorgan in one part of chromosome 11 can be over 6,000,000 pairs. More precisely, a shared segment in FTDNA is a half-identical region one centiMorgan or more in length.
CHROMOSOME | START LOCATION | END LOCATION | LENGTH | CENTIMORGANS |
1 | 44805958 | 47175419 | 2369461 | 1.08 |
2 | 106254302 | 116973471 | 10719169 | 8.09 |
2 | 157113214 | 159347591 | 2234377 | 2.44 |
3 | 11537627 | 12600665 | 1063038 | 1.41 |
4 | 165504024 | 167423895 | 1919871 | 2.29 |
6 | 29267608 | 31571470 | 2303862 | 1.53 |
11 | 46718718 | 56273717 | 9554999 | 1.54 |
11 | 103382220 | 105990699 | 2608479 | 2.38 |
17 | 36593956 | 38838321 | 2244365 | 1.92 |
Now back to transitivity:
Several words and phrases suggest themselves to describe the two relations shown in the coloured regions in the chromosome browser:
The words and phrases which spring to mind include:
For the first relation (that between the reference person and the person represented by one of the colours) let's stick to the last of these to make it completely unambiguous what we mean.
The first and most important thing to note is that, just like - and for exactly the same reasons as - is related to, is half-identical with is NOT a transitive relation. The person represented by the orange segments may not be half-identical with the person represented by the blue segments, even if the segments overlap.
Suppose X and Y share a half-identical region on, say, chromosome 11, and Y and Z share a half-identical region starting at the same location on chromosome 11. It does not follow that X and Z share a half-identical region here. For example, X's first pair could be AC, Y's first pair could be CG, and Z's first pair could be GT (which is not half-identical with AC). The same could be the case at many other locations within the region.
In practice, this means that Y inherited this region of his (or her) paternal chromosome from an ancestor shared with X but inherited the corresponding region of his maternal chromosome from an ancestor shared with Z (or vice versa, maternal sharted with X and paternal shared with Y).
When Y looks at X and Z together in the chromosome browser, there will be an overlap of coloured regions in the relevant part of chromosome 11.
When X looks at Y and Z together in the chromosome browser, there will be just one coloured region in the relevant part of chromosome 11.
And when Z looks at X and Y together in the chromosome browser, there will be just one coloured region in the relevant part of chromosome 11.
This assumes that each of the three individuals matches both of the other two overall; otherwise FamilyTreeDNA.com does not allow them to do the comparisons in the chromosome browser.
If X, Y and Z in this example want to research effectively, it appears that they will have to share their FamilyTreeDNA.com passwords, so that each can compare the other two in the chromosome browser. In Chapter 1, I have already pointed out other circumstances in which sharing passwords seems to be necessary for effective and productive research. It would be nice if there were two levels of access to kits - read-only guest access to allow this sort of chromosome browsing; and full write access to allow changing of passwords, editing of GEDCOMs, ancestral surnames, known relationships, etc., in much the same way as online family trees published using systems such as TNG and even the much-maligned ancestry.com allow different levels of access to different people.
On the FTDNA website, there are lots of routes from the Matches page to the Chromosome Browser page and to some of the data which the chromosome browser displays. (I haven't yet figured out how to download the full ACGT sequences which would surely help my understanding greatly.)
For example, find the match you want to compare with on the Matches page. Click the tiny dropdown just below his or her mugshot (or missing mugshot icon). The Longest Block figure is immediately revealed. Click the "Compare in Chromosome Browser" link. (Repeat for up to five individuals.) Click the large blue "compare" arrow. Now the number of Shared Segments between you and each selected match is revealed along with the lovely colour diagram.
I first found the following more circuitous alternative route: click Family Finder, Chromosome Browser, Filter Matches By ..., Name, [type name, don't hit <Enter> key], Find, checkbox. Not yet having found the quick route to the Longest Block figure, I thought I then had to View this data in a table and scan the centiMorgans column for the largest value. The centiMorgans data is shown to two decimal places in the table, but for some strange reason trailing zeroes are omitted and the numbers are centred instead of aligning on the decimal point, which doesn't make it any easier to find the maximum by eyeballing the column.
Now it is time for a discussion of an important flaw in Family Tree DNA's policy:
I can not use the Chromosome Browser to compare my DNA with that of someone who is not one of my matches, not even with someone like Terry, who is the daughter of a match, and even uses the same e-mail address for her own kit and for her mother's kit. Likewise, I can not use the Chromosome Browser to compare my DNA with that of a known relative who has tested but has not shown up as a match.
I don't see why any two consenting adult customers of Family Tree DNA should not be allowed to compare their autosomal DNA in the chromosome browser, but the company thinks differently, citing unexplained "compliance with our matching and privacy policies" when I raised this question on the company's facebook page.
My Shared cM with the children is between 49% and 73% of my Shared cM with the mother. The four siblings somehow have slightly different lists of ancestral surnames - probably because FamilyTreeDNA hasn't thought of allowing, or forcing, siblings who share an e-mail address (and even those who don't) to also automatically share their list of ancestral surnames. My understanding is that the genetic process is an example of what statisticians call a Markov process: whether or not any or all of the four children have inherited this block of DNA from their mother adds nothing further to what can be inferred about my relation to the mother from the overlap between my DNA and hers.
Until Terry and I can compare ourselves directly in the chromosome browser, we cannot rule out possibility 3. If we could compare ourselves directly, then we might find that we have smaller half-identical regions within the half-identical regions that I share with her mother.
While I have no known relative who has tested, perhaps Terry has and we can do some sort of triangulation to distinguish between possibilities 1 and 2 and thus see on which side I match her mother.
By way of background, Terry and I know that her mother and I both have McNamara ancestors who lived in adjoining townlands. Hers were apparently there in the Tithe Applotment Book of 1827. Mine are blow-ins, having arrived from a townland about 26km away only with my greatgrandparents' marriage in 1876. We each have a cousin named Michael McNamara still living in the original homesteads in the adjoining townlands, so maybe we'll have to persuade them to be tested (Y-DNA and autosomal DNA)! But as Terry's mother and my grandmother come from the Loop Head peninsula and all of their known ancestors lived there, the common ancestor from whom the shared segments (or some of them) derive might not be a McNamara at all.
Anyone who has read Oscar Wilde's The Importance of Being Earnest would probably expect the play to be responsible for the first hit in a Google search for Worthing abandoned baby, but that is not the case for me right now. (As with any Google search, your mileage may vary!) I am waiting for this match's permission before I fully explain the relevance of the Wildean allusion here ...
My own online family tree is at http://pwaldron.info/tng/index.php but to see it you will have to Register for a New TNG User Account.
Comments about this page can be left on facebook.