Last updated: 1 January 2024
[jump to Part 2:
The Solution]
Lengths in centiMorgans (cMs) and, to a lesser extent, SNP (single nucleotide polymorphism) counts have been used as the main proxies for closeness of relationship when ranking and filtering autosomal DNA matches. Beyond close cousins, these proxies are only weakly correlated with the true relationship (see 23AndMe graph). For this and other reasons, existing methods clearly produce many false positives and probably also dismiss many false negatives. Children are very often shown as matching people who don't match either of their parents, which is most unlikely, or shown as closer matches to third parties than their parents are, which is equally improbable.
An improved understanding of the limitations of the current methods will make it clear that additional criteria, based on statistics easily calculated from existing DNA databases, are required in order to improve the methods used to search for matches, thereby finding more of the relatives in the database, and dismissing more of the non-relatives.
As alternatives or supplements to the currently used units of length, I propose using two new measures which I will define below as
I will conclude by noting the many advantages of using the proposed new measures in conjunction with the current flawed and inadequate measures.
In A Beginner's Adventures in Genetic Genealogy, I have attempted to give a comprehensive account of everything that I have learned about the subject, starting from the very basic definitions and principles, and working right up towards cutting-edge science. I have moved some of the more important and surprising results of my research to the first part of this document in an effort to provide more concise motivation for the new concepts introduced in the second part.
In this document, I assume a knowledge of the basics and concentrate on the major flaws in autosomal DNA-matching as currently practiced. I will continue to play devil's advocate and to pick holes in the methodologies being used by the main DNA companies, by the third-party websites and by many genetic genealogists.
I will argue here that we need a great improvement in the current statistical analysis of DNA data before we can reliably trace small half-identical regions of DNA to a common ancestor more than a handful of generations back.
Users of autosomal DNA-matching services quickly realise that perhaps the biggest problem in genetic genealogy is that lists of matches include many false positives and exclude many false negatives.
A child's list of matches (assuming that the child does not have any distant descendants in the database) should not in general include anyone who is not also on the list of matches of at least one of the child's parents. Every one of the child's true matches must be related to one (or possibly both) of the parents, and, assuming that the child does not have any descendants at all in the database, must be genealogically related more closely to the parent than to the child and genetically related at least as closely to the parent as to the child.
Anyone lucky enough to have DNA samples from both parents will realise as soon as their data has been uploaded to GEDmatch.com that GEDmatch.com's matching algorithm does not satisfy this basic accuracy check. Those, like myself, unfortunate enough to have lost either or both parents before consumer DNA testing became available may take a little longer to learn this important lesson. If you have not already learned this lesson, then the experiment that I am about to describe will teach you the need for scepticism.
There are few published statistics on the frequency of obvious false positives such as I have just described. As I know of several families in which one or more children and both parents have provided DNA samples for analysis and in which all of the relevant raw data has been copied from FamilyTreeDNA.com to GEDmatch.com, I decided to compile my own statistics using GEDmatch's top-1500 lists for three of these families. The following table shows the results:
Child | Neither | Father only | Mother only | Both | Total |
---|---|---|---|---|---|
Robert | 876 | 231 | 336 | 28 | 1471 |
Maureen | 867 | 238 | 365 | 30 | 1500 |
Gregory | 878 | 247 | 349 | 26 | 1500 |
James D | 784 | 280 | 400 | 36 | 1500 |
John R Jr. | 851 | 272 | 345 | 32 | 1500 |
Aileen | 870 | 298 | 308 | 24 | 1500 |
Terry | 955 | 226 | 288 | 31 | 1500 |
The results in the table are valid as of the various dates when I carried out the analysis; results will vary as the database grows. Furthermore, the experiments in this document were carried out in late 2013 and early 2014, since when GEDmatch.com may have changed some of its default settings.
The first five people in the table are a group of siblings whom I know of only because they appear near the top of my lists of matches; the last two are personal friends whom I knew before I sent my DNA for analysis.
The surprising fact that jumps out from this table is that in all cases the majority of the child's matches match neither of the parents. While in principle all those related to the children should be related to at least one of the parents, a few explanations for the fact that the "Neither" column is not a column of zeroes suggest themselves:
On Terry's list, for example, I had to go no further than the 13th match by Gen (not counting her parents' kits from any of the DNA companies) in order to find the first kit in the "Neither" column. This is kit number F214942.
This problem is not specific to GEDmatch, although it is probably exacerbated by the fact that GEDmatch shows more matches for every user than FamilyTreeDNA does. For example, in the case of Aileen in the table above, at least 101 of her first 462 FTDNA-overall-matches did not match either her mother or her father. 234 matched one parent and 127 matched the other and some of these matched both, so that there were more than 101 who matched neither.
The sample of three families on which the table is based is probably too small to draw any firm conclusions as to why all seven children share more matches with their mothers than with their fathers.
I am not alone in querying the reasons for large numbers of false positives on match lists or querying apparent relationships to the children of parents to whom one is not related. I hope that my explanations are more convincing than some of those provided in discussions in different facebook.com groups here and here.
To understand why matching algorithms throw up so many false matches, we need to delve a little deeper into the data. First we need to avoid the possibility of confusing by recalling some basic definitions.
The vocabulary used to describe pieces of autosomal DNA, to describe the letters representing them and to describe comaprisons between them is often ambiguous, and therefore confusing. I will confine myself to a minimal vocabulary here and will define it as precisely as I can.
The most ambiguous word of all is "match" and I will endeavour to make it as clear as possible whenever I use it in what sense I am using it.
I will use the word segment to describe a piece of a single chromosome, represented by a string of letters, e.g. ACTGCAGACGACTA. I will use the term identical segments to describe corresponding segments of two different chromosomes represented by the same string of letters. By corresponding I mean having the same chromosome number, the same start location and the same end location.
I will use the word region to describe the corresponding pieces of two chromosomes, represented by an array of unordered pairs of letters, e.g. AA AT AC CC CG CT CA CC GG TT AC AT (assuming that only unphased data can be observed).
Other writers may use these words with slightly different meanings, but my interpretations appear to be the majority view.
(I am still not certain whether a segment is to a region as an allele is to a genotype and/or as a haplotype is to a diplotype. Genotype means the two alleles (or letters) at a single location and diplotype means the two haplotypes (or strings of letters) at a sequence of locations. It is not clear from anything that I have read whether genotype means ordered pair or unordered pair and it is equally unclear whether diplotype means sequence of ordered pairs or sequence of unordered pairs. Consequently, I will avoid using these terms. Similarly, I will avoid using the term block, which also seems to be used ambiguously.)
Lists of matches are based on what I will call half-identical regions.
As of 28 Apr 2014, Google returned 24 hits for "half-identical segment", no hits for "half-identical block", 42 hits for "half-identical region" and 20 hits for "half-identical DNA". Within this tiny sample, the word "region" unambiguously wins the popular vote. Furthermore, there is a three-letter acronym (TLA) for it, namely HIR, widely attributed to Leon Kull. GEDmatch.com unfortunately uses the term "matching segment" to describe a sequence of "Base Pairs with Half Match".
Corresponding regions of two individuals' DNA are said to be half-identical to each other if the corresponding unordered pairs within the region each share at least one letter.
For example, at a biallelic SNP, where either A or G can be observed on each chromosome, the unordered pairs which can be observed are AA, AG and GG:
A half-identical region must be built up of a sequence of one or more overlapping subregions where one or more of the following is true:
I will use the term overlapping segments to describe the segments which make up a half-identical region where there are not identical segments running the full length of the region.
I think that the term overlapping segments is much clearer than either FTDNA's phrase compound segments or Ann Turner's phrase pseudo-segment.
In the definition of compound segment in FamilyTreeDNA.com's FAQ (formerly here), the example looks at just one paternal segment and just one maternal segment.
Ann Turner's "Identity Crisis" article in the Journal of Genetic Genealogy (Volume 7, Fall, 2011) looks at several cases where a half-identical region is made up of overlapping segments, but it considers only the possibility of two or at most three overlapping segments. She describes a region with several alternating identical paternal and maternal segments as "identical by state" or a "pseudo-segment"; a region including a long segment from one parent in the middle, with a short segment from the other parent at either or both ends as having "a fuzzy boundary"; and a region comprised of one paternal segment and one similarly sized maternal segment as a "compound segment".
Regions made up of overlapping segments are merely half-identical by chance, whatever the number or the length of the overlapping segments making up the region.
It seems to be a generally accepted belief that the longer a half-identical region, the more likely it is that it contains matching segments of its full length, and the less likely it is to be made up of overlapping segments. I would like to see some statistical evidence to confirm this.
It can be assumed, and frequently is assumed, that the longer a half-identical region, on any length scale, the longer the longest pair of identical segments it contains.
It can be equally validly assumed, but rarely is considered, that the longer a half-identical region, the greater the number of individual overlapping segments it contains.
While it is probably still valid to rank matches by the length of half-identical regions, there are several reasons suggesting that the most recent common ancestor is typically much further back than implied by the pure cM length of the half-identical region:
Each overlapping segment represents a paternal/paternal, paternal/maternal, maternal/paternal or maternal/maternal match. Overlapping segments which all involve the same parent of one of the parties can appear to be inherited. For example, on a region comprising a paternal/maternal, paternal/paternal, paternal/maternal sequence of identical segments, the first person's father must also be half-identical to the second person, and there is up to 50% probability that the first person's child will also be half-identical to the second person. Thus, an individual (on one side of a genealogical brick wall, for example) can be half-identical by chance to a whole dynasty (on the other side of the genealogical brick wall). The possibility that a half-identical region is composed of overlapping segments can be safely dismissed only if two or more known relatives from one side of the brick wall match two or more known relatives from the other side; furthermore, the known relatives must not be doubly related. For example, two siblings who are full-identical to each other on a particular region can both be half-identical to another pair of siblings who are full-identical to each other on the same region, but purely because there is a sequence of overlapping short paternal/paternal, paternal/maternal, maternal/paternal and maternal/maternal matches covering the region.
If both parents of one of the two individuals being compared come from the same endogamous population, then half-identical regions are more likely to be made up of overlapping segments. Ashkenazi Jews are typically cited as a population in which this is a major problem in determining relationships and identifying common ancestors.
If, as in the examples above, samples are taken from a child and both of the child's parents and if the same matching algorithm returns a fourth person as a match to the child but not to either parent, then the child and the fourth-person are unambiguously half-identical purely by chance on the regions where they are deemed to be half-identical. As I will demonstrate in an example below, there may still be a smaller subregion within the half-identical by chance region where a less strict matching algorithm might correctly identify a more distant relationship between one of the parents and the fourth person.
The black box methodology used by the operators of the online DNA databases to compile match lists can be broken down into two components:
The ranking algorithms used are generally based on the cM length of the longest single half-identical region and the aggregate cM length of all half-identical regions. In the case of GEDmatch, these are combined in its default ranking algorithm, along with X-DNA observations, into a "Gen" estimate, denoting the estimated number of generations to the most recent common ancestor. The "Gen" estimate can, quite bizarrely, be reduced by reducing the cM and/or SNP thresholds for inclusion in the calculations.
GEDmatch.com allows the user to select from multiple cut-off criteria, in the sense that the one-to-many comparison can display a different set of top 1500 matches depending on which cut-off criteria are chosen.
FamilyTreeDNA.com has a fixed set of cut-off criteria and gives the user access only to the fixed set of matches meeting those criteria (shared cM > 20 and longest block >7 cM and longest block > 500 SNPs) [subject to confirmation].
The rarity and relevance criteria proposed below for ranking half-identical regions can be used to improve the quite primitive ranking algorithms currently used. They will push false positives and overlapping segments down the list, and will push genuine relatives up the list or onto the list.
While ranking criteria can be improved, the cut-off criteria remain somewhat arbitrary. The objectives are:
Furthermore, the same cut-off criteria will generate introductions to genetic matches generating different emotional responses from adoptees and descendants of migrants than from those who have grown up in an established and stable community. Adoptees and the descendant of migrants will probably be much more excited to meet a third cousin than those who have always lived in the same community as their third cousins.
The autosomal DNA of the two parents gets shuffled randomly like a deck of cards during the process of reproduction. To continue the card-playing metaphor, the task of working out which parent and which more remote ancestors each piece of DNA came from is akin to that facing a card counter playing blackjack in a casino.
cM lengths give rise to reasonable estimates of the number of generations to the most recent common ancestor when comparisons are made between two randomly selected individuals.
These estimates become biased in many cases:
A slightly more sophisticated way of investigating and identifying false matches involves looking not just at lists of names, but also at the cM lengths of the half-identical regions which a parent and a child share with a third party.
I will illustrate this with one detailed example. We will see that this first such investigation of a false positive which I carried out showed that the half-identical region that I was investigating was built up from no less than six overlapping segments.
In the early days of studying my own DNA results, I discovered that it is possible for me to share a half-identical region of autosomal DNA with someone which is far bigger than the longest half-identical region which I share with either of her parents. I soon realised that half-identical regions can be made up of overlapping segments, some paternal and some maternal, and that such regions are of far less genealogical significance than other half-identical regions of the same length, as measured in cMs or SNPs or even in base pairs.
This example will also illustrate that even when there is strong evidence of a genealogical relationship and some evidence of a genetic relationship, the cM length of a half-identical region may greatly overstate the closeness of the genetic relationship.
I matched both Terry and her mother (from the table above) on their GEDmatch.com top-1500 lists. Terry's mother appears in my FTDNA match list, but Terry herself does not. Comparing name lists alone would therefore not reveal that my match on GEDmatch.com with Terry is a false genetic match.
There is lots of genealogical and geographical evidence that Terry's mother and I might be related. We both have McNamaras in our family tree, and they have lived in adjoining townlands since the 1870s. We may have another common ancestor elsewhere on the Loop Head peninsula in county Clare in Ireland, where my grandmother and Terry's mother were both born.
While Terry is obviously expected to have inherited half of the DNA which I share with her mother, FTDNA's matching algorithm does not deem Terry and myself to be matches, so we had to wait a few weeks until GEDmatch.com could upload and process our raw data before making direct comparisons. I then had seven kits with which to compare my own - Terry's from all three companies and each of her parent's from two of the companies. This table shows the results, with "Minimum segment cM size to be included in total" set to 1cM and the default values for the other parameters:
Paddy v. | Company | Largest region | Chromosome | Total of regions > 1 cM | Estimated number of generations to MRCA |
Terry | FTDNA | 9.1 | 18 | 83.0 | 4.6 |
Terry | 23andMe | 7.3 | 18 | 87.8 | 5.7 |
Terry | ancestry | 9.1 | 18 | 87.9 | 4.6 |
mother | FTDNA | 11.1 | 1 | 71.0 | 3.8 |
mother | 23andMe | 11.1 | 1 | 66.5 | 3.9 |
father | FTDNA | 4.6 | 9 | 58.1 | 6.0 |
father | 23andMe | 4.6 | 9 | 52.4 | 6.0 |
The most interesting number in this table will turn out to be the region of 9.1cM on chromosome 18 between locations 6,480,913 and 8,377,195 on which Terry and I are half-identical in both her FTDNA results (687 SNPs) and AncestryDNA results (668 SNPs). The 23andMe data show a shorter half-identical region of just 7.3cM (552 SNPs), beginning at a different start location but with the same end location. We compared our FTDNA raw data in the longer region.
The start location 6,480,913 is the first of 137 consecutive SNPs where either Terry and I are half-identical or there is missing data for one or other of us. At the 138th SNP, at location 6,830,063, Paddy is GG and Terry is AA, so we are not half-identical. Immediately after this location, we share another half-identical region of 571 SNPs. There are 13 missing observations (--) for Paddy and 3 for Terry within the 137+571=708 otherwise half-identical locations. For some reason, GEDmatch.com seems to have ignored one blip (the difference at location 6,830,063) in a run of 709 SNPs where we are otherwise half-identical and reported a single half-identical region.
Half-identical regions are intervals bounded by SNPs which are either opposite homozygous or polyallelic.
The start location 6,480,913 is the first half-identical SNP within the region (where Paddy is TC and Terry is CC), so the next largest location below the start must be either opposite homozygous or polyallelic.
The end location 8,377,195 is the first non-matching SNP outside the half-identical region (where Paddy is AA and Terry is GG). In this case it is opposite homozygous, but an end location can also be polyallelic.
In other words, the start and end locations reported by GEDmatch describe half-open intervals: the start point is included in the half-identical region, but the end point is not included.
Furthermore, GEDmatch's SNP counts (687 from FTDNA or 668 from Ancestry) are both smaller than the 692 (137+571-13-3) half-identical SNPs observed in the raw data. We will see shortly that the Ancestry discrepancy is because Ancestry observes slightly fewer SNPs than FTDNA; the slight FTDNA discrepancy remains a puzzle.
Terry is naturally automatically half-identical to both of her parents on this region. The really interesting observation is that I have no half-identical region with either parent on chromosome 18 above the minimum reported length of 1cM with the default GEDmatch "SNP count minimum threshold to be considered a matching segment and included in the total" (500 SNPs). The 9.1cM half-identical region must be made up of a sequence of shorter overlapping segments on which I am alternately half-identical to Terry's maternal and paternal chromosomes. Looking at the raw data confirms that there are overlapping subregions of 120 SNPs, 74 SNPs, 86 SNPs, 119 SNPs, 70 SNPs and 443 SNPs where I am alternately half-identical to Terry's father and Terry's mother. The last subregion is the start of a region running from location 7,070,135 to location 8,470,128 of length 6.4cM and 487 SNPs where I am half-identical to Terry's mother. This region is not identified by the default GEDmatch.com settings as it is 13 SNPs too short to be counted. However, given the evidence from other chromosomes, from ancestral surnames and from ancestral places, I am quite satisfied that the final run of 443 SNPs where I am half-identical to both Terry and her mother comes from a common ancestor. GEDmatch.com removed its centiMorgan calculator before I could calculate the cM length of this false negative.
GEDmatch.com's black-box matching algorithm is apparently not aware of the known relationship betweeen Terry and her mother. If it was, then it could have highlighted the significance of this probable half-identical by descent region, over and above much longer half-identical by chance regions.
In the case of parent/child relationships, the matching algorithms do not have to rely on the assumed known relationships claimed by those who submit samples or data. Two DNA kits are half-identical along their full length if and only if one is a parent and the other a child. It is less easy to determine which is the parent and which is the child. This requires comparing the two kits with third parties and looking for a pattern of one kit (the parent) averaging twice the length of half-identical regions with the third party as the other (the child). Any worthwhile matching algorithm should be doing this. This will involve an additional computing load, but I firmly believe that the associated costs would be justified.
While Terry's mother and I may have inherited a matching segment on chromosome 1, as suggested by the table above, from some common ancestor, the half-identical region of 9.1cM between Terry and myself on chromosome 18 is unambiguously a false positive. In other words, this region is merely half-identical by chance.
The moral of this example is probably that the first step in analysing a possible relationship to someone whose parents' DNA has been sampled or is available for sampling is to confirm that the half-identical regions are also half-identical with at least one of the parents.
Naturally, I am now inclined to treat half-identical regions of 9.1cM or less with anyone with whom I don't have a known relationship, and the possibility that they might be inherited in their entirety from a common ancestor, with extreme scepticism.
The ISOGG website provides a table estimating the probability whether or not half-identical regions of various lengths are "IBD" (identical by descent). Other measures are needed to distinguish which 20% (say) of half-identical regions of length 6cM are IBD and which 80% are not:
The new rarity and relevance measures proposed below are other measures which will help to divide half-identical regions of similar cM length into those which are half-identical by descent and those which are half-identical by chance.
In the light of the above basic experiments, a radical re-think is clearly called for.
I spent many months cogitating on the concept of regions such as that analysed in the above example which are half-identical by chance. It was clear to me that many others in the genetic genealogy community were also thinking about and struggling with this problem, but most of them describe the problematic regions by the meaningless TLA IBS or the equally meaningless and terribly confusing phrase identical by state.
Regions of autosomal DNA are currently measured in a number of different ways:
FamilyTreeDNA | 23andMe | Ancestry | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Chr | Start | End | cM | SNPs | Start | End | cM | SNPs | Start | End | cM | SNPs |
1 | 742429 | 247169190 | 281.5 | 58160 | 72017 | 247174776 | 281.5 | 75091 | 72017 | 247169190 | 281.5 | 55621 |
2 | 8674 | 242683192 | 263.7 | 56794 | 8674 | 242683192 | 263.7 | 75447 | 8674 | 242669396 | 263.7 | 54495 |
3 | 36495 | 199310226 | 224.2 | 46522 | 36495 | 199310226 | 224.2 | 61654 | 36495 | 199310226 | 224.2 | 44536 |
4 | 61566 | 191140682 | 214.5 | 39897 | 61566 | 191173722 | 214.5 | 53790 | 61566 | 191117403 | 214.4 | 38115 |
5 | 91139 | 180623543 | 209.3 | 41375 | 91139 | 180625733 | 209.3 | 54814 | 91139 | 180623543 | 209.3 | 39731 |
6 | 148878 | 170732528 | 194.1 | 47406 | 100815 | 170746898 | 194.1 | 61638 | 148878 | 170732309 | 194.1 | 44852 |
7 | 140018 | 158811958 | 187.0 | 37462 | 140018 | 158811958 | 187.0 | 49736 | 140018 | 158811958 | 187.0 | 35577 |
8 | 154984 | 146255887 | 169.2 | 36371 | 154984 | 146255887 | 169.2 | 48070 | 154984 | 146255887 | 169.2 | 34700 |
9 | 36587 | 140147760 | 167.2 | 32259 | 36587 | 140147760 | 167.2 | 42004 | 36587 | 140145149 | 167.2 | 30921 |
10 | 88087 | 135297961 | 174.1 | 38411 | 88087 | 135297961 | 174.1 | 49245 | 88087 | 135297961 | 174.1 | 36787 |
11 | 188510 | 134436845 | 161.1 | 36047 | 188510 | 134439273 | 161.1 | 46775 | 188510 | 134436845 | 161.1 | 34437 |
12 | 61880 | 132276195 | 176.0 | 34982 | 61880 | 132281048 | 176.0 | 45962 | 61880 | 132276195 | 176.0 | 33375 |
13 | 17956717 | 114108295 | 131.9 | 27388 | 17956717 | 114109121 | 131.9 | 35293 | 17956717 | 114106015 | 131.9 | 26196 |
14 | 18325726 | 106345097 | 125.2 | 23042 | 18397823 | 106353025 | 125.2 | 30075 | 18397823 | 106345097 | 125.2 | 22034 |
15 | 18331687 | 100215359 | 132.4 | 21421 | 18331687 | 100215583 | 132.4 | 27775 | 18331687 | 100214895 | 132.4 | 20542 |
16 | 28165 | 88668978 | 133.8 | 22456 | 28165 | 88676480 | 133.8 | 29340 | 28165 | 88668978 | 133.8 | 21409 |
17 | 8547 | 78637198 | 137.3 | 19928 | 12344 | 78644402 | 137.3 | 25951 | 12344 | 78639702 | 137.3 | 19013 |
18 | 3034 | 76112951 | 129.5 | 21391 | 3034 | 76112951 | 129.5 | 27396 | 3034 | 76102273 | 129.5 | 20523 |
19 | 211912 | 63776118 | 111.1 | 14763 | 211912 | 63779291 | 111.1 | 18009 | 211912 | 63776118 | 111.1 | 13871 |
20 | 11244 | 62374274 | 114.8 | 18197 | 11244 | 62376958 | 114.8 | 23343 | 11244 | 62374274 | 114.8 | 17428 |
21 | 9849404 | 46909175 | 70.1 | 10124 | 9849404 | 46909417 | 70.1 | 13096 | 9849404 | 46909175 | 70.1 | 9711 |
22 | 14494244 | 49528625 | 79.1 | 10331 | 14494244 | 49528625 | 79.1 | 13624 | 14494244 | 49528625 | 79.1 | 9727 |
Total | 3587.1 | 694727 | 3587.1 | 908128 | 3587.0 | 663601 |
These first three units of measurement are pure, universal, abstract measures of length, and do not vary from individual to individual, although the first two can vary for one individual from organisation to organisation.
For the purposes of matching algorithms, something much better and more individual-specific than mere length is required.
Recognising pattern matches is something that we all do frequently in everyday life. Examples from the realm of genealogy and elsewhere include deciphering bad handwriting, searching for evidence of plagiarism or breach of copyright, attributing authorship for a written work or work of art, predicting the outcomes of elections or sporting events, etc. In any of these spheres, the search for patterns instinctively begins by looking for unusual aspects of the two items being compared and checking whether they share these unusual aspects. This seems an equally good starting point for the search for autosomal DNA matches. After all, it is a pretty close description of how Y-DNA matches and mtDNA matches are sought.
I will now propose two new individual-specific units of measurement, namely the rarity of an individual's DNA observed in a region (represented by a single number for each individual and region) and relevance of a half-identical region shared by two individuals (represented by two numbers for each region, one for each individual).
To summarise the above discussion: DNA matches are generally ranked either by the length in cMs of the longest half-identical region or by the aggregate length in cMs of all half-identical regions meeting certain arbitrary criteria. The two rankings are generally very different, even allowing for differences in these arbitrary criteria between websites or changes to the criteria on a single website. It appears that the search is still on for the best ranking algorithm.
I eventually began to wonder if people whose names I might encounter on my long and growing lists of DNA matches were likely to be any more closely related to me than people I might encounter walking down any street in the U.S.A. where most DNA customers live (probably) or any street in Ireland where I live myself (possibly) or any street in counties Mayo, Clare, Kerry, Limerick or Roscommon, where my ancestors farmed for many generations (probably not).
After almost a year, I had a light bulb moment and realised that what makes genetic genealogy so difficult is that the same units (cM, SNP, bp) are used to measure three very different things:
These are three very different concepts and three different units of measurement are surely required to adequately describe them.
Rarity is the easier concept to explain, but relevance will prove the most useful in finding relatives.
By the rarity of a region of one's own DNA, I mean an estimate of the probability that a randomly selected individual from a DNA database has exactly the same DNA in that region, in other words the probability that one's own DNA is full-identical to the randomly selected individual's DNA in that region.
Similarly, by the relevance of a half-identical region of autosomal DNA shared by two individuals, I mean separate estimates of the probability that, first, one's own DNA and, second, the other individual's DNA, are half-identical to the randomly selected individual's DNA in that region.
All the DNA companies and the meta-sites such as GEDmatch.com have large databases of raw autosomal DNA data. For each SNP, they can easily (in terms of computing load on their servers relative to the load imposed by multiple one-on-one comparisons) produce frequency distributions giving, for each SNP i, the proportion of the user base (ignoring misreads) which has each of the 10 possible unordered pairs of letters at that SNP (AA, CC, GG, TT, AC, AG, AT, CG, CT, GT). Let us denote these proportions p(i,UV) where i denotes the location of the SNP and U and V denote the letters.
The probability that a randomly selected individual from the database has a particular sequence of pairs of letters in a particular region, i.e. the rarity of that sequence in that region, can then be estimated by multiplying the relevant proportions p(i,UV) together. For regions of any appreciable length, these estimated probabilities will be very small, and the standard statistical approach in these cases is to express the probabilities on a logarithmic scale. This has the nice property that the logarithm of the product of the probabilities is just the sum of the logarithms of the probabilities. The logarithms will all be negative (or zero) so it is their magnitude or distance from zero which measures rarity.
The observed population proportions can thus be stored in logarithmic form and added, generating a much smaller computing load than storing them in probability form and multiplying them.
The observed population proportions can be re-calculated on any chosen frequency as more individuals are added to the database in question - daily, weekly, or even just once annually.
I first proposed estimating the relevance of a region from my own perspective merely by counting the number of SNPs within the region at which I myself am homozygous. I even implemented this in a Microsoft Excel spreadsheet, but it is tediously slow on my top-of-the-range 2012 laptop and clearly requires far more computationally efficient software than Microsoft Excel. I soon realised that the holders of large DNA databases can provide much better measures of relevance that this crude measure.
Two individuals can be half-identical on a region, but the relevance for one individual can appear very different from the relevance for the other. If the region corresponds to a run of heterozygosity for either individual, then it has absolutely no genealogical significance. If the region corresponds to runs of homozygosity for both individuals, then it is clear that not only do the two individuals share identical segments, but also all four of their parents do (unless there has been a crossover within the region when either of the individuals was conceived). If the region corresponds to a run of heterozygosity for one individual and a run of homozygosity for the other, then its relevance will appear very different from the two different perspectives. While one can get an initial handle on the relevance of half-identical regions from one's own perspective by counting homozygous and heterozygous SNPs in one's own raw data, there is unfortunately currently no way of establishing the corresponding relevance for those appearing on one's match lists without access to their raw data also.
While tracing a single common ancestral source of half-identical regions which are runs of homozygosity for both parties is difficult (since, as we have just seen, there are typically two common ancestral sources for each party), half-identical regions which are runs of heterozygosity for both parties present a different problem. In the latter, the regions are half-identical precisely because they are runs of heterozygosity; there is no evidence whatsoever of common ancestry from such regions.
Recognising that it is not the total length in SNPs of half-identical regions that is informative, but the number of mutually homozygous SNPs in each region, let us return to the example above where I compared myself to Terry and her parents. The proportion of mutually homozygous SNPs between Terry and myself varies dramatically from one overlapping segment to another, running from 36.1% up to 63.4% as shown in this table:
Subregion | Total SNPs | Mutually homozygous SNPs | % |
1 | 120 | 57 | 47.5% |
2 | 74 | 40 | 54.1% |
3 | 86 | 39 | 45.3% |
4 | 119 | 43 | 36.1% |
5 | 70 | 40 | 57.1% |
6 | 443 | 281 | 63.4% |
It may be significant that the longest subregion is also the subregion with the highest proportion of mutually homozygous SNPs and thus the one relatively least likely to be half-identical by chance.
In September 2014, DNA Rockstar Genealogist Roberta Estes encouraged people to drop the SNP threshold at GEDmatch.com to 100 SNPs. She made no attempt to discuss how few or how many heterozygous and homozygous and even no-call SNPs those being compared may have within these tiny 100SNP regions. Within a few days, Roberta, advised by Ann Turner, who in turn was advised by John Olson, had to eat humble pie. Ann Turner's advice was:
These results, finding “what appear to be contemporary matches for the Anzick child”, seemed very counter-intuitive to me, so I asked John Olson of GEDMatch to look under the hood a bit more. It turns out the ancient DNA sequence has many no-calls, which are treated as universal matches for segment analysis. Another factor which should be examined is whether some of the matching alleles are simply the variants with the highest frequency in all populations. If so, that would also lead to spurious matching segments. It may not be appropriate to apply tools developed for genetic genealogy to ancient DNA sequences like this without a more thorough examination of the underlying data.If rockstar genealogists can get things as badly wrong as this, then I probably shouldn't feel so bad about being confused and possibly making my own mistakes.
I strongly recommend using David Pike's calculators to give you a feel for your own runs of homozygosity and runs of heterozygosity.
The failure of genetic genealogists to date to distinguish properly between length, rarity and relevance is just another symptom of the poor public understanding of probability. Genetic genealogists are not alone in this. For example, in the run-up to the Scottish independence referendum on 18 September 2014, countless journalists in Scotland and elsewhere insisted up to the eve of polling that it was `too close to call', mainly because just one single opinion poll from the numerous polls carried out had predicted a Yes vote. The betting market had no hesitation in calling the result, and called it correctly. On the eve of polling, it estimated the probability of rejection at 80% and over 99% of the substantial betting turnover had put the probability of rejection at over two thirds. One wonders why journalists don't put their money where their mouths are, but perhaps they remember the Bush v. Kerry U.S. presidential election of 2004, which the betting markets thought was equally cut and dried - in favour of Kerry, who lost. The problem may be that many journalists do not appreciate that the forecast share of the vote and the probability of winning are not the same thing. They are both quantities usually measured in percentage points, in the same way as apples and oranges are both fruits and the cM length of the longest half-identical region and the canonical degrees of consanguinity are both measures of relatedness.
The simplest statistical model in most text books is that of the coin toss. Depending on local coin design, this may be described as `heads or tails' (sterling and dollar coins and those of many other currencies all have a head of a head of state on one side) or `heads or harps' (Irish coins all have a harp on one side) or something else (many euro coins have neither head nor harp!).
An unbiased coin has 50% probability of landing heads-up and 50% probability of landing tails-up. Some flaw or asymmetry may cause a coin to be biased, so that the probability p of landing heads-up is different from 50%. In an extreme case, a faulty coin might have a head on both sides, so that the probability of landing heads-up is 100%.
In the same way that the true relationship between two individuals in a DNA database cannot be directly observed from the DNA, the true probability p of landing heads-up cannot be directly observed. Both can only be estimated.
Where p can in theory simply take on any value in the continuous range from 0% to 100%, the range of possible relationships between the two individuals is not continuous and the true relationship is more difficult to model and to estimate.
The long-established statistical techniques used to estimate p for the coin toss cannot be directly applied to estimating relationships from DNA data, but they provide starting points for the much more sophisticated models which must be developed as genetic genealogy evolves.
Given observations from a series of experiments (say 100 coin tosses) and the assumptions of a model (say p=48%), the likelihood or probability of the observed data given the model can be calculated. The best model (known as the maximum likelihood estimate) is the one which assigns the highest probability to the observed data. In simple cases like the coin toss with continuous parameter ranges, the maximum likelihood estimate can be easily found using calculus.
The probability of a particular run of results in a sequence of 100 coin tosses is very small, so it is usually expressed on a logarithmic scale to make it easier to interpret, and is known as the log likelihood.
Some of these simple principles can be carried over to the genetic genealogy problem.
The parallels with the coin toss example break down when we come to the point of using calculus (which deals with variables like probability measured on a continuous scale) to find the maximum likelihood genealogical relationship (a variable which takes on distinct and discrete values like first cousin or third cousin once removed).
As well as the three standard units of length for regions of DNA, we need a unit of rarity for one individual's regions of DNA and a unit of relevance for two individuals' half-identical regions of DNA, measured from the different perspectives of both. I propose again using log likelihood for both cases.
The initial model in the genetic genealogy case is that the probability of observing a particular pair of letters at a particular location is just the relative or percentage frequency of that pair of letters at that location in the entire relevant DNA database.
The rarity of a region of one's DNA is the probability of observing the same DNA (i.e., the same unordered pairs of letters) by selecting each unordered pair randomly from this sample distribution, assuming that each unordered pair of letters is chosen independently of those around it. In practice, of course, adjacent pairs of letters are not generated independently, but inherited in long sequences. However, independence provides an obvious and useful benchmark.
The probability of observing the data given the model, calculated in this way, can then be expressed on a log likelihood scale.
On regions where there is no variation within the population, all the population proportions will be 1, and the corresponding log likelihoods will be zero (log 1=0).A SNP where 5% of the letters observed are A and 95% of the letters observed are C might be expected to have 0.25% AA, 9.5% AC and 90.25% CC. However, this assumes a certain degree of independence. It could be that A is recessive and C is dominant, and that AA is incompatible with life or reduces life expectancy, in which case the proportion of AA in the database will be much less than 0.25%. Thus, the calculations should be based on the frequencies of unordered pairs of letters rather than on the frequencies of individual letters. This will unfortunately increase the size of the lookup tables necessary to do the calculations and the costs and complexity of doing the calculations.
Assuming for the purposes of this example that the observed population frequencies actually are 0.25%, 9.5% and 90.25%, a half-identical region including this location will be of far greater relevance for an individual who has the very rare AA at this location than for an individual with the very common CC or the universally matched AC.
Thus the rarity of the same region of DNA will be different for different individuals.
More generally, if the proportion of one letter in the population at a biallelic SNP is q and the proportion of the other letter is 1-q and these are independent, then the proportions who are homozygous with each letter will be q2 and (1-q)2 respectively and the proprtion who are heterozygous will be 2q(1-q).
If the letters in the paternal and maternal chromosomes are independent at a particular biallelic location, then at most 50% of the population will be heterozygous at that location (if the two letters are equally likely). If each letter occurs more than 1/3 of the time, then heterozygosity will be more common than either form of homozygosity. If one letter (e.g. A) occurs less than 1/3 of the time, then that type of homozygosity (AA) will be less common than heterozygosity (e.g. AC), which in turn will be less common than the other type of homozygosity (CC).
The magnitude of the log likelihood will clearly grow with the number of SNPs observed in the region. To get a pure measure of rarity, it can be divided by the number of SNPs observed. The length and rarity are probably best presented jointly in cMs and SNPs (as at present) and in log likelihood per SNP observed.
Now consider individual W looking for a match with individual Z in a particular region. For simplicity, assume that all SNPs within the region are biallelic. The extension to polyallelic SNPs is straightforward.
At a SNP where W is heterozygous (e.g. AC), the probability that individual Z is half-identical is 1 (as, in our example, AA, AC and CC are all half-identical to AC).
At a SNP at location i where W is homozygous (e.g. AA), the probability that individual Z is half-identical is 1-p(i,CC) (since W is half-identical to everyone except those who are homozygous with the opposite letter).
The relevance from W's perspective of the half-identical region is just the product of these probabilities 1-p(i,VV) over W's homozygous SNPs within the region, where V denotes the opposite letter at the relevant location.
Individual heterozygous SNPs contribute 1 to the product of the probabilities or equivalently contribute 0 to the sum of the logarithms of the probabilities, so do not influence the relevance of the half-identical region. Indeed, the same can be said of runs of heterozygosity of any length.
This assumes independence of the letters at nearby locations, which is again a reasonable benchmark.
Similarly, the relevance from Z's perspective of the half-identical region is just the product of the relevant probabilities over Z's homozygous SNPs within the region.
Note that the relevance can be very different from the two different individuals' perspectives. Both relevances must be displayed alongside each other, and alongside the conventional length measures and the rarity. A region which is highly relevant for both sides merits the closest analysis.
Again the probabilities will be very small and will have to be expressed on the log likelihood scale.
I do not yet see any obvious way of combining the two measures from the two individuals' perspectives - adding them would probably be misleading, as it suggests that the two probabilities are independent, which they clearly are not.
As noted in the example above, it is not clear when one observes a length in SNPs for a half-identical region at GEDmatch.com, particularly a half-identical region where one individual's data comes from FTDNA and the other's from 23andMe, whether the number of SNPs is the number where a pair of letters is observed for both individuals, the number where a pair of letters is observed for at least one individual, or the number where a pair of letters has been sought for at least one individual but perhaps misread.
In calculating rarity and relevance, no-calls, whether caused by read errors or at SNPs used by one company and not by the other company, must just be omitted from the log likelihood calculation.
Paradoxically, in endogamous populations, where long runs of homozygosity are more common, the absolute values of the relevance measure will be higher than in less endogamous populations, but the relative values of the relevance measure can still be used to better identify more likely relatives in the database.
I am not yet in a position to provide numerical examples of my proposed measures, as I have neither the population database, the fast and efficient software nor the servers required. However, I am more than willing to provide my services on suitable terms to anyone interested in developing a prototype. In the meantime, I welcome offers of raw data to build a test database and recommendations for fast and efficient software to do the necessary calcuations.
I already have access to raw data for about nine kits, but they are hardly a representative sample as almost all are either my own known relatives or my close matches based on current matching algorithms.
Rarity and relevance calculators will not be very different from centiMorgan calculators.
I presume that a centiMorgan calculator assigns a single weight to each location and calculates the centiMorgan length of a region by summing these weights over all locations in the region.
A rarity calculator will assign an array of ten weights to each location; at biallelic SNPs, these will comprise three log likelihoods (e.g. the logs of the probabilities of observing AA, AC and CC) and seven missing value codes (for the pairs that are not observed in the population). It will calculate the rarity of a region for an individual by summing the relevant log likelihoods over all the locations in the region.
A relevance calculator will assign an array of four weights to each location; at biallelic SNPs, these will comprise two log likelihoods (e.g. the logs of the probabilities of not observing AA or CC) and two zeroes (the logs of the probabilities of not observing the homozygous pairs that are not observed in the population). It will calculate the relevance of a region for an individual by summing the relevant log likelihoods over all the individual's homozygous locations in the region.
For the purposes of constructing an initial prototype it might be adequate to assume that the letters in maternal and paternal chromosomes are independent.
So how can we persuade the DNA websites to implement this?
Or have the rarity and relevance measures been tried already and shown on average to have no appreciable difference from the length measure?
Given that both FTDNA and GEDmatch are currently struggling to fix their badly broken GEDCOM displays, I suspect that anything innovative will be a hard sell at the moment.
I expect technological progress to move genetic genealogy forward in two directions:
By far the easiest and cheapest way today to establish the genome of a deceased ancestor is probably still exhumation. This is still very expensive, and relies on a number of assumptions:
The sophisticated statistical models and software necessary to avoid the need for exhumation are not yet readily available and are deserving of further research and development efforts.
Sometimes it is easy to infer precisely an ancestral couple's unordered pair of letters at a particular location precisely. For example, if two full siblings are opposite homozygous at a location (say AA and CC) then both parents must be heterozygous (AC in this case) at this location. More generally, it is impossible to distinguish between a man's autosomal DNA and a woman's autosomal DNA by looking at the DNA of their shared descendants; this requires DNA from half-siblings of the children, siblings of the parents, cousins of the parents, or descendants of any of these. DNA samples from earlier generations of the family would also be very helpful, but are unlikely to be available today. This will not be the case for future generations, who will have the benefit of DNA samples and data stored with reputable companies from the earliest days of genetic genealogy. (Reputable companies sadly do not include ancestry.com, which deliberately destroyed the DNA samples of many of its customers, and customers of companies which it had acquired, in the autumn of 2014.)
As I wrote this, I realised that the last two people whose funerals I attended had been influenced by their respective daughters to provide DNA samples for genealogical analysis during what turned out to be their final illnesses. Both were cremated.
The models and software developed to estimate missing genomes must allow from the beginning for the possible availability of data from both previous and subsequent generations.
When calculating relevance in comparing oneself to a stranger, one is effectively comparing one's own known observed DNA to the population proportions.
When calculating relevance in comparing an ancestor whose DNA is not directly observable to a stranger, one must rely instead of one's own observed DNA on the estimated distribution of the ancestor's DNA. At some locations, we may know that the ancestor is heterozygous; at others we may have Bayesian estimates of the probabilities of heterozygosity and each homozygous possibility. The population proportions are our priors for these Bayesian estimates; they are then updated using the observed DNA of known relatives.
But, before we start comparing long-dead ancestors with living or deceased strangers, we must perfect our algorithms for comparing living people with each other.Comments about this page by my facebook friends can be left on the facebook post where I originally announced it.