Genealogically useful, misattributed and false DNA matches

by Paddy Waldron

Last updated: 23 July 2020

URL: http://pwaldron.info/DNA/GenealogicallyUseful.html

The changes to AncestryDNA's matching algorithm announced on 14 July 2020 and scheduled for early August 2020 sparked off a variety of reactions and got me thinking again about how to identify genealogically useful autosomal DNA matches.

When I first got involved in genetic genealogy back in 2013, I was very confused by a variety of poorly defined jargon used to dismiss matches which are not considered genealogically useful, for example terms such as "IBS", "identical-by-state", "pseudo-segment", "false positive", "false match", etc. I soon realised that many of these terms are generally just as subjective as the thresholds built into the matching algorithms used in the various DNA comparison databases.

Several years of experience in studying the DNA matches of myself and others on many different websites, and my background in mathematical sciences including statistics, have subsequently led me to my own subjective opinions on criteria for assessing the genealogical usefulness of DNA matches, which can be divided into four broad categories:

I will deal with each of these in turn, starting with the most technical. You may click on the hyperlinks above to skip ahead to the other sections if you wish.

SNP-based criteria for genealogical usefulness

Until such time as a technology is developed to read our paternal chromosomes and maternal chromosomes separately, autosomal DNA matching is based on identifying half-identical regions bounded by opposite homozygous locations. At every location read along the pairs of autosomal chromosomes, the current technology reads an unordered pair of letters, e.g. AA, AT, AC, etc. At every location within a half-identical region, the two individuals being compared have at least one letter in common. For example, at a location where A or C can be observed, two half-identical individuals can have AA and AA, AA and AC, AC and AC, CC and AC, or CC and CC. On the boundaries of half-identical regions, one of the individuals has AA and the other has CC. The vast majority of locations observed on the autosomal chromosomes are bi-allelic, so that only two of the four letters (A,C,G,T) are observed at each location.

The first principle of DNA comparison is that within a long half-identical region there is a long, but not directly observable, identical segment (e.g. ACGTAAGTTGGAC ...) which is common to, for example, one individual's paternal chromosome and the other individual's maternal chromosome. The second principle of DNA comparison is that this identical segment, or most of it, came to both parties from a (probably deceased) common ancestor, although that common ancestor is likewise generally not directly observable (barring exhumation).

Something which is not directly observable can in general neither be proven true nor proven false.

It is often possible to estimate the likelihood or probability that a proposition is true or false. Unless the estimated probability that something is false is 100%, than it can be described at best as probably false and must not be described as false without this qualification.

One reason to dismiss a half-identical region as not genealogical useful is that it may be

This refers both to regions in which there is a sequence of overlapping short paternal/paternal, paternal/maternal, maternal/paternal and maternal/maternal matches. The fewer the SNPs being compared within the region, the more likely it is to be half-identical by chance. As different laboratories use different chips with different sets of SNPs, the more the matching algorithms have to guard against half-identical by chance matching, and the more half-identical by chance regions may slip through the net. As of 16 July 2020, the number of SNPs available for comparison with my top 3,000 matches at GEDmatch, called the overlap, ranged from 47,557 to 345,972. In the early days, when most comparisons were based on the same underlying chip and an overlap of hundreds of thousands of SNPs, I was dubious of any half-identical region containing less than 1,000 SNPs. The present diversity of chips in use means that one can no longer afford to be so fussy about SNP density.

Another reason to dismiss a half-identical region as not genealogical useful is that it may be

This refers to the fact that the SNPs which the DNA companies examine are generally not all the SNPs. In other words, the locations not examined are not necessarily locations at which all humans are identical. Thus, it is possible that two people match at a long sequence of consecutive observed SNPs, but that there are unobserved SNPs between the observed SNPs at which the two people do not match. Dave Nicolson has written a paper about this.

A half-identical by chance or half-identical by omission region can be considered a false match.

Half-identical regions generally have fuzzy boundaries at each end of the identical segment which they contain. In other words, there will generally be a gap between the last location in the identical segment and the next opposite homozygous location, e.g.

In this example, the identical segment ends with five As, but V and W are half-identical at the next four locations (where each has at least one C) until the half-identical region eventually ends at the last location shown, where one has two As and the other has two Cs.

For more on these concepts, see Ann Turner's "Identity Crisis" article in the Journal of Genetic Genealogy (Volume 7, Fall, 2011).

In general, an identical segment shared by individuals V and W will have descended to both individuals in its entirety from a single common ancestor. The only other possibility is that there has been a crossover within the identical segment in a recent generation, so that, for example, V inherited the first part of it from one ancestor and the second part of it from another ancestor. If the crossover was near either end of the identical segment, then it repesents another type of fuzzy boundary. If the crossover was in the middle of a long identical segment, then, by the same logic, V must share both of the ancestors from whom he inherited the two ends of the identical segment with W. As one climbs up the family tree, fuzzy boundaries can be shaved off each end of the initial segment, so that it can shrink considerably, or even disappear, before the common ancestor is reached.

Another reason to dismiss a half-identical region as not genealogical useful is that it may be

Such a segment could be called identical by chance and can be considered a false match.
The prevalence of fuzzy boundaries means that the lengths of identical segments can only be estimated and that these estimates incorporate substantial measurement error. I have not seen any formal study of the distribution of this measurement error.

A consequence of this measurement error is that when three individuals share the same identical segment, the estimated lengths of the three corresponding half-identical regions can be slightly different. It can often be the case that one or two of the half-identical regions are longer than the threshold being used for a particular comparison, and the other two or one are shorter than the threshold. If two of the three individuals are parent and child, then there is a strong (and justifiable) temptation to dismiss these half-identical regions as not being genealogically useful. This was the subject of Debbie Kennett's 2017 study and of other parent/child studies which she cites. Those studies concentrated on the 6cM threshold then used by AncestryDNA for identifying matches, but exactly the same phenomenon is observed around the 20cM threshold used by AncestryDNA for identifying shared matches.

With AncestryDNA about to raise its threshold from 6 centiMorgans (cM) to 8cM, I would like to see one of these parent/child studies extended, as a matter of urgency, to examine how many of the shared matches between parent and child are estimated to share over 8cM (and over 20cM) with one and under 8cM (and under 20cM) with the other, so that, in the 8cM case, they currently appear "true", but will appear "false" when one of the matches has disappeared under the new regime. As I have no living parent and no child and do not have access to any parents-and-child trio of AncestryDNA kits, I cannot carry out this research myself.

A match of which the estimated length is within a normal margin of error of some arbitrary threshold must not be considered a false match.

centiMorgan-based criteria for genealogical usefulness

Only in the last three paragraphs above have I mentioned the centiMorgan length of a half-identical region or of an identical segment, which is another indicator of whether or not it is genealogically useful. Before returning to centiMorgans, let us consider who our ancestors were. I have estimated that in somewhere like Ireland, where the population is small and there was little inward migration in recent centuries, it is unlikely that any two randomly selected people with no tradition of recent immigrant ancestors are more distantly related than about twelfth cousins. Mark Humphrys argues that we Irish are all descended from Brian Ború, the High King of Ireland who was killed in battle in 1014.

This simple observation has profound implications:

If your ancestors come from a smaller population (e.g. an offshore island or a coastal peninsula), then your parents were probably even more closely related. If your ancestors come from a larger population, then your parents might not be as closely related as twelfth cousins. However, the other principles above remain valid.
Hence, a "long" identical segment shared by two known cousins is generally assumed to have come from one of their most recent known common ancestral couple (and a "long" identical segment shared by two known half-cousins is generally assumed to have come from their most recent known common ancestor), while a "short" identical segment shared by two known relatives may, however, have come from some other and still unidentified common ancestor.

"Long" and "short" in this case are subjective terms, and how to interpret them also depends both on the quality of surviving genealogical records for the areas where your ancestor lived and on how geographically diverse your ancestry is.

If surviving genealogical records are good, then you should be able to identify common ancestors with DNA matches with whom you share long identical segments and should also be able to identify the other (not shared) ancestors of yourself and your match on the relevant generation.

If surviving genealogical records are good, and particularly if your ancestors did not move around very much, then you will find that you are multiply related to many of your DNA matches.

If surviving genealogical records are bad, then you will struggle to identify your relationships to any of your DNA matches beyond immediate family members.

The identical segments which you share with a DNA match who is a known relative will cease to be genealogically useful when you can not reliably assign them to one particular ancestor or ancestral couple among those that you share with the match.

In summary, other reasons to dismiss half-identical regions (besides those which are half-identical by chance, etc.) as not genealogical useful is that they may be A match attributed to the wrong common ancestor can be described as a misattributed match, this alone does not make it a false match.

True (and false) matches can become misattributed matches in all sorts of way, including:
I ran into the problem of misattributed matches when experimenting with DNA PAINTER. I started out with a 7cM threshold and found no less than four regions of between 7cM and 8cM in length which I shared with known relatives, but which I could not reliably assign to a particular ancestor. In other words, I appear to be doubly related to at least one of the matches whom I match in each of these regions.  So I decided that for anything under 8cM the risk that the shared DNA came from a common ancestor other than the known common ancestor is unacceptably high, and for anything over 8cM, the corresponding risk is acceptably low. FamilyTreeDNA has long had an 8cM threshold in its matching algorithm (although it annoyingly reports and counts much smaller half-identical regions once one of them exceeds 8cM). I am delighted to see that AncestryDNA is now doing something similar.

I also ran into the problem of misattributed matches when a region (of slightly over 8cM) in which many of my kits triangulated with each other turned out to be a pile-up region (details here).

Base-pair-based criteria for genealogical usefulness

The ISOGG Wiki cites a relatively old (2014) study by Speed and Balding which used computer simulations based on base pairs and going back for 50 generations. They showed that, for example:
The relationship between base pairs and centiMorgans is highly non-linear, which is the very reason for using the centiMorgan scale rather than the base-pair scale to assess the genealogical usefulness of a match. Hence, the Speed and Balding results can NOT be used directly to draw reliable inferences about the age of segments measured in centiMorgans. I am not aware of any similar study calibrated in centiMorgans, but one is badly needed. Its results will undoubtedly be similar to, but less extreme than, those of Speed and Balding.

In the absence of such a study, the editors of the ISOGG Wiki propose using the Speed and Balding results, in conjunction with a rule of thumb that on average one centiMorgan equals one megabase.

To see where this rule of thumb comes from, I did a one-to-one comparison of my own kit with my own kit at GEDmatch and added the end locations of the 22 segments to get the aggregate length of the autosomes in base pairs (2,865,561,399); then added the cM lengths of the 22 segments to get the aggregate length of the autosomes in cMs (3587.1), which gives an average of 798,852 base pairs per cM, while the rule of thumb rounds this to one significant digit and proposes 1,000,000 base pairs per cM.

Thus this rule of thumb introduces two opposite biases:
Without further research (see below), it is very unclear which of these two biases will be greater. Nevertheless, the ISOGG rule of thumb is still often (mis)used in combination with the Speed and Balding results (and other more justifiable reasons) to warn against the dangers of over-reliance on segments which are short in centiMorgan terms, e.g. here. No theory can be either proven or disproven by a biased methodology. A biased methodology may demonstrate a relationship in the correct direction, but will exaggerate (or understate) the strength of the relationship.

By definition, if a set of segments is sorted by megabase length (longest-to-shortest) and then re-sorted by centiMorgan length, segments which descend from very distant ancestors will generally move down the list after re-sorting and segments which descend from more recent ancestors will generally move up the list after re-sorting.

I have been challenged to illustrate the biases introduced by using Mb as a proxy for cM. To do this quickly, I ran the GEDmatch matching segment search on my own kit (LR012759C1) with the minimum thresholds. This gave a sample of exactly 10,000, mostly small, half-identical regions (HIRs) with both cM and Mb length for each HIR. In the absence of any better data, I assessed the genealogical usefulness of each HIR by whether or not I have established my relationship to the other party. In the table below, the columns headed "known" show the half-identical regions that I share with individuals to whom I have established my relationship; the columns headed "unknown" show the half-identical regions that I share with individuals to whom I have NOT established my relationship. All of these relationships are closer than sixth cousin. I divided the HIRs into groups using the Mb ranges from the oft-cited Speed and Balding Figure 2, which I converted into cM equivalents using the true average of just under 0.8Mb=1cM. Each row represents one of the Speed and Balding ranges. Here are the results:



Mb ranges known unknown Grand Total %known
0.2-0.5Mb 6 74 80 7.5%
0.5-1.0Mb 56 995 1051 5.3%

1-2Mb 182 3112 3294 5.5% cM ranges known unknown Grand Total %known
2-5Mb 206 3579 3785 5.4% 3.0-6.2cM 408 7160 7568 5.4%
5-10Mb 106 1025 1131 9.4% 6.3-12.5cM 113 1599 1712 6.6%
10-20Mb 92 406 498 18.5% 12.6-25.0cM 115 469 584 19.7%
20-30Mb 40 69 109 36.7% 25.1-37.5cM 49 34 83 59.0%
30-40Mb 27 7 34 79.4% 37.6-50.0cM 27 4 31 87.1%
40-50Mb 8 8 100.0% 50.1-62.5cM 9 1 10 90.0%
50-60Mb 4 4 100.0% 62.6-75,1cM 8 8 100.0%
60-80Mb 4 4 100.0% 75.2-100.1cM 3 3 100.0%
Over 80Mb 2 2 100.0% over 100.1cM 1 1 100.0%
Grand Total 733 9267 10000 7.3% Grand Total 733 9267 10000 7.3%

The two principal objectives for this exercise were to demonstrate:
  1. that the Mb and cM scales are very different, which is illustrated by the differences between the two columns headed "Grand Total" (the fourth and ninth columns); and
  2. that assuming equivalence of Mb and cM scales introduces systematic biases to subsequent calculations, which is illustrated by the differences between the two columns headed "%known" (the fifth and last columns).
Some other interesting points from this table:
I still agree with Blaine Bettinger and others that small segments potentially poison our genealogical research, but the poison is not as deadly as some would have us believe.

In summary, a match sharing a segment inherited from a very distant ancestor can be described as an ancient match and is not genealogically useful, but must not be considered a false match.

Other criteria for genealogical usefulness and a wish list

The cM lengths of DNA matches are strongly correlated with the degree of relationship for close relationships, out to about first cousins.

The cM lengths of DNA matches are only very weakly correlated with the degree of relationship for distant relationships, certainly beyond third cousins.

We need other and better criteria to identify the dozens of genealogical useful matches almost certainly lurking among the many thousands of useless, irrelevant and even false small matches which will disappear under the new AncestryDNA matching algorithm. There is no point in throwing the baby out with the bathwater.

While a single small DNA segment shared by two individuals is not genealogically useful in isolation, a pattern of large and small DNA segments shared by multiple descendants of one ancestor with multiple descendants of another ancestor can be extremely useful (in combination with archival evidence and family traditions) in establishing the relationship between the two ancestors. We need better ways of finding these patterns. A good start is to fish in all the online gene pools and to share your DNA match lists with your known relatives and with your other close matches.

The thresholds at which matches cease to be genealogically useful are different for different purposes:
The points above inspire part of my wish list for the next round of improvements to the DNA websites:
I will conclude with some statistics as of 16 July 2020 on the matches in danger of removal from my own AncestryDNA match list:
Based on these statistics, I would much rather lose matches with more centiMorgans and fewer shared matches than those with many shared matches and fewer centiMorgans. I will update the statistics after the changes are implemented.

I hope that my arguments and examples have convinced readers that, as DNA comparison databases grow to tens of millions of individuals and as pile-up regions are identified and eliminated, a well-defined count of shared matches will cease to be just a random artifact of the matching process, but will converge to a very useful measure of the relative genealogical usefulness of small matches and even of non-matches.
Many thanks to those who provided useful feedback via Facebook on previous versions of this page.