Genealogically useful, misattributed and false DNA matches
Last updated: 23 July 2020
changes to AncestryDNA's matching algorithm announced on 14 July 2020
and scheduled for early August 2020 sparked off a variety of
reactions and got me thinking again about how to
useful autosomal DNA matches.
When I first got involved in genetic genealogy back in 2013, I
was very confused by a variety of poorly defined jargon used to dismiss
not considered genealogically useful, for example terms such as "IBS",
"identical-by-state", "pseudo-segment", "false
positive", "false match", etc. I soon realised
that many of these terms are generally just as subjective as the
matching algorithms used in the various DNA comparison databases.
Several years of experience in studying the DNA matches of
and others on many different websites, and my background in
mathematical sciences including statistics, have subsequently led me to
own subjective opinions on criteria for assessing the genealogical
usefulness of DNA matches, which can be divided into four broad
I will deal with each of these in turn, starting with the most
technical. You may click on the hyperlinks above to skip ahead to the
other sections if you wish.
criteria for genealogical usefulness
Until such time as a technology is developed to read our
paternal chromosomes and maternal chromosomes separately, autosomal DNA
matching is based on identifying half-identical
regions bounded by opposite homozygous locations. At every
location read along the pairs of autosomal chromosomes, the current
an unordered pair of letters, e.g. AA, AT, AC, etc. At every
location within a half-identical region, the two individuals being
compared have at least one letter in common. For example, at a location
where A or C can be observed, two half-identical individuals can have
AA and AA,
AA and AC, AC and AC, CC and AC, or CC and CC. On the boundaries of
half-identical regions, one of the individuals has AA and the other has
CC. The vast majority of locations observed on the autosomal
chromosomes are bi-allelic, so that only two of the four letters
(A,C,G,T) are observed at each location.
The first principle of DNA comparison is that within a long
half-identical region there is a long, but not directly observable,
identical segment (e.g.
ACGTAAGTTGGAC ...) which is common to, for example, one
individual's paternal chromosome and the other individual's maternal
chromosome. The second principle of DNA comparison is that this
identical segment, or most of it, came to both parties from a (probably
deceased) common ancestor,
although that common ancestor is likewise generally not directly observable
Something which is not directly observable can in general
neither be proven true
nor proven false.
It is often possible to estimate the likelihood or probability
a proposition is true or false. Unless the estimated probability that
something is false is 100%, than it can be described at best as probably false and
must not be described as false
without this qualification.
One reason to dismiss a half-identical region as not
genealogical useful is that it may be
This refers both to regions in which there is a sequence of overlapping
maternal/paternal and maternal/maternal matches. The fewer the SNPs
being compared within the region, the more likely it is to be
half-identical by chance. As different laboratories use different chips
with different sets of SNPs, the more the matching algorithms have to
guard against half-identical by chance matching, and the more
half-identical by chance regions may slip through the net. As of 16
July 2020, the number of SNPs available for comparison with my top
3,000 matches at GEDmatch, called the overlap,
ranged from 47,557 to 345,972. In the early days, when most comparisons
were based on the same underlying chip and an overlap of hundreds of
thousands of SNPs, I was dubious of any half-identical region
containing less than 1,000 SNPs. The present diversity of chips in use
means that one can no longer afford to be so fussy about SNP density.
Another reason to dismiss a half-identical region as not
genealogical useful is that it may be
This refers to the fact that the SNPs which the DNA companies examine
are generally not all the SNPs. In other words, the locations not
examined are not necessarily locations at which all humans are
identical. Thus, it is possible that two people match at a long
sequence of consecutive observed SNPs, but that there are unobserved
SNPs between the observed SNPs at which the two people do not match.
Dave Nicolson has written a paper about this.
A half-identical by chance or half-identical by omission
region can be
considered a false match.
Half-identical regions generally have fuzzy boundaries
at each end of the identical segment which they contain. In other
words, there will generally be a gap between the last location in the
identical segment and the next opposite homozygous location, e.g.
In this example, the identical segment ends with five As, but V and W
are half-identical at the next four locations (where each has at least
one C) until the half-identical region eventually ends at the last
location shown, where one has two As and the other has two Cs.
- individual V paternal: AAAAACCCCC
- individual V maternal: CCCCCCCCCC
- individual W paternal: AAAAAAAAAA
- individual W maternal: ACACACCCCA
For more on these concepts, see Ann
in the Journal
7, Fall, 2011).
In general, an identical segment shared by individuals V and W will
have descended to both individuals in its entirety from a single common
ancestor. The only other possibility is that there has been a crossover
within the identical segment in a recent generation, so that, for
example, V inherited the first part of it from one ancestor and the
second part of it from another ancestor. If the crossover was near
either end of the identical segment, then it repesents another type of
fuzzy boundary. If the crossover was in the middle of a long identical
segment, then, by the same logic, V must share both of the ancestors
from whom he inherited the two ends of the identical segment with W. As
one climbs up the family tree, fuzzy boundaries can be shaved off each
end of the initial segment, so that it can shrink considerably, or even
disappear, before the common ancestor is reached.
Another reason to dismiss a half-identical region as not
genealogical useful is that it may be
Such a segment could be called identical
by chance and can be considered a false match.
- contain long
fuzzy boundaries around a small core segment
The prevalence of fuzzy boundaries means that the lengths of identical
segments can only be estimated and that these estimates incorporate
error. I have not seen any formal study of the
distribution of this measurement error.
A consequence of this measurement error is that when three individuals
share the same identical segment, the estimated lengths of the three
corresponding half-identical regions can be slightly different. It can
often be the case that one or two of the half-identical regions are
longer than the threshold being used for a particular comparison, and
the other two or one are shorter than the threshold. If two of the
three individuals are parent and child, then there is a strong (and
justifiable) temptation to dismiss these half-identical regions as not
being genealogically useful. This was the subject of Debbie Kennett's 2017 study
and of other parent/child studies which she cites. Those studies
concentrated on the 6cM threshold then used by AncestryDNA for
identifying matches, but exactly the same phenomenon is observed around
the 20cM threshold used by AncestryDNA for identifying shared matches.
With AncestryDNA about to raise its threshold from 6 centiMorgans (cM)
to 8cM, I would like to see one of these parent/child studies extended,
as a matter of urgency, to examine how many of the shared matches
between parent and child are estimated to share over 8cM (and over
20cM) with one and under 8cM (and under 20cM) with the other, so that,
in the 8cM case, they currently appear "true", but will appear "false"
when one of the matches has disappeared under the new regime. As I have
no living parent and no child and do not have access to any
parents-and-child trio of AncestryDNA kits, I cannot carry out this
A match of which the estimated length is within a normal margin of
error of some arbitrary threshold must not be considered a false match.
criteria for genealogical usefulness
Only in the last three paragraphs above have I mentioned the
length of a half-identical region or of an identical segment, which is
another indicator of whether or not it is genealogically useful. Before
returning to centiMorgans, let us consider who our ancestors were. I
have estimated that in somewhere like Ireland, where the
small and there was little inward migration in recent
is unlikely that any two randomly selected people with no tradition of
immigrant ancestors are more distantly related than about twelfth
Humphrys argues that we
Irish are all
descended from Brian Ború, the High King of Ireland who was
killed in battle
This simple observation has profound implications:
If your ancestors come from a smaller population (e.g. an offshore
island or a coastal peninsula), then your parents were probably even
more closely related. If your ancestors come from a larger population,
then your parents might not be as closely related as twelfth cousins.
However, the other principles above remain valid.
- if your parents came from the same population, then they
were probably twelfth cousins or closer;
- if your match's parents came from the same population, then
they too were probably twelfth cousins or closer;
- so you and your match are probably related in at least four
different ways: paternal/paternal, paternal/maternal,
- extending the analysis backwards, you and your match are
probably related in many more ways;
- if you and your match have a known recent common ancestral
couple, then one of that couple is probably the source of any
"long" identical segment (in centiMorgan terms) that you share
with your match;
- the source of a "short" half-identical region that you
your match could be a more distant common ancestor of whom you are
Hence, a "long" identical segment shared by two known cousins is
generally assumed to have come from one of their most recent known
common ancestral couple (and a "long" identical segment shared by two
known half-cousins is generally
assumed to have come from their most recent known common ancestor),
while a "short" identical segment shared by two known relatives may,
however, have come from some other and still unidentified common
"Long" and "short" in this case are subjective terms, and how to
interpret them also depends both on the quality of surviving
genealogical records for the areas where your ancestor lived and on how
geographically diverse your ancestry is.
If surviving genealogical records are good, then you should be able to
identify common ancestors with DNA matches with whom you share long
identical segments and should also be able to identify the other (not
shared) ancestors of yourself and your match on the relevant generation.
If surviving genealogical records are good, and particularly if your
ancestors did not move around very much, then you will find that you
are multiply related to many of your DNA matches.
If surviving genealogical records are bad, then you will struggle to
identify your relationships to any of your DNA matches beyond immediate
The identical segments which you share with a DNA match who is a known
relative will cease to be genealogically useful when you can not
reliably assign them to one particular ancestor or ancestral couple
among those that you share with the match.
In summary, other reasons to dismiss half-identical regions (besides
those which are half-identical by chance, etc.) as not genealogical
useful is that they may be
A match attributed to the wrong common ancestor can be described as a misattributed match,
this alone does not make it a false
- shared with a
stranger and inherited from a common ancestor too distant to be
identified from surviving genealogical records; or
- shared with a
but inherited from a common ancestor other than one of the known most
recent common ancestral couple.
True (and false) matches can become misattributed matches in all sorts
I ran into the problem of misattributed matches when experimenting
with DNA PAINTER.
I started out with a 7cM threshold and found no less than four regions
of between 7cM and 8cM in length which I shared with known relatives,
but which I could not reliably assign to a particular ancestor. In
other words, I appear to be doubly related to at least one of the
matches whom I match in each of these regions. So I decided
for anything under 8cM the risk that the shared DNA came from a common
ancestor other than the known common ancestor is unacceptably high, and
for anything over 8cM, the corresponding risk is acceptably low.
FamilyTreeDNA has long had an 8cM threshold in its matching algorithm
(although it annoyingly reports and counts much smaller half-identical
regions once one of them exceeds 8cM). I am delighted to see that
is now doing something similar.
- misattribution to someone who is not a common ancestor;
- misattribution to a common ancestor who is not the source
of the shared DNA;
- automated misattribution:
- using unvalidated family trees (e.g. trees with children
born before their parents!);
- using wrong trees (e.g. trees which have confused two
namesakes or which are linked to the wrong DNA kit);
- using incomplete trees (e.g. attributing a shared DNA
to an ancestor in the tree when it comes from an ancestor omitted from
- using correct trees (e.g. attributing a shared
segment to a more recent common ancestor when it comes from a more
distant common ancestor);
- manual misattribution;
I also ran into the problem of misattributed matches when a region (of
slightly over 8cM) in which many of my kits triangulated with each
other turned out to be a pile-up region (details here).
criteria for genealogical usefulness
The ISOGG Wiki cites a relatively old
(2014) study by Speed and Balding which
used computer simulations based on base pairs and going back for 50
generations. They showed that, for example:
The relationship between base pairs and centiMorgans is highly
non-linear, which is the very reason for using the centiMorgan scale
rather than the base-pair scale to assess the genealogical usefulness
of a match. Hence, the Speed and Balding results can NOT be used
reliable inferences about the age of segments measured in centiMorgans.
I am not aware of any similar study calibrated in centiMorgans, but one
is badly needed. Its results will undoubtedly be similar to, but less
extreme than, those of Speed and Balding.
- over 50% of 5 Mb (5,000,000 base pair)
segments date back over 20 generations;
- over 60% of 10 Mb
segments date back over 10 generations; and
- around 40% of 20 Mb
segments date back over 10 generations.
In the absence of such a
study, the editors of the ISOGG Wiki propose using the Speed and
results, in conjunction with a rule of thumb that on average one
centiMorgan equals one
To see where this rule of thumb comes from, I did a one-to-one
comparison of my own kit with my own kit at GEDmatch and added the end
locations of the 22 segments to get the aggregate length of the
autosomes in base pairs (2,865,561,399); then added the cM lengths of
the 22 segments to get the aggregate length of the autosomes in cMs
(3587.1), which gives an average of 798,852 base pairs per cM, while
the rule of thumb rounds this to one significant digit and proposes
1,000,000 base pairs per cM.
Thus this rule of thumb introduces two opposite biases:
Without further research (see below), it is very unclear which of these
two biases will be greater. Nevertheless, the ISOGG rule of
still often (mis)used in combination with the Speed and Balding results
(and other more justifiable reasons) to warn against the dangers of
over-reliance on segments which are short in centiMorgan terms, e.g. here. No theory can be either
proven or disproven by a biased methodology. A biased methodology may
demonstrate a relationship in the correct direction, but will
exaggerate (or understate) the strength of the relationship.
- because the relationship between the cM and Mb scales is
non-linear, assuming any sort of direct proportionality or linear
relationship causes a bias which undoubtedly makes small segments look worse than they
- because the actual average ratio between the cM
scales is just under 0.8Mb=1cM, using a one-to-one ratio causes an
opposite bias which undoubtedly makes small segments look better than they
By definition, if a set of segments is sorted by megabase length
(longest-to-shortest) and then re-sorted by centiMorgan length,
segments which descend from very distant ancestors will
generally move down the list after re-sorting and segments which
descend from more recent ancestors will generally
move up the list after re-sorting.
I have been challenged to illustrate the biases introduced by using Mb
a proxy for cM. To do this quickly, I ran the GEDmatch matching segment
on my own kit (LR012759C1) with the minimum thresholds. This gave a
sample of exactly 10,000, mostly small, half-identical regions (HIRs)
with both cM and
Mb length for each HIR. In the absence of any better data, I assessed
the genealogical usefulness of each
HIR by whether or not I have established my relationship to the other
party. In the table below, the columns headed "known" show the
half-identical regions that
I share with individuals to whom I have established my relationship;
the columns headed "unknown" show the half-identical regions that I
share with individuals to whom I have NOT established my relationship.
All of these relationships are closer than sixth cousin. I
divided the HIRs into groups using the Mb ranges from the oft-cited Speed and Balding Figure 2,
which I converted into cM equivalents using the true average of just
under 0.8Mb=1cM. Each row represents one of the Speed and Balding
ranges. Here are the results:
The two principal objectives for this exercise were to demonstrate:
Some other interesting points from this table:
- that the Mb and cM scales are very different, which is
illustrated by the differences between the two columns headed "Grand
Total" (the fourth and ninth columns); and
- that assuming equivalence of Mb and cM scales introduces
systematic biases to subsequent calculations, which is illustrated by
the differences between the two columns headed "%known" (the fifth and
I still agree with Blaine Bettinger
and others that small segments potentially poison our genealogical
research, but the poison is not as deadly as some would have us believe.
- The distribution on the cM scale looks very different from
the distribution on the Mb scale:
- GEDmatch filters out all HIRs under 3.0cM, regardless of
their Mb length.
If the 1cM=1Mb rule of thumb was appropriate, there would be no HIR
under 3.0Mb, but 63.0% of the HIRs are below this threshold.
If the 0.8cM=1Mb rule of thumb was appropriate, there would be no HIR
under 2.4Mb, but 52.8% of the HIRs are below this threshold.
- If this was a random sample, then the average Mb/cM ratio
would be around 798,852.
- As this is not a random sample but has been filtered by
GEDmatch for genealogical usefulness, the average Mb/cM ratio is lower.
- Because the Mb/cM relationship is non-linear, the average
the Mb/cM ratios for the individual HIRs (613,148) is different to the
ratio of the aggregate Mb length to the aggregate cM length (635,709).
- If the relationship between the Mb and cM scales was
linear, then the correlation between the two measures would be 1.00.
- Because the relationship between the Mb and cM scales is
non-linear, the actual correlation between the two measures is 0.86.
- The widely cited result that "40% of 20 Mb
segments date back beyond 10 generations" is far more pessimistic than
this table, which shows that I have found a much more recent common
ancestor for 36.7% of the HIRs in the corresponding 20-30Mb group, and
an even more
reassuring 59.0% of those in the equivalent cM-based group. Of course,
without digging up the common ancestral couples I cannot prove to the
doubters that the known common ancestors were the sources of the
- There is one outlier of 55.0cM/28.4Mb for which I have not
able to identify the common ancestor; the relatively short Mb length
seems to give a
better estimate of the age of this particular segment.
- Ignoring this single outlier, my success rate in finding
ancestors is as good or better when grouping by cM as when grouping
by Mb for every group above 10Mb/12.6cM.
- The results on the right of the table look a little less
depressing than those on the left because of the use of the cM scale in
place of the Mb scale.
- The results on the left of the table in turn look a little
less depressing than those of Speed and Balding
because the GEDmatch matching algorithm, like those of the other DNA
companies and the efforts of all good genetic genealogists, has already
filtered out to the best of its ability many segments from extremely
In summary, a match sharing a segment inherited from a very distant
ancestor can be
described as an ancient
match and is not genealogically useful, but must not be
considered a false match.
criteria for genealogical usefulness and a wish list
The cM lengths of DNA matches are strongly correlated with the degree
of relationship for close relationships, out to about first cousins.
The cM lengths of DNA matches are only very weakly correlated with the
of relationship for distant relationships, certainly beyond third
We need other and better criteria to identify the dozens of
genealogical useful matches almost certainly lurking among the many
thousands of useless, irrelevant and even false small matches which
will disappear under the new AncestryDNA matching algorithm. There is
no point in throwing the baby out with the bathwater.
While a single small DNA segment shared by two individuals is not
genealogically useful in isolation, a pattern of large and small DNA
segments shared by multiple descendants of one ancestor with multiple
descendants of another ancestor can be extremely useful (in combination
with archival evidence and family traditions) in establishing the
relationship between the two ancestors. We need better ways of finding
these patterns. A good start is to fish in all the online gene pools
and to share your DNA match lists with
your known relatives and with your other close matches.
The thresholds at which matches cease to be genealogically useful are
different for different purposes:
The points above inspire part of my wish list for the next round of
improvements to the DNA websites:
- For chromosome mapping purposes, as already stated, my
is that 8cM is a sensible threshold for determining what is
- For triangulation purposes, much smaller half-identical
can still be very useful. The more individuals one adds to a
triangulation group, the smaller the overlap shared by everyone in the
group, but the more confidently one can predict that the overlap was
inherited from one of the common ancestral couple shared by everyone in
the group. The overlap may be small, but will be part of much longer
segments unquestionably shared by different subgroups of the
triangulation group. Sometimes, however, I feel obliged to omit a known
from a triangulation group in order to make my conclusion more
convincing to those with a stronger bias than my own against small
- For purposes of identifying matches at AncestryDNA with
can trace common ancestors and validate common ancestor hints,
particularly those far down my match list, I find
that the best criterion for
assessing genealogical usefulness is not the shared cMs but the number
of shared matches.
I will conclude with some statistics as of 16 July 2020 on the
matches in danger of removal from my own AncestryDNA match list:
- I wish that GEDmatch would drop the 7cM minimum
threshold used in its Tier 1 triangulation tool to the minimum
threshold available for one-to-one comparisons, currently 3cM, and that
it would include the SNP overlap as well as the centiMorgan length of
each triangulated segment which it reports.
- I wish that AncestryDNA would show the number
of shared matches for each kit on my
match list, rather than forcing me to click twice and then potentially
scroll down repeatedly for every single kit in order to count the
shared matches. I would also like to be able to sort and filter my
match list by the number of shared matches, as I can by the number of
- I wish that FTDNA would improve its
in-common-with lists to give some indication of how much DNA the other
two parties share, as the other DNA comparison websites do: GEDmatch
and MyHeritage show the estimated shared cM, while AncestryDNA shows
only shared matches where the estimated shared cM exceeds 20 (not
counting the tiny segments that FTDNA counts).
- I wish that AncestryDNA, FTDNA and GEDmatch would allow
shared match lists to be sorted by the average (or equivalently the
total) cM shared with the two kits being compared:
- MyHeritage already does this;
- GEDmatch sorts by the cM shared with whichever kit is
entered first in the web form;
- FTDNA and AncestryDNA sort by the cM shared with
whichever kit is logged in.
- I wish that both AncestryDNA and FTDNA would improve their
shared/in-common-with lists to indicate whether the matches are
triangulated, as MyHeritage does and as GEDmatch alllows via its
additional display and processing options.
- I wish that MyHeritage would allow users to filter out
triangulated shared matches, rather than forcing the user to click and
scroll down repeatedly for every single match in order to find and/or
count the triangulated
- I wish that all the companies would extend the ability to
For example, I would like to see a list of the matches shared by all of
the descendants of one of my ancestors whose DNA is linked to my online
tree, as the more descendants of the relevant ancestor that an
matches, the more likely he or she is to be also descended from or
closely related to that ancestor.
- by two kits which are not deemed to be matches, such as
two known third or fouth cousins who don't meet the relevant matching
possible only at GEDmatch; and at FTDNA, but only by administrators of
projects of which both kits are
- by three or more kits (currently possible only at FTDNA
and only by administrators of projects of which all the kits are
GEDmatch could add the shared match list to its Multi Kit Analysis in
the same way as FTDNA's GAP interface for project administrators adds
it to its own autosomal matrix comparisons.
- I wish that users had more control over the shared match
thresholds for one-to-one comparisons:
Some users might even want to set an upper threshold to eliminate
shared close cousins who do not share ancestors, for example if one
party's greataunt was married to the other party's greatuncle.
- FTDNA gives the user no control and their
built-in thresholds are so low and their shared match lists are now so
long that they are genealogically almost useless;
- GEDmatch proposes a 10cM default threshold, which is
still too low, but which can be adjusted upwards by the user;
- AncestryDNA's fixed 20cM threshold has proven enormously
useful in my own research;
- the more sensible sort order used by MyHeritage partly
compensates for its low built-in threshold.
Based on these statistics, I would much rather lose matches with more
centiMorgans and fewer
shared matches than those with many shared matches and fewer
centiMorgans. I will update the statistics after the changes are
- I have identified (and
starred) 142 known relatives among my 36,571 AncestryDNA matches
- AncestryDNA does not appear to report the total number of
common ancestor hints for a kit.
- Only 11 of my known relatives are among the countless
matches with whom I am estimated to
share 6cM or 7cM. The number of shared matches for these bottom 11
known relatives are 8, 16, 2, 4, 4, 1, 2, 15, 3, 2 and 1 respectively
- Only 12 of my common ancestor hints are among the
matches with whom I am estimated to
share 6cM or 7cM. I consider six of these 12 hints to be wrong,
spurious or plain nonsense (and far more dangerous in the hands of
inexperienced genealogists than small DNA segments). The number of
shared matches for the six
hints with which I agree are 8, 2, 4, 1, 2 and 2 (average 3.2). The
number of shared matches for the six hints with which I disagree are 5,
0, 0, 2, 2 and 0 (average 1.5).
- As a control group, I looked at the 12 most recent matches
(as of 18 July 2020) with whom I am estimated to
share 6cM or 7cM. The number of shared matches are 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0 and 0 (average 0.1).
I hope that my arguments and examples have convinced readers that, as DNA
comparison databases grow to tens of millions of individuals and as pile-up regions
are identified and
eliminated, a well-defined count of shared matches will cease to be
just a random artifact of the matching process, but will converge to a
very useful measure of the relative genealogical usefulness of small
matches and even of non-matches.
Many thanks to those who provided useful feedback via Facebook on previous
versions of this page.