Genealogically useful, misattributed and false DNA matches
by Paddy
Waldron
Last updated: 5 September 2020
URL: http://pwaldron.info/DNA/GenealogicallyUseful.html
The
changes to AncestryDNA's matching algorithm announced on 14 July 2020
and implemented throughout August 2020 sparked off a variety of
reactions and got me thinking again about how to
identify genealogically
useful autosomal DNA matches.
When I first got involved in genetic genealogy back in 2013, I
was very confused by a variety of poorly defined jargon which is used to dismiss
matches that
are
not considered genealogically useful. This jargon includes terms such as "IBS",
"identical-by-state", "pseudo-segment", "false
positive", "false match", etc. I soon realised
that many of these terms are generally just as subjective as the
thresholds built
into the
matching algorithms used in the various DNA comparison databases.
Several years of experience in studying the DNA matches of
myself
and others on many different websites, and my background in
mathematical sciences including statistics, have subsequently led me to
my
own subjective opinions on criteria for assessing the genealogical
usefulness of DNA matches, which can be divided into four broad
categories:
I will deal with each of these in turn, starting with the most
technical. You may click on the hyperlinks above to skip ahead to the
other sections if you wish.
SNP-based
criteria for genealogical usefulness
Until such time as a technology is developed to read our
paternal chromosomes and maternal chromosomes separately, autosomal DNA
matching is based on identifying half-identical
regions bounded by opposite homozygous locations. At every
location read along the pairs of autosomal chromosomes, the current
technology reads
an unordered pair of letters, e.g. AA, AT, AC, etc. At every
location within a half-identical region, the two individuals being
compared have at least one letter in common. For example, at a location
where A or C can be observed, two half-identical individuals can have
AA and AA,
AA and AC, AC and AC, CC and AC, or CC and CC. On the boundaries of
half-identical regions, one of the individuals has AA and the other has
CC. The vast majority of locations observed on the autosomal
chromosomes are bi-allelic, so that only two of the four letters
(A,C,G,T) are observed at each location.
The first principle of DNA comparison is that within a long
half-identical region there is a long, but not directly observable,
identical segment (e.g.
ACGTAAGTTGGAC ...) which is common to, for example, one
individual's paternal chromosome and the other individual's maternal
chromosome. The second principle of DNA comparison is that this
identical segment, or most of it, came to both parties from a (probably
deceased) common ancestor,
although that common ancestor is likewise generally not directly observable
(barring exhumation).
Something which is not directly observable can in general
neither be proven true
nor proven false.
It is often possible to estimate the likelihood or probability
that
a proposition is true or false. Unless the estimated probability that
something is false is 100%, than it can be described at best as probably false and
must not be described as false
without this qualification.
One reason to dismiss a half-identical region as not
genealogical useful is that it may be
This refers to regions in which there is a sequence of overlapping
short paternal/paternal,
paternal/maternal,
maternal/paternal and maternal/maternal matches. The fewer the SNPs
being compared within the region, the more likely it is to be
half-identical by chance. As different laboratories introduce different chips
with different sets of SNPs, the more the matching algorithms have to
guard against half-identical by chance matching, and the more
half-identical by chance regions may slip through the net. As of 16
July 2020, the number of SNPs available for comparison with my top
3,000 matches at GEDmatch, called the overlap,
ranged from 47,557 to 345,972. In the early days, when most comparisons
were based on the same underlying chip and an overlap of hundreds of
thousands of SNPs, I was dubious of any half-identical region
containing less than 1,000 SNPs. The present diversity of chips in use
means that one can no longer afford to be so fussy about SNP density.
Another reason to dismiss a half-identical region as not
genealogical useful is that it may be
- half-identical
by omission
This refers to the fact that the SNPs which the DNA companies examine
are generally not all the SNPs. In other words, the locations not
examined are not necessarily locations at which all humans are
identical. Thus, it is possible that two people match at a long
sequence of consecutive observed SNPs, but that there are unobserved
SNPs between the observed SNPs at which the two people do not match.
Dave Nicolson has written a paper about this.
A half-identical by chance or half-identical by omission
region can be
considered a false match.
Half-identical regions generally have fuzzy boundaries
at each end of the identical segment which they contain. In other
words, there will generally be a gap between the last location in the
identical segment and the next opposite homozygous location, e.g.
- individual V paternal: AAAAACCCCC
- individual V maternal: CCCCCCCCCC
- individual W paternal: AAAAAAAAAA
- individual W maternal: ACACACCCCA
In this example, the identical segment ends with five As, but V and W
are half-identical at the next four locations (where each has at least
one C) until the half-identical region eventually ends at the last
location shown, where one has two As and the other has two Cs.
For more on these concepts, see Ann
Turner's "Identity
Crisis" article
in the Journal
of Genetic
Genealogy (Volume
7, Fall, 2011).
In general, an identical segment shared by individuals V and W will
have descended to both individuals in its entirety from a single common
ancestor. The only other possibility is that there has been a crossover
within the identical segment in a recent generation, so that, for
example, V inherited the first part of it from one member of an ancestral couple and the
second part of it from the other member of that ancestral couple. If the crossover was near
either end of the identical segment, then it repesents another type of
fuzzy boundary. If the crossover was in the middle of a long identical
segment, then, by the same logic, V must ultimately share one of the ancestors
from whom he inherited each end of the identical segment with W. As
one climbs up the family tree, fuzzy boundaries can be shaved off each
end of the initial segment, so that it can shrink considerably, or even
disappear, before the common ancestor is reached.
Another reason to dismiss a half-identical region as not
genealogical useful is that it may
- contain long
fuzzy boundaries around a small core segment
Such a segment could be called identical
by chance and can be considered a false match.
The prevalence of fuzzy boundaries means that the lengths of identical
segments can only be estimated and that these estimates incorporate
substantial measurement
error. I have not seen any formal study of the
distribution of this measurement error.
A consequence of this measurement error is that when three individuals
share the same identical segment, the estimated lengths of the three
corresponding half-identical regions can be slightly different. It can
often be the case that one or two of the half-identical regions are
longer than the threshold being used for a particular comparison, and
the other two or one are shorter than the threshold. If two of the
three individuals are parent and child, then there is a strong (and
justifiable) temptation to dismiss these half-identical regions as not
being genealogically useful. This was the subject of Debbie Kennett's 2017 study
and of other parent/child studies which she cites. Those studies
concentrated on the 6cM threshold then used by AncestryDNA for
identifying matches, but exactly the same phenomenon is observed around
the 20cM threshold used by AncestryDNA for identifying shared matches.
When AncestryDNA announced its intention to raise its threshold from 6 centiMorgans (cM)
to 8cM, I would like to have seen one of these parent/child studies extended,
as a matter of urgency, to examine how many of the shared matches
between parent and child are estimated to share over 8cM (and over
20cM) with one and under 8cM (and under 20cM) with the other, so that,
in the 8cM case, they originally appeared "true", but then appeared "false"
when one of the matches had disappeared under the new regime. As I have
no living parent and no child and do not have access to any
parents-and-child trio of AncestryDNA kits, I could not carry out this
research myself, and I am not aware of anyone else who did it.
A match of which the estimated length is within a normal margin of
error of some arbitrary threshold must not be considered a false match.
centiMorgan-based
criteria for genealogical usefulness
Only in the last three paragraphs above have I mentioned the
centiMorgan
length of a half-identical region or of an identical segment, which is
another indicator of whether or not it is genealogically useful. Before
returning to centiMorgans, let us consider who our ancestors were. I
have estimated that in somewhere like Ireland, where the
population is
small and there was little inward migration in recent
centuries, it
is unlikely that any two randomly selected people with no tradition of
recent
immigrant ancestors are more distantly related than about twelfth
cousins. Going back about three times as far, Mark
Humphrys argues that we
Irish are all
descended from Brian Ború, the High King of Ireland who was
killed in battle
in 1014.
This simple observation has profound implications:
- if your parents came from the same population, then they
were probably twelfth cousins or closer;
- if your match's parents came from the same population, then
they too were probably twelfth cousins or closer;
- so you and your match are probably related in at least four
different ways: paternal/paternal, paternal/maternal,
maternal/paternal, maternal/maternal;
- extending the analysis backwards, you and your match are
probably related in many more ways;
- if you and your match have a known recent common ancestral
couple, then one of that couple is probably the source of any
"long" identical segment (in centiMorgan terms) that you share
with your match;
- the source of a "short" segment that you
share with
your match could be a more distant common ancestor of whom you are
currently unaware.
If your ancestors come from a smaller population (e.g. an offshore
island or a coastal peninsula), then your parents were probably even
more closely related than twelfth cousins. If your ancestors come from
a larger population,
then your parents might not be as closely related as twelfth cousins.
However, the other principles above remain valid, whether the 95th
percentile relationship is sixth cousin, twelfth cousin or eighteenth
cousin.
Hence, a "long" identical segment shared by two known cousins is
generally assumed to have come from one of their most recent known
common ancestral couple (and a "long" identical segment shared by two
known half-cousins is generally
assumed to have come from their most recent known common ancestor),
while a "short" identical segment shared by two known relatives may,
however, have come from some other, less recent, and still unidentified, common
ancestor.
"Long" and "short" in this case are subjective terms, and how to
interpret them also depends both on the quality of surviving
genealogical records for the areas where your ancestor lived and on how
geographically diverse your ancestry is.
If surviving genealogical records are good, then you should be able to
identify common ancestors with DNA matches with whom you share long
identical segments and should also be able to identify the other (not
shared) ancestors of yourself and your match on the relevant generation.
If surviving genealogical records are good, and particularly if your
ancestors did not move around very much, then you will find that you
are multiply related to many of your DNA matches.
If surviving genealogical records are bad, then you will struggle to
identify your relationships to any of your DNA matches beyond immediate
family members.
The identical segments which you share with a DNA match who is a known
relative will cease to be genealogically useful when you can not
reliably assign them to one particular ancestor or ancestral couple
among those that you share with the match.
In summary, other reasons to dismiss half-identical regions (besides
those which are half-identical by chance, etc.) as not genealogical
useful is that they may be
- shared with a
stranger and inherited from a common ancestor too distant to be
identified from surviving genealogical records; or
- shared with a
known relative,
but inherited from a common ancestor other than one of the known most
recent common ancestral couple.
A match attributed to the wrong common ancestor can be described as a misattributed match,
this alone does not make it a false
match.
True (and false) matches can become misattributed matches in all sorts
of way,
including:
- misattribution to someone who is not a common ancestor;
- misattribution to a common ancestor who is not the source
of the shared DNA;
- automated misattribution:
- using unvalidated family trees (e.g. trees with children
born before their parents!);
- using wrong trees (e.g. trees which have confused two
namesakes or which are linked to the wrong DNA kit);
- using incomplete trees (e.g. attributing a shared DNA
segment
to an ancestor in the tree when it comes from an ancestor omitted from
the tree);
- using correct trees (e.g. attributing a shared
DNA
segment to a more recent common ancestor when it comes from a more
distant common ancestor);
- manual misattribution;
- etc.
I ran into the problem of misattributed matches when experimenting
with DNA PAINTER.
I started out with a 7cM threshold and found no less than four regions
of between 7cM and 8cM in length which I shared with known relatives,
but which I could not reliably assign to a particular ancestor. In
other words, I appear to be doubly related to at least one of the
matches whom I match in each of these regions. So I decided
that
for anything under 8cM the risk that the shared DNA came from a common
ancestor other than the known common ancestor is unacceptably high, and
for anything over 8cM, the corresponding risk is acceptably low.
FamilyTreeDNA has long had an 8cM threshold in its matching algorithm
(although it annoyingly reports and counts much smaller half-identical
regions once one of them exceeds 8cM). I am delighted to see that
AncestryDNA
is now doing something similar.
I also ran into the problem of misattributed matches when a region (of
slightly over 8cM) in which many of my kits triangulated with each
other turned out to be a pile-up region (details here).
Base-pair-based
criteria for genealogical usefulness
The ISOGG Wiki cites a relatively old
(2014) study by Speed and Balding which
used computer simulations based on base pairs and going back for 50
generations. They showed that, for example:
- over 50% of 5 Mb (5,000,000 base pair)
segments date back over 20 generations;
- over 60% of 10 Mb
segments date back over 10 generations; and
- around 40% of 20 Mb
segments date back over 10 generations.
The relationship between base pairs and centiMorgans is highly
non-linear, which is the very reason for using the centiMorgan scale
rather than the base-pair scale to assess the genealogical usefulness
of a match. Hence, the Speed and Balding results can NOT be used
directly to
draw
reliable inferences about the age of segments measured in centiMorgans.
I am not aware of any similar study calibrated in centiMorgans, but one
is badly needed. Its results will undoubtedly be similar to, but less
extreme than, those of Speed and Balding.
In the absence of such a
study, the editors of the ISOGG Wiki propose using the Speed and
Balding
results, in conjunction with a rule of thumb that on average one
centiMorgan equals one
megabase.
To see where this rule of thumb comes from, I did a one-to-one
comparison of my own kit with my own kit at GEDmatch and added the end
locations of the 22 segments to get the aggregate length of the
autosomes in base pairs (2,865,561,399); then added the cM lengths of
the 22 segments to get the aggregate length of the autosomes in cMs
(3587.1), which gives an average of 798,852 base pairs per cM, while
the rule of thumb rounds this to the nearest Mb per cM and proposes
1,000,000 base pairs per cM.
Thus this rule of thumb introduces two opposite biases:
- because the relationship between the cM and Mb scales is
highly
non-linear, assuming any sort of direct proportionality or linear
relationship causes a bias which undoubtedly makes small segments look worse than they
really are;
- because the actual average ratio between the cM
and Mb
scales is just under 0.8Mb=1cM, using a one-to-one ratio causes an
opposite bias which undoubtedly makes small segments look better than they
really are.
Without further research (see below), it is very unclear which of these
two biases will be greater. Nevertheless, the ISOGG rule of
thumb is
still often (mis)used in combination with the Speed and Balding results
(and other more justifiable reasons) to warn against the dangers of
over-reliance on segments which are short in centiMorgan terms, e.g. here. No theory can be either
proven or disproven by a biased methodology. A biased methodology may
demonstrate a relationship in the correct direction, but, depending on the direction of the bias, will
either exaggerate or understate the strength of the relationship.
By definition, if a set of segments is sorted by megabase length
(longest-to-shortest) and then re-sorted by centiMorgan length,
segments which descend from very distant ancestors will
generally move down the list after re-sorting and segments which
descend from more recent ancestors will generally
move up the list after re-sorting.
I have been challenged to illustrate the biases introduced by using Mb
as
a proxy for cM. To do this quickly, I ran the GEDmatch Tier 1 Segment Search
on my own kit (VA864386C1) with the minimum thresholds. This gave a
sample of exactly 10,000, mostly small, half-identical regions (HIRs)
with both cM and
Mb length for each HIR. In the absence of any better data, I assessed
the genealogical usefulness of each
HIR by whether or not I have established my relationship to the other
party. In the table below, the columns headed "known" show the
half-identical regions that
I share with individuals to whom I have established my relationship;
the columns headed "unknown" show the half-identical regions that I
share with individuals to whom I have NOT established my relationship.
All of these relationships are closer than sixth cousin. I
divided the HIRs into groups using the Mb ranges from the oft-cited Speed and Balding Figure 2,
which I converted into cM equivalents using the true average of just
under 0.8Mb=1cM. Each row represents one of the Speed and Balding
ranges. Here are the results:
|
|
|
|
|
|
|
|
|
|
Mb ranges |
known |
unknown |
Grand
Total |
%known |
|
|
|
|
|
0.2-0.5Mb |
6 |
74 |
80 |
7.5% |
|
|
|
|
|
0.5-1.0Mb |
56 |
995 |
1051 |
5.3% |
|
|
|
|
|
1-2Mb |
182 |
3112 |
3294 |
5.5% |
cM
ranges |
known |
unknown |
Grand
Total |
%known |
2-5Mb |
206 |
3579 |
3785 |
5.4% |
3.0-6.2cM |
408 |
7160 |
7568 |
5.4% |
5-10Mb |
106 |
1025 |
1131 |
9.4% |
6.3-12.5cM |
113 |
1599 |
1712 |
6.6% |
10-20Mb |
92 |
406 |
498 |
18.5% |
12.6-25.0cM |
115 |
469 |
584 |
19.7% |
20-30Mb |
40 |
69 |
109 |
36.7% |
25.1-37.5cM |
49 |
34 |
83 |
59.0% |
30-40Mb |
27 |
7 |
34 |
79.4% |
37.6-50.0cM |
27 |
4 |
31 |
87.1% |
40-50Mb |
8 |
|
8 |
100.0% |
50.1-62.5cM |
9 |
1 |
10 |
90.0% |
50-60Mb |
4 |
|
4 |
100.0% |
62.6-75,1cM |
8 |
|
8 |
100.0% |
60-80Mb |
4 |
|
4 |
100.0% |
75.2-100.1cM |
3 |
|
3 |
100.0% |
Over 80Mb |
2 |
|
2 |
100.0% |
over
100.1cM |
1 |
|
1 |
100.0% |
Grand Total |
733 |
9267 |
10000 |
7.3% |
Grand
Total |
733 |
9267 |
10000 |
7.3% |
The two principal objectives for this exercise were to demonstrate:
- that the Mb and cM scales are very different, which is
illustrated by the differences between the two columns headed "Grand
Total" (the fourth and ninth columns); and
- that assuming equivalence of Mb and cM scales introduces
systematic biases to subsequent calculations, which is illustrated by
the differences between the two columns headed "%known" (the fifth and
last columns).
Interesting points from this table and the underlying data include the following:
- The distribution on the cM scale looks very different from
the distribution on the Mb scale:
- GEDmatch filters out all HIRs under 3.0cM, regardless of
their Mb length.
If the 1cM=1Mb rule of thumb was appropriate, there would be no HIR
under 3.0Mb, but 63.0% of the HIRs are below this threshold.
If the 0.8cM=1Mb rule of thumb was appropriate, there would be no HIR
under 2.4Mb, but 52.8% of the HIRs are below this threshold.
- If this was a random sample, then the average Mb/cM ratio
would be around 798,852.
- As this is not a random sample but has been filtered by
GEDmatch for genealogical usefulness, the average Mb/cM ratio is lower.
- Because the Mb/cM relationship is non-linear, the average
of
the Mb/cM ratios for the individual HIRs (613,148) is different to the
ratio of the aggregate Mb length to the aggregate cM length (635,709).
- If the relationship between the Mb and cM scales was
linear, then the correlation between the two measures would be 1.00.
- Because the relationship between the Mb and cM scales is
non-linear, the actual correlation between the two measures is 0.86.
- In the five Mb-based groups above 30Mb and in the six
cM-based groups above 25cM, I have identified a common ancestor for
most of the HIRs.
- The widely cited result that "40% of 20 Mb
segments date back beyond 10 generations" is far more pessimistic than
this table, which shows that I have found a much more recent common
ancestor for 36.7% of the HIRs in the corresponding 20-30Mb group, and
an even more
reassuring 59.0% of those in the equivalent cM-based group. Of course,
without digging up the common ancestral couples I cannot prove to the
doubters that the known common ancestors were the sources of the
relevant HIRs.
- There is one outlier of 55.0cM/28.4Mb for which I have not
been
able to identify the common ancestor; the relatively short Mb length
seems to give a
better estimate of the age of this particular segment.
- Ignoring this single outlier, my success rate in finding
common
ancestors is as good or better when grouping by cM as when grouping
by Mb for every group above 10Mb/12.6cM.
- The results on the right of the table look a little less
depressing than those on the left because of the use of the cM scale in
place of the Mb scale.
- The results on the left of the table in turn look a little
less depressing than those of Speed and Balding
because the GEDmatch matching algorithm, like those of the other DNA
companies and the efforts of all good genetic genealogists, has already
filtered out to the best of its ability many segments from extremely
distant ancestors.
I still agree with Blaine Bettinger
and others that small segments potentially poison our genealogical
research, but the poison is not as deadly as some would have us believe.
In summary, a match sharing a segment inherited from a very distant
ancestor can be
described as an ancient
match and is not genealogically useful, but must not be
considered a false match.
Other
criteria for genealogical usefulness and a wish list
The cM lengths of DNA matches are strongly correlated with the degree
of relationship for close relationships, out to about first cousins.
The cM lengths of DNA matches are only very weakly correlated with the
degree
of relationship for distant relationships, certainly beyond third
cousins.
We need other and better criteria to identify the dozens of
genealogically useful matches almost certainly lurking among the many
thousands of useless, irrelevant and even false small matches which
disappeared under the new AncestryDNA matching algorithm. There is
no point in throwing the baby out with the bathwater.
There are three almost equally important criteria which can be used to
assess the genealogical usefulness of a half-identical region between
yourself and a DNA match:
- the length of the half-identical region, whether measured in cM, SNPs or Mb;
- the number of other similar and longer half-identical regions
which you and your known relatives share with the match and his or her
known relatives; and
- the number of other individuals who are half-identical to you and your match on this region.
The first of these criteria is widely observable and widely used (and
misused). The second criterion is observable for
individual-to-individual comparisons, but takes a little more effort to
observe for family-to-family comparisons. The third criterion is
generally observable only by using advanced third-party tools such as
the Tier 1 Segment Search at GEDmatch.
While a single small DNA segment shared by two individuals is not
genealogically useful in isolation, a pattern of large and small DNA
segments shared by multiple descendants of one ancestor with multiple
descendants of another ancestor can be extremely useful (in combination
with archival evidence and family traditions) in establishing the
relationship between the two ancestors. We need better ways of finding
these patterns. A good start is to fish in all the online gene pools
and to share your DNA match lists with
your known relatives and with your other close matches.
The thresholds at which matches cease to be genealogically useful are
different for different purposes:
- For chromosome mapping purposes, as already stated, my
experience
is that 8cM is a sensible threshold for determining what is
genealogically useful.
- For triangulation purposes, much smaller half-identical
regions
can still be very useful. The more individuals one adds to a
triangulation group, the smaller the overlap shared by everyone in the
group, but the more confidently one can predict that the overlap was
inherited from one of the common ancestral couple shared by everyone in
the group. The overlap may be small, but will be part of much longer
segments unquestionably shared by different subgroups of the
triangulation group. Sometimes, however, I feel obliged to omit a known
relative
from a triangulation group in order to make my conclusion more
convincing to those with a stronger bias than my own against small
triangulated segments.
- For purposes of identifying matches at AncestryDNA with
whom I
can trace common ancestors and validate common ancestor hints,
particularly those far down my match list, I find
that the best criterion for
assessing genealogical usefulness is currently not the shared cMs but the number
of shared matches.
The points above inspire part of my wish list for the next round of
improvements to the DNA websites:
- I wish that AncestryDNA would divert some of the
resources devoted to tweaking its quite acceptable DNA-matching
algorithm to tweaking, or to completely redesigning, its unacceptably
poor tree-matching algorithm, which generates many hints that any
sentient human can see are complete nonsense.
- I wish that GEDmatch would drop the 7cM minimum
threshold used in its Tier 1 triangulation tool to the minimum
threshold available for one-to-one comparisons, currently 3cM, and that
it would include the SNP overlap as well as the centiMorgan length of
each triangulated segment which it reports.
- I wish that AncestryDNA would show the number
of shared matches for each kit on my
match list, rather than forcing me to click twice and then potentially
scroll down repeatedly for every single kit in order to count the
shared matches. I would also like to be able to sort and filter my
match list by the number of shared matches, as I can by the number of
shared centiMorgans.
- I wish that FTDNA would improve its
in-common-with lists to give some indication of how much DNA the other
two parties share, as the other DNA comparison websites do: GEDmatch
and MyHeritage show the estimated shared cM, while AncestryDNA shows
only shared matches where the estimated shared cM exceeds 20 (not
counting the tiny segments that FTDNA counts).
- I wish that AncestryDNA, FTDNA and GEDmatch would allow
their
shared match lists to be sorted by the average (or equivalently the
total) cM shared with the two kits being compared:
- MyHeritage already does this;
- GEDmatch sorts by the cM shared with whichever kit is
entered first in the web form;
- FTDNA and AncestryDNA sort by the cM shared with
whichever kit is logged in.
- I wish that both AncestryDNA and FTDNA would improve their
shared/in-common-with lists to indicate whether the matches are
triangulated, as MyHeritage does and as GEDmatch alllows via its
additional display and processing options.
- I wish that MyHeritage would allow users to filter
triangulated and untriangulated shared matches, rather than forcing the
user to click and
scroll down repeatedly for every single match in order to find and/or
count the triangulated
shared matches
- I wish that all the companies would extend the ability to
see matches
shared:
- by two kits which are not deemed to be matches, such as
two known third or fouth cousins who don't meet the relevant matching
threshold (currently
possible only at GEDmatch; and at FTDNA, but only by administrators of
projects of which both kits are
members); and
- by three or more kits (currently possible only at FTDNA
and only by administrators of projects of which all the kits are
members).
For example, I would like to see a list of the matches shared by all of
the descendants of one of my ancestors whose DNA is linked to my online
family
tree, as the more descendants of the relevant ancestor that an
individual
matches, the more likely he or she is to be also descended from or
closely related to that ancestor.
GEDmatch could add the shared match list to its Multi Kit Analysis in
the same way as FTDNA's GAP interface for project administrators adds
it to its own autosomal matrix comparisons.
- I wish that users had more control over the shared match
thresholds for one-to-one comparisons:
- FTDNA gives the user no control and their
built-in thresholds are so low and their shared match lists are now so
long that they are genealogically almost useless;
- GEDmatch proposes a 10cM default threshold, which is
still too low, but which can be adjusted upwards by the user;
- AncestryDNA's fixed 20cM threshold has proven enormously
useful in my own research;
- the more sensible sort order used by MyHeritage partly
compensates for its low built-in threshold.
Some users might even want to set an upper threshold to eliminate
shared close cousins who do not share ancestors, for example if one
party's greataunt was married to the other party's greatuncle.
- I wish that AncestryDNA would provide a chromosome browser, but I
know that I am wasting my breath in adding my voice to those of many
thousands of AncestryDNA customers who have been campaigning for this basic and essential tool for many years.
I will conclude with some statistics as of 16 July 2020 on the
matches in danger of removal from my own AncestryDNA match list:
- I have identified (and
starred) 142 known relatives among my 36,571 AncestryDNA matches
(0.39%).
- AncestryDNA does not appear to report the total number of
common ancestor hints for a kit.
- Only 11 of my known relatives are among the countless
matches with whom I am estimated to
share 6cM or 7cM. The number of shared matches for these bottom 11
known relatives are 8, 16, 2, 4, 4, 1, 2, 15, 3, 2 and 1 respectively
(average 5.3).
- Only 12 of my common ancestor hints are among the
countless
matches with whom I am estimated to
share 6cM or 7cM. I consider six of these 12 hints to be wrong,
spurious or plain nonsense (and far more dangerous in the hands of
inexperienced genealogists than small DNA segments). The number of
shared matches for the six
hints with which I agree are 8, 2, 4, 1, 2 and 2 (average 3.2). The
number of shared matches for the six hints with which I disagree are 5,
0, 0, 2, 2 and 0 (average 1.5).
- As a control group, I looked at the 12 most recent matches
(as of 18 July 2020) with whom I am estimated to
share 6cM or 7cM. The number of shared matches are 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0 and 0 (average 0.1).
Based on these statistics, I would much rather lose matches with more
centiMorgans and fewer
shared matches than those with many shared matches and fewer
centiMorgans.
I will update these statistics after the changes have been fully
implemented.
As of 1 September 2020, there was still some confusion as to what
AncestryDNA's new matching criteria were and as to whether they had
been fully implemented:
- matches under 8cM which had not had notes or coloured dots
added by either party and had not been the subject of messages between
the parties had been deleted from the perspective of both parties;
- matches under 8cM which had notes or coloured dots added by one party remained visible to that party only;
- conflicting information had been provided by AncestryDNA to
different customers as to whether matches under 8cM which had notes or
coloured dots added by one party would become visible again to the
other party (see this Facebook discussion).
I hope that my arguments and examples have convinced readers that, as DNA
comparison databases grow to tens of millions of individuals and as pile-up regions
are identified and
eliminated, a well-defined count of shared matches will cease to be
just a random artifact of the matching process, but will converge to a
very useful measure of the relative genealogical usefulness of small
matches and even of non-matches.
Many thanks to those who provided useful feedback via Facebook on previous
versions of this page.