Last updated: 29 May 2024.
The purposes of this web page are two-fold:
It was only after a lot of thought over a number of years that I submitted a sample of my own DNA to FamilyTreeDNA (FTDNA) on 20 October 2013, receiving the results on 15 November 2013.
At the time, FamilyTreeDNA was really the only practical and affordable option for those residing outside the United States of America, where the first three big DNA companies were based. The alternatives at the time were AncestryDNA and 23andMe.
The basic selfish reasons that motivate most people to submit DNA samples to a genetic genealogy database are to confirm their own genealogical relationships and to find their own long lost cousins.
However, you should consider submitting your DNA not just for your own benefit, but for the benefit of others, including those only distantly related to you. You can rest assured that your own descendants will be literally eternally grateful to you for doing so. It is much cheaper and easier to do it now than for your survivors to do it when you are dead. Even if submitting your DNA doesn't help you directly, you might have the missing jigsaw pieces that will solve someone else's mystery. Genetic genealogy databases are of particular value to those who don't know much about their biological ancestry due to adoption, abandonment, infidelity, sperm or egg donation and similar causes.
The value of an online genetic genealogy database to those searching for relatives depends fundamentally on the number of people in the database. The first to join do need to make a leap of faith and exercise patience until the database reaches critical mass. When you do join, there are huge positive externalities for those already in the database and also for those who join subsequently. If your close relatives are not in the database, then you cannot be matched with them. If your close relatives are already in the database, then by joining you are presenting them with a valuable and much appreciated gift.
The first three major genetic genealogy companies were all based in the USA and some either do not welcome DNA samples from outside the USA or have a pricing policy designed to rip off those resident outside the USA. For those whose roots are in the USA, the databases were already approaching critical mass by 2014. Critical mass for those in countries like Ireland, where I live, was much further off when I submitted my sample. Someone has to get the ball rolling, so why not you? If members of your extended family (like all families in Ireland) emigrated to the USA, then you already have a good chance of finding their descendants. And you may even find cousins among the early participants from other jurisdictions. Or you may just help distant relatives without a well documented family tree to focus their searches in the correct direction, whether that is on your own branch of the family or on another branch of the family.
On the other hand, if there is a family secret that you, or others in your family, would like to remain a secret, then genetic genealogy may not be for you. Conversely, if you feel that now is the time to bring the family secret into the open and obtain closure for all involved, then genetic genealogy is definitely the way to go.
Genetic genealogy can also reveal family secrets that nobody ever suspected, for example an inadvertent baby swap at Fordham Hospital in The Bronx in 1913 which went unnoticed for over a century.
In fact, a disproportionate number of those resorting to DNA to trace their family history are adoptees, or parents or descendants of adoptees, or even foundlings with no paper trail at all on their biological family. I will do my best to assist and advise any such people among my own DNA matches, insofar as this is possible without putting unwanted pressure on those on the other side of the family secret who may want to keep it a secret. Doing my best includes encouraging all my relatives to submit DNA samples to one of the DNA companies so that I will be able to tell those with matching DNA and no paper trail as precisely as possible on which side of my ancestry they are likely to be related to me.
I also have my own, possibly selfish, reasons for wanting my own relatives to join a genetic genealogy database. Having submitted my own DNA, I now want to compare my results with those of:
If you are reading this, then there is a fair chance that you are already in one of these categories, so please read on! Even if you have no reason to suspect that you are closely related to me, I hope that what follows will help your understanding of what you can learn from your own DNA results.
If you belong to one of the first two categories and you have already submitted a sample of your DNA to one of the DNA companies, then please get in touch so that we can compare results. If you belong to the third category, then we will be automatically put in touch through either the databases maintained by the DNA companies or a third party database like GEDmatch.com (to which you should copy your results right now, if you have not already done so). However, if you have not published the information that you already have on the relevant website, or explained in your profile why you have not done this, then I will assume that you are not sufficiently interested in your ancestry to want to correspond further with me.
To be a successful genealogist, you must record what you know of your relatives in a good software package, offline (preferably) or online (if you have very fast broadband and more faith than I do in cloud computing, and trust your online service provider and your cloud data not to evaporate, as my Vodafone e-mail, mundia.com messages, etc., have done). I use Ancestral Quest. All good genealogy software packages will export selected information, typically just names, dates and places for your direct ancestors, to a GEDCOM file (a standardised format for exchange of genealogical data) which can be uploaded to the relevant DNA website. When a FamilyTreeDNA customer uploads a GEDCOM file, the list of ancestral surnames on the FamilyTreeDNA profile is automatically populated, but the names of your most distant known patrilineal and matrilineal ancestors must be entered separately. You must also link the DNA sample to the correct individual in the family tree. I have a seen a few cases of DNA data purporting to be from an individual who has been dead for hundreds of years, something that is not yet commercially available. If the family genealogist persuades you to provide a DNA sample, he or she should be able in return to provide a GEDCOM file to go with it, although it may initially cover only your shared ancestors. I have written much more about this here.
At GEDmatch.com, multiple DNA samples can be linked to a single e-mail address and a single GEDCOM file.
I recognise that there are conflicting opinions on how much detail adoptees should include on their FamilyTreeDNA profiles; the onus is on those who choose to leave their profiles blank to initiate contact with all their possible relatives. My general guidance for those aiming to reunite families separated by adoption, whether using DNA or more conventional methods, is not to rush in without the advice of an experienced genetic genealogist and/or an experienced social worker, as there will only be one chance to make the critical first contact a success.
If you are one of my known or probable relatives and you have not yet submitted a DNA sample to one of the DNA companies, then please consider doing so.
As of 7 June 2018, the two best
initial options are:
FamilyTreeDNA is clearly less expensive, particularly for those
outside the USA and for those not already paying annual membership
fees to ancestry.com. FamilyTreeDNA also has the advantage of
geographic and surname projects and Y-DNA and mitochondrial DNA
products. However, AncestryDNA has a much bigger customer
base (over 5 million as of August 2017, about five
times the size of FTDNA, but representing only 30 countries as of October
2016) and will probably find you more relatives. The choice
is yours. If you choose AncestryDNA, you can still join the
FamilyTreeDNA database via the free autosomal transfer. However, you will
need to send a second sample direct to FamilyTreeDNA for Y-DNA or
mitochondrial DNA analysis.
There are a growing number of DNA comparison websites and those interested in finding long-lost relatives should be in all of them. While helping an adoptee who is married to a Murphy, I coined what I have called Murphy's Law of Genetic Genealogy:
If there are N DNA comparison websites and your DNA is in N-1 of them, then your most important match will be in the Nth.
In the words of another widely used metaphor, there are many online gene pools out there and there are many people who are in only one or two of them; for maximum effect, you must fish in all of these pools.
I strongly urge all those who have sent DNA samples to
AncestryDNA and those who sent DNA samples to 23andMe between
November 2010 and August 2017 to transfer their autosomal DNA data to
FamilyTreeDNA.com if they do not already have a Family Finder
presence. You may find new long lost relatives who have sent
their DNA to FamilyTreeDNA but have not yet transferred their data
to GEDmatch.com. This free service was announced on 16 February
2017.
If you have used FamilyTreeDNA for Y-DNA or mitochondrial DNA but
used one of the other companies for autosomal DNA, then this
advice also applies to you.
Before May 2016 (see here), AncestryDNA and FamilyTreeDNA used
the same set of autosomal SNPs, so importing AncestryDNA data to
the FamilyTreeDNA database and running comparisons was
straightforward. After AncestryDNA changed to a different and
smaller set of autosomal SNPs, it took nine months to develop a
new matching algorithm.
Whichever DNA company you use, you can copy your results file to www.gedmatch.com
for free to compare with people who have used the other lab (or
23andMe.com, which is a poor third from a genealogical
perspective).
End-of-year sales have become an annual tradition in the world of genetic genealogy. For the end-of-2013 sale, a USD100 restaurant.com gift card and Family Finder Testing together cost only USD99, so those living in the USA who like eating out could actually profit by USD1 by submitting DNA samples! While this particular offer is unlikely to be repeated, keep an eye out for other special offers. Prices are also traditionally cut around DNA Day (25 April), Mother's Day in the United States (the second Sunday of May) and Father's Day in the United States (the third Sunday of June).
The more remote our known connection, the more interesting I think the results will be.
On my paternal side, I am naturally interested in exploring the various origins of the Waldron surname in Ireland. If you are a male Waldron with Irish roots, I particularly urge you to please consider purchasing a kit! In this case, I initially recommend not Family Finder Testing but the Y-DNA37, Y-DNA67, Y-DNA111 or Big Y-700 products, whichever suits your budget. I would love to see some other Irish Waldron ancestors listed in the Paternal Ancestor Name column on The WALDRON Surname DNA Project - Y-DNA Colorized Chart (where I am kit number 310654). If you are a male Waldron and are already a FamilyTreeDNA customer, please click the Join Request link on this page to join the project. My genealogical research has been stuck for many decades at my GGgrandfather Thomas Waldron (c1825/6 Roscommon-1902 Limerick). I would love to find a Waldron Y-DNA match to give me some idea where I should be looking in order to go back another generation. Likewise, I would love to find more Irish Waldron Y-DNA non-matches in order to rule out some of the wild goose chases on which I have gone over the years.
Also on my paternal side, I am particularly interested in exploring my relationship to a number of reputed fifth cousins, whom I know about from notes in my County Clare grandmother's diaries and in her letters about meetings with various sets of her third cousins. In particular, these include the Nolans of Kilkee, the Houlihans of Killard, the Burkes of Cloonnagarnaun and their apparent descendants the O'Maras and O'Connells of Moveen. While the family friendship with all of these families remains strong down to the present day, nobody remembers any longer who our common ancestors were, and furthermore the genealogical records which might unlock these secrets have not survived. There are also the Clancys of Cranny who remained the closest of friends with the Clancys of Killard (one of whom was my greatgrandmother) long after the details of their relationship were forgotten. If you belong to one of these families, I particularly urge you to please consider purchasing a kit and ordering Family Finder Testing! I also encourage all Clancys to order Y-DNA analysis. If you have any ancestors from any part of County Clare, then as soon as you get a password for your FTDNA kit, I recommend that you visit the Clare Roots project Activity Feed and hit the JOIN button at the top right. (Disclaimer: I am Administrator of this project. Project administrators can see your DNA results when you join their projects and can help you to interpret them.)
Finally on my paternal side, DNA helped to confirm the relationship between the Blackalls and Clancys of Killard in County Clare.
Similarly, on my maternal side, I am particularly interested in exploring whether my grandparents (both Durkans by birth) or my grandfather's parents (also both Durkans by birth) might have been related other than by marriage. Both couples lived in the townland of Cuilmore (the one in the civil parish of Kilconduff in County Mayo). I would also like to explore how my mother was related to her "Aunt Ellie" McDonagh, who clearly wasn't a genealogical aunt, but did have a Durkan grandmother. Aunt Ellie was from the townland of Cuillalea in County Mayo, from an area apparently known locally as Cartoonbawn or Cortoonbawn or Cortoon Bawn or Ballincurry or Ballinacurry, or however you would like to spell it yourself. If you are connected to the Durkans or to the McDonaghs, please consider purchasing a kit and ordering Family Finder Testing! [Sending my DNA to AncestryDNA in 2015 threw up a critical clue in the "Aunt Ellie" mystery.] I am not yet aware of a County Mayo project at FTDNA, but would love to see one started.
If you descend from any of these families and have already submitted a DNA sample, then please copy your raw data to GEDmatch.com and let me know your GEDmatch kit number; mine is VA864386C1. I am aware of at least one Nolan descendant and of several Houlihan descendants who have submitted DNA samples but have not yet sent me a GEDmatch kit number.
On all sides, for reasons which will become clear as you read on, I would like to have the DNA of at least eight third cousins available for comparison, at least one of them descended from each pair of my greatgreatgrandparents.
I have prepared this web page in an effort to help my known and probable and possible relatives, and anyone else who is interested, to understand the rapidly evolving, but often poorly explained and poorly understood, discipline of genetic genealogy. It may even be of help to the service providers struggling to help their customers to make their way more easily up the steep learning curve that I have experienced in my own adventures in genetic genealogy. I certainly hope that it will help customers and FamilyTreeDNA itself respectively to get more out of the FamilyTreeDNA.com website and to fix some of its shortcomings.
If you fall into one of the categories of my known, probable or possible relatives and would like to consult my password-protected online family tree, please read on or scroll down to the end of this web page for details of how to obtain a password.
Some people are reluctant to submit DNA samples to commercial organisations and/or to submit the resulting raw data to third party DNA-comparison websites and/or to publish kit numbers which would allow strangers to do one-to-one DNA comparisons, often for reasons that they cannot articulate. For example, one person considered it a bad idea to publish GEDmatch kit numbers in a closed facebook.com group with only 2,824 members, monitored by a handful of administrators of great integrity. Any information posted in such a group is far less public than information that is posted at GEDmatch.com itself, where the audience is orders of magnitude larger and registration is automated and not monitored. All GEDmatch.com users have already happily handed over their actual DNA to a commercial organisation and handed over all the raw data extracted from it by the commercial organisation to GEDmatch, so there is no reason to think that merely giving the kit number in a closed facebook group is a bad idea. Of course, DNA may reveal unknown or unsuspected relationships, but GEDmatch.com users must have been aware of that before sending off their samples, and we cannot change history. Those whose genetic ancestry has been concealed from them have a right, and usually a great desire, to know it.
The genetic locations examined for genealogical purposes are generally not the same as the genetic locations that have been examined for evidence of possibly elevated risk of certain diseases. It is widely believed that some individuals have "good genes" and are likely to live longer, and other individuals have "bad genes" and are likely to suffer more ill health. The margins of error associated with health predictions derived from DNA are just as hard to find, and probably just as large, as the margins of error associated with the estimated ethnicity percentages derived from DNA and peddled by many DNA companies.
Revealing to insurers and financiers that one has good genes is likely to make one's health care less expensive and one's pensions more expensive, and conversely revealing that one has bad genes is likely to make one's pensions less expensive and one's health care more expensive. If an insurance company got its hands on one's raw data (which it certainly wont do merely by knowing the GEDmatch.com kit number), it might wish to reduce premiums for illness and death insurance, as I call them, and increase premiums for pensions, or vice versa. (The marketing people call these products health and life insurance, but somehow still market fire and theft insurance!) In many jurisdictions (certainly in the U.S.), law prohibits insurance companies from charging higher premiums to those whose bad genes put them at higher-than-average risk of certain diseases. This is referred to as "non-discrimination", although the law actually discriminates against those whose good genes put them at lower-than-average risk of the same diseases, by forcing them to pay the same as if they were at average risk.
Others worry that DNA samples sent to genetic genealogy companies may be used, with or without search warrants, to identify them or their close or distant relatives as suspects for criminal offences. Those of us who are not criminals need not worry about our DNA being used to convict us of crime. Those who don't trust the courts and juries to see reasonable doubts in DNA evidence probably don't trust courts and juries to process other types of evidence fairly either. The DNA locations used to uniquely identify criminal suspects beyond reasonable doubt are the fastest-mutating locations, large numbers of which are unlikely to be the same for any two individuals bar identical twins. The DNA locations used to identify closely related individuals are slower-mutating locations, which are very likely to be the same for those who are closely related.
This long introduction may have raised lots of questions in the minds of readers. Please read on for the answers to those questions.
[The text of these chapters still needs to be embellished with many more illustrations, which I might have to borrow from someone like Maurice Gleeson!]
Having been increasingly addicted to genealogy from the age of 12 or earlier and having a degree in mathematical sciences with a particular interest in probability and statistics, it was inevitable that I would develop an interest in DNA and in genetic genealogy.
I attended various one-off lectures on these subjects over a number of years, and read lots of explanations, often ending up more confused rather than less confused after an effort to improve my understanding. I have still not found the inspirational book or inspirational teacher that suddenly fits everything into place within the context of my prior knowledge, such as happened with probability and statistics when I took Adrian Raftery's course (251) as a third year undergraduate at Trinity College Dublin back in 1983/4. (In the genetic genealogy field, my brief exposure to lectures by Maurice Gleeson and Dan Bradley has, however, helped a lot.)
The more I have read, the more sceptical I have become about the lack of scientific and statistical rigour in genetic genealogy and about some of the inferences apparently drawn from DNA evidence, to the extent that I considered entitling this web page "A Sceptic's Adventures in Genetic Genealogy". Then I discovered that there is an ongoing debate about whether the second letter of sceptic should be a C or a K and whether the spelling difference reflects a slight nuance in the meaning of the word rather than merely the side of the Atlantic Ocean on which I grew up! When I was publicly accused of being a DNA "Luddite", I thought I should perhaps put that word in the title, or perhaps just admit to being "confused", but I eventually settled for the more neutral "beginner".
My scepticism made me reluctant to submit my DNA for analysis, and I continue to exercise caution rather than jump to unwarranted conclusions on the basis of sloppy statistical analysis, sloppy science and sloppy explanations, all of which I still believe are typical of the DNA industry.
On the third day of the joint Back To Our Past (BTOP) and Genetic Genealogy Ireland 2013 shows at the Royal Dublin Society (20 Oct 2013), Kathy Borges of the International Society of Genetic Genealogists (ISOGG) eventually did persuade me to purchase Y-DNA and autosomal DNA products from Family Tree DNA. Notification arrived by e-mail that my autosomal DNA results were available online on 16 Nov 2013 and that my Y-DNA results were available online on 21 Nov 2013.
I should probably try to weave my initial thoughts and the answers that I have found to my questions into the ISOGG Wiki, but for now I still have more questions than answers and it is much quicker and easier to post them all together here on this single web page on my own personal web site documenting my own adventures as a sceptic in genetic genealogy.
Perhaps I should go and study the subject formally somewhere like The Mathematical Genetics Group at the University of Oxford.
I hope that this chapter will help to dispel some myths, in particular about the need for a little jargon, and that the next chapter will get me some feedback about interpreting my own autosomal DNA results, or lack thereof. To begin with, however, some definitions will help to add some rigour.
My good friend Kevin O'Brien summarised the difficulties of DNA research succinctly in one sentence:
"This DNA research is different from tracing and is more like geometry as you are given the answer and then you have to prove the theorem."essential basic concepts.
To prove the theorems, one must understand a few
If you are reading this page, you hopefully have some basic understanding of DNA and of genetic genealogy. For those who don't, I had better begin by outlining some basic definitions.
DNA (short for deoxyribonucleic acid) is material contained within human cells (and the cells of any living organism) and inherited by children from their parents. Genetic genealogy is the use of variations in DNA between individuals in order to assist genealogical research. For the purposes of genetic genealogy, DNA is represented by long strings of the letters A, C, G and T, for example ACCTGAGTCAGTAC. As far as genetic genealogy is concerned, the precise details of the chemical structures which these four letters represent are unimportant. (If you must know, they are initials representing the four bases adenine (A), cytosine (C), guanine (G) and thymine (T).)
As an occasional computer programmer, I like to describe something like ACCTGAGTCAGTAC as a string of letters and something like GTCAGT as a substring of ACCTGAGTCAGTAC. The words sequence and subsequence may be used by others as synonyms of string and substring.
A person's genome is the very long string containing his or her complete complement of DNA. For the purposes of genetic genealogy, various shorter strings from within the genome will be of greater relevance. These shorter strings include, for example, chromosomes, segments and short tandem repeats (STRs).
The human genome is made up, inter alia, of 46 chromosomes.
The FTDNA glossary (faq id: 684) defines a DNA segment as "any continuous run or length of DNA" "described by the place where it starts and the place where it stops". In other words, a DNA segment runs from one location (or locus) on the genome to another location. For example, the segment on chromosome 1 starting at location 117,139,047 and ending at location 145,233,773 is represented by a long string of 28,094,727 letters (including both endpoints).
For simplicity, I will refer to the value observed at each location (A, C, G or T) as a letter; others may use various equivalent technical terms such as allele, nucleotide or base instead of 'letter'.
The FTDNA glossary does not define the word block, but FTDNA appears to use this word frequently on its website merely as a synonym of segment.
A short tandem repeat (STR) is a string of letters consisting of the same short substring repeated several times, for example CCTGCCTGCCTGCCTGCCTGCCTGCCTG is CCTG repeated seven times.
A gene is any short segment associated with some physical characteristic, but is generally too short to be of any great use or significance in genetic genealogy.
Every random variable has an expected value or expectation which is the average value that it takes in a large number of repeated experiments. For example, if an unbiased coin is tossed 100 times, the expected value of the proportion of heads is 50%. Similarly, if a person has many grandchildren, then the expected value of the proportion of the grandparent's autosomal DNA inherited by each grandchild is 25%. Just as one coin toss does not result in exactly half a head, one grandchild will not inherit exactly 25% from every grandparent, but may inherit slightly more from two and correspondingly less from the other two.
There are four main types of DNA, which each have very different inheritance paths, and which I will discuss in four separate chapters later:
While autosomal DNA comes equally from both parents, this is not true of DNA as a whole. Not only does mtDNA come from the mother only, but we will also see below that the Y chromosome is much shorter than the X chromosome. Thus everyone inherits slightly more DNA from the mother than from the father, and this is particularly true for men.
The first exposure to DNA analysis for some readers may have been the two-part Blood Of The Irish television documentary first broadcast in 2008. The second part can be viewed on YouTube. The genetic narrative jumps back and forth, without any explanation, between Y-DNA and mtDNA. The climax of the programme was the revelation that three children from a sample of unspecified size from Carron and Kilnaboy in County Clare had similar mitochondrial DNA to ancient remains found in a nearby cave, estimated to be 3,500 years old.
The objective of the programme may have been to investigate whether the direct male line and direct female line ancestors thousands of years ago of those living in Ireland today were men and women who also lived in Ireland thousands of years ago. However, much of the programme appeared to assume that this desired conclusion had already been proven, and to extrapolate from the direct male line and.direct female line ancestors to the billions of other ancestors living at the same time.
There have been great scientific, technological and commercial advances in the world of DNA analysis since 2008, but, even by the standards of its time, this programme left much to be desired.
If you are not a computer programmer or software developer, then you may want to skip ahead to the next section on mutation.
Traditional genealogy applications will produce pedigree charts and descendancy charts for any individual in a GEDCOM file showing respectively all the ancestors from whom the root individual may have inherited autosomal DNA and all the descendants to whom the root individual may have passed on autosomal DNA (and their spouses). These charts were probably not designed with autosomal DNA in mind. It is just coincidence that one can potentially inherit autosomal DNA from all of one's ancestors, and that one can potentially pass on autosomal DNA to all of one's descendants.
I am still looking for a genealogy application which will produce similar pedigree charts and descendancy charts showing the inheritance paths of the other three types of DNA. For example, an X pedigree chart should show just the ancestors from whom the root individual could have inherited segments of X-DNA and an X descendancy chart should show just the individuals to whom the root individual might have passed on segments of X-DNA.
Blank X descendancy charts are widely available, but software to fill them in for specific individuals is hard to find.
It is surprising that even GEDmatch.com has not as of September 2014 implemented X pedigree charts.
Back in 1991, I wrote a program myself to produce descendancy charts showing only descendants inheriting the Y chromosome from the root individual, but it assumed the underlying database was in the original PAF format and contained less than 32K individuals, so is hardly worth resurrecting now (as PAF has been discontinued and had switched to a new format before its discontinuation, and as my own database is now several times that maximum size limit and as it has become almost impossible to find a PASCAL compiler for a modern computer).
My hope is that these charts can be most easily added to TNG which I use for my own genealogy website. I have started a discussion of this topic in the TNG forums.
Programmers working on genealogy software may be interested in the minor modifications to existing code required to provide these options. A new variable with four possible values (Y, X, autosomal [the current default] and mt) is required. Four cases must be dealt with depending on the value of this new variable. The default autosomal case remains unchanged, certainly if there is already a choice as to whether spouses of descendants (who clearly do not inherit the root individual's autosomal DNA) are included or omitted. The other three cases are dealt with as follows:
IF descendant is female THEN
proceed to next descendant
ELSE {descendant is male}
output descendant
stack descendant's children for later processing
proceed to next descendant
IF ancestor is female THEN
stack ancestor's father and mother for later processing
output ancestor
proceed to next ancestor
ELSE {ancestor is male}
stack ancestor's mother for later processing
output ancestor
proceed to next ancestor
IF descendant is female THEN
output descendant
stack descendant's children for later processing
proceed to next descendant
ELSE {descendant is male}
output descendant
stack descendant's daughters for later processing
proceed to next descendant
IF descendant is male THENAnn Turner did this for the MS-DOS version of Personal Ancestral File (PAF) away back in 1994.
output descendant
proceed to next descendant
ELSE {descendant is female}
output descendant
stack descendant's children for later processing
proceed to next descendant
The letters observed at each location on a child's genome are typically inherited unchanged (other than by recombination) from one or other parent.
A son inherits his Y chromosome and one set of 22 autosomes virtually unchanged from his father and inherits his X chromosome, his mitochondrial DNA and another set of 22 autosomes virtually unchanged from his mother.
Similarly, a daughter inherits one X chromosome and one set of 22 autosomes virtually unchanged from her father and inherits her mitochondrial DNA, another X chromosome and another set of 22 autosomes virtually unchanged from her mother.
However, isolated mutations, essentially just transcription errors, can occur.
Mutation rates vary greatly along the human genome.
At most locations on the genome, the mutation rate is effectively zero and the same letter is observed for all humans.
Some locations have a slightly greater mutation rate, in the range of one mutation in the entire history of mankind. Such locations on the Y-chromosome and in mitochondrial DNA are very useful for slotting individuals into the appropriate locations on the relevant evolutionary tree or phylogenetic tree. While a great deal of effort has gone into identifying such locations, they are not useful for practical genealogical purposes, as two individuals with the same letters at a set of such locations may still not have any common ancestor within thousands of years. By 2015, hopes were high that some surname-specific Y-DNA mutations might soon be identified.
If locations have a higher mutation rate, perhaps as high as 1-in-20 or even 1-in-10 reproductions, then comparing the letters observed at a set of such locations can have great genealogical value. Two individuals with the same observations at a set of such fast-mutating locations are very likely to have a relatively recent common ancestor or common ancestral couple.
Estimation of the time to most recent common ancestral couple depends crucially on both the number of locations compared and on the estimated mutation rates for each of those locations, based on research involving many parent/child observations.
There are two different basic units in which the length of a segment of DNA is frequently measured, and a third unit used only for the types of DNA which are subject to recombination, namely autosomal DNA and X-DNA:
CHROMOSOME | START LOCATION | END LOCATION | LENGTH | CENTIMORGANS |
1 | 44805958 | 47175419 | 2369461 | 1.08 |
2 | 106254302 | 116973471 | 10719169 | 8.09 |
2 | 157113214 | 159347591 | 2234377 | 2.44 |
3 | 11537627 | 12600665 | 1063038 | 1.41 |
4 | 165504024 | 167423895 | 1919871 | 2.29 |
6 | 29267608 | 31571470 | 2303862 | 1.53 |
11 | 46718718 | 56273717 | 9554999 | 1.54 |
11 | 103382220 | 105990699 | 2608479 | 2.38 |
17 | 36593956 | 38838321 | 2244365 | 1.92 |
If a segment of X-DNA or autosomal DNA has lots of SNPs, then two people's DNA is unlikely to be identical purely by chance on that segment.
Conversely, if a segment is small in terms of centiMorgans, then it wont have seen many recombinations over the generations, and may have been inherited unchanged from a very distant ancestor, particularly in the case of X-DNA which is not subject to recombination when passed from father to daughter.
Thus, to be sure that a segment is inherited from a recent common ancestor, one would like to see that it is long on both the centiMorgan and SNP scales.
Given a long shared segement, unless we have a complete pedigree for both parties going back many generations, it will always be difficult to know whether the shared segment comes from a known common ancestor or an unknown common ancestor on some other ancestral line.Converting between units of measurement
There must be plots somewhere showing the monotonic relationship between the length along each chromosome measured in base pairs, the length along the chromosome measured in centiMorgans and the length along the chromosome measured in SNPs, but I have not yet come across them.
Genealogists are probably used to variables which can be measured in either of two units of measurement which are linearly related to each other. For example, those with nineteenth century rural Irish ancestors will have converted the areas of their ancestors' landholdings from the Irish acres generally used in the Tithe Applotment Books to the statute acres used in Griffith's Valuation using the fixed conversion ratio 121 Irish acres=196 statute acres. A graph of areas in Irish acres versus areas in statute acres will look like a straight line.
For the three units of measurement in which DNA is measured, there are no such fixed conversion ratios, as the relationships between the units of measurement are non-linear. The local conversion ratios between base pairs, centiMorgans and SNPs vary considerably along the genome. Graphs of the relationship between base pairs and centiMorgans or between base pairs and SNPs or between centiMorgans and SNPs will slope upwards, but otherwise will not look anything like a straight line.
In the absence of such representative graphs, the best that I can show here is a table based on the local conversion ratios in a (non-random) sample of 4,339 regions (those where I am half-identical with one or more of my 381 FTDNA-overall-matches as of 10 Jan 2014; by construction, this is an unrepresentative sample). These may be biased estimates of the average conversion ratios throughout the genome.
bp/cM | bp/SNP | SNP/cM | |
Minimum | 112,200 | 118 | 89 |
Average | 1,413,219 | 2,292 | 331 |
Maximum | 10,576,336 | 18,786 | 2,384 |
Each of the measurement units defined above can also be converted into percentages of the total length of the genome, which are a much simpler way of viewing the results for autosomal DNA and X-DNA, which both come in segments from multiple ancestors.
The use of percentages assumes that a precise value of the total (the denominator in the percentage calculation) is known.
The total length of the human genome in base pairs is typically imprecisely specified as "over 3 billion DNA base pairs" (see table in Wikipedia). This total length, however, includes only one copy of each of the 22 autosomal chromosomes. The genome actually contains around 6 billion base pairs, as it contains two copies of each autosomal chromosome. James Michael Connor (Medical Genetics for the MRCOG and Beyond, RCOG, 2005, page 3) confirms, for example, that there are "280Mb in each copy of chromosome 1", so that the base pairs figures in the Wikipedia table clearly represent the numbers of base pairs in one copy of each autosomal chromosome. Gianpiero Cavalleri confirms that, roughly speaking, "Each of us inherits 6 billion letters of DNA, 3 billion from our mother and 3 billion from our father."
Since it is common to speak about the length of DNA, the width of the human genome can correspondingly be viewed as two base pairs for the autosomal chromosomes and for a woman's X chromosomes; elsewhere it can be viewed as one base pair wide. The following table summarises the details:
Male | Female | |||||
Length | Width | Total | Length | Width | Total | |
Autosomal | 2,881,033,286 | 2 | 5,762,066,572 | 2,881,033,286 | 2 | 5,762,066,572 |
X | 155,270,560 | 1 | 155,270,560 | 155,270,560 | 2 | 310,541,120 |
Y | 59,373,566 | 1 | 59,373,566 | 0 | ||
Mitochondrial | 16,569 | 1 | 16,569 | 16,569 | 1 | 16,569 |
GRAND TOTAL | 3,095,693,981 | 5,976,727,267 | 3,036,320,415 | 6,072,624,261 |
Note that the X chromosome contains almost three times as many base pairs as the Y chromosome, so the total number of base pairs in the female human genome is greater than the total number of base pairs in the male human genome.
Despite this confusion about the total length of the genome, the base pair remains the most precise and unambiguous of the three units of measurement; however, it is also the least appropriate as a measure of the genealogical relevance of a shared segment of DNA.
The total number of cM is also imprecisely specified, apparently varying slightly from one DNA website to another. Figures for the length in cM of the autosomal chromosomes only and figures for the length in cM of the autosomal chromosomes and the X chromosome combined may be seen and should not be confused. Furthermore, the definition of the centiMorgan is based on empirical observation of recombination frequencies, and thus can vary based on the particular experimental data on which it is based.
The total number of SNPs used by a particular DNA company is at least directly observable in the raw data downloadable from the company. For example, my raw autosomal DNA data from FamilyTreeDNA.com includes precisely 696,752 SNPs, with one letter from my paternal chromosome and one letter from my maternal chromosome observed at each SNP. My raw X-DNA data includes one letter from each of precisely 17,797 SNPs. If I were female, then I would have another letter from my second X chromosome at each of these 17,797 SNPs. As with centiMorgans, the definition of SNPs is based on empirical observation of variation, and thus can also vary based on the particular experimental data on which it is based and on the DNA company collecting the data. A location where no variation is observed in a small sample may exhibit variation in a larger sample and be reclassified as a SNP. DNA observation is also subject to measurement error, so there will be occasional SNPs which result in no calls so that there can be slight variation in the number of SNPs observed between different individuals even with the same DNA company.
For all these reasons, it is critically important to avoid ambiguity by giving precise details of how the centiMorgan or the SNP has been defined, including specifying the full length of the genome and its components according to the relevant definition.
One way of getting a feel for the length of your autosomes in SNPs and cMs is to do a one-to-one comparison of your own kit with your own kit at GEDmatch.com. This table shows my details:
Chr | End Location | Centimorgans (cM) | SNPs | bp/cM | bp/SNP | SNP/cM |
1 | 247,169,190 | 281.5 | 57,186 | 878,043 | 4,322 | 203 |
2 | 242,683,192 | 263.7 | 55,850 | 920,300 | 4,345 | 212 |
3 | 199,310,226 | 224.2 | 45,709 | 888,984 | 4,360 | 204 |
4 | 191,140,682 | 214.5 | 39,248 | 891,099 | 4,870 | 183 |
5 | 180,623,543 | 209.3 | 40,685 | 862,989 | 4,440 | 194 |
6 | 170,732,528 | 194.1 | 46,476 | 879,611 | 3,674 | 239 |
7 | 158,811,958 | 187.0 | 36,759 | 849,262 | 4,320 | 197 |
8 | 146,255,887 | 169.2 | 35,757 | 864,396 | 4,090 | 211 |
9 | 140,147,760 | 167.2 | 31,717 | 838,204 | 4,419 | 190 |
10 | 135,297,961 | 174.1 | 37,783 | 777,128 | 3,581 | 217 |
11 | 134,436,845 | 161.1 | 35,392 | 834,493 | 3,799 | 220 |
12 | 132,276,195 | 176.0 | 34,384 | 751,569 | 3,847 | 195 |
13 | 114,108,121 | 131.9 | 26,933 | 865,111 | 4,237 | 204 |
14 | 106,345,097 | 125.2 | 22,630 | 849,402 | 4,699 | 181 |
15 | 100,214,895 | 132.4 | 21,052 | 756,910 | 4,760 | 159 |
16 | 88,668,978 | 133.8 | 22,030 | 662,698 | 4,025 | 165 |
17 | 78,637,198 | 137.3 | 19,564 | 572,740 | 4,019 | 142 |
18 | 76,112,951 | 129.5 | 21,052 | 587,745 | 3,615 | 163 |
19 | 63,776,118 | 111.1 | 14,454 | 574,042 | 4,412 | 130 |
20 | 62,374,274 | 114.8 | 17,887 | 543,330 | 3,487 | 156 |
21 | 46,909,175 | 70.1 | 9,948 | 669,175 | 4,715 | 142 |
22 | 49,528,625 | 79.1 | 10,112 | 626,152 | 4,898 | 128 |
All autosomes | 2,865,561,399 | 3587.1 | 682,608 | 798,852 | 4,198 | 190 |
The End Location column may understate the chromosome lengths in bps, as it probably refers to the location of the last SNP on the chromosome, and there may several thousand more bps beyond that last SNP.
Note that the variation in the overall ratios between the different units of measurement from one chromosome to another is small compared to the variation between smaller segments illustrated in an earlier table and that the various ratios are very different from those in the earlier unrepresentative sample.
While the length in centiMorgans of each chromosome appears to be the same from one FTDNA customer to another, the number of SNPs observed on every chromosome varies from customer to customer and the end locations can also vary in some cases.
Note that for each of the chromosomes, the probability of recombination is greater than 50%, ranging from 50.4% for Chromosome 21 to 94.0% for Chromosome 1. Conversely, the probability of inheriting an entire chromosome intact from one grandparent ranges from 6.0% for Chromosome 1 to 49.6% for Chromosome 21.
Although in theory the chromosomes are numbered in order of decreasing length, this is not the case in the table, where Chromosome 22 is longer on all three scales than Chromosome 21.
It is neither practical nor essential nor affordable to observe all 6,072,624,261 base pairs in the female human genome, as the vast majority of these have the same value for all women, and similarly for men.
Instead we just observe the locations which are known to vary from one person to another.
In the case of autosomal DNA, FTDNA makes observations at 696,752 paternal SNPs and at the corresponding 696,752 maternal SNPs.
For each of the 696,752 locations, two letters are observed, say A and G, but it is not possible to tell whether the A comes from the paternal copy of the relevant chromosome and the G from the maternal copy, or vice versa.
Presumably if we moved along the genome observing every letter along the way we could keep track of which were the paternal letters and which were the maternal letters; instead, we pop in just once every 4000 or so base pairs, at which stage we can no longer look back and see which is the paternal chromosome and which the maternal chromosome.
In other words, instead of observing 696,752 ordered pairs of letters (of which there are 16 possible values, namely any one of ACGT with any one of ACGT: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT), since the parental source of the letters can not be observed, we observe 696,752 unordered pairs (of which there are ten possible values: AA, CC, GG, TT, AC, AG, AT, CG, CT and GT).
In other words, observed autosomal DNA is represented not by two (unobservable) ordered strings of letters, but by one array of unordered pairs of letters.
The observed unordered data is said to be unphased; the unobservered ordered data which we would like to have is said to be phased. There are various limited techniques available for phasing the unordered data. A certain amount of simple phasing of a child's data is possible if samples are available from both of the child's parents. Ancestry.com uses more sophisticated phasing algorithms, particularly in the new matching process which it introduced in November 2014.
I took an interest in equine pedigrees from a very young age, even before I began to be interested in human pedigrees. I have long taken an interest in the activities of Equinome, a University College Dublin campus company which claims to have identified a SNP called the speed gene which predicts a racehorse's distance perference. It was only when I realised that the unordered pairs observed at the location of Equinome's speed gene can be C:C, C:T and T:T that I realised the vast difference between the two possible A-with-T and C-with-G base pairs in a single chromosome and the ten possible unordered pairs observed in maternal and paternal chromosome pairs.
A region is a run of unordered pairs, starting at one specified locus on a specific chromosome and ending at another specified locus on the same chromosome. In theory, the region comprises one DNA segment on the paternal copy of the chromosome and another DNA segment on the maternal copy of the same chromosome. In practice, neither of these segments is independently observed.
Consider the comparison between person W's DNA and person Z's DNA in a particular region on a particular chromosome. (Since the mathematician's usual generic variables X and Y refer to chromosomes in genetics, I'll try to avoid confusion by using V, W and Z instead as variables to denote generic people.)
If we could observe W's paternal segment, W's maternal segment, Z's paternal segment and Z's maternal segment in this region, then we could tell whether or not one of W's segments was identical to one of Z's segments. If we found two matching segments, then we could state that these segments were identical by state (IBS) and that W and Z segment-match on this segment.
Provided that W's paternal segment is different from W's maternal segment and likewise Z's paternal segment is different from Z's maternal segment, then we can start to investigate whether these IBS segments are identical by descent (IBD).
If
then we have proven that the matching segments are IBD. The term IBD is often loosely used when this level of rigorous proof is lacking.
Even if no DNA samples are available for some of the people on the family tree, it may still be possible to prove that the matching segments are IBD.
More generally, the unavailablility of some DNA samples means that we can merely draw conclusions about the likelihood of a hypothesised relationship given the DNA or, equivalently, the probability of the observed DNA given the hypothesised relationship.
In general, the longer two IBS segments, the more likely they are to be IBD.
Since we do not independently observe W's paternal segment, W's maternal segment, Z's paternal segment and Z's maternal segment in the region of interest, we must base comparisons on the unordered pairs that we do observe.
Think of W as yourself and Z as another person who has been selected at random from a DNA database.
Let us first consider a particular location or SNP.
At this single location, W's unordered pair matches Z's unordered pair if it is possible that one of W's unobservable segments matches one of Z's unobservable segments. In other words, they match if at least one letter is common to both pairs.
Things could get confusing here, as we are comparing two people, each of whom has a pair of letters at every point on the chromosome. Remember that pair refers to the two letters, not to the two people.
For example, at a biallelic SNP, where either A or G can be observed on each chromosome, the unordered pairs which can be observed are AA, AG and GG.
To avoid the confusion which would arise if the word 'match' was used in multiple different senses, we say that unordered pairs which match in this sense are half-identical pairs and if their pairs are half-identical, we say that W is half-identical to Z at this location.
A person whose paternal and maternal letters are the same at a particular location (AA, CC, GG or TT) is said to be homozygous (or homozygotic) at that location. A person whose paternal and maternal letters are different at a particular location (e.g. AC) is said to be heterozygous (or heterozygotic) at this location.
At locations which are biallelic (the vast majority), someone who is heterozygous will automatically be half-identical to everyone. Thus, observing a heterozygous pair provides no information whatsoever about the possibility that the two people are related.
All the relevant information comes from locations at which both W and Z are homozygous, and the few locations which are polyallelic.
If W and Z are homozygous at a particular location, but with different letters (e.g. W is AA and Z is GG), then they clearly did not inherit that location from a common ancestor.
However, if W and Z are homozygous at a particular location with the same latter (e.g. both W and Z are AA), then they may have inherited their autosomal DNA at that location from a common ancestor.
When investigating the possibility of a relationship, we can discard any biallelic SNPs at which either W or Z is heterozygous, since those SNPs provide no information about the likelihood of a relationship. We just need to compare the locations at which both W and Z are homozygous, i.e. their mutually homozygous locations.
The more consecutive mutually homozygous locations we find, the more likely it is that the relevant region includes a segment of DNA inherited from a common ancestor.
To explore the probabilities involved, let us suppose that the SNP we are considering is biallelic with the proportions p and 1-p of the population having each value, say A and C respectively, and correspondingly, assuming independence of paternal and maternal letters, the proportions p2, 2p(1-p) and (1-p)2 of the population having each unordered pair, say AA, AC and CC respectively.
If you are homozygous AA at that SNP, then the other person is half-identical to you unless he or she is CC, in other words half-identical with probability 1-(1-p)(1-p)=2p-p2=p(2-p).
Similarly, if you are homozygous CC, then the other person is half-identical to you with probability 1-p2.
If you are heterozygous AC or CA, then the other person is half-identical to you with probability 1, since everyone is half-identical to you.
This table shows these probabilities for different values of p, shown in the first column.
The next three columns show the corresponding proportions of the population who are homozygous AA (p2), heterozygous (2p(1-p)) and homozygous CC ((1-p)2).
The second block of three columns shows the probability that a randomly chosen individual is half-identical to you conditional on your letters at the relevant location.
The last column shows the unconditional (ex ante) probability that a randomly chosen individual is half-identical to you at the relevant location.
Before you see your own results, the last column gives the relevant probability; once you know your own results, you can update the probability by looking at the fifth, sixth or seventh column, whichever is relevant.
Population Proportions | Probability random individual half-identical to: | ||||||
A | AA | AC or CA | CC | AA | AC or CA | CC | unknown |
0.0% | 0.0% | 0.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% |
5.0% | 0.3% | 9.5% | 90.3% | 9.8% | 100.0% | 99.8% | 99.5% |
10.0% | 1.0% | 18.0% | 81.0% | 19.0% | 100.0% | 99.0% | 98.4% |
15.0% | 2.3% | 25.5% | 72.3% | 27.8% | 100.0% | 97.8% | 96.7% |
20.0% | 4.0% | 32.0% | 64.0% | 36.0% | 100.0% | 96.0% | 94.9% |
25.0% | 6.3% | 37.5% | 56.3% | 43.8% | 100.0% | 93.8% | 93.0% |
30.0% | 9.0% | 42.0% | 49.0% | 51.0% | 100.0% | 91.0% | 91.2% |
35.0% | 12.3% | 45.5% | 42.3% | 57.8% | 100.0% | 87.8% | 89.6% |
40.0% | 16.0% | 48.0% | 36.0% | 64.0% | 100.0% | 84.0% | 88.5% |
45.0% | 20.3% | 49.5% | 30.3% | 69.8% | 100.0% | 79.8% | 87.7% |
50.0% | 25.0% | 50.0% | 25.0% | 75.0% | 100.0% | 75.0% | 87.5% |
55.0% | 30.3% | 49.5% | 20.3% | 79.8% | 100.0% | 69.8% | 87.7% |
60.0% | 36.0% | 48.0% | 16.0% | 84.0% | 100.0% | 64.0% | 88.5% |
64.6% | 41.7% | 45.7% | 12.5% | 87.5% | 100.0% | 58.3% | 89.5% |
65.0% | 42.3% | 45.5% | 12.3% | 87.8% | 100.0% | 57.8% | 89.6% |
70.0% | 49.0% | 42.0% | 9.0% | 91.0% | 100.0% | 51.0% | 91.2% |
75.0% | 56.3% | 37.5% | 6.3% | 93.8% | 100.0% | 43.8% | 93.0% |
80.0% | 64.0% | 32.0% | 4.0% | 96.0% | 100.0% | 36.0% | 94.9% |
85.0% | 72.3% | 25.5% | 2.3% | 97.8% | 100.0% | 27.8% | 96.7% |
90.0% | 81.0% | 18.0% | 1.0% | 99.0% | 100.0% | 19.0% | 98.4% |
95.0% | 90.3% | 9.5% | 0.2% | 99.8% | 100.0% | 9.7% | 99.5% |
100.0% | 100.0% | 0.0% | 0.0% | 100.0% | 100.0% | 0.0% | 100.0% |
The first point to note from this table is that for all biallelic SNPs, you can expect ex ante to be half-identical to at least 87.5% of the population. This proportion rises if you subsequently discover that you are heterozygous, or homozygous with a very common value (>64.6% probability) at the SNP; it falls if you discover that you are homozygous with a less common value (<64.6% probability).
At a polyalleic SNP, the probability that W and Z are half-identical is smaller. (We have already noted that such locations are very rare, if not non-existent.)
For example, suppose that the four letters A, C, G and T occur equally often in the population at the chosen polyallelic SNP. Under this simplifying assumption, for a quarter of the population the first pair will comprise two identical letters, AA, CC, GG or TT; the remaining three-quarters of the population will be heterozygous.
If W is homozygous, say AA, then 7 of the 16 possible values of Z's ordered pair will match W (AA, AC, AG, AT, CA, GA, TA).
If W is heterozygous, say AC, then 12 of the 16 possible values of Z's ordered pair will match W (all except GT, TG, GG and TT).
So the probability of finding a match in this sense for this example is 0.25*(7/16)+0.75*(12/16)=43/64=0.671875 or 67.1875%, much lower than for a biallelic SNP, whatever the distribution of the two letters in the popluation at the biallelic SNP.
In practice, we know that the distribution of letters at most SNPs is far from uniform: rather than A, C, G and T each occuring 25% of the time, at some SNPs A may occur 90% of the time, G 10% of the time and C and T never.
If we look at 10 consecutive biallelic SNPs, what is the probability that the two people are half-identical at all 10 locations?
If we assume that both letters are equally likely at each SNP, then the ex ante probability that W and Z are half-identical at 10 consecutive locations is 0.875 to the power of 10, which is around 26.3%. For 20 consecutive biallelic SNPs, the probability drops to roughly 6.9%. For 50 consecutive biallelic SNPs, it is of the order of 0.1%. For 100 consecutive SNPs, it is of the order of 10-6 and it soon becomes vanishingly small.
If we assume, as in our other example, that the (unobservable) ordered pairs for the two people at each SNP are independently chosen randomly from a uniform distribution with four possible values, then the answer is 0.671875 to the power of 10, which is only around 1.9%, compared to around 67.2% for a single SNP. For 20 consecutive SNPs, the probability drops to roughly 0.04%. For 50 consecutive SNPs, it is of the order of 10-9. For 100 consecutive SNPs, it is of the order of 10-18 and it becomes vanishingly small more quickly than for biallelic SNPs.
As can be seen from the table above, these probabilities decline less quickly if one letter is more common than the other at each SNP.
Nevertheless, it holds as a general principle that the probability that two individuals are half-identical throughout a long region purely by chance becomes vanishingly small as the length in SNPs of the region increases.
Note, however, that the more SNPs in a region at which you are homozygous and the rarer the letters you have at those homozygous SNPs, the smaller all the probabilities are, and the less likely you are to be half-identical to another randomly chosen individual purely by chance on that region. These are the regions in which you should begin your search for relatives.
The likelihood that W and Z's DNA is half-identical purely by chance throughout a particular region decreases the more mutually homozygous SNPs they share in the region.
While the total number of SNPs in such regions is universally reported, I have yet to find any comparison tool which routinely reports the far more informative total number of mutually homozygous SNPs.
A region in which all the pairs are half-identical is known as a half-identical region. It would be more sensible to call it a half/half identical region, as it is a region in which half of one individual's DNA matches half of the other individual's DNA. The two individuals could be said to region-match in this region.
In practice, consecutive base pairs or consecutive SNPs are not independent, as was implicitly assumed in all of the above probability calculations, but are inherited in chunks from both parents. So we observe many more half-identical regions in practice than pure chance would suggest. The longest (in mutually homozygous SNPs) of these half-identical regions arise not by chance, but by inheritance.
There are several possible origins of a half-identical region when comparing W and Z:
In the course of investigation, it may be possible to deduce with high probability that there is a full/half match: that one of W's segments is half-identical to Z's DNA in a particular region. Typically, this deduction will be made when W and one or more of W's known relatives (who are not doubly related to W) all exhibit half/half matches with Z.
Once again, in general, the longer two half-identical by chance regions, the more likely they are to be half-identical by descent
Half-identical regions which are both more than 1 cM long and more than 500(?) SNPs long are described at familytreedna.com as shared segments.
Logically, if two people have identical segments (a full/full match), then they have a half-identical region.
The converse, that there are identical segments in a half-identical region (a half/half match), does not automatically follow.
It should be possible to do some probability calculations to estimate the probability that a half-identical region of a given length (in SNPs) contains an identical segment.
Ann Turner's "Identity Crisis" article in the Journal of Genetic Genealogy (Volume 7, Fall, 2011) looks at several cases where a half-identical region is made up of overlapping segments, but it considers only the possibility of two or at most three overlapping segments. She describes a region with several alternating identical paternal and maternal segments as "identical by state" or a "pseudo-segment"; a region including a long segment from one parent in the middle, with a short segment from the other parent at either or both ends as having "a fuzzy boundary"; and a region comprised of one paternal segment and one similarly sized maternal segment as a "compound segment".
When a golfer, no matter how good, strikes a tee-shot, the probability of a hole-in-one is miniscule.
When a golfer plays a 72-hole tournament, the probability of a hole-in-one during the tournament is much larger.
In the course of a 72-hole tournament, the probability of a hole-in-one by one of a field of perhaps hundreds of players is much larger again than the probability that any individual named player scores a hole-in-one.
Some gamblers who understood this principle made a lot of money at the expense of bookmakers who did not.
Similar principles apply when comparing DNA.
When comparing the DNA of two individuals who have a priori evidence suggesting a relationship, the probability that a reasonably long half-identical region is half-identical by chance is miniscule.
When comparing the DNA of one individual with that of all the other individuals who have sent their DNA to a DNA company, the probability that one or more of the half-identical regions identified are half-identical by chance is much larger.
When comparing the DNA of one individual with that of all the other individuals in a meta-database like GEDmatch.com containing observations from many DNA companies, the probability that one or more of the half-identical regions identified are half-identical by chance is much larger again.
This can also be viewed as an example of data-mining - the principle that if you look hard enough for a pattern in a large database, you will eventually find one.
The word test is used with many different meanings in many different fields. To a scientist or a medic, it may be a deterministic test with a definite positive or negative outcome. To a statistician, it is a hypothesis test which can accept or reject (but not prove or disprove) a hypothesis based on the observed outcome of one or more random experiments. The word is used loosely by genetic genealogists with other meanings, but I will try to stick to the rigorous statistical meaning. In particular, I prefer to refer to the companies involved in genetic genealogy simply as DNA companies rather than "DNA testing companies" or "DNA sequencing companies" as they are more often described. The term "DNA sampling company", which more accurately describes what the companies really do, is rarely used. It will become clear from what follows what the DNA companies do, what they don't do, what they should do, and what their customers and organisations like ISOGG or the FDA representing the interests of their customers should lobby them to do.
To a statistician, a sample is the set of data collected from the random experiments on the basis of which a hypothesis is tested. So a DNA sample comprises either the strings of letters returned by a DNA company or the cells collected from its customers from which those letters are observed. The relevant random experiment is not the collection of cells and observation of letters (which is deterministic, apart from some measurement error) but the act of reproduction many years earlier in which the random processes of mutation and recombination produced the child's DNA from the parents'.
The various competing DNA companies market various products which comprise both the collection of cells, the raw data returned in the form of strings of letters and the interpretation of both the genealogical and medical implications of those raw data.
Genetic genealogy has been very poorly explained, or even mis-explained, to the public. See, for example, Blaine Bettinger's well-reasoned post on genetic exceptionalism.
The results of DNA analysis are frequently combined with the history and mythology of human migration. The connection between genetics and the history of human migration is generally extremely poorly explained. Is it based on DNA extracted from prehistoric human remains, on other evidence from excavation of prehistoric settlements, or on pure guesswork based on the geographical spread of DNA in today's living people?
Analysis of DNA can provide estimates of the probability that an individual currently living in place A and an individual currently living in place B had a common ancestor, either at any time, or within a specified number of generations. DNA from living people on its own cannot provide any information as to whether any such common ancestor lived in place A, lived in place B or lived in some other place C, or moved between places A, B and C.
Consider the extreme example of a family of two brothers, one of whom continued to live in his birthplace and fathered 10 daughters and no sons, the other of whom emigrated and fathered 10 sons. Their shared Y-DNA (passed from father to son) disappeared in one generation from their birthplace, but increased and multiplied in the emigrant's destination. The present location of the Y-DNA is therefore far away from the location where the common ancestor lived. (The initial brothers could of course have had male line cousins who passed on the same Y-DNA, perhaps in yet another different location.)
The units in which DNA testing (Y-DNA testing in particular) measures the genetic distance between two individuals are numbers of mutations, i.e. rare (small probability) differences in DNA between a child and the parent from whom the child inherits the DNA. By studying the frequency distribution of mutations per reproduction (or recombinations per reproduction for autosomal DNA), we can begin to understand the significance of this genetic distance. With some knowledge of the number of reproductions per generation (i.e. the average number of children fathered by each male) and its variation over centuries and millennia, estimates of the average number of mutations per generation or recombinations per generation can be derived. These can then be used to provide further estimates of the number of generations between the two individuals. By studying the frequency distribution of the age of parents at reproduction (i.e. years per generation) and its variation over centuries and millennia, estimated numbers of years for variables like the time to the most recent common ancestor can be derived. As stated by Dan Bradley of Trinity College Dublin at BTOP, the error bars for such time estimates are typically of the order of +/-50% of the point estimate. (I presume that "error bar" is geneticists' jargon for what statisticians' jargon calls "confidence interval".)
Genetics is a branch of applied probability and statistics in exactly the same way as insurance, gambling, investment, lots of sports, medicine and many other aspects of everyday life are. The highly educated population of the 21st century are well capable of understanding it, provided that it is defined precisely and explained clearly in this context. Indeed, as Kelly Wheaton says, "a statistics course is more important than a genetics one for genetic genealogists".
Genetic genealogy is a branch of genealogy which likewise has its place alongside traditional genealogical methods. Statistics prove nothing and likewise genetic genealogy alone proves nothing. Both, however, can be of great help in telling researchers where to look for the desired proof, and in rejecting wrong hypotheses.
Annelies van den Belt, chief executive of DC Thomson Family History, told the Oireachtas Joint Committee on Environment, Culture and the Gaeltacht on 12 December 2013 that her company's DNA products are tools to allow casual users to discover their roots without in-depth research. This sounds like "dumbing down" rather than education. The intended audience of this page does not include such casual users. Genealogy is not possible without in-depth research.
Many of my doubts about aspects of what is passed off as genetic genealogy are reinforced by the Genetic Astrology page of the Molecular and Cultural Evolution Lab at University College London. Another good source is Elliot Aguilar's article Selling Roots.
After looking at lists of autosomal DNA matches for some time, most people realise that some are close enough to be of genealogical interest and others are distant enough to be a waste of time. Somehow the vague and poorly defined terms IBD and IBS have come to be used to describe respectively the half-identical regions which are of genealogical interest and those which are not. Debate rages as to where one should draw the line between the two categories and as to what terms should be used to describe each category.
In my case, the first new relative that I discovered through DNA testing was a ninth cousin twice removed whom I found through GEDmatch.com and facebook.com; he was not deemed an FTDNA-overall-match to either myself or any of my known relatives. The first new relative that I discovered through FamilyTreeDNA.com was not an FTDNA-overall-match to myself, but was second to me among my paternal first cousin's FTDNA-overall-matches, hiding behind an e-mail address from the other side of the Atlantic, with a longest half-identical region of 32.41cM. So I certainly wont be dismissing anything shorter than 32.41 as "IBS".
The question one must really ask here is what length of half-identical region suggests that someone in a DNA database is more closely related than the average person one might pass walking down the street? We all have millions of distant cousins. We all share descent from anyone who lived in the same geographical area a thousand years ago. We all have a billion slots to fill on the 30th generation of our family tree. If two people can document that they are 10th or 15th cousins, it is quite likely that they are equally closely or more closely related on other lines that they have not yet documented. At that distance, DNA does not add anything to our knowledge of relationships that pure mathematics has not already told us.
I was asked to look at the ADSA output for two full-siblings and noticed a remarkable difference. On Chromosome 6, they are half-identical (and possibly fully identical) from location 148,878 to location 75,903,756 and from location 165,993,090 to location 170,761,395.
Between location 34,600,991 and location 67,897,582 the brother has only one match, namely the sister. However, the sister has no less than 115 matching segments (including the brother) in this region. It is probably safe to conclude that the siblings inherited both their paternal chromosomes and their maternal chromosomes from opposite grandparents in this region.
ADSA really makes this hidden conundrum jump out like a sore thumb.
The lengths of these 115 matching segments range from 6,300 SNPs to 9,100 SNPs and from 7.09cM to 9.46cM. This is clearly an area where the SNP/cM ratio is unusually high.
Many of the sister's 115 matches in this area are not FTDNA-overall-matches to each other, presumably because they don't meet FTDNA's 20cM threshold for overall matches.
The term "IBS" is ambiguous and widely misused. It seems to be used to describe both half-identical regions which are small on the centiMorgan scale and half-identical regions which are small on the SNP scale.
Half-identical regions which are small on the SNP scale (no matter what their size on the centiMorgan scale) are quite likely to be half-identical-by-chance, i.e. to be comprised of sequences of alternating small compound segments, representing paternal/paternal, paternal/maternal, maternal/paternal and/or maternal/maternal matches lined up together.
Half-identical regions which are small on the centiMorgan scale (but possibly very large on the SNP scale, as in the present example) are most unlikely to be merely half-identical-by-chance. However, they may come from a very distant MRCA and consequently (as in the present case) be shared by a very large group of descendants.
Throwing away the half-identical-by-chance regions makes perfect sense, but throwing away the others with them is definitely a case of throwing away the baby with the bathwater.
We know that anyone who was alive 1000 years ago and has living descendants today must statistically be an ancestor of practically everyone living today, at least within a defined geographical region such as Ireland. Our example is Brian Boru, a High King of Ireland, the millennium of whose death at the Battle of Clontarf in 1014 was celebrated recently.
I like to think of the region on Chromosome 6 where the sister in this example is half-identical to 115 other people as coming from a MRCA of some time around Brian Boru's generation.
If we plotted half-identical regions on a scatter plot with the length in SNPs on the X-axis and the length in cM on the Y-axis, everyone would agree that those which fall near the axes are of little genealogical value and those that fall far out to the north-east of the diagram are of great genealogical value. The question remains of where to draw the boundary between the valuable ones (which a lot of people call IBD) and the others (which a lot of people call IBS). The general convention seems to be to draw a right-angled boundary and throw out everything below a particular cM threshold AND everything below a particular SNP threshold.
The quadrant northwest of the threshold point contains the regions that are half-identical-by-chance; the quadrant southeast of the the threshold point contains the segments from very distant MRCAs; the quadrant southwest of the threshold point contains regions failing on both criteria; and the quadrant northeast of the threshold point contains the good matches.
Why don't we use a diagonal boundary rather than a right-angled boundary? And why don't we set a different boundary for matches where there is additional evidence of a possible close relationship? E.g., in the present example, people who match the reference person's full-sibling and are just outside the chosen boundary should be considered better matches than people who don't match the reference person's full-sibling but are just inside the chosen boundary.
The word 'ethnicity' is widely used in the marketing of DNA products. I have not even attempted to research how this word is defined. While the word is not used in Mark Thomas's Guardian article, I suspect that it is one of the concepts which he says 'are better thought of as genetic astrology'.
Nevertheless, the estimated ethnicity percentages derived from DNA and peddled by many DNA companies seem to be a key selling point, prompting many people with no interest in genealogy to end up on the DNA comparison websites.
FamilyTreeDNA's concepts of ethnicity long relegated all those of Irish ancestry to "British Isles" ethnicity, which many Irish people consider at best objectionable and at worst plain wrong. This finally changed with the release of myOrigins 3.0 starting on 22 September 2020.
All estimated ethnicity percentages that I have seen are displayed without any estimated margin of error. Until I can see solid statistical evidence that they are significantly different from zero, I will work on the assumption that all small percentages are essentially zero.
Large estimated ethnicity percentages are sometimes useful geographical clues in unknown parentage searches. An estimated percentage of over 25 is probably worth investigating.
There are many good articles on the mis-selling of estimated ethnicity percentages, such as this one by John Grenham in 2016 and this one in WIRED magazine in 2020.
Some DNA companies (ancestry.com in particular) have employed marketing people to sell their products by promising not to use jargon. In other words, they admit that they want to sell only to people who don't know what they are buying. Consumer protection authorities should look into this: the better regulated financial sector would never be allowed to get away with it! See Roberta Estes's blog and Heather Collins's blog for further critique of the ancestry.com product.
Any new science requires a new vocabulary to explain it. However, an attempt to reconcile the geneticists' vocabulary, the genealogists' vocabulary and the statisticians' vocabulary is urgently required. Scientists and marketers should agree on the vocabulary, minimise the number of different synonyms used for each concept, avoid mentioning concepts which are not directly relevant to their audience, and define any new words which are necessary clearly and precisely, with whatever diagrams and mathematical models are necessary to help the understanding of those who prefer verbal, spatial or quantitative approaches respectively. The problem is epitomised both by the looseness of FTDNA's glossary and by AncestryDNA's refusal to even use what it terms "jargon" to make its statements intelligible to multiple audiences.
For example, as noted above, at FamilyTreeDNA.com, the simple words "block" and "segment" appear to mean exactly the same thing and to be used interchangeably on the same page, unnecessarily confusing the company's customers. (If there is a subtle difference that I have missed, please let me know.)
As genetics is a branch of applied probability and statistics, it cannot be explained clearly without using the basic vocabulary of those subjects, i.e. words and phrases like probability, estimate, confidence interval and hypothesis test. Beware of anyone who tries to persuade you otherwise.
Like any sophisticated and rapidly developing website, FamilyTreeDNA.com is bound to take some getting used to.
It appears that every visit has to start with a login screen even if one ticks the apparently useless 'Remember me' checkbox. One must also remember to click the small dark "LOG IN" button towards the middle left, not the larger and brighter "Login" button or plain text "Login" at the top right, which merely reloads the login screen. There are regular annoying pop-ups saying things like: 'You have been idle for 120 minutes. Your session may have timed out. The page will be reloaded and you may need to log in again.' or 'Your session will expire on Sun Nov 17 2013 13:19:52 GMT+0000 (GMT Standard Time). You have 5 minutes remaining until your session times out. Click OK to keep this session.' If facebook.com can keep its billions of users permanently logged in, there is no excuse for any smaller website such as FamilyTreeDNA.com not to provide this option. At least the timeout was increased from 30 minutes to 120 minutes soon after I started to use the website.
Those managing DNA samples from multiple family members must become project administrators in order to use a single login for all the kits. Even then, up to 2016 there was no way to use the known relationships between the kits of family members in more appropriately targeted searching for matches.
If you are a project administrator and wish to scan multiple kits simultaneously for new matches, then I recommend the following strategy:
I have already mentioned GEDmatch.comin passing a few times, and noted that my own GEDmatch.com kit can still be found using the original kit number F310654. However, I now appear in other people's match lists as T205074 (based on a FamilyTreeDNA data file) and A931453 (based on an AncestryDNA data file). There are only minor differences between these files as the samples were submitted when the two companies were using essentially the same set of SNPs. More recently, they have used very different sets of SNPs and the same person's data files from samples submitted more recently will give very different results at GEDmatch.
As with all DNA comparison websites, paranoia leads to some potential users expressing reservations from what some describe as a "security" point of view. Personally, I am far more comfortable with GEDmatch.com having my DNA data than I am with AncestryDNA having it! The people running GEDmatch.com clearly want to help me to find my relatives; the people running AncestryDNA clearly are solely interested in separating me from my money.
As of 6 February 2015, my 664 FTDNA-overall-matches yield 622 distinct e-mail addresses and my top-1500 GEDmatch one-to-many matches yielded 1144 distinct e-mail addresses. The overlap was only 96 e-mail addresses, or just over 15% of the researchers that I could contact via FTDNA. On the one hand, that means that GEDmatch.com allows me to contact 1048 researchers who either are not customers of FTDNA at all or are customers of FTDNA but don't meet the thresholds required to be deemed FTDNA-overall-matches to me. Of my GEDmatch list, 614 came from ancestry.com, 449 directly from FTDNA, 45 from other sources via FTDNA, 391 from 23AndMe, and one was a phased kit. My conclusion is that, despite the great promise of GEDmatch, sending DNA samples to the other two companies will vastly increase the size of the pool in which I can fish for possible relatives.
One huge disadvantage of GEDmatch.com is that it doesn't operate according to the basic principles underlying the vast majority of websites, such as recognising logins for a fixed time or permitting hyperlinking and bookmarking of individual web pages within the website.
For example, if you go to http://www.gedmatch.com/ in one browser tab and login, and immediately open a new tab and go to the same URL, you will again be presented with the login screen. If you wish to look at several GEDmatch.com reports at the same time, then login and open the main menu in one browser tab. Every time you want to look at a new report, go back to that tab, right-click on the link to the form for whichever report you want next, and select Open Link in New Tab or equivalent.
A consequence of this bizarre policy is that the instructions that I have put together for getting your DNA data and your pedigree chart to GEDmatch are extremely clunky and not the simple "click here" style instructions that I would love to provide.As with any DNA comparison website, in order to be of maximum assistance to DNA matches, all DNA kits at GEDmatch.com must be associated with GEDCOM files giving the known direct ancestors of the DNA subject.
Full details of how to do this are here.
If you think that you may be related to me, then you will also want to compare your GEDmatch kit with those of my other close known relatives, which you will find at the top of the one-to-many match list for my own combined kit VA864386C1.
It is possible to download raw data from FamilyTreeDNA.com, ancestry.com, MyHeritage.com, 23andMe.com and some other websites and upload it to GEDmatch.com, which provides alternative and generally superior tools for analysis of the DNA data.
Another huge advantage of GEDmatch.com is that it has long allowed comparison of X chromosome data, which, after several missed deadlines, was introduced in a limited way on the FamilyTreeDNA website itself on 2 January 2014. GEDmatch.com shows that one may not share any detectable autosomal DNA with those with whom one shares the most (in cM) X-DNA. FamilyTreeDNA.com permits X-DNA comparisons only between those who also share autosomal DNA.Another huge disadvantage of GEDmatch.com is that it is was long run by part-time amateurs and funded by voluntary donations, so that there has been a constant struggle to match the facilities offered to the demand for those facilities. Service improved since the months around February 2015, when the following message was displayed on the login page:
Due to increased usage we are experiencing slow or non response for GEDmatch programs.GEDmatch.com was taken over by Verogen in late 2019.
We apologize and suggest logging on during off hours if you experience slow response.
The 'Gen' column by which GEDmatch.com sorts one-to-many matches by default is confusing. If a parent and child are both in the database, then GEDmatch.com finds that they are half-identical everywhere, and estimates 'Gen' as 1. However, if an individual submits samples to two DNA companies and uploads the data from both companies, then GEDmatch.com finds that the two kits are half-identical everywhere, and again estimates 'Gen' as 1. The matching algorithm fails to check for full matches everywhere, which would distinguish the parent-child relationship from the duplicate (or identical twin) relationship. If it did so, then it would presumably set 'Gen' to 0 for the latter comparison.
Similarly, you can click through to 'One-to-one' compare form at http://ww2.gedmatch.com:8006/autosomal/u_compare1.php and fill in the form to go to http://ww2.gedmatch.com:8006/autosomal/u_compare2.fnx?kit1=F310654&kit2=FU2924&chart=0&resolution=1000&threshold=&shared=&noise=&win_size=&bunch_limit=&xsubmit=Submit, but if you follow that link directly, you will get a different error message: (500) Internal Server Error. At least it is accompanied by a statement of the bizarre GEDmatch policy:
One common cause is trying to link to this page from a forum or an email. Most GEDmatch pages do not allow this, and require that you log-in to the site directly. Other than the main page, pages on this site should not be accessed from your browser history, or from links posted in forums, Google, etc.
The 'one-to-one' results show "Estimated number of generations to MRCA"; however, the algorithm used to arrive at this estimate depens on the parameters selected when submitting the form, so I have chosen to ignore all "Gen" figures generated by GEDmatch. I find the centiMorgan estimates much easier to get my head around than the Gen estimates, which, as already noted, don't seem to distinguish at the most basic level between self/identical-twin matches (Gen=1.0) and parent/child matches (also Gen=1.0). Attempting to distinguish between Gen=4.8 and Gen=4.9 can only convey a false and spurious sense of accuracy about estimates which have huge margins of error associated with them.
For example, Comparing Kit A475217 and A370395 with the default settings as of 3 Nov 2018 (Minimum threshold size to be included in total = 500 SNPs; Minimum segment cM to be included in total = 7.0 cM) produces an estimate of 3.5 Gen; changing to Minimum threshold size to be included in total = 400 SNPs and Minimum segment cM to be included in total = 6.0 cM produces no estimate. Thresholds of 1000 SNPs and 10cM give a more distant relationship estimate of 3.7 Gen. The default settings for the one-to-many comparison appear to produce the same estimates as the default settings for the one-to-one comparison.
As of 27 January 2014, the Find people who match with you on a specified segment page at http://ww2.gedmatch.com:8006/autosomal/seg_compare1.php behaved more normally. Following that link often produced ERROR(49) Not Logged in. Filling in the form brought one to http://ww2.gedmatch.com:8006/autosomal/seg_compare2.fnx?kit1=F310654&chrom=X&start=23955089&end=36111764&shared=7&chart=0&resolution=1000&xsubmit=Submit but if you follow that link directly, you will again be redirected to the error message: (500) Internal Server Error. The GEDmatch.Com Chromosome Segment Comparison took a very long time to complete (this one reported: "Comparison took 2714.723849 seconds"), so one had to make sure to start it at a time when one would not need to shut down one's computer shortly! Because of the load it was imposing on the GEDmatch servers, this analysis tool was withdrawn around February 2014.
On most of the GEDmatch.com forms, if you navigate back to the form using the browser back button or Ctrl-LeftArrow, the last values that you filled in will often be wiped out and you need to start from scratch. This behavious is not consistent: I have managed to open the same form in two tabs in the same window, and found that in one tab I could repeatedly navigate back and find the values still filled in, while in the other tab no matter how often I navigated back the values were wiped out. The usual workaround for that is to edit the URL in the navigation toolbar of your browser, but that is also disabled at GEDmatch.com!
Finally, if you close and re-open your browser with GEDmatch.com output visible in several open tabs, you will find the above error messages in each tab and have to re-enter the data in every form.
On 24 Mar 2014, I posted the following query in the GEDmatch Forums/DNA Utilities/Triangulation:
The Tier 1 Matching Segment Search now provides the solution to the problem outlined in the above query.Subject: How can I find people who match two kits on a segment where those two match each other?
I have some known relatives who have uploaded their data to GEDmatch, such as my fifth cousin Cindy.
Our longest half-identical region is 4.6cM and our Estimated number of generations to MRCA = 6.9. Hence we don't appear on each other's top 1500 'One-to-many' matches even if we set minimum Autosomal largest segment to 4.5cM. Both of our top 1500 lists stop at 6.5 generations to MRCA.
There are 64 kits in common between my top 1500 and Cindy's top 1500.
The next thing I would like to do is to go through these 64 kits and find the people who match both of us and also each other on the same regions. Any such people are the most likely to be descended from our common GGGGgrandparents.
It seems that there should be a form where I can enter my kit number and my fifth cousin's kit number and get a list of the 64 kits who are common to our top 1500 along with some indication of which ones are half-identical to either or both of us on the regions where we are half-identical to each other.
The current Triangulation and Segment Triangulation utilities each ask for a single kit number, and look for pairs of kits which match the selected kit and match each other.
I would prefer a triangulation utility which asks for two kit numbers (obviously of known relatives who match each other) and looks for other kits which match both selected kits on the same regions where they match each other.
How can I do this without using Excel to find the 64 common matches between the two top-1500 lists and then manually doing 128 'One-to-one' compares between each of the 64 with me and my known relative?
The 'People who match one or both of 2 kits' utility does not appear to check whether the matches are in the same regions (suggesting a single common ancestor for all three) or in different regions (allowing the possibility that the third person is connected to the first two through different common ancestors).
GEDmatch.com continually tweaks its matching algorithm. On 29 Sep 2014, I looked at the top-1500 cut-off point in Gen for 11 kits and found the following range of values:
F303343 Donna: 6.6
F318138 Cindy: 6.8
FU2924 Anthea: 6.8
F310654 Paddy: 6.9
F335377 Antoin: 7.4
F325507 Colm: 7.4
F325763 Dara: 7.4
M090954 Sean: 7.4
M081357 Joanne: 7.5
F325501 Aileen: 7.5
F335391 Mary: 7.5
Anthea appears on Joanne's top-1500 at 6.9 Gen, but Joanne
doesn't make Anthea's top-1500 which stops at 6.8 Gen.
The chapters which follow on interpreting the different types of DNA results will all eventually contain reviews and what I hope will be some constructive criticism of the relevant parts of the FamilyTreeDNA.com and GEDmatch.com websites.
This is another independent DNA tools website, featuring the Autosomal DNA Segment Analyzer amongst other tools.
You can register here.
Then bookmark the login page. There is a "Keep me logged in" tickbox, but it seems unreliable.
While you transfer of your raw DNA data from FTDNA, AncestryDNA or 23andMe to GEDmatch.com is a one-off procedure, you will want to periodically transfer your match data from FamilyTreeDNA.com to DNAGedcom.com as new matches appear.
As with GEDmatch.com, you can manage multiple DNA kits within a single DNAGedcom.com account.
You will also need to bookmark the download page.
Note that your web browser may automatically prompt you to use your DNAGedcom.com password as the password for each new FTDNA kit from which you wish to download match data. Remember to type the appropriate password for the relevant FTDNA Kit Number over the suggested password before hitting the "Get Data" button. Your web browser should then remember the relevant passwords whenever you want to refresh the match data for existing FTDNA kits.
If your download fails, the error message (in red on black) may not be easy to read or scroll through. You can copy and paste the entire error message to a text editor where it will be easier to read, if not easier to understand.
FTDNA frequently changes its website without warning, which often knocks out the DNAGedcom.com download procedure.
If you have an autosomal transfer kit at FTDNA, you can download your match data to DNAGedcom.com without paying the USD19 fee to unlock the chromosome browser etc.
There is a DNAGedcom User Group on Facebook which you should also join.
I will return to the Autosomal DNA Segment Analyzer later.
There are four different levels at which one can compare autosomal DNA and look for matches or potential relatives:
My initial autosomal DNA results were presented at FamilyTreeDNA.com in the form of 36 pages of matches, with 10 people per page; subsequently this was increased to 30 people per page. In an attempt to avoid ambiguity, I will call these people my FTDNA-overall-matches. The term autosomal or Family Finder is implicit in this definition, as those who have had their Y-DNA and/or mtDNA analysed by FamilyTreeDNA will have different (possibly overlapping) sets of matches from those analyses.
The silly mania for dividing everything into groups of 10 still applies (as of 11 June 2017) to the screen for entering Current Surnames, where the browser back and forward commands don't even work properly between the groups of 10.
I have yet to find any formal definition of what the word "match" means in this context or of what matching algorithm is used at FamilyTreeDNA.com. The nearest to a definition that I can find in the FTDNA FAQs is:
The Family Finder program has calculated all of your matches to be your relatives within the relationship range. Family Tree DNA uses stringent standards for the relationship range and for the degree of relatedness. Thus, only those determined with high confidence to be your actual genetic relatives are included.
Where are the "stringent standards" published? How high is "high" confidence? What statistical principles lie behind this secret definition? Are user-entered known relationships used within the matching algorithm, as appears to be the case for ancestry.com's DNA matching algorithm?
At GEDmatch.com, the One-to-many DNA comparison page at least allows the user to tweak the parameters used to define a match. Note, however, that the GEDmatch.com 'One-to-one' compare page by default once looked for segments > 3cM in FTDNA data but only for segments > 5cM in 23AndMe data.
After you have spent a while looking at your autosomal DNA matches, you will inevitably begin to question both to what extent the people on the long lists of matches thrown up by autosomal DNA comparison are any more likely to be related to you than the people that you might pass walking down the street in a place where your ancestors lived for a couple of generations, and to what extent you will be able to prove this using traditional genealogical methods.
I have estimated that in somewhere like Ireland, where the population is small and there was little inward migration in recent centuries, it is unlikely that any two randomly selected people with no tradition of recent immigrant ancestors are more distantly related than about twelfth cousins. Mark Humphrys argues that we Irish are all descended from Brian Ború, the High King of Ireland who was killed in battle in 1014.
This simple observation has profound implications:
On the other hand, there are documented cases of people who found each other because they were deemed to be autosomal DNA matches discovering a paper trail which shows that they are more distantly related than twelfth cousins. I personally have found a documented ninth cousin twice removed because we were deemed (by GEDmatch.com) to be autosomal DNA matches. It is of course possible, if not probable, that there is a closer but less well documented relationship between such distant cousins.
Lists of matches based purely on one-to-one autosomal DNA comparisons will undoubtedly include some people whose genealogical relationship cannot be established and omit some true genealogical relatives. The extent to which you will be able to filter the true genealogical relatives from the others depends not only on the closeness of the DNA match, but also on the answers to many questions which can not be considered by these pure one-to-one autosomal DNA comparisons:
At this stage, some basic ideas about false positives (matches with no known relationship to you) and false negatives (known relatives who are not deemed to be matches) may be helpful.
Every statistical inference is subject to two types of error. For no particular reason, they are known as Type I and Type II errors:
Relationship | Match Probability |
---|---|
2nd cousins or closer | > 99% |
3rd cousin | > 90% |
4th cousin | > 50% |
5th cousin | > 10% |
6th cousin and more distant | Remote (typically less than 2%) |
Cousin relationship | Probability of detecting |
1st Cousin or closer | ~100% |
2nd Cousin | >99% |
3rd Cousin | ~90% |
4th Cousin | ~45% |
5th Cousin | ~15% |
6th Cousin and beyond | <5% |
As with any hypothesis test, there is a tradeoff between sensitivity and specificity when using DNA to test whether two individuals are related. In the language of the preceding section, reducing the probability of a Type I error will increase the probability of a Type II error.
Choosing where to set the threshold is more of an art than a science. The various DNA companies have all developed their own matching algorithms, which can give very different results.
Here is an extreme example involving myself (VA864386C1) and a
man (A831973) who shares his surname with at least four of my 16
GGgrandparents, and shares one reasonably large half-identical
region with me:
In this case, I have more confidence in the conclusions by FTDNA and GEDmatch than in those by AncestryDNA, as our ancestors not only shared a surname but lived near each other.
In another case, with no common ancestral surnames and no common ancestral locations, I might have more confidence in the DNA company which does not deem there to be a match.
The other DNA companies should follow the example of GEDmatch and allow their customers to choose where they would like to set the threshold. At the very least, they should state clearly and unambiguously where the threshold has actually been set.
The basic matching algorithms do not currently look at surnames or family trees. Both AncestryDNA (ThruLines) and MyHeritage (Theory of Family Relativity) are trying to develop more sophisticated matching algorithms which combine DNA evidence and genealogical evidence. In the future, the matching thresholds and ranking of matches used by the DNA companies may be revised to combine genealogical evidence with DNA evidence.
Major changes in the Family Finder interface were implemented on 6 Jul 2016. Parts of the following discussion may still need to be revised in the light of these changes.
The only URL for the Family Finder interface is https://my.familytreedna.com/family-finder/matches.aspx. There are various ways of customising the display, but in general no parameters are added to the URL and no cookies are set to remember your preferred view (if you are like me, this will probably be with newest matches first and with Expand, formerly referred to as Show Full View, turned on), so you cannot bookmark your customised display and you must repeat the customisation every time you go to the page.
Personally, I like to see my Family Finder matches ordered by descending Match Date, so I have to click the Match Date column header to resort the matches every time I visit my bookmark. For Y-DNA matches, I have to click twice, as the first click sorts with oldest first and the second click then sorts with newest first. I suspect that all customers will want quick access to the top 30 by Match Date, but this requires an additional point-and-click after selecting your bookmark. The number of FTDNA-overall-matches is displayed at the top of the list of matches; if this has changed since your last visit, you will know that you have at least one new match.
After accidentally managing to add a parameter to the URL, I posted about this on 24 Mar 2015 at
https://www.facebook.com/groups/isogg/permalink/10153215548052922/
Each FamilyTreeDNA kit has an ekit identifier which is used in
URLs to avoid confusion when you share your Family Tree or DNA
results with project administrators or others.
To find your ekit identifier, go to your pedigree chart and select the "Share Tree"
button which is the fourth item from the right on the white menu
bar above your blank pedigree chart. This will display a URL in
the form http://www.familytreedna.com/my/family-tree/share?k=j0XmKW8S87H%2Ffjy92CTXbQ%3D%3D
Whatever appears after the equals sign in the URL is your ekit
identifier.
If you have access to multiple kits and want to bookmark pages for particular kits, you can generally add ?ekit= followed by the relevant ekit identifier to the end of any URL.This will be particular helpful when you have become a project administrator.
To upload your GEDCOM file to FamilyTreeDNA.com, just select the "Upload GEDCOM" button which is the third item from the right on the white menu bar above your blank pedigree chart at, e.g., https://www.familytreedna.com/my/family-tree?ekit=AyxNwlRR9Y6t4mCCdnLA%2fw%3d%3d#mode=1
There seems to be no means of viewing all my hundreds of FTDNA-overall-matches in the same web browser window. I can, however, see them all, sortable by all fields, in a single Microsoft Excel window by clicking the Excel button at the bottom right of the browser window. This causes Mozilla Firefox to offer to open an XML Document in XML Editor. I am not familiar with either of these, but clicking OK then opens a normal Excel window. The file downloaded is not a properly formatted Excel file and is probably just a CSV file: column widths are not set to match the content width; panes are not frozen; autofilter is not turned on; dates are not in my preferred Microsoft Windows date format; e-mail addresses are not hyperlinked; long lists of surnames and placenames are not set to wrap in a readable manner; etc. As I will be re-downloading this file, I had to record this macro in order to make it usable in Excel 2010. Hopefully the macro will be of use to other FamilyTreeDNA customers. If you know how to use a macro in Excel, hopefully you know how to copy and paste someone else's macro into an appropriate place, and how to back up the macros which you want to access via your Excel Add-Ins menu, which Excel stupidly insists on storing in a fixed location in the directory hierarchy.
Nick Reddan has sent me some helpful additional information on Excel macros which I will be inserting here in due course.
For each FTDNA-overall-match, I can see the fields below either in the web browser window or in the Excel window or in both windows or somewhere else. The full (but invisible) list of matches is sortable in the web browser by clicking on any of three column headers (Match Date, Relationship Range or Shared cM); it is also sortable by four additional fields (Name (married surname), Longest Block, Y-DNA Haplogroup or mtDNA Haplogroup) by selecting from a dropdown Sort By menu and clicking an Apply button. These sorts in each case display the top 10 FTDNA-overall-matches by the chosen criterion. It is not possible to bookmark or hyperlink to the top 10 in any order bar the default Relationship Range order at https://my.familytreedna.com/family-finder/matches.aspx.
In Excel, FTDNA-overall-matches are naturally sortable by all 12 columns, including the following, by which it is not possible to sort on the website: first name, Suggested Relationship, Known Relationship, E-mail, first ancestral surname listed and Notes. Since married surnames are not a separate column in the spreadsheet, Excel cannot sort by them.
On the website, one can click on any match's Full Name or mugshot to bring up a Profile pop-up.
The following items of information are available for each FTDNA-overall-match:
What did I actually find?
You must refresh your internet browserSo I was initially forced to struggle mostly with Safari and its horrendous ignorance of basic Windows keyboard shortcuts and clipboard practice.
Uh-oh. It looks like you caught us just after we rolled a site upgrade. In order to continue, please refresh your internet browser. We apologize for the inconvenience.
Refresh browser Cancel
Before looking at the chromosome browser, we need a tiny bit of basic mathematics.
The mathematical relation R is said to be transitive if VRW and WRZ imply VRZ.
For example, equals (=) is a transitive relation, since V=W and W=Z imply V=Z.
Other simple mathematical examples of transitive relations are is greater than (>), is less than (<) and is a subset of.
A mathematical example of a relation which is not transitive is is not equal to. For example, 1 is not equal to 2 and 2 is not equal to 1, but 1 is equal to 1.
More generally — verbally — is identical to is a transitive relation, since V is identical to W and W is identical to Z imply that V is identical to Z.
In genealogy, is related to is not a transitive relation, since Tom is related to Dick and Dick is related to Harry do not necessarily imply that Tom is related to Harry. Tom could be related to Dick on Dick's paternal side and Dick related to Harry on Dick's maternal side, in which case Tom is not related to Harry. Or Tom, Dick and Harry could have a common ancestor (more likely, a common ancestral couple, depending on whether they are full-cousins or half-cousins) in which case Tom is related to Harry. Or Dick could have some more remote ancestral couple for whom one spouse is related to Tom and the other spouse is related to Harry in which case again Tom is not related to Harry. I think that takes care of all the possibilities. Some genealogists like to talk about the concept of is connected to, which is a transitive relation. (If X is related to Y and Y is related to Z, then we say that X is connected to Z, regardless of whether X is related to Z.)
Other simple genealogical examples of transitive relations are is an ancestor of and is a descendant of.
As an aside, and speaking of ancestors versus ancestral couples, it is surprising how much more common the former phrase is than the latter in what I have read about DNA. If two cousins find a significant half-identical region in their autosomal DNA, then they can, with a bit of co-operation, work out which is the most recent ancestral couple from whom they have both inherited the relevant segment. At that stage, it is generally equally likely that the segment was inherited from the husband in that couple as from the wife. (Political correctness probably means that I should call them male partner and female partner rather than husband and wife!) Eventually, a more distant cousin sharing the same segment may show up and reveal whether the segment came from the husband or from the wife in the most recent ancestral couple of the first two cousins. This only pushes the conundrum back another generation or more, to the most recent ancestral couple shared by all three cousins.
The advantage of mathematics over the English language is its lack of ambiguity.
Is matches a transitive relation? It depends on the sense in which the word is used.
In many uses of the word matches, if V matches W and W matches Z, then V matches Z, and so matches is (sometimes) a transitive relation.
For example, if matches is used in the sense of is identical to, then we have already seen that it is a transitive relation.
However, Family Tree DNA uses the word matches in the sense of is related to, and we have already seen that then matches is clearly NOT a transitive relation (even without the added complication that in this case the relationship is probable rather than known).
Further confusion can arise, and certainly arose initially for me, from the multiple uses of the verb match, with different connotations, by Family Tree DNA and its users:
Genealogists are familiar with the extension of an is-related-to type of relation to an is-connected-to type of relation.
Mathematicians in exactly the same way frequently extend a mathematical relation to the equivalence relation generated by the underlying mathematical relation.
Any equivalence relation divides the set of objects which it compares into equivalence classes.
We can define such an is-related-to type of relation on the set of all my FTDNA-overall-matches. Let's call this relation P (for Paddy). We will say that WPZ if (a) W FTDNA-overall-matches both Z and me and (b) Z FTDNA-overall-matches both W and me. This just means that there is DNA evidence that W and Z may be related to each other and to me.
Two people are in the same equivalence class of the corresponding is-connected-to type of relation if it is possible to trace a path from one to the other, but without going through me.
So if my siblings are not in the FTDNA database and my parents are not related to each other and my parents have no relatives in common who have tested, then my paternal relatives and my maternal relatives will not end up in the same equivalence class.
Similarly, if none of my first cousins or their descendants are in the FTDNA database and similar assumptions hold, then people related to me through each of my four grandparents will end up in four (or more) equivalence classes.
Many people believe that there are no more than six degrees of separation between any two human beings, but I don't. I think the world is much smaller than that. As one of my friends says, "there are only 200 people in the world and the rest are an illusion created with mirrors". It took me several decades of doing genealogy to come up with a path from my father to my mother: a cousin of my father and a cousin of my mother whose spouses were uncle and niece. Within a couple of years, I had found a second such path. There may well be such paths between your paternal relatives and your maternal relatives in the FTDNA database, so you may find that your FTDNA-overall-matches do not divide naturally into equivalence classes like this at all.
DNAGEDCOM.com allows me to download an ??????_ICW.csv file (where ?????? represents my kit number) defining the is-related-to pairs by which the relation P is generated. If I open this .csv file in Microsoft Excel, then I can use the PivotTable tool to start generating the is-connected-to relation which breaks my FTDNA-overall-matches up into the equivalence classes of interest. The next step is to use a few matrix multiplications to calculate for each of my FTDNA-overall-matches how many of my other FTDNA-overall-matches are within one, two, three, etc., degrees of separation. Unfortunately, as I feared, having tried to use it back around 2001 for handicapping racehorses, the MMULT function in Microsoft Excel is still incredibly slow and inefficient in the current version, even on my top-of-the-range 2012 laptop, so the details will have to wait until I install some real software like SciLab to do the job.
What I have discovered is that as of 10 Jan 2014 my 381 FTDNA-overall-matches included eight people each in equivalence classes on their own, in other words eight people with whom I have no ICW matches.
To users of FamilyTreeDNA.com, the word "matches" can also sometimes mean region-matches.
Here's an example of what the FTDNA chromosome browser looks like when comparing the logged-in kit to two other kits:
A maximum of five kits can be viewed simultaneously in the chromosome browser. These can be selected by ticking boxes in one of the worst designed selectors imaginable. To preserve the anonymity of my FTDNA-overall-matches and to avoid shaming the designer, I don't show it here. It is about 7.5 lines tall, but shows one's FTDNA-overall-matches 10 at a time, ordered by surname, necessitating a vertical scrollbar. It runs along to the left of chromosomes 8 to 18, with plenty of blank white space to the left of chromosomes 19 to X. Instead of showing the usual colour-coded place-holders where no mugshots are available, it shows grey mugshots for both males and females. My first FTDNA-overall-match by surname is identified only by his or her initials "N A". I have to go to another tab and search for his or her colour-coded mugshot to determine whether this is a male or a female, which is very significant when looking at the X chromosome.
Someone I know through genealogy sent me the screenshot above (taken before FTDNA added the X chromosome), generated while logged in to her mother's FTDNA kit. Let's call her Terry. There are lots more similar examples on the ISOGG Wiki.
Terry and her mother have both tested with FTDNA and therefore are, of course, FTDNA-overall-matches.
Terry's mother and I are also FTDNA-overall-matches. As discussed on the FTDNA facebook page, we region-match on 14 regions of 1cM or more. Our number of shared segments, longest block shared (8.93) and total shared cM (38.24) combine to bring us in above the secret threshold for FTDNA-overall-matches.
Terry and I are not FTDNA-overall-matches. For reasons that I will come to later, we cannot see in the FTDNA data how much, or which segments, Terry has inherited from her mother of the autosomal DNA that her mother and I share, but we obviously expect it to be about half. What matters for now is that Terry and I come in below the threshold for FTDNA-overall-matches.
The blue regions in the image above are the 14 segments of 1cM or more on which Terry's mother and I region-match; the original default image showed only the 1 segment of 5cM or more on which we region-match, but there is a dropdown menu which can be used to reduce the threshold and display the smaller regions.
The orange regions in the image are those on which Terry and her mother match: pretty much everywhere.
For my first couple of days looking at these pretty pictures in the chromosome browser, I made the false assumption that region-matches was a transitive relation, but this example shows that it clearly isn't. If Terry region-matched me everywhere that she region-matches her mother and that her mother region-matches me, i.e. in all the blue segments in the chromosome browser, then Terry would have to FTDNA-overall-match me, which she doesn't.
The next clue that I had misunderstood something was the fact that Terry's DNA and her mother's DNA seem to match in 100% of locations, but we expect to find that Terry inherited only 50% of her DNA from her mother (one chromosome in each pair), and the other 50% (the other chromosome in each pair) from her father. The parent-child example in the ISOGG wiki looks just the same as Terry's example, so I knew there had to be a rational explanation.
I turned to Google in search of this explanation, and eventually found a reference to half-identical regions. Thinking that I might be on the right track, I googled that phrase, which brought me to Lesson 9 of the Beginners Guide To Genetic Genealogy on the Wheaton Surname Resources website, at which stage a light-bulb finally went off in my head! I hope I have explained the concept of half-identical region clearly above.
My initial confusion comes from the fact that there are 22 pairs of chromosomes, but the chromosome browser appears to show only 22 single chromosomes.
ADSA provides a nice graphical overview of your matches and allows you to identify groups of people related to each other, and in some areas to separate your paternal and maternal matches.
The FTDNA chromosome browser allows comparison of the logged-in kit with up to five FTDNA-overall-matches. The ADSA automatically compares all those with half-identical regions longer than the selected cM threshold on each chromosome. I recommend thresholds of 10cM and 1000 SNPs for beginners, but you will eventually want to drop these thresholds to check whether relevant individuals appear on particular chromosomes.
The FTDNA matrix tool allows comparison of a selected group of up to ten of the logged-in kit's FTDNA-overall-matches and shows which of the selected group FTDNA-overall-match each other. The ADSA automatically displays the equivalent matrix for 22 naturally defined groups, one for each chromosome, comprising all my FTDNA-overall-matches with half-identical regions longer than the selected cM threshold on the relevant chromosome.
The FTDNA chromosome browser and matrix tool sort the people being compared according to the probably arbitrary order in which the user selected them. The ADSA very helpfully sorts the people being compared by the starting location of the shared half-identical region.
The pretty FTDNA chromosome browser pictures can only be shared as screen-grabs. The even more colourful ADSA output is just a single clever self-contained (admittedly large) HTML file, which can be saved to disk and even shared as an e-mail attachment.
The FTDNA website breaks my FTDNA-overall-matches up into dozens of web pages with 10 people on each. The ADSA displays all of them on a single web page.
The FTDNA website displays ancestral surnames and locations with horrible nested scroll bars. The ADSA displays the full string on mouseover with no need for additional clicking (if your screen is wide enough).
The ADSA uses data transferred by the user from FTDNA, so cannot be used to view matches from the perspective of anyone other than the kit owner.
Note that the ADSA displays anyone who region-matches the kit-owner in more than one region of the same chromosome as if he or she is two separate people (who sometimes appear not to even FTDNA-overall-match each other).
A chromosome, as shown in the chromosome browser, is essentially an array of pairs of the four letters A, C, G and T.
This table shows the distribution of the paternal/maternal unordered pairs in my raw autosomal results:
paternal/maternal unordered pair | No. |
CC | 132111 |
GG | 131631 |
AA | 113797 |
TT | 112998 |
TC | 83322 |
AG | 82671 |
AC | 18129 |
TG | 18011 |
-- | 3501 |
GC | 178 |
CG | 168 |
TA | 122 |
AT | 113 |
Grand Total | 696752 |
Lots of interesting things can be seen from this table:
Note that at a typical (biallelic) SNP where I am heterozygous, I am half-identical to everyone. For example, TC is half-identical to CC, TT and TC. Hence, biallelic SNPs where I am heterozygous provide no information in the search for half-identical regions. On the other hand, at a biallelic SNP where I am homozygous, I am not half-identical to anyone who is homozygous with the other possible letter, i.e. where we are opposite homozygous. For example, if I am AA, I am half-identical to those who are AG or AA, but not to those who are GG. Thus the significance of a region where I am half-identical to a stranger is determined not by the total number of SNPs in the region, but by the number of SNPs in the region at which I am homozygous. The more heterozygous SNPs I have in a region, the more false positives will be expected to appear among those with whom I am half-identical on that region.
The prevalence of homozygous SNPs varies markedly throughout my genome. I have a spreadsheet in which I have compiled details of 281 regions on which I am half-identical to various people. The proportion of SNPs at which I am homozygous in 280 of these regions varies between 51.3% to 86.3%, with one small outlier where I am homozygous at 74 of 75 SNPs. The mean proportion is 72.8%, which surprised me. I am homozygous at only 70.4% of all SNPs, and expected homozygous SNPs to be under-represented in regions where I am half-identical to others. The standard deviation of the proportion is 6.6 percentage points.
Surprisingly, I have not yet found any comparison tool which automatically reports either the number or percentage of informative homozygous SNPs in a region of interest. For a region on which two individuals are half-identical, any such comparison tool should report the number of homozygous SNPs for each of the two individuals. A comparison tool which does this is urgently needed by all genetic genealogists.
genesis.GEDmatch.com began to address this problem by introducing the concept of slimming, which remains poorly documented, but has been discussed on a Facebook thread.
David Pike's utilities report, inter alia, on runs or sequences of consecutive homozygous and heterozygous SNPs. Here are summaries for the first four kits to which I had access:
Longest sequence (SNPs) | ||||||
Name | %Heterozygous | %Homozygous | %NoCalls | Total | heterozygous | homozygous |
Paddy | 29.09% | 70.40% | 0.50% | 100.00% | 18 | 480 |
Antoin | 29.34% | 70.31% | 0.35% | 100.00% | 18 | 720 |
Mary | 29.35% | 70.05% | 0.60% | 100.00% | 23 | 428 |
Anthea | 29.63% | 70.17% | 0.19% | 100.00% | 21 | 350 |
The extraordinary thing about this table is the difference between the lengths of the longest sequences of heterozygous SNPs and the longest sequence of homozygous SNPs. As there are more homozygous SNPs than heterozygous SNPs, one would certainly expect the longest homozygous sequence to be longer than the longest heterozygous SNPs, but not more than ten times longer. Why is this?
The data for my relatives does not appear to be unusual, as the default settings for the two utilities are 20 SNPs and 200 SNPs respectively.
Several words and phrases suggest themselves to describe the two relations shown in the coloured regions in the FTDNA chromosome browser:
The words and phrases which spring to mind include:
For the first relation (that between the reference person and the person represented by one of the colours) let's stick to the last of these to make it completely unambiguous what we mean.
The first and most important thing to note is that, just like - and for exactly the same reasons as - is related to, is half-identical with is NOT a transitive relation. The person represented by the orange segments may not be half-identical with the person represented by the blue segments, even if the segments overlap.
Suppose V and W share a half-identical region on, say, chromosome 11, and W and Z share a half-identical region starting at the same location on chromosome 11. It does not follow that V and Z share a half-identical region here. For example, V's first pair could be AC, W's first pair could be CT, and Z's first pair could be GT (which is not half-identical with AC). The same could be the case at many other locations within the region.
In practice, this means that W inherited this segment of his (or her) paternal chromosome from an ancestor shared with V but inherited the corresponding segment of his maternal chromosome from an ancestor shared with Z (or vice versa, maternal shared with V and paternal shared with Z).
When W looks at V and Z together in the chromosome browser, there will be an overlap of coloured regions in the relevant part of chromosome 11.
When V looks at W and Z together in the chromosome browser, there will be just one coloured region in the relevant part of chromosome 11.
And when Z looks at V and W together in the chromosome browser, there will be just one coloured region in the relevant part of chromosome 11.
This assumes that each of the three individuals FTDNA-overall-matches both of the other two; otherwise FamilyTreeDNA.com does not allow them to do the comparisons in the chromosome browser.
If V, W and Z in this example want to research effectively, it appears that they will have to share their FamilyTreeDNA.com passwords, so that each can compare the other two in the chromosome browser. In Chapter 1, I have already pointed out other circumstances in which sharing passwords seems to be necessary for effective and productive research. It would be nice if there were two levels of access to kits - read-only guest access to allow this sort of chromosome browsing; and full write access to allow changing of passwords, editing of GEDCOMs, ancestral surnames, known relationships, etc., in much the same way as online family trees published using systems such as TNG and even the much-maligned ancestry.com allow different levels of access to different people.
On the FTDNA website, there are lots of routes from the Matches page to the Chromosome Browser page and to some of the data which the chromosome browser displays.
For example, find the person you want to compare with on the Matches page. Click the tiny dropdown just below his or her mugshot (or missing mugshot icon). The Longest Block figure is immediately revealed. Click the "Compare in Chromosome Browser" link. (Repeat for up to five individuals.) Click the large blue "compare" arrow. Now the number of Shared Segments between you and each selected person is revealed along with the lovely colour diagram.
I first found the following more circuitous alternative route: click Family Finder, Chromosome Browser, Filter Matches By ..., Name, [type name, don't hit <Enter> key], Find, checkbox. Not yet having found the quick route to the Longest Block figure, I thought I then had to View this data in a table and scan the centiMorgans column for the largest value. The centiMorgans data is shown to two decimal places in the table, but for some strange reason trailing zeroes are omitted and the numbers are centred instead of aligning on the decimal point, which doesn't make it any easier to find the maximum by eyeballing the column.
Now it is time for a discussion of an important flaw in Family Tree DNA's policy:
I can not use the Chromosome Browser to compare my DNA with that of someone who is not one of my FTDNA-overall-matches, not even with someone like Terry, who is the daughter of a match, and even uses the same e-mail address for her own kit and for her mother's kit. Likewise, I can not use the Chromosome Browser to compare my DNA with that of a known relative who has tested but has not shown up as an FTDNA-overall-match.
I don't see why any two consenting adult customers of Family Tree DNA should not be allowed to compare their autosomal DNA in the chromosome browser, but the company thinks differently, citing unexplained "compliance with our matching and privacy policies" when I raised this question on the company's facebook page.
Such comparisons must be done on third party websites, such as GEDmatch.com.
The next two chapters will each deal in some detail with a single initial FTDNA-overall-match with whom I have had extensive correspondence. In this chapter, I will include brief reviews of two other groups of FTDNA-overall-matches whom I have not yet contacted directly.
As already mentioned, my initial top ten FTDNA-overall-matches, which are the same both by Relationship Range and by Longest Block, include no less than five members of the Dengen family, sharing an e-mail address - a mother and four of her children. (I was bemused when I first looked at this in the small hours of the morning and thought one of the sons was his father, with whom he shares his name! I am still bemused that the father has two Buckley lines, apparently from Cork and Tipperary, while I have one Buckley line, from Kerry; I don't seem to have any surnames in common with the mother, with whom I share DNA!) This is where the Chromosome Browser seemed to come into its own. All five Dengens are half-identical with me on the same large segment on chromosome 20 (27.96cM for the mother; 20.11cM or greater for the children). As explained above, I cannot compare two of the siblings to each other or compare mother and child in the chromosome browser, so I cannot tell whether any pair of them are half-identical in this segment. But because I know their relationships to each other from their posted family trees, I know that mother and each child must be half-identical in this segment. Ex ante, I cannot tell whether any pair of siblings are half-identical in this segment, but (I think) because all four are half-identical with me, the chances that any two of them are half-identical with each other become larger.
My Shared cM with the children is between 49% and 73% of my Shared cM with the mother. The four siblings somehow have slightly different lists of ancestral surnames - probably because FamilyTreeDNA hasn't thought of allowing, or forcing, siblings who share an e-mail address (and even those who don't) to also automatically share their list of ancestral surnames. My understanding is that the genetic process is an example of what statisticians call a Markov process: whether or not any or all of the four children have inherited this block of DNA from their mother adds nothing further to what can be inferred about my relation to the mother from the overlap between my DNA and hers.
So far, I have not discussed the crucial relationship between the lengths of half-identical regions, sometimes expressed as percentages of autosomal DNA shared by two individuals, and their genealogical relationship.
This relationship provides a crude way of estimating the genealogical relationship between two people whose autosomal DNA has half-identical regions.
Everyone inherits exactly half of his or her autosomal DNA from his or her father (22 paternal chromosomes) and the other half (22 maternal chromosomes) from his or her mother.
We have already seen that this implies that the regions where a parent and a child are half-identical cover the whole of the 22 autosomal chromosomes.
Because recombination is random in nature, the proportion of DNA inherited from grandparents and more distant ancestors becomes random.
On average, we each inherit 25% of our autosomal DNA from each of our four grandparents. Random variation means that some people, for example, inherit 24% from their paternal grandfather and 26% from their paternal grandmother; or 27.1% from maternal grandfather and 22.9% from maternal grandmother; or even 32% from paternal grandfather and only 18% from paternal grandmother. The latter is a real example from GEDmatch, where the grandson is A260081, the grandfather is A237206 and the grandmother is A329975.
On average, siblings are fully identical on 25% of their autosomal DNA (where they have inherited the same DNA from both parents); they are half-identical on 50% of their autosomal DNA (where they have inherited the same DNA from one parent, but different DNA from the other parent); and they do not match at all on 25% of their autosomal DNA (where they have inherited DNA from different grandparents on both maternal and paternal sides).
The sibling relationship is special because siblings are related on both the paternal and the maternal side.
The calculations are much easier for those who are related on just one side.
Siblings are expected to share 50%, uncle-or-aunt/nephew-or-niece 25%, first cousins or greatuncle--or-greataunt/greatnephew-or-greatniece 12.5% and so on.
There are some nice pictures and tables which illustrate the shared percentages, such as this one from the FTDNA website:
This graph from 23AndMe shows why it is easy to exactly identify close relationships but more difficult to be precise about relationships beyond first cousins:
On 24 October 2015, Robert James Ligouri announced his Autosomal (atDNA) Prediction Grid. For a given value of Total Shared cMs, this shows the range of plausible full relationships (not considering the additional possibility of a half-relationship).
This 50% principle can be quite difficult to grasp, as illustrated by Roberta Estes's blog. The sort of loose language that Roberta occasionally uses can lead to confusion. Thus she appears surprised to find that almost 50% of short (in cM) segments of maternal (say) DNA are inherited in their entirety by a child from a parent, and almost 50% are not inherited at all by the child (who instead inherits the corresponding paternal segment). In the limit, this is like a coin toss, where almost 50% of the time the coin lands head up, almost 50% of the time the coin lands tails up, and there is a miniscule probability that the coin lands standing on its edge.
At the opposite extreme, when results are aggregated over the entire genome, the proportions of a child's maternal autosomes that come from the maternal grandmother and from the maternal grandfather are each expected to be 50%, and the standard deviation around this figure is small.
For a large collection of short segments, it is of course expected that half of the segments will be passed on to the child and half will not. It is NOT expected that half of each individual segment will be passed on. By definition of a centiMorgan, it is expected that a 100cM segment will experience one crossover and thus be passed on in two parts. In this sense, we could say that segments longer than 100cM are expected break into more than two parts, but segments shorter than 100cM are expected (on average) to break into less than two parts.
When checking whether a small segment has been partially passed on by a parent to a child or not passed on at all, it is important to set both the cM and SNP filters low enough to pick up even smaller segments. See Example III below.
If the percentage of autosomal DNA shared by two people is 12.5%, then other variables like age differences need to be considered in working out the most likely relationship between them. Two men of similar age sharing 12.5% are most likely first cousins, but if their ages are a couple of generations apart they are more likely to greatuncle and greatnephew.
How can we estimate the shared percentages from the observed data?
The standard procedure appears to be to do this on the cM scale, by adding up the lengths of any half-identical regions longer than 1cM. This overestimates the shared percentage by including half-identical regions which are not half-identical by descent, but underestimates the shared percentage by excluding half-identical regions which are shorter than 1cM but are half-identical by descent. It seems that these two biases cancel each other out quite effectively.
If we have an unbiased estimate of the shared percentage between W and Z and want to estimate the shared percentage between W and one of Z's parents whose DNA is unavailable, say V (the one on whose side the relationship is), what can we do? Since Z got half of her DNA from V, we just double the estimate. But what about half-identical regions between 0.5cM and 1cM which are not counted when comparing W and Z, but which are expected to come from half-identical regions of between 1cM and 2cM shared by W and V? Do we need to use a factor slightly greater than 2.0 to take account of these regions?
Similarly, if we have separate estimates of the shared percentages between B and the siblings D, E and F, we can get a more precise estimate of the shared percentage between B and the parent of the sibling group who is related to B, say H, by doubling the figures for D, E and F and then taking the average of the three results.
The method of doubling the shared percentage appears to work fine until the result is greater than 50% or even 100%. By the time that stage is reached, in theory the analysis has gone back beyond the common ancestor, possibly by taking a wrong turn somewhere along the road. In practice, however, the shared segments that are detected are precisely those that are longer than expected, so implied shared percentages greater than 50% or even 100% are likely to be encountered even in a correct lineage.
In fact, beyond a common ancestor, a halving principle replaces the doubling principle. Suppose I share 23% of my DNA with one of my grandfathers (slightly less than the expected 25%). Given this information, I am expected to share 11.5% of my DNA with each of his parents (slightly less than the prior expectation of 12.5%).
If you are an adoptee, or even a foundling, or for some other reason have doubts about your parentage or paternity, then DNA can be a big help. In summary:
Also:
Also:
If you know who your mother is, but don't know who your father is, then DNA can also help, particularly if your mother has other children who are willing to provide DNA samples. Your mother's other children will be either:
These are just expected values; actual values will be distributed around these averages. The standard deviations are poorly documented but there is an online spreadsheet showing a small sample of comparisons. The percentages seem to be calculated differently in this spreadsheet (can someone please explain?), but the maximum shared percentage observed for half-siblings is much smaller than the minimum shared percentage observed for full-siblings. In other words, this test can unambiguously distinguish between half-siblings and full-siblings.
Anyone who shares half-identical regions of autosomal DNA (particularly overlapping regions) with two half-siblings (or two groups of half-siblings) with the same mother is unlikely to be related to the father of either group.
My FTDNA-overall-matches include two people whose GEDCOMs suggest, with various caveats, that they may be sixth cousins, namely Charles (with whom I share 38.19cM) and Janice (27.90cM), and also Janice's nephew Walter (29.39cM).
Note that I share more with the nephew than with the aunt! Walter's father must have inherited much more than expected and Janice inherited much less than expected from their mother. My implied shared cM with Walter's father, calculated by doubling my shared cM with Walter, is 58.78. Similarly, my shared cM with Janice's mother, calculated by adding (averaging, then doubling) the shared cM figures for two of her children (one figure directly observed, the other inferred from his father's figure) is 86.68. Ultimately, my implied shared cM with their common ancestral couple, Christopher Choate and his wife, calculated by doubling the shared cM figures back to the common ancestor and then averaging, and subject to the aforementioned caveats, is 5217.92cM or 77.1%, which is implausible. One of my five closest FTDNA-overall-matches (who submitted his DNA after I did and whose shared cM with me 49.36) is also a Choate and shares an e-mail address with Walter. Unfortunately Walter has not yet uploaded a GEDCOM for this latest member of his immediate family to give a DNA sample, so I have no idea where to incorporate the new person into this calculation. He has listed the new person's Most Distant Paternal Ancestors as "James Choate b.1813 TN m. Elmira Farmer b.1816 MO".
I was not surprised when I checked the chromosome browser and found that I share the same half-identical region on Chromosome 9 with all four of these people. The half-identical region which I share with Janice is 9.76Mb or 12.69cM or 2477 SNPs long, and is contained within the even longer half-identical regions which I share with the other three. There seems little doubt, given the additional genealogical evidence, that all four of them have inherited this segment from Christopher Choate or his wife.
Christopher is said to have been born in Maryland in 1720, so at first seemed very unlikely to have descendants in Ireland, where all my known ancestors back to 1720 have lived. There is certainly no mention of Ireland in the extensive Choate pedigree on genforum. However, another Choate pedigree says that Thomas Choate, one of Christopher's many American-born sons, married in Ireland. In the complete absence of genealogical evidence to link me to his descendants, it could of course be that I am merely half-identical by chance with Christopher or his wife in the relevant region of chromosome 9, and thus that I am also half-identical by chance with many of their descendants who have inherited the relevant segment from one of them, including these four FTDNA customers. Or perhaps the caveats in the GEDCOMs are correct, and all these people are descended from their daughter-in-law, the woman whom Thomas Choate married in Ireland.
Two more people FTDNA-overall-match myself and the four Choate descendants and share half-identical regions with me overlapping with that on chromosome 9 that I share with Janice and the others. GEDmatch.com has no less than 17 kits with which I am half-identical on the same segment as with Janice, including Janice herself and Walter again, three other FTDNA kits and twelve 23AndMe kits. I have not yet tried to trace any of them back to Christopher Choate and his wife. Some of them may be half-identical with the Choate descendants, some of them may be related to me on the opposite side from the Choates (paternal or maternal), and some of them may again be half-identical purely by chance.
Triangulation and phasing are really opposite sides of the same coin. If V is half-identical on the same region with W and Z, then there are two possibilities:
I further dislike the word triangulate because it is used ambiguously, often being used also in the context of FTDNA-overall-matches as well as in this context of region-matches. In the case of FTDNA-overall-matches, the logic is also that if three people each FTDNA-overall-match the other two, then it is more likely that all three share a common ancestor. While this triangulation argument makes a common ancestor more likely, it does not definitively prove that one exists, as each pair may share completely different half-identical regions coming from three different common ancestral couples.
Furthermore, a triangle has three sides, but in the case of region-matches FTDNA continues to effectively insist on displaying just two sides of the triangle - I can see my half-identical regions with two of my FTDNA-overall-matches, but without asking one of them to look in the chromosome browser or ADSA or GEDmatch or to join a project which I administer, I cannot tell whether they are half-identical with each other in a region where both are half-identical with me, or alternatively appear to be related to me on opposite sides (one paternal, the other maternal).
While the word triangulation suggests that inferences can be drawn and even that proofs can be established based on groups of three people, who may only be FTDNA-overall-matches, in fact one needs to look at groups of four people, who must region-match on the same region, to draw truly valuable inferences. If three people region-match you on the same region, then either all three share your paternal segment on that region, or all three share your maternal segment on that region, or two share one segment and one shares the other. In other words, at least three of the four people have an identical segment. Now it will be much easier to make progress.
This all assumes that the half-identical regions in question contain enough SNPs that they are not merely half-identical by chance. If you find two people with whom you are half-identical on the same region, there are really three possible explanations:
In the second of these cases, it can be deduced that at any biallelic SNP where the other two people don't match (i.e. are opposite homozygous), you are heterozygous (since you match both). Furthermore, if you know which is the paternal match and which is the maternal match, then you can deduce which letter came from your father and which from your mother at that SNP. For example, if you are AG, the paternal match is AA and the maternal match is GG, then your A must have come from your father and your G from your mother. Even if you don't know which is the paternal match and which is the maternal match (e.g. if you are an adoptee), you can still draw valuable inferences. For example, if both matches have ancestral surnames and/or ancestral places which crop up repeatedly among your matches, then you can phase the surnames and places into two groups, one associated with each parent. Such an analysis between Anthea (an adoptee) and myself and Michael (who both match Anthea on the same region on Chromosome 4 but don't match each other) provided strong evidence that Anthea's many matches from Connemara are probably through the parent related to Michael, and that her many matches from east Mayo are probably through her other parent, the one related to me.
Another word of caution (which still applies to this last example): if the region in question is shorter than 20cM, then it is not sufficient to check whether the two people are FTDNA-overall-matches or even to check whether ADSA shows that they match in the region of interest. It could be that they are half-identical on the region of interest, but don't meet the 20cM overall Shared cM criterion to be deemed FTDNA-overall-matches. It is essential to copy both parties' raw data to GEDmatch.com and to run a one-to-one comparison between them there. This has not yet been done in this example. It is also advisable to set the cM threshold for the one-to-one comparison at GEDmatch.com much lower than the length of the region of interest, since it is well-known that half-identical regions have fuzzy boundaries. Many people argue that the word 'triangulate' should never be used when discussing ICW matches unless all three have been shown to be half-identical to one another on the same region of a particular chromosome.
Here is another interesting example combining triangulation and phasing:
Kit1 | Kit2 | Chr | Start Location | End Location | Centimorgans (cM) | SNPs |
F310654 | F335391 | 6 | 9,024,323 | 37,698,281 | 35.8 | 13187 |
F310654 | M090954 | 6 | 25,059,788 | 32,851,195 | 3.5 | 6272 |
F310841 | M090954 | 6 | 25,790,241 | 33,593,237 | 3.2 | 6849 |
F310654 | F310841 | 6 | 25,988,473 | 32,851,195 | 2.2 | 6052 |
F335391 | M090954 | 6 | 29,194,808 | 32,795,951 | 1.5 | 4338 |
F310841 | F335391 | 6 | 29,194,808 | 32,795,951 | 1.5 | 4443 |
The above table shows six half-identical regions between four individuals: each is half-identical to the other three, at a minimum at 4,338 consecutive SNPs between locations 29,194,808 and 32,795,951 on Chromosome 6. Note that this example deals with a region with a very high SNP/cM ratio. (Ann Turner has written a paper about this strange region on chromosome 6.) As the cM lengths of the half-identical regions are small, the most recent common ancestors (apart from for the known first cousins with a half-identical region of 35.8cM) are probably very distant.
F310654 and F335391 in the first row are known first cousins on F310654's maternal side. M090954 is half-identical to both of them between locations 29,194,808 and 32,795,951, so must also match F310654's maternal chromosome between these locations, and the same applies to F310841. So we can claim successful triangulation (or even quadrangulation) in this region.
However, between locations 25,988,473 and 29,194,808 both M090954 and F310841 are half-identical to F310654, but not to F335391. Now we can claim successful phasing in this region: it is safe to assume that F335391 matches F310654's maternal chromosome in this region, since they are known first cousins. Hence, M090954 and F310841 must match F310654's paternal chromosome in this region.
Now the question arises as to whether M090954 and F310841 also match F310654's paternal chromosome in the adjoining region between locations 29,194,808 and 32,795,951. One way of checking this is to look at the colour-coded graphic bars in the GEDmatch.Com Autosomal Comparison:
Base Pairs with Full Match = | |
Base Pairs with Half Match = | |
Match with Phased data = | |
Base Pairs with No Match = | |
Base Pairs not included in comparison = | |
Matching segments greater than 1 centiMorgans = | |
Centromere |
The following table shows the output for F310654 and M090954:
Chr | Start Location | End Location | Centimorgans (cM) | SNPs |
6 | 25059788 | 32851195 | 3.5 | 6272 |
The following table shows the output for F310654 and F310841:
Chr | Start Location | End Location | Centimorgans (cM) | SNPs |
6 | 25988473 | 32851195 | 2.2 | 6052 |
In both cases, the output makes it clear that there are subregions (coloured green) where each of the other parties match both F310654's paternal chromosome and F310654's maternal chromosome. It would be necessary to inspect raw data to identify the overlapping segments more precisely.
F310654 and M090954 are believed to be fourth cousins twice removed on F310654's paternal side, but the half-identical region of 3.5cM considered here may be from a more distant common ancestor than this known relationship.
The ultimate objective is to collect DNA matches into triangulation groups. A triangulation group is a set of three or more people who are all half-identical to each of the other group members on overlapping regions.
Here is an example from Chromosome 9:
9 106,881,785 124,347,952 24.0 4,938 Hincken/Morrissey
9 114,714,102 132,034,981 24.4 4,397 Quinn/Hincken
9 114,714,102 125,890,734 15.7 2,999 Morrissey/Quinn
All three members of the triangulation group are half-identical to each other from 114,714,102 to 124,347,952. The interpolator can be used to confirm that the (sex-averaged) length of this overlap region is approximately 13.7cM:
Chr | Query (bp) | Sex-averaged (Kos cM) | Female (Kos cM) | Male (Kos cM) |
9 | 114714102 | 118.8510593879110 | 147.9192814455530 | 91.1703069169749 |
9 | 124347952 | 132.5099030861300 | 164.5840116062510 | 101.8401906721300 |
This region is long enough that it almost certainly contains a DNA segment inherited by all the members of the triangulation group from a relatively recent single common ancestor.
A triangulation group including at least one subgroup who are known relatives is of particular use. In the above example, Hincken and Quinn are known third cousins, with most recent common ancestral couple Denis O'Connell (died 1887 aged 90) and Kate O'Dea (died 1889 aged 78). It can then be concluded that the other members of the triangulation group are related to one of the most recent common ancestral couple of the known relatives, or even descended from both. In this case, we can infer that Morrissey is related to either Denis O'Connell or Kate O'Dea. The total amount of shared DNA will indicate whether the most recent common ancestor of the entire triangulation group is upstream or downstream from the known most common ancestral couple of the known relatives. In this case, Morrissey has known ancestry from the same townland as the others, so geographic evidence added to genetic evidence points in a very specific direction.
As half-identical regions have fuzzy boundaries, it is rare to find that even two of the half-identical regions in a triangulation group share either the same beginning or ending location. In the example above, there is just one boundary location (114,714,102) common to two of the three half-identical regions. One should concentrate on looking for overlaps rather than exact shared boundaries, which have no special significance.If a triangulation group contains two subgroups of known relatives, then it can be concluded that one of the most recent common ancestral couple of the first subgroup is related to one of the most recent common ancestral couple of the second subgroup
This question from IGP's County Clare Ireland Genealogy Group on Facebook illustrates some of the factors to bear in mind when looking at possible triangulation groups:
This is the diagram of chromosome 14 for five of my DNA matches. The first two are first cousins to one another and my third cousins. We three are descended from our ggg grandparents from Clare or Galway. The last match shares their surname. Are we all likely to be related through that line?
In theory, some of these five FTDNA-overall-matches could match the questioner's paternal chromosome 14 and others match her maternal chromosome 14. In practice, something like this usually does turn out to be a genuine triangulation group. It is necessary to look at all 15 one-to-one comparisons between the six people in order to confirm that each is half-identical to the other five on the overlap, and thus that there is a DNA segment which all six have inherited from a shared common ancestor. The easiest way to do this is to have all six people at GEDmatch. If five of the six can be seen at FTDNA, either by having the passwords or by administering a project to which they belong, then one can check that way. One can also get a good indicator from the ADSA which will show which of these kits are FTDNA-overall-matches to each other. This does not prove that they are half-identical to each other on this particular region: one could be related to the questioner's father but have a second relationship through a different common ancestral couple to those who are related to her mother. The DNA match still doesn't prove who the common ancestor was, but one can be virtually certain in this case that the shared DNA came through one of the known shared GGgrandparents of three members of the triangulation group.
The likelihood that it came through a common ancestor with the shared surname depends on various other factors:
Terry and I got to know each other through the County Clare Ireland Genealogy group and the Kilrush Local History Group page on facebook.com. We discovered that we both have ancestors named McNamara who lived in the adjoining townlands of Breaghva and Moveen West in the civil parish of Moyarta on the Loop Head peninsula in the west of County Clare. Terry's cousin Michael McNamara still lives in Breaghva and my cousin Michael McNamara still lives in Moveen West. When necessary, they are distinguished by the age-old Irish tradition of using patronymics rather than surnames: Michael "Anthony" and Michael "Pádraig" respectively. In any case, the surname is invariably shortened to "Mack" in that part of County Clare, where "Mack" is almost always short for McNamara (although I did know a Tommy Mack, whose full name was Tommy McInerney).
Terry and I became facebook friends in January 2013 and met when she returned to her mother's native Kilrush for the National Famine Commemoration in May 2013. We have a lot in common besides our McNamara roots. She is an excellent genealogist. When she couldn't find Breaghva in the Tithe Applotment Book of 1827 for Moyarta parish, she converted the acreages of all the townlands from Irish acres to statute acres, compared them with the corresponding acreages from the Ordnance Survey, and discovered that Breaghva was originally considered as part of the adjoining townland of Moyarta West, where she found what is surely her McNamara ancestor living as early as 1827. My McNamara relatives, on the other hand, are mere blow-ins, and didn't come to Moveen West until my greatgrandfather Old Johnnie McNamara arrived from a townland about 26km away on his marriage in 1876. The two families have known each other since then, but I have heard it explained many times that they are not closely related.
Terry is also a technical expert in many areas. Her TNG site was among those that inspired me to switch in August 2013 from tribalpages.com to TNG for my (password-protected) online family tree. The fact that she had sent DNA samples for herself to all three companies (FTDNA, 23andMe and ancestry.com) and for both of her parents to both FTDNA and 23andMe was among the triggers that finally persuaded me to send mine to FTDNA.
The Loop Head peninsula, where our ancestors lived, is bounded on the south by the Shannon Estuary and on the north by the Atlantic Ocean. A peninsular popluation by definition is not quite an island population, but it is still a population group with a limited number of founders and a certain lack of mobility when it comes to finding marriage partners. So I was not too surprised to find Terry's mother among my initial FTDNA-overall-matches. Sadly she was gravely ill at the time and died a few weeks later (RIP).
Terry's first thought was that we are probably related on the McNamara side after all, and that we should probably get our respective cousins, the two Michael Macks, to submit DNA samples, for both Y-DNA and autosomal DNA comparisons. But as Terry's mother and my paternal grandmother come from the Loop Head peninsula and all of their known ancestors lived there, the common ancestor from whom the shared segments (or some of them) derive might not be a McNamara at all.
[Example details moved here.]
Here I should probably give a counter-example. Joseph and Paul
(father and son) caught my attention among my
FTDNA-overall-matches because their ancestral names and places
include surnames and townlands in County Mayo which also appear in
my mother's family tree. My longest block with each of them is the
same 8.07cM block on chromosome 19. In this case, there is no
doubt that this block was inherited by the son in its entirety
from the father. I would not be in the least surprised to find
that both of them and myself all inherited it from a common
ancestor. The first surprise in this case was to find that Terry's
mother (from County Clare) also FTDNA-overall-matched Paul (with
roots in County Mayo), but not his father. This is another
spurious match. The second surprise was to find that my total
Shared cM with the son (33.64cM) was greater than my total Shared
cM with the father (26.53cM), even though the mother is of German
ancestry.
The many anomalies already noted in the relative lengths of half-identical regions make it clear that short half-identical regions are often merely half-identical by chance and are not indicative of a close genealogical relationship. This gives rise to the related questions of "how long is 'long'?" and "how close is 'close'?
Peter Ralph and Graham Coop, who have a far better grasp of the relevant mathematics and statistics than most of those writing about genetic genealogy, have written about the identification of genomic regions shared between distant relatives and conclude that "sharing a single long block doesn’t imply a particularly close genealogical relationship". While the words "long" and "close" are left open to different interpretations, this is a good principle to bear in mind.
The ultimate answers to these questions depend on what supplementary evidence is available:
No matter what the answers to the above supplementary questions, the answers to the initial questions remain subjective.
I started out innocently assuming that I could prove my relationship to all my FTDNA-overall-matches. Then I found the spurious 9.1cM half-identical region in the previous example, and began to dismiss anything below that. Finbar O'Mahony then advised me that he does not normally contact anyone (presumably apart from a known relative) whose longest half-identical region is less than 15 cMs, and I began to think about increasing my threshold further. In other words, in the case of complete strangers, the guidance is that anything under 15cM is short.
As the threshold for FTDNA-overall-matches is 20cM, one is left wondering in the case of half-identical regions under 20cM whether those who are not FTDNA-overall-matches are not half-identical on the relevant region, or are half-identical on the relevant region but lack enough small additional half-identical regions to make up the 20cM threshold. This soon convinced me that it's probably not worth the effort of investigating possible relationships with those sharing longest half-identical regions under 20cM.
It was stated above that autosomal DNA contains a few hundred segments of genealogical value per individual; this estimate is based on dividing the total length of autosomal DNA (3587.1cM) by a reasonable cut-off value for the length of a half-identical region of genealogical interest, say 20cM.
As time has gone on, I have wondered whether the threshold should be even higher than 20cM.
In other words, I still don't know where it is sensible to draw the line, but I would certainly accept weaker DNA evidence if there is corroborating genealogical evidence than if there is no genealogical evidence at all. The argument is purely academic while I concentrate on half-identical regions of 30cM and more for which I have still not found a common ancestor.As of 17 March 2015:
In the case of known relatives, guidance as to the length of half-identical regions certain to come from the common ancestor is in short (no pun intended) supply, but 20cM is clearly quite long in this case. Indeed, this question did not arise in my own mind until I found that I shared half-identical regions with a ninth cousin twice removed. A half-identical region of 8.8cM/918SNPs with that ninth cousin twice removed seemed very long. With a known fifth cousin, my longest shared half-identical regions are 4.6cM/875SNPs and 3.8cM/1177SNPs, which seemed quite short in this context.
With both of these known relatives, there is a suspicion of a second relationship. My ninth cousin twice removed has Waldron cousins, who could be my ancestors. My fifth cousin and I both have O'Halloran ancestors as well as our known shared Keas ancestors. What is the likelihood that our shared half-identical regions contain shared segments from our known common ancestors (as opposed to shared segments from these other possible common ancestors)?
For now, I will have to leave these questions unanswered, but for known relatives I will certainly look at all half-identical regions greater than 1cM and greater than 500SNPs.
Among my 354 initial FTDNA-overall-matches, I had:
At GEDmatch.com as of 23 Dec 2013, I had:
It can be seen from these figures that probably only around one in six FTDNA customers have copied their raw data to GEDmatch.com.
It can also be seen that GEDmatch.com's definition of a match (apparently based solely on largest cM) includes far more so-called matches than FTDNA's definition (apparently based on additional criteria such as number of shared segments and total shared cM).
I certainly suspect that the vast majority of the 1,311 matches for the default GEDmatch.com parameters are false positives. Besides, 11 e-mail addresses are much more manageable than 1,311!
For someone like me with a large database of confirmed blood relatives (9,531 people on my father's side, but only 813 people on my mother's side at Christmas 2013), 19 interested and interesting matches is a poor return.
For a foundling or any adoptee with no confirmed blood relatives, on the other hand, 19 interested and interesting matches would be a fantastic return.
Now back to myself and Terry and her mother: In the segments where my DNA is half-identical with Terry's mother, it could be the case that either:
Until Terry and I can compare ourselves directly in the chromosome browser, we cannot rule out possibility 3. If we could compare ourselves directly, then we might find that we have smaller half-identical regions within the half-identical regions that I share with her mother.
While I had no known relative who had tested before me, perhaps Terry has others besides her parents and we can do some sort of further analysis to distinguish between possibilities 1 and 2 and thus see on which side I match her mother.
After a wait of almost three days, I heard back from Anthea, the first FTDNA-overall-match that I attempted to contact when my initial results arrived. She has already featured in passing in some of my examples above.
Anthea told me a very interesting and totally unexpected story. She has given me permission completely to publish any details of her story, which in any case has been in the public domain for a very long time, wherever I want. However, the details of how she came to be adopted in Worthing in England in 1938 are of no relevance here.
In short, Anthea's only evidence in the search for her biological parents comes from DNA. She received her first autosomal results from FamilyTreeDNA.com on 10 February 2012. She was my closest initial FTDNA-overall-match by Shared centiMorgans (119.09108) and my second closest by Longest Block (30.45). We have 16 Shared DNA Segments. A year later, she was still my closest match by Shared cM and by Longest Block (ignoring my two known first cousins), and I was her closest match by Shared cM, but had dropped from 5th to 7th among her matches by Longest Block. Anthea soon admitted that she and her husband "do not understand chromozones [sic] or triangulating!" That was further motivation for me to try to explain these topics more clearly here.By looking at my FTDNA-overall-matches in common with Anthea, it soon became apparent that we must be connected through my mother, as many of the ICW matches, like my mother, had roots in the triangle between Swinford, Charlestown and Kilkelly in County Mayo. When the Family Finder results for my paternal and maternal first cousins arrived, they confirmed that my relationship to Anthea is on my maternal side. However, the connection to the first of my maternal first cousins to submit a DNA sample (estimated by FTDNA as 4th Cousin - Remote Cousin) is not as strong as the connection to me (estimated by FTDNA as 2nd Cousin - 4th Cousin), implying that the relationship must be less close than was suggested by my own results alone. When another of my maternal first cousins later submitted a DNA sample, the estimated relationship had to be revised again, and it now appears to be closer to the initial estimate.
As of 29 July 2016, Anthea's closest match is an estimated second cousin at AncestryDNA. Initial analysis points very strongly at one pair of his greatgrandparents as direct ancestors of Anthea, so the puzzle may be 25% solved.
A lady whom I will call Anne is Anthea's top FTDNA-overall-match by the old default sort order (i.e. by longest block). It is surprising that Anne and I had not met before we discovered this DNA connection, as we have connections in the worlds of academia and local history and numerous mutual friends, and as her maternal ancestors and my paternal ancestors have roots in adjoining parishes in County Clare. All this is pure coincidence, however.
Anne subsequently obtained DNA samples from her full-brothers Garry and Terence and their first cousin Patricia (brothers' children). Neither brother is deemed an FTDNA-overall-match to Anthea at all, but Patricia is. This raises a number of questions.
First of all, it clearly implies on the one hand that FTDNA's estimated relationships between Anne and Anthea and between Anne and Patricia (2nd Cousin - 4th Cousin in both cases) must also be revised outwards but on the other hand that FTDNA's estimated relationship between Anne's brothers and Anthea (none at all) must be revised inwards. (Recall that FTDNA's estimated relationships consider solely the DNA of the two people being compared, and completely ignore the DNA of known relatives of either party.)
At the next level, the fact that one sibling can top someone else's match list and the other not appear on it at all suggests investigating the Shared Segments between the matches, in this case between Anne and Anthea and between Patricia and Anthea. Between Anne and Anthea, there is one Shared Segment of 43.53cM but there is no other longer than 4.14cM. Between Patricia and Anthea, there is one Shared Segment of 23.52cM but there is no other longer than 4.21cM.
In fact, on Chromosome 10, based on GEDmatch comparisons:
Given that Anne inherited this long 43.53cM segment from one of her parents, what is the probability that her full-sibling did not inherit it, or did not inherit enough of it to register as an FTDNA-overall-match with Anthea? Suppose this segment came from the, say, paternal chromosome of one of Anne's parents. There is a 50% probability that her brother inherited from the corresponding maternal chromosome at the relevant start location (91,205,789 on Chromosome 10), and a Poisson probability of 64.71% of no crossover throughout the relevant segment. Thus there is a probability of over 32% that the brother does not match Anthea anywhere in this region. If one allows for possible crossovers near the ends of the region, or multiple crossovers, so that Anne's brother and Anthea share small segments of DNA, but not enough to meet the FTDNA thresholds, the ex ante probability that Anthea and Anne's brother are not FTDNA-overall-matches is clearly somewhat more than 32%. With two brothers, since the events are independent, the probability can just be squared, but there is still an ex ante probability of over 10% that neither brother FTDNA-overall-matches Anthea.
Now suppose that Anne and Anthea were half-identical on two separate regions, on two different chromosomes, each half as long as the actual 43.53cM match, i.e. each 21.765cM long. The probability that Anne's brother and Anthea don't match in one of these regions is again the 50% probability that the brother inherited the opposite chromosome at the start of the region, multiplied by the relevant Poisson probability of no crossover in the shorter region, which is clearly much larger, in fact 80.44%, giving a result of 40.22%. However, inheritance on two differenct chromosomes is independent, so the probability that the brother and Anthea don't match on either of the two regions is 40.22% x 40.22% or just 16.18%.
This gives the slightly counter-intuitive result that if you have two FTDNA-overall-matches with the same Shared cM but different Longest Block, you are more likely to match a sibling (or any known relative) of the FTDNA-overall-match with the shorter Longest Block.
Like the diversification principle in investment (which essentially says "don't put all your eggs in one basket"), this is essentially just another application of the Law of Large Numbers.
With DNA samples from three siblings and a first cousin, we can try to divide their FTDNA-overall-matches into four subsets, one subset for each grandparent of the sibling group. I will consider the subset of FTDNA-overall-matches who are half-identical to Anthea or to one of the sibling group on the region where Anne and Anthea are half-identical to each other (Chromosome 10 between locations 91,205,789 and 126,559,193). The easiest way to do this is using the ADSA for each of the four people, with "Chromosome to graph" set to 10, "Base Pairs" set from 91205789 "to" 126559193 and "Minimum Segment Length in cM" set to 8. I ignored some half-identical regions which had only a tiny overlap with the start or end of the region where Anne and Anthea are half-identical.
Let us label the grandparents as follows:
The autosomal DNA of the three siblings alone could not tell us whether 1A and 1B were the paternal grandparents and 2A and 2B the maternal grandparents or vice versa. However, the fact that a paternal cousin is half-identical to Anthea on the same region confirms that the relationship is through one of the paternal grandparents. Likewise, autosomal DNA alone still cannot tell us whether A is the grandfather and B the grandmother or vice versa on either side.
On side 1, Anne has inherited from grandparent 1A and, as Garry and Terence do not match Anthea, they must both have inherited from grandparent 1B.
Garry is half-identical to both of his siblings on this region, so he must have inherited from the same grandparent as Anne on the other side, i.e. from 2A.
Anne and Terence are not half-identical on this region, so they must have inherited from opposite grandparents on both sides; in other words, Terence inherited from 1B and 2B.
So we have the following pattern:
There is also an outside chance that either of these reasons could also result in a person being assigned to the wrong group, unless the lengths of the relevant half-identical regions are 20cM or greater.
The following table shows the matches and categories:
cM length of half-identical region with | |||||
Match surname | Anthea | Anne | Garry | Terence | Grandparent group |
Lewis | 8.20 | 9.19 | 1A | ||
Likens | 20.91 | 34.81 | 1A | ||
James | 9.90 | 9.90 | 1B | ||
Bean | 17.59 | 18.01 | 2A | ||
Fetherston | 13.60 | 14.29 | 2A | ||
Johnson | 9.70 | 9.89 | 2A | ||
Murphy | 11.15 | 11.27 | 2A | ||
Smoyer | 10.00 | 9.34 | 2A | ||
Swanson, A | 13.79 | 13.79 | 2A | ||
Swanson, M | 13.38 | 13.38 | 2A | ||
Wellar | 9.81 | 8.99 | 2A | ||
Blake | 9.53 | 3 | |||
Clifford | 12.16 | 3 | |||
Laffey | 16.63 | 3 | |||
Peneycad | 8.46 | 3 | |||
Doble | 8.00 | ambiguous | |||
McDade | 8.29 | ambiguous | |||
Mellick | 8.22 | ambiguous | |||
Ottum | 9.25 | ambiguous | |||
Scott | 8.25 | ambiguous | |||
Svircev | 8.25 | ambiguous |
It is reassuring that the two people in grandparent group 1A,
matching Anne and Anthea, namely Lewis and Likens,
FTDNA-overall-match each other.
Of the eight people in group 2A, matching Anne and Garry, Smoyer
and Johnson FTDNA-overall-match each other; Johnson, Murphy, the
two Swansons and Bean FTDNA-overall-match each other; the
Swansons, Bean and Fetherston FTDNA-overall-match each other; and
Bean, Fetherson and Wellar FTDNA-overall-match each other. These
subgroups arise from different subregions of the long region where
Anne and Anthea are half-identical.
Of the four people in grandparent group 3, matching Anthea only,
only Blake and Peneycad FTDNA-overall-match each other, leaving
the possibility that the other two may be half-identical by chance
to Anthea.
The next stop is to look for common ancestors of the three siblings and each of the four grandparent groups in the above table. If the common ancestor with group 2A can be identified, then we will know that Anthea is related through the other parent. If the common ancestor with group 1A or group 1B can be identified, then we will know which grandparent Anthea is related to.
There is a lot more work still to be done!
Update: Since I compiled the above table, two Kane samples have been submitted to FTDNA. In the region of interest, they are half-identical to Anne and Garry (12.94cM and 12.69cM for both siblings) but not to Anthea or Terence, so they must be related to grandparent 2A.
For another interesting example concerning my Kett GGGgrandfather, see facebook discussion.
Y-DNA comparison is the best way to rule in or rule out distant relationships between men with the same or similar surnames. For closer relationships, autosomal DNA comparison gives more precise answers.
Only men have a Y chromosome for comparison.
A woman does not have a Y chromosome, so should find a male relative with the relevant surname to swab:
By far the most comprehensive Y-DNA service is provided by FamilyTreeDNA.com, but there are competitors like YFull.com and YSEQ.
If you have already sent cheek swabs to FamilyTreeDNA for Family Finder or mitochondrial analysis, then they are held in storage and will be re-used for Y-DNA analysis.
[surname] FTDNA projectIf there is no surname project for your surname, then you can apply to set up your own project by following a simple five-step application process (which actually consists of only four steps!).
New FamilyTreeDNA.com customers need to fill in the names of both their Direct Maternal (i.e. matrilineal) and Direct Paternal (i.e. patrilineal) Most Distant (i.e. most distant known) Ancestors here in order to help those looking for mitochondrial DNA matches and Y-DNA matches respectively. An extraordinarily high proportion of customers have failed to attempt this, or have attempted it but have filled in details of the wrong ancestors, often filling in details of ancestors of the wrong gender. It is particularly important for anyone who has ordered mtDNA analysis and for men who have ordered Y-DNA analysis to fill in details of these ancestors which then appear in the relevant project reports, but it's a good idea for all FTDNA customers to fill in the details, which then show on the basic customer profile shown to those who match any part of their DNA.
Most people squeeze in names, places and dates to the limited string length of 50 characters available for the names of the most distant ancestors, but FamilyTreeDNA really should provide separate fields and columns for name, birth date, death date, country, county, etc., to help those scanning this information in the tables in surname projects and mitochondrial projects.
As I have already noted, some people are actually confused by the simple concept that Y-DNA follows the male line, and even by the simpler concept that in most cultures the surname generally, but not always, follows the same male line. Even more people are confused by the related concept that in many cultures grants of arms generally, but not always, follow the male line and the surname. Y-DNA will identify relationships that go back much further than the adoption of surnames, which in most cultures was around or after the year 1000 AD. Other cultures, for example European royalty, still do not use surnames in the 21st century.
In practice, there are many exceptions to the foregoing cultural principles, which result in sons inheriting DNA from their genetic father, but not inheriting the exact surname used by their genetic father.
The spelling of surnames mutates over the generations in much the same way as DNA mutates. After many generations, the surnames used by two men from the same genetic male line may end up being unrecognisably different.
A surname/DNA switch is defined as the use of a surname different from that used by the genetic father, which may be:
When the surname does not follow the male line, some genetic genealogists once used the term non-paternity event (NPE), but most now prefer to refer to these occurences more precisely as surname/DNA switches. After all, every birth involves paternity, so NPE is now more usually expanded as "Not the Parent Expected" and used when the DNA results do not match the oral family tradition.
Among the myriad of, possibly one-off, circumstances causing surname/DNA switches (and other forms of NPE) are:
There are many examples of surname/DNA switches in the pedigrees
of recent world leaders, including the following:
Grants of arms have historically been associated with specific families and never with surnames; thus sharing a surname does not automatically confer the right to bear the same arms. Similarly, sharing a surname does not automatically mean that two men carry the same Y-DNA.
I was initially under the misapprehension that genetic distance was a simple count of the number of differences between the sequence of integers reported for two individuals. For example, two individuals who have purchased Y-DNA37 and are reported to have a genetic distance of 2/37 might be expected to have the same results in 35 positions in the sequence and different results in the other 2 positions in the sequence. The actual situation turns out to be much more complex than this. Some of the numbers reported are related. For example, the 10th and 12th numbers in the sequence both relate to DYS389. One mutation can cause both of these numbers to change. Similarly, the 22nd, 23rd, 24th and 25th numbers in the sequence are all related to DYS464. One mutation can cause more than one of these four numbers to change. For some men, Y-DNA37 results may contain more than 37 numbers. For example, most men have four numbers for DYS464, but some O'Deas have six numbers, as can be seen on the results page. For more details, see Deconstructing TMRCA & Genetic Distance by John Barrett Robb.
Y-DNA genetic distance is no more closely correlated with genealogical relationship than shared centiMorgans of autosomal DNA. Note also that lower genetic distance indicates a closer relationship, while higher shared centiMorgans indicates a closer relationship.
Terry Barton of WorldFamilies.net gives a nice example in his 2008 interview with Blaine Bettinger: "my Dad and I each started a mutation (his is at DYS388, while mine is at DYS452) So, I am 41/43 when compared to my Uncle. I use this example to explain how you can’t count mutations to determine how closely related you are to someone ... Dr. Richard Barton [and myself] have no paper trail connection [but] Rich is a 43/43 match to my Uncle."
Having made yourself findable by others who share your surname and potentially share your Y-DNA, the next step is to look for such men in the FamilyTreeDNA community.
On the Surname & Geographical Projects page, FamilyTreeDNA has a "Project Search" box (the search box at the top of the right-hand column). This actually functions as a Surname Search box. Whether or not one is logged in, one can enter a surname and see results like this:
The following names matched your search request:
NAME | COUNT |
---|---|
Marrinan | 4 |
The COUNT is apparently the number of FTDNA customers with the surname. It is not clear whether it is the number of male customers, the number of Y-DNA customers, the number of customers whose results are fully processed, the total number of customers who have ordered kits, or what. It can at least be used to give an upper bound on the number of men with the same surname with whom another male can potentially compare his Y-DNA.
FamilyTreeDNA provides the infrastructure for volunteer administrators to run projects, including surname projects and other types of Y-DNA projects.
FamilyTreeDNA.com insists on sending e-mail notifications of what it deems Y-DNA "Test Matches"; the customer has little control over what is deemed worthy of generating an e-mail, so a "Test Match" may be somebody with a different surname, no online family tree, no Most Distant Ancestor recorded, and the sixth such match at a genetic distance of two steps on a 37-marker scale. One e-mail alerted me to an individual with a different surname for whom the data could not reject the null hypothesis that we did not share a common ancestor within the last 14 generations; he was still my closest Y-DNA match at the time. These probability calculations appear to be blind to the surnames of the men whose Y-DNA is being compared.
FamilyTreeDNA's e-mail notification policies are very inflexible: I don't want e-mails about distant Y-DNA matches with different surnames with whom I have no hope of establishing a genealogical relationship, but must receive them; I do want e-mails about possible distant cousins who are deemed to be my Family Finder matches, but the FAQs state that FamilyTreeDNA.com does "not send notifications for Family Finder matches that are more distant than 3rd cousins". This should presumably be interpreted as indicating a high likelihood of false positives amongst such matches.
Once you have received your Y-DNA37 (or higher) results, you will probably want to control the flow of e-mail notifications of new matches. On the Email Notifications page, as the entry-level Y-DNA product is now Y-DNA37, you will probably want to untick the Y-12 and Y-25 boxes (unless you have very few or no Y-DNA37 matches). If you leave these ticked, remember that when you get an e-mail notification of a new Y-DNA12 or Y-DNA25 match, then:In the case of Waldron, the COUNT of FTDNA customers with the surname Waldron was 29 when I first learned of this search facility on 26 October 2015; it had not increased as of 23 December 2015, but had grown to 60 by 23 Jul 2017.
Further down the search results, it was stated that there were 33 members in the Waldron project on 26 October 2015; this too had not increased as of 23 December 2015, but had increased to 49 by 23 Jul 2017.
There are several places on the FamilyTreeDNA website where one might find relatives using Y-DNA, including:My efforts, and those of the Waldron Clan Association in County Mayo organising the 2013 Waldron Clan Gathering, to communicate with the administrators of the Waldron project were slow to get off the ground, but we eventually made contact in December 2015.
I am advised every time I get an e-mail about a very distant Y-DNA match that "We recommend ordering the Y-DNA67 to narrow down your matches with more precision & confidence." It is not immediately obvious why this recommendation is of any relevance until such time as I have at least two people matching me on all of the 37 markers that I initially purchased. Apart from being a good marketing ploy, the principle seems to be that "65-of-67" or "64-of-67" matches are typically more closely related than "35-of-37" matches. Furthermore, all expert advice now is that identifying and ordering a relevant SNP pack is better value than an STR upgrade.
However, my known first cousin once removed has also done the Genographic Project test and that gives his paternal line as R-M343 (of which R-M269 is a subclade).
On 28 Nov 2015, I finally persuaded the Waldron project administrators to look again at the grouping of project members and to assign myself and my known first cousin once-removed to the same group, which they still describe in the old notation as R1b1a2. This was probably equivalent when the group name was assigned to R-M269, but, if so, two new SNPs upstream from M269 have since been discovered and R-M269 has become R1b1a1a2 on the 2017 ISOGG haplotree. The old-fashioned R1b1a1a2-type descriptors no longer appear on the FTDNA haplotree.
As of 8 Dec 2015, the Waldron project members had been arranged by the administrators into four groups: three Waldrons in I1 (all with predicted haplogroup I-M253); one Buso in J2 (predicted haplogroup J-M172); two ungrouped; and the remaining 14 in R1b1a2, including myself. 12 of these have predicted haplogroup R-M269, one has confirmed haplogroup R-L20 (which is a subgroup of the R-M269 group; see below) and I have confirmed haplogroup R-FGC29367. Note that project members can seem to disappear and later reappear from day to day depending on whether you look at the results when not logged in (or timed out), when logged in as a kit which is a project member, when logged in as a kit which is not a project member, when logged in as administrator of the project, or when logged in as administrator of another project. This is determined by each individual project member's selected privacy settings.
As of 8 December 2015, when my own Y-DNA haplogroup was confirmed, only two other members of the Waldron Surname DNA Project had a confirmed haplogroup, namely a Waldron in R-L20 and a Phillips in R-DF23.
It is possible to order individual SNP tests from FamilyTreeDNA at USD39 each (and more cheaply from YSEQ), but I had not been able to figure out from online sources which ones I should order when Joss Ar Gall of ISOGG at the FTDNA stand at Genetic Genealogy Ireland 2015 taught me more in a few minutes about my own Y-DNA than I had learned in several years of reading web pages and attending lectures.
Joss's advice was to look at my Y-DNA matches and sort them (twice) by Terminal SNP. This revealed nine matches with identified Terminal SNPs, one a "35/37" match, two "34/37" matches, and six "33/37" matches. The terminal SNPs were L148, L48, two M269s, U106, Z11, Z343, Z383 and Z8. My closest STR match ("35/37") of those with identified Terminal SNPs has terminal SNP Z383 which did not appear on the Haplogroup R page as of 18 Oct 2015.
Having looked at these results, Joss advised me to join the U106 Project at FTDNA and consult the project co-administrators. Mike Maddi sent this advice on Sunday 18 Oct 2015:"looking at his 37 STRs, I think it's very likely that he's U106+ and just about as likely that he's Z8+, which is downstream from U106. I base that on his results for DYS390, DYS447, DYS464d and H4. His results for those markers match the clear modal values found in Z8. If Patrick hasn't ordered the new Z8 SNP pack yet, he should do so. (It just became available on Thursday.)"
After a lot of searching, I found the order form for SNP packs. The only pack that it offers me is the R1b-M343 Backbone SNP Pack.
So I sought further advice on where to find an order form for the Z8 SNP pack and got this from Mike:
"To order the Z8 SNP pack, log into your FTDNA account and click on the blue "Upgrade" button in the upper right corner of the page. On the page you're sent to, look for the box labeled "Advanced Tests" and click on the blue "Buy Now" button. You'll be sent to a page with a pull down menu on the left side labeled "Test Type" - choose "SNP Pack" from the menu that's revealed. You'll see the Z8 SNP pack as the last one on the list. Click on the "Add" link for that test and follow the directions for completing the purchase."
The Z8 SNP pack results were scheduled to arrive by 23 Dec 2015. I looked for them on 8 Dec 2015 and found that they had been waiting for me since 3 Dec 2015! The bottom line was "Your Confirmed Haplogroup is R-FGC12057. Haplogroup R-U106 is the descendant of the major R-P25 (aka R-M343) lineage and is found from Eastern Europe to its highest frequency in Central Europe and the British Isles."
FGC12057 is also known as Y30001. The 95% confidence interval for the time to the most recent common ancestor is from 1450 years ago to 3100 years ago (as of 2015).
My Big Y-700 results subsequently moved me town the haplotree from R-FGC12057 to the more recent R-FGC29367.
Having found a known relative at last via Y-DNA testing on 14 Dec 2014, I put out an appeal to the Waldron Clan facebook page and group for other Waldrons to order kits and join the project. I presume that the basic Y-DNA37 product is the best starting point for every Waldron at this stage, but would welcome advice on whether and when people should be advised to order individual SNP tests in addition to or in place of the Y-DNA37 product.
I was the second member of the Waldron Surname DNA Project to purchase a
SNP product. The following table shows the subclades to which
myself and my fellow pioneer belonged, with the defining SNPs for
each subclade:
P J M Waldron | 129522 | ||
SNP | Haplogroup | SNP | Haplogroup |
M207 | R | M207 | R |
P241 | R1 | P241 | R1 |
M343 | R1b | M343 | R1b |
L389 | R1b1 | L389 | R1b1 |
P297 | R1b1a | P297 | R1b1a |
M269/L483/L150 | R1b1a2 | M269/L483/L150 | R1b1a2 |
L23 | R1b1a2a | L23 | R1b1a2a |
L51 | R1b1a2a1 | L51 | R1b1a2a1 |
L151/P311 | R1b1a2a1a | L151/P311 | R1b1a2a1a |
U106 | R1b1a2a1a1 | P312 | R1b1a2a1a2 |
Z381 | R1b1a2a1a1c | U152 | R1b1a2a1a2b |
Z301 | R1b1a2a1a1c2 | L2 | R1b1a2a1a2b1 |
L48 | R1b1a2a1a1c2b | Z367 | R1b1a2a1a2b1a |
Z9 | R1b1a2a1a1c2b2 | L20 | R1b1a2a1a2b1a1 |
Z30 | R1b1a2a1a1c2b2a | ||
Z2 | R1b1a2a1a1c2b2a1 | ||
Z7 | R1b1a2a1a1c2b2a1a | ||
Z8 | R1b1a2a1a1c2b2a1a1 | ||
Z11 | R1b1a2a1a1c2b2a1a1a | ||
Z12 | R1b1a2a1a1c2b2a1a1a1 | ||
Z8175 | R1b1a2a1a1c2b2a1a1a1 | ||
FGC12057 | R1b1a2a1a1c2b2a1a1a1 |
Note that the nomenclature of the subclades is subject to annual revisions as more SNPs are identified. The table above is based on the 2015 nomenclature.
My fellow Waldron pioneer and I belong to different subclades of R1b1a2a1a, also known as R-L151. According to YFull, the 95% confidence interval for the Time to Most Recent Common Ancestor (TMRCA) for two people who are L151 positive is from 4,400 to 5,300 years, long before the adoption of surnames.
The commonest Y-DNA haplotypes and terminal SNPs identified in Ireland are also from subclades of R1b1a2a1a and are listed in the following table:
2015 subclade | Terminal SNP | Name |
R1b1a2a1a2c | L21 | |
R1b1a2a1a2c1 | DF13 | |
R1b1a2a1a2c1a1a1 | M222 | North West Irish/Irish Type I |
R1b1a2a1a2c1c1b | CTS4466 | South Irish/Irish Type II |
R1b1a2a1a2c1f2a | L226 | Dalcassian/Irish Type III |
R1b1a2a1a2c1g1a1 | L362 or L362.2 | Munster Type I |
One of the longest documented male line lineages in Ireland is that from Brian Boru (d. 1014), Dalcassian High King of Ireland, to the present Lord Inchiquin, Conor O'Brien, whose subclade of R-L226 is R1b1a2a1a2c1f2a1a1a or R-YFS231286 or R-Y6913. See kit 29355 at the O'Brien Y-Chromosome DNA Surname Project.
Surname projects are only one of three quite different ways of identifying close male-line relatives:
Source | Kit no. | Surname | "Genetic distance" | Most distant known ancestor | Terminal SNP | Haplogroup |
Y-DNA37 | Boswell | 34/37 | [BLANK] | M269 | R1b1a2 | |
Y-DNA37 | Taylor | 33/37 | John Taylor 1627 ENG - 1702 VA | M269 | R1b1a2 | |
Waldron project | N41617 | Waldron | 12/12 | Thomas Waldron (c1825/6 Roscommon-1902 Limerick) | M269 (predicted) | R1b1a2 |
Y-DNA37 | Cottrell | 33/37 | Richard N Cottrell, b.1794, Clun, Salop, England | U106 | R1b1a2a1a1 | |
Y-DNA37 | Braden | 33/37 | [BLANK] | L48 | R1b1a2a1a1c2b | |
Y-DNA37 | Cantley | 34/37 | Alexander Cantley | Z8 | R1b1a2a1a1c2b2a1a1 | |
Y-DNA37 | Watkins | 33/37 | Issacs Watkins, b. 1776 Caswell County, NC | Z11 | R1b1a2a1a1c2b2a1a1a | |
Myself | 310654 | Waldron | 37/37 | Thomas Waldron (c1825/6 Roscommon-1902 Limerick) | Z12>Z8175>FGC12057>Z383>FGC29367 | R1b1a2a1a1c2b2a1a1a1 |
Y-DNA37 | Morganstein | 35/37 | [BLANK] | Z12 > Z383 | R1b1a2a1a1c2b2a1a1a1 | |
U106 project | 313521 | 31/37 | Closest Big-Y Surname Match: BEATTIE | Z12 > Z8175> FGC12057 | R1b1a2a1a1c2b2a1a1a1 | |
U106 project | 61021 | 31/37 | John Gibson, DOB 1580 - Dysart, Scotland | Z12 > Z8175> FGC12057 | R1b1a2a1a1c2b2a1a1a1 | |
U106 project | 193127 | 27/37 | Andrew Beatty, m. 1789, Kilskeery, Tyrone | Z12 > Z8175> FGC12057 | R1b1a2a1a1c2b2a1a1a1 | |
Y-DNA37/U106 project | 13318 | Crisp | 33/37 | Chesley Crisp, b.c. 1805, North Carolina, USA | L148 | R1b1a2a1a1c2b2a1a1a1a |
Y-DNA37 | Weaver | 33/37 | John Williams c1700 Brunswick,VA-1763 Brunswick,VA | Z343 | R1b1a2a1a1c2b2a1a1b2a | |
Waldron project | 129522 | Waldron | 23/37 | CHARLES A.WALDRON c.1820-New York City NY L20* | L20 | R1b1a2a1a2b1a1 |
I am sceptical of the TiP calculator, which estimates the probability that the most recent common ancestor of two Y-DNA matches is within any number of generations.
These estimated probabilities depend on the genetic distance between the two men, the mutation rates of the particular markers at which they differ, and the number of known generations of male line ancestors of each man.
However, I don't think that the TiP calculator takes any account of the number of matches reported, of the surnames of the men being compared and of their matches, or of the locations where their known ancestors lived.
If the men being compared have similar surnames and known ancestors from the same location, then the most recent common ancestor is probably closer than suggested by the TiP calculator.
On the other hand, if the men being compared have different surnames and their known ancestors lived far apart, then the most recent common ancestor is probably further back than suggested by the TiP calculator.
Similarly, someone from a dynasty which has many more matches than average, with many more surnames than average amongst those matches, probably has many fewer mutations than average, so the most recent common ancestor of two people from that dynasty is probably more distant than predicted by a model that seems to ignore this information.
I tend to use the TiP calculator to come up with a 95% confidence interval for the number of generations to the most recent common ancestor; I know that others look at the 50% probability cut-off or median number of generations to MRCA.
Bearing all these caveats in mind, it is interesting to look at what perfect STR matches tell us about the number of generations back to the MRCA.
At 0/37, the upper bound of the 95% confidence interval is
between 6 and 7 generations.
At 0/67, the upper bound of the 95% confidence interval is between
5 and 6 generations.
At 0/111, the upper bound of the 95% confidence interval is
between 3 and 4 generations.
In these cases, an STR upgrade can be viewed as giving a more precise estimate of the number of generations to the MRCA. The 37-67 upgrade moves the MRCA one generation closer if no additional mutation is discovered (and can move the MRCA further back if additional mutations ARE discovered).The 67-111 upgrade moves the MRCA two generations closer if no additional mutation is discovered (and can also move the MRCA further back if additional mutations ARE discovered).
In one example that I have studied, one additional mutation in the last 44 STR markers left the upper bound of the 95% confidence interval between 5 and 6 generations.
The main objective of analysing a man's Y chromosome is to find
the lowest (i.e. most recent) SNP on the haplotree for which he is
positive, or likely to be positive.
There are several ways of doing this:
According to Jim Owston:
a man may inherit his X as one of his mother’s two X-chromosomes completely intact; however, it is more likely he will receive a recombined X that includes segments from each of his mother’s two Xs.
The same remark about recombination applies also to a woman's maternal X chromosome; her paternal X chromosome is a copy of her father's single X chromosome, so is not subject to any recombination.
As I am a male, my single X chromosome is a combination of my mother's two X chromosomes. Anywhere between 0% and 100% of that recombined X chromosome comes from my maternal grandfather and the remainder from my maternal grandmother.
Roberta Estes says, without linking to any source, that:
a complete X chromosome ... is comprised of 18092 SNPs and is 195.93cM in length, barring anomalies like read errors and such, which do periodically occur.
Taking this length in cM and assuming that recombination follows a Poisson process (i.e. that successive recombinations are independent events), the distribution of the number of crossovers on one X chromosome in one generation is:
#crossovers | probability |
0 | 14.1% |
1 | 27.6% |
2 | 27.1% |
3 | 17.7% |
4 | 8.7% |
5 | 3.4% |
6 | 1.1% |
7 | 0.3% |
8 or more | 0.1% |
In a small sample of five, using data from Matt Dexter, Roberta finds one instance of transmission without recombination. She expresses herself "staggered" that, in another study by Robert Paine with 21 people, 25% of participants show no recombination on the X chromosome. Five successes in 21 binomial trials with a success probability of 14.1% in each trial is not significantly different from the expected 2.96 successes. Perhaps I should express this result in terms of failures rather than successes, as those anticipating recombinations and not finding them may consider their absence to be a failure. On the other hand, those of us who want to know exactly where our X-DNA came from will probably get excited by the discovery that there was no recombination, eliminating a whole branch of our ancestry as a potential source of X-DNA, and consider this a success!
Roberta also shows a screenshot from the 23andMe chromosome browser, suggesting that the fact that individuals from different generations are half-identical throughout the X chromosome is evidence of transmission without recombination. As with the autosomal chromosomes, parent and child will always be half-identical throughout the X chromosome, regardless of recombination.
We have seen in an earlier table that only the first five autosomal chromosomes are longer in terms of cM than the X chromosome; thus the last 17 autosomal chromosomes have a probability higher than 14.1% of being transmitted without recombination (again making the assumption of independence so that the Poisson distribution applies). As can also be seen in the chromosome browser or the Wikipedia table, on the base pair scale the X chromosome is longer than all bar seven of the autosomal chromosomes.
Elsewhere, Roberta says
I really do suspect that [the X chromosome is] recombining less frequently than the ... autosomes.
She actually appears to suspect either that the accepted length in cM of the X chromosome is wrong, or that recombinations are not independent, so that the Poisson distribution misrepresents the distribution of crossovers.
While, as a male, my single X chromosome comes entirely from my mother, it includes a contribution from at least one ancestor in every generation beyond my mother, and parts of it almost certainly come from two or more ancestors in more remote generations. Only by comparing DNA test results can I figure out with whom I share all or part of my X chromosome. If enough potential X cousins, as they might be called, submit DNA samples for analysis and upload their data to gedmatch.com to permit X chromosome comparisons, and if there is enough diversity among their X chromosomes, then it may be possible to narrow down the potential sources of my own X chromosome to less than the theoretical maximum number of ancestors, in my grandparents' generation and beyond.
For autosomal DNA, a segment of, say, 10cM, is equally likely to have come down unbroken from any of the ancestors in a given generation, say from each of my eight greatgrandparents. This does not apply to X-DNA, for which the number of opportunities for recombination in the inheritance path depends not on the number of generations in the path but on the number of females in the path.
If we let M denote male ancestors and F denote female ancestors, then the inheritance path to an ancestor can be described by a string of Ms and Fs, for example MFF for a man's maternal grandmother or FMFM for a woman's father's maternal grandfather.
The expected proportion of one's X-DNA inherited from a particular ancestor is 0 if there are two consecutive Ms in the inheritance path, and otherwise 2-f, where f is the number of Fs in the path excluding the last letter.
From my father (MM), I (as a male) inherit no X-DNA.
From my mother (MF), I inherit 20=1 or 100% of my X-DNA.
For any ancestor on my paternal side, the inheritance path begins MM... so I inherit no X-DNA.
From each of my maternal grandfather (MFM) and my maternal grandmother (MFF), I expect to inherit 1/2 of my X-DNA.
From my matrilineal greatgreatgrandmother (MFFFF), I expect to inherit 1/8 of my X-DNA, but I expect to inherit exactly the same percentage from a far more distant ancestor with inheritance path MFMFMFMF.
Conversely, the typical X-DNA segment of, say, 10cM, is equally likely to have come down unbroken from these two ancestors three generations apart with different inheritance paths.
Similarly, an X-DNA segment of, say, 10cM, is far more likely to have come down unbroken from an X-ancestor on a given distant generation than an autosomal DNA segment of the same length. The exception to this is the matrilineal X-ancestor in the given generation, whose X chromosomes are subject to recombination in every generation.
Finding the paper trail to explain the source of a shared X-DNA segment will therefore on average be more difficult than finding the paper trail to explain the source of a shared autosomal DNA segment of the same length.
In other words, the closeness of an X match cannot be thought of in standard cousin terms. A father passes on his X unchanged to his daughters, but a mother passes on a recombination of her two Xs to all her children. So the distribution of the amount of X shared by two people depends only on the number of females on the path between the two people in the family tree. For example, two male third cousins whose mothers' PATERNAL grandmothers were sisters are expected to share more X-DNA than two male third cousins whose mothers' MATERNAL grandmothers were sisters, because there is one less recombination on each side. In the first case, the path to the common greatgreatgrandmother for both third cousins is MFMFF and in the second case it is MFFFF. Draw yourself a little relationship diagram if you don't follow!
This difficulty is both exacerbated and simplified by the inheritance path of surnames: exacerbated because the surname follows the X-DNA for at most two generations (father's surname and daughter's maiden surname; or mother's married surname and daughter's maiden surname); but simplified if two people sharing X-DNA find that they also share an ancestral surname (there are only two generations where the surname inheritance path crosses the X-DNA inheritance path). For example, I share X-DNA with someone who has a Walsh ancestor. My own matrilineal greatgreatgrandmother may have been a Walsh (this remains unproven). If our shared X-DNA comes through these Walshes, then my "match" must descend from my possible GGGGgrandmother Mrs. Walsh. This is just too far back to be sure that we would also share autosomal DNA (which we don't).
These X inheritance paths have their advantages as well as their disadvantages.
For a mitochondrial or matrilineal ancestor, the proportion of her X-DNA which a descendant is expected to inherit is the same as the proportion of her autosomal DNA which the descendant is expected to inherit. Thus the ratio of expected cM of autosomal DNA inherited to expected cM of X-DNA inherited is the same as the ratio of the total cM length of the autosomes to the total cM length of an X chromosome: 3587.1/195.9 or approximately 18.3:1.
For an X ancestor with one male on the line of descent, the proportion of the ancestor's X-DNA inherited by a descendant (male or female) is twice the proportion of the same ancestor's autosomal DNA inherited by the same descendant; thus the ratio of expected cM inherited is half the figure for the matrilineal ancestor, or approximately 9.2:1. Similarly, with two males in the inheritance path, the autosomal:X ratio drops to 4.6:1, and so on, until with five males in the path the ratio becomes 0.6:1, and we can expect to find more X-DNA than autosomal DNA inherited from that particular ancestor.
This explains why GEDmatch.com often shows matches sharing substantial segments of X-DNA but no autosomal DNA. Unfortunately, the FTDNA matching algorithm ignores X-DNA, so does not report those who share large segments of X-DNA but not autosomal DNA. I have written about a great example of this here.
One can actually take advantage of this quirk when selecting descendants to provide DNA samples for particular research problems. For example, my GGGGgrandfather John Keas was one of three men of similar age with that unusual surname (now generally spelled Keyes) who held land in 1833 in Carrig in county Limerick. I would like to test the hypothesis that the three men were brothers. The first strategy that comes to mind is to find one descendant of each Keas man, from the closest living generation to the 1833 landholder in each case, and to look for half-identical regions of autosomal DNA. In two cases (John Keas and William Keas), there are many descendants from whom to choose; in the third case (Michael Keas), we are still struggling to find a single proven living descendant. Where there are many descendants to choose from, and when economic circumstances dictate that DNA samples from all of them cannot be submitted, then the possible use of the X chromosome should be considered. An X-descendant with an alternating male/female line of descent from one of the three Keas men is expected to have inherited a much larger proportion of the X-DNA of Mrs. Keas, the unknown hypothesised common mother of the three men, than a direct female line descendant or any descendant who is not an X descendant. So, when exploring the family tree in search of candidates from whom to collect DNA samples, concentrate on these alternating male/female lines.
For example, in the case of this Mrs. Keas, who was bearing children by the late 1780s, there is a living GGGGgrandchild with a FMFMFMF line of descent. She is expected to have inherited 1/8th of each of Mrs. Keas's two X chromosomes, or about 49cM, but only 1/64th of Mrs. Keas's 44 autosomes, or about 112cM.
More importantly, if there is shared autosomal DNA, it could have come from any of the 64 ancestors on Mrs. Keas's generation, but shared X-DNA can come from only 21 of those 64 ancestors.
Finally, if the two subjects whose X-DNA matches are both male, then it is certain that there is an identical segment, and not just a region which is half-identical by chance because it is made up of overlapping paternal and maternal segments.
For men, matches on the X chromosome must come from the maternal side. For women, matches on the X chromosome can come from either the paternal or maternal side.
When comparing X-DNA results for two females, the same principles apply as when comparing autosomal DNA results for two individuals of any gender. Each female has two X chromosomes, and it is possible only to identify half-identical regions or half/half matches.
When comparing X-DNA results for two males, things are a lot easier, since each has just one X chromosome, and it is possible to unambiguously identify identical segments, or to find full/full matches.
When comparing X-DNA results for a male with those for a female, we are faced with a new complication, as the male has one X chromosome and the female has two, so one can observe only half/full matches.
Two individuals could share an identical segment on the X chromosome without sharing one on any of the autosomal chromosomes. Many autosomal comparisons show only one substantial half-identical region, so there is no reason why the only such region can not be on the X chromosome, particularly in the case of a female-to-female comparison.
In a male-to-male X-DNA comparison, there is no danger of finding half-identical by chance regions. In a female-to-female X-DNA comparison, half-identical by chance regions are very possible.
The simplest comparison is between the single X chromosome of two males. If matching segments are found, then there is a full/full match, which has definitely not arisen by chance and is most likely to have arisen by descent. For those males, like me, who remain unconvinced that half-identical regions of autosomal DNA are very likely to contain identical segments, male matches on the X chromosome are a good place to start looking hard for genealogical relationships with strangers.
Things get more complicated when comparing the single X chromosome of a male with the two X chromosomes of a female. The result may be a full/half match, in which the male's X chromosome matches at least one of the female's X chromosomes throughout some half-identical region. This may arise by chance or by descent.
Stronger conclusions can be drawn when comparing the single X chromosome of a male with the four X chromosomes of two known X-related females (who are not doubly related). If there is a segment where the two females have a half/half match, then, because there is evidence of a recent common X-ancestor, it is extremely likely that the two females have matching segments in that region. If the male has a full/half match with both females in the same region, then it is extremely likely that their common X-ancestor is also X-related to him.
Finally, comparisons can be made between two pairs of known X-related females. Any segments where one of the X-related pairs have a half/half match probably come from their most recent common X-ancestor. If all four females are half/half matches with each other on the same region, then it is extremely likely that the most recent common X-ancestors of each pair were related.
Suppose that two sisters look at the half-identical regions on the X chromosome that they share with a third person (male or female).
If the third person matches the sisters' shared paternal X chromosome, then both sets of half-identical regions will be the same (unless the third person is related to the sisters on both paternal and maternal sides).
The contrapositive of this statement is also true: if the two sets of half-identical regions are different, then the third person must be related to the sisters on their maternal side (or merely half-identical by chance). This is because the maternal X chromosomes inherited by the sisters from their mother are the result of recombination, so only 50% of them are expected to be the same. In other words, sisters' pairs of X chromosomes are expected to be full-identical on half of their length and half-identical everywhere they are not full-identical. Where they are full-identical, they come from the same grandparents (paternal grandmother and maternal grandfather or paternal grandmother and maternal grandmother). Where they are half-identical, the maternal X chromosomes come from opposite grandparents (one sister from maternal grandfather and the other sister from maternal grandmother).
Comparing autosomal DNA is just like comparing the X-DNA of two females, with the additional complication that the source of the shared DNA can be any ancestor, not just an X-ancestor.
Before any definitive conclusions can be drawn, both parties needs to have not only their own DNA analysed, but also the DNA of some half-related individuals - in other words, the DNA of any relative other than a full-sibling or a double cousin.
So really this X-DNA chapter should come before the autosomal DNA chapters.
As of 2 January 2014, FamilyTreeDNA added two new features:
I have not yet found FTDNA's definition of X-Match. As of 25 January 2014, the FAQs still report that:
the Family Finder test does not currently use X-chromosome DNA (X-DNA) test results. The X-chromosome follows a different inheritance pattern than your autosomal DNA. Therefore, it requires a different matching algorithm to be accurate and scientifically valid.
The Bioinformatics team is investigating the math and programing for an accurate X-chromosome program.
Roberta Estes says, without linking to any source:
The X matching criteria [sic] at Family Tree DNA is: 1cM/500SNPs.
All that I can report is the nature of my own first three FTDNA-X-matches (as of 25 January 2014), which comprise two males (6.93cM or 650SNP identical segment with one, 5.12cM or 575SNP identical segment with the other) and one female (4.98cM or 550SNP half-identical region, or full/half match, by some fluke almost totally overlapping the identical segment with the first male).
Not yet having any known relative among my FTDNA-overall-matches, these are the first confirmed segment matches of any sort that I have found via FTDNA.
FTDNA apparently allows comparison of the X chromosomes of two FTDNA-overall-matches, but not of those not deemed to be FTDNA-overall-matches based on the autosomes.
How does GEDmatch.com deal with autosomal match v. X match?
For one-to-one comparisons, regardless of gender, GEDmatch.com presents a report beginning with a legend listing Base Pairs with Full Match (green), Base Pairs with Half Match (yellow) and Base Pairs with No Match (red). If both parties in the comparison are male, the only possible results are Full Match and No Match, so there will be no yellow regions.
Here's an example of a male v. male comparison, with a threshold of 3cM and one fully matching segment above that threshold, of 4.6cM. The images appear to be generated on a SNP scale rather than a cM or bp scale. Note the absence of yellow regions.
Here's an example of a male v. female comparison, again with a threshold of 3cM, this time with two half-identical regions above that threshold. Note the presence of yellow regions.
The thresholds to use for X comparisons depend on the genders involved. When comparing two females (with two X-chromosomes each), use the same thresholds as for autosomal comparisons. Use lower thresholds when comparing a male (with one X-chromosome) to a female (with two X-chromosomes). Then lower the thresholds again when comparing two males (with one X-chromosome each). Clearly, there is no way that an X match between two men can be half-identical by chance (i.e. made up of overlapping segments from maternal and paternal chromosomes), as frequently happens with half-identical regions on the X chromosome between two females and with half-identical regions between autosomal chromosomes for any combination of genders. For some bizarre reason, the GEDmatch one-to-one X comparison is set to use the same default of 7cM whatever the genders associated with the two kits being compared.
If you suspect that a kit has been uploaded to GEDmatch with the gender misreported, probably the easiest was of checking this is to look at the graphic for the one-to-one X comparison with a known male kit. If there is no yellow region in the graphic, then the suspect kit must be male; if there are yellow regions, then the suspect kit must be female.
Because women have two X-chromosomes and men have only one, it is inevitable that women have many more X-matches than men. Women have twice as much X-DNA available for comparison with the same database of potential matches, so one would expect a woman to have on average just twice as many X-matches in the same database as a man.
Roberta Estes says on her blog that the reason women have many more X-matches than men is because women have so many more ancestors in the “mix”. This statement strictly is not even true, let alone an explanation of the gender bias. Although most people believe that human history is finite, in any mathematical representation of that finite history that we might use, both men and women ultimately have an infinite number of X-ancestors. In each generation, the ratio of the number of X-ancestors that a female has on that generation to the number of X-ancestors that a male has on that generation approaches the golden ratio, approximately 1.6180339887... (the limit of the ratio between two consecutive Fibonacci numbers). The expected number of matches for a given amount of X-DNA found by searching a given database depends only on the amount of X-DNA (one or two X-chromosomes) and the size of the database, and not in any way on the number of ancestors from whom that X-DNA might have been inherited, whatever form of ancestor-counting is used.
The observed gender bias appears to be far larger than the 2:1 ratio that one would expect to find.
In my case, I have access to three FTDNA accounts, one male and two females. I am male and have only three FTDNA-X-Matches out of 398 FTDNA-overall-matches (as of 25 January 2014) but one female has 69 FTDNA-X-Matches out of 357 FTDNA-overall-matches and the other has 48 FTDNA-X-Matches out of 416 FTDNA-overall-matches.
Similar discrepancies can be seen at GEDmatch.com, which shows that in many cases two people can share more X-DNA than autosomal DNA (on the cM scale).
As of 22 January 2014, I have only 20 GEDmatch matches by X-DNA Total cM of longer than 9.1cM, of whom only 4 are male and with only 6 of whom I share autosomal half-identical regions longer than 7cM.
Anthea, on the other hand, based only on 23AndMe customers, has 20 GEDmatch matches by X-DNA Total cM of longer than 49.6cM, of whom none are male and none share an autosomal half-identical region longer than 7cM with her.
It is obvious from the X-DNA inheritance pattern that the majority of X-DNA matches will be female, but 20 out of 20 is still surprising.
The observed gender bias must arise instead from a preponderence of half-identical by chance regions.
At GEDmatch.com, I recommend that males go to the 'One-to-many' matches page and enter their kit number. In the X-DNA group of columns, click the blue arrow in the largest cM column to sort by that column. Look down the Sex column and note the Kit Nbr for any M that you find. These are the men with whom you have identical segments on the X chromosome and are worth investigating further.
In my own case, by using this approach I found five males including myself who have identical segments of various lengths in the region of the X chromosome between locations 23,955,089 and 36,111,764. Two are from FTDNA, two from 23andMe and one from Ancestry. Of the other four, only the other FTDNA customer has published an e-mail address, enabling us to establish contact. However, his X ancestors cannot be traced out of the USA and mine cannot be traced out of Ireland, so we have failed to establish our precise relationship. I would love to hear from the other three anonymous X matches whose GEDmatch Kit Numbers appear in the table below.
As these are identical segments and not just half-identical regions, all five men are very likely to share a common X ancestor.
Here's a table showing the ten one-on-one matching segments between these five men, identified by their GEDmatch kit numbers:
Kit Nbr 1 | Kit Nbr 2 | Start | End | cM | SNPs |
A241230 | F156355 | 23,955,089 | 32,690,504 | 12.7 | 1,302 |
M223101 | A241230 | 28,606,324 | 32,387,789 | 8.1 | 764 |
M223101 | F156355 | 28,606,324 | 32,387,789 | 8.1 | 773 |
F310654 | A241230 | 29,528,059 | 32,387,789 | 7.1 | 639 |
F310654 | F156355 | 29,528,059 | 32,387,789 | 7.1 | 666 |
F310654 | M223101 | 29,528,059 | 32,496,045 | 7.9 | 694 |
F156355 | M391301 | 31,333,265 | 32,387,789 | 3.8 | 322 |
A241230 | M391301 | 31,333,265 | 32,387,789 | 3.8 | 315 |
M223101 | M391301 | 31,333,265 | 32,493,780 | 4.6 | 520 |
F310654 | M391301 | 31,333,265 | 36,111,764 | 10.0 | 812 |
This table identifies eight crossover locations, viewing 32,493,780 and 32,496,045 as the same crossover for two reasons:
As the latter is impossible (since male/male X matching is transitive - unlike male/female or female/female X matching), there must be some form of measurement error in this tiny segment.
I needed this hand-drawn diagram in order to figure out what was going on:
All five men are identical on the segments up to the crossover at 32,387,789, so these segments appear to descend from a common X ancestor of all five men.
Beyond that crossover, the five men break into two subgroups: A241230 and F156355 are identical to each other; and M223101, F310654 and M391301 are all identical to one another. For one or other subgroup, these segments appear to descend from a more recent common X ancestor of the subgroup.
These small shared segments probably come from a very distant common X ancestor, but, with at least five people known to be identical on the same segment, the chances of finding a common X ancestor for at least two of the five men are increased. If all of them were willing to provide a means of contact and to discuss their possible common X ancestors, the chances would be even greater.
For further reading on X-DNA, see Louise Coakley's blog post.
In practice, however, if you look at the relationship diagram connecting you to a distant mitochondrial cousin, everyone included will (typically) have a different surname, apart from the most recent common male ancestor and his two daughters.
Just as some surnames proliferate, due to many men of the surname having several sons, and other surnames get "daughtered out", due to men of the surname not marrying or fathering only daughters, so some mitochondrial DNA signatures are more prolific than others. For example, as of 31 January 2016, I had managed to document only 101 people (living and deceased) sharing my own mitochondrial DNA, but no less than 449 sharing my father's mitochondrial DNA.
Indeed, my grandmothers' mitochondrial DNA is doomed, as each of them had only one daughter, and each of those daughters in turn had only sons. Once we are gone, our respective grandmothers will never again have mitochondrial descendants.
My greatgrandmothers' mitochondrial DNA is in safer hands. Greatgrandmother Waldron née Nolan has a female mitochondrial descendant born in 2005; greatgrandmother McNamara née Clancy has a female mitochondrial descendant born in 2013; greatgrandmother Durkan née Durkan has several female mitochondrial descendants born since the 1980s; and my other greatgrandmother Durkan née O'Neil also has a female mitochondrial descendant born in the new millennium.
In genetic genealogy, mitochondrial DNA is useful for situations like these:
When I submitted my own DNA sample, I was advised to defer purchase of mtDNA analysis for various reasons, which may have included:
I eventually purchased the mtFull Sequence (FMS) product for myself and received my results on 24 February 2015. My first and only FMS match at a genetic distance of 0 at that stage was an adoptee. Three more perfect matches appeared subsequently, one each in 2015, 2016 and 2017, but we have been unable to find a common ancestor between any two of us.
My most distant known mitochondrial ancestor is my GGgrandmother Mrs. Mary O'Neil, who died in Barnalyra in County Mayo on 6 June 1887.
I also ordered mtFull Sequence for my paternal first cousin (who
shares my father's mtDNA) on 10 February 2016, because there was a
good chance that it could prove or disprove my hunch that two
early nineteenth century West Clare matriarchs are closely
related, namely:
Generation 1 (Ancestor) | Catherine Crotty | Mary Conors | Mrs. Mary O'Neil |
Generation 2 (Children) | 3 | 5 | 6 |
Generation 3 (Grandchildren) | 5 | 44 | 25 |
Generation 4 (Greatgrandchildren) | 7 | 91 | 33 |
Generation 5 (GGgrandchildren) | 8 | 99 | 19 |
Generation 6 (GGGgrandchildren) | 9 | 100 | 11 |
Generation 7 (GGGGgrandchildren) | 7 | 83 | 7 |
Generation 8 (GGGGGgrandchildren) | 0 | 27 | 0 |
My facebook friends | 2 | 21 | 2 |
If Catherine and Mary were first cousins, with different maiden surnames, then there is one chance in three that their mothers were sisters, and that they shared mitochondrial DNA.
Catherine has two GGgrandchildren (descended from different children) with Family Finder results at FamilyTreeDNA.com (AF and FM); Mary has two GGGgrandchildren from her first marriage (myself PW and my first cousin AD) and one suspected GGgrandchild from her second marriage (TE) with Family Finder results at FamilyTreeDNA.com. (Family Finder results for children and grandchildren of some of these five people are also available, but can not add any additional information.)
The 3x2=6 possible Family Finder comparisons between Catherine's descendants and Mary'sknown and suspected descendants reveal that three of these six pairs are deemed to be FTDNA-overall matches, as shown in this table (courtesy of the Clare Roots project at FTDNA):
AF | FM | TE | PW | AD | |
AF | 79.9988300000 | 35.9827300000 | 38.2373100000 | 0.0000000000 | |
FM | 79.9988300000 | 59.8881800000 | 0.0000000000 | 0.0000000000 | |
TE | 35.9827300000 | 59.8881800000 | 0.0000000000 | 0.0000000000 | |
PW | 38.2373100000 | 0.0000000000 | 0.0000000000 | 903.0628100000 | |
AD | 0.0000000000 | 0.0000000000 | 0.0000000000 | 903.0628100000 |
If Catherine and Mary were first cousins, then two of the bolded
comparisons (repeated on each side of the diagonal in this
symmetric table) would involve comparing fifth cousins and the
other four would involve comparing fifth cousins once removed. The
observed 50% match rate is far more than the predicted 10% or less
match rate given by FamilyTreeDNA for such distant relationships.
If the match rate for pairs of fifth cousins was 10%, then the
probability of three or more matches in six independent fifth
cousin comparions would be less than 0.13%. Although the
comparisons in this case are not independent, my theory that
Catherine and Mary were closely related is clearly a little more
than a hunch. The fact that there is only one match out of three
between Mary's descendants raises a red flag and suggests that
there may be a common ancestor somewhere else on the family tree.
There are two more of Mary's descendants at AncestryDNA and
GEDmatch.com, but one-to-one comparisons there don't strengthen
the case for a relationship between the two dynasties. Further
evidence unearthed after I generated the above table confirmed
that TE's GGgrandfather Thomas Kelly of Glascloon was not Mary
Conors' second husband of that name and townland, but an older
namesake, possibly his father.
It would be nice to see if the five individuals in this table
have any other matches in common, which is a report that is
theoretically available to project administrators, but this operation usually times out. It finally
completed on 11 March 2016 and reported "No In Common With
Members".
Catherine's mitochondrial DNA has long been known, through one of
her GGgrandchildren, to be from Haplogroup - H3-T16311C!. As of 4
May 2018, this GGgranddaughter's only Full Mitochondrial Sequence
match with a genetic distance of zero is one of her own daughters.
(Her mitochondrial DNA is unlikely to die out any time soon as all
of her four children, five of her six grandchildren and her first
greatgrandchild are females who have inherited her mtDNA.) As
Mary's mitochondrial DNA was available, through one of her
GGGgrandchildren, on whom only the Family Finder analysis had been
ordered so far, I decided that it was worth paying for
mitochondrial analysis on Mary's descendant for the one-in-three
or smaller chance that the possible relationship between Catherine
and Mary is on the mitochondrial line. Results were promised for
some time between 23 March 2016 and 6 April 2016, but I got an
e-mail on 11 March 2016 to say that they were already available,
and they showed a match date of 2 March 2016.
These results suggested that Mary Conors' numerous mitochondrial descendants belong to mitochondrial haplogroup H27. On 13 February 2018, I was alerted by another FTDNA customer to one of her mtDNA matches who appears to be my second cousin and also a mitochondrial descendant of Mary Conors. But his mitochondrial haplogroup is H27-T16093C. It appears that one mutation since my greatgrandmother has thwarted the FTDNA matching algorithm, which believes that the two second cousins are unrelated. The matching algorithm has worse shortcomings, for it concluded due to similar recent mutations that Blaine Bettinger and his mother were unrelated, although autosomal DNA comparison had confirmed the relationship.
As of 4 May 2018, FTDNA had no perfect (genetic distance of 0) mitochondrial match to my first cousin, but two such matches to my second cousin. By way of comparison, I belong to mitochondrial haplogroup U4b1b2 and FTDNA then had four such mitochondrial matches to me, and I was disappointed at how few matches I had!
FTDNA recalculated its mitochondrial matches in May 2016. Mine remained unchanged. My first cousin acquired three new matches, all backdated to 2 March 2016, two at genetic distance 1 and one at genetic distance 3.
So it's
back to the drawing board on the theory that Catherine and
Mary were sisters' children. Even if we got a
perfect mitochondrial match, we still could not have ruled out the
possibility that the relationship between Catherine and Mary was
more distant - possibly much more distant - than first cousins.
A sample of six Family Finder comparisons is very small, and we
also need more descendants of both women to order Family Finder to
provide additional independent evidence, especially at least one
descendant each of Mary's daughters Biddy, Kitty and Peggie
Galvin.
There are unexpected and unexplained differences between the tools for analysing mitochondrial DNA matches and those for analysing Y-DNA matches.
For each of my Y-DNA matches at FamilyTreeDNA.com, I can click on a TiP icon and get a Y-DNA TiP Report allowing me to see the likelihood of the most recent common ancestor that I share with the relevant match being a particular number of generations in the past.
For mtDNA matches, there is no equivalent of the TiP icon.
Question 1: How can I work out the corresponding likelihoods in the mitochondrial case?
My 71 Y-DNA37 matches come from many different Y-DNA Haplogroups and are all a genetic distance of 2, 3 or 4 from me.
My 6 mtDNA matches as of 22 November 2015 all come from mtDNA Haplogroup U4b1b, two at a genetic distance of 0 from me and four at a genetic distance of 2 from me.
Anthea's 31 mtDNA matches as of 22 November 2015 all come from mtDNA Haplogroup U5a2b4 and are all a genetic distance of 1, 2 or 3 from her.
Without the ability to turn "genetic distance" into a probability scale, I find it impossible to interpret these results.
Question 2: Why are the Y-DNA interface and the mtDNA interface at FamilyTreeDNA.com so different?
Myself, Anthea and a third person are each half-identical to the other two on the same substantial part of Chromosome 6. Anthea and the third person both belong to mtDNA haplogroups beginning with U5a2. It would be nice to get some idea of the likelihood that the most recent common ancestor from whom we inherited the autosomal match was on the shared matrilineal line of Anthea and the third person. As I have read that mitochondrial mutations are very rare, I suspect that the likelihood in this case is small.
Question 3: How can I put some sort of number or probability on this likelihood?
I started a discussion around these three questions on facebook.com.
For further reading about mitochondrial DNA, these are probably the places to look:
http://www.isogg.org/wiki/MtDNA_testing_comparison_chart
http://www.mitosearch.org/
http://www.mtdnacommunity.org/
Genome Mate can be installed and run within your web browser from genomemate.org.
I have created separate Genome Mate profiles for myself, for my paternal first cousin and for my maternal first cousin, by importing match data and chromosome browser data downloaded from FamilyTreeDNA.com for each of us, on 20 July 2014. I accepted the default 7cM/500SNP cutoffs for now.
This enables me to work through my FTDNA-overall-matches and identify those who must be paternal (because they match both me and my paternal first cousin where we match each other) and those who must be maternal (because they match both me and my maternal first cousin where we match each other) .
I eventually worked out the following methodology:
This will be a long and tedious process!
Among the trickier parts of genetic genealogy is developing appropriate strategies to adopt:
Given an individual of interest, these strategies can involve a mixture of:
If not properly thought out and planned, the process can become overwhelming, frustrating or confusing for a variety of reasons:
In this chapter, I will set out, in no particular order, as of yet, some of my thoughts on strategies which may prove helpful.
This section is inspired by my totally unexpected discovery that my closest initial FTDNA-overall-match is not only an adoptee, but a foundling.
Developing the optimal research strategy, particularly when there has been an adoption or other non-paternity event, requires the research to reconcile conflicting emotional, scientific and economic motivations.
People on both sides of the adoption brick wall may experience unexpected and unpredictable emotional reactions. Some people on both sides of the adoption brick wall will accept that they cannot change the past; others may wish to let bygones be bygones and be unwilling to investigate the past. The biological parent(s) of the adoptee may or may not be the first suspect(s) in the family tree that come to mind as perhaps having had a child out of wedlock.
The scientific approach will suggest taking DNA samples from as many potential relatives as possible, rather than zeroing in immediately on the most likely suspects suggested by the historical evidence.
The economic problem is that DNA analysis is still costly, particularly if there are dozens of members of a large extended family available for sampling. Unless a fairy godmother is willing to pay for the analysis of all the samples that statistical rigour requires, a degree of negotiation between the parties will possibly be required.
Some compromise is inevitable between these three motivations. This may require the combined skills of a social worker, a statistician and a businessman.
Before planning a scientific research strategy, one must decide on the specific objective of the research. In other words, the first step is to decide on the precise hypothesis that one wishes to test. This hypothesis should be something along the lines of "Jack and Jill were siblings" or "Tom was the father of Dick and Harry".
The optimal research strategy will depend critically on both the hypothesis to be tested and the budget, the time and the degree of co-operation from the extended family which are available. Costs will undoubtedly continue to fall, but if we wait for ever we'll all be dead before we get anywhere.
The strategy to be decided at each step basically involves answering the question: "Whose autosomal DNA should be analysed next?" As the solution comes within sight, it may become appropriate to look at other forms of DNA. For example, if you find two female ancestors that you suspect were sisters, then you might want to look at mitochondrial DNA from a matrilineal descendant of each.
Thus one should probably follow all of the following strategies in parallel as funds and time permit.
Let your genealogical instinct suggest the precise hypothesis to be tested. If the evidence causes you to reject your hypothesis, go back and come up with a new hypothesis.
In an adoption case, you may have a hunch as to which member or members of the extended family is most likely to have produced a secret love child. So cut to the chase and look for a DNA sample from his or her closest known relative. Even if your suspect is long dead or has lost contact with the family many years ago, a certain amount of diplomacy will be required here. Adoption counselling services are there to help. (As of 7 September 2014, however, the Adoptions Rights Alliance has no reference to DNA or genetics on its home page.)
Is the DNA of a particular branch of your family in danger of extinction or dilution?
If you have a relative who is of great age or seriously ill, collect a sample of his or her DNA before it is too late.
Digging someone up to collect DNA after he or she is dead and buried is awkward and expensive and unpleasant and messy and requires exhumation orders and incurs legal costs.
The day is probably not far away when funeral directors will routinely offer to collect and preserve a DNA sample from the body of a deceased person.
Rarity value arises not only from calendar age or life expectancy, but also from position in the family tree.
The DNA of an only surviving child is more irreplacable than the DNA of a child from a large family.
The higher up the family tree an individual is, the more useful his or her DNA will be. For example, the youngest child of the youngest child of an ancestor may have been born long after the oldest child of the oldest child of the oldest child of the ancestor (his first cousin once removed). But he is one generation higher up the tree, so his autosomal DNA is expected to contain twice as much of the common ancestral couple's DNA as that of the oldest person on the next generation down.
If you are just interested in identifying a common ancestor, then the DNA of a child whose parents' DNA has already been collected will be of no incremental value. On the other hand, if you are interested in chromosome mapping and phasing, then sampling a child and both parents is of critical importance. We have already seen in Example A that looking at parents' DNA can help to quickly identify smaller half-identical regions as merely half-identical by chance.
In my case, the rarity strategy points me towards my mother's maternal first cousins. Her last sibling died in 2006 and her last paternal first cousin died in 1981, but three or four of her maternal first cousins were still alive, aged 78 and upwards when I got my own DNA results. My father's paternal first cousins and maternal half first cousins are younger and/or more plentiful.
In any case, start with the earliest living generation - it may cost twice as much, but you will learn more than twice as much by sampling your father or paternal uncle or aunt and your mother or maternal uncle or aunt as you will learn by sampling yourself.
If you find a suspected relative, for example an adoptee, and have no idea which side of the family he or she comes from, the most logical approach is to repeatedly bisect your pedigree chart. If neither person is an adoptee, then both can pursue this strategy simultaneously. And it will leave a framework in place for identifying your common ancestor with future mystery matches.
Find a living paternal relative (who is not also a maternal relative), by working through father, paternal grandparent or greatgrandparent, paternal uncle or aunt, paternal first cousin, and so on, until you find someone living and willing to provide a DNA sample. (Your own siblings, nieces, nephews, etc., will not give any independent evidence about the relationship of interest beyond what your own DNA sample has revealed.)
Then find a living maternal relative, by working through mother, maternal grandparent or greatgrandparent, maternal uncle or aunt, maternal first cousin, and so on, until you find someone living and willing to provide a DNA sample.
If the original match matches the paternal relative but not the maternal relative, then you can be fairly sure that he or she is related on the paternal side, and vice versa.
Now move back a generation and repeat the process:
Suppose the original match turns out to come from your maternal side.
Find and sample a living relative on your maternal grandfather's side (maternal grandfather, greatuncle or greataunt on that side, first cousin once removed on that side, second cousin on that side, etc.).
Also find and sample a living relative on your maternal grandmother's side (maternal grandmother, greatuncle or greataunt on that side, first cousin once removed on that side, second cousin on that side, etc.).
If the original match matches the maternal grandfather's relative but not the maternal grandmother's relative, then you can be fairly sure that he or she is related on the maternal grandfather's side, and vice versa.
Continue until you come to a step where the original match matches both sides. Now you've almost certainly found your common ancestral couple.
If cost is not an obstacle, then you can immediately start collecting DNA samples from all those who you might in future want to approach in connection with a particular study, for example:
If you have living relatives from a generation before your own on any side, then substitute one of them for the person from the subsequent generation.
As noted above, beyond third cousins there is a significant probability that one or other or both parties will not have inherited any autosomal DNA from the common ancestral couple, so the value of this approach declines.
If the objective is purely to work out to which branch of your ancestry distant DNA matches belong, then you might skip the first and second cousins and just recruit third cousins.
Note that I have not mentioned siblings in this section; the usefulness of their DNA is in increasing the precision of estimated relationships, a topic to which I will now turn.
Nobody would carry out an opinion poll based on a sample size of one, but that is precisely what the matching algorithms of the DNA companies do.
Another field of applied statistics in which I have extensive experience is the handicapping of racehorses - allotting the weights to be carried by each horse in a race on the basis of observed ability in order to equalise their chances of winning. Handicappers are not required to assess a horse's ability until it has raced three times, typically in races where the number of runners is in double figures. The matching algorithms of the DNA companies are comparable to requiring handicappers to make a definitive judgement of a horse's ability after a single run in a two-horse race.
As in both of these examples, proper statistical analysis generally requires taking a large sample of independent and identically distributed observations and looking at the average of those observations.
The bigger the sample, the smaller the margin of error.
If you have not already done so, ask living relatives from generations older than your own to provide DNA samples. It will be a great help in distinguishing from which side your matches come and how closely they are related to you.
Anyone from an earlier generation is expected to share twice as much autosomal DNA with mutual relatives as you do; those from two generations earlier are expected to share four times as much; and so on. So it is important to obtain samples from parents, aunts and uncles and cousins once removed (and, even more so, from grandparents, greataunts, great uncles and cousins twice removed) before it is too late. These samples will both provide more accurate estimates of relationships and more segments of your common ancestral couple's DNA which may match those inherited by other descendants.
As the observed ranges of shared DNA for close relatives do not overlap, it is immediately possible to identify and distinguish between parent/child (100%), full-sibling (75%), half-sibling (50%), uncle-or-aunt/nephew-or-niece (25%) and first cousin (12.5%) relationships. Beyond this, the shared percentages are closer together and the standard deviations relatively larger, so it becomes impossible to distinguish between relationships using samples of one. The solution is to obtain DNA samples from known relatives of either or both parties.
This is easy to do if you and/or your suspected relative have siblings and/or half-siblings on the side you are interested in.
A simple way to arrive at more precise relationship estimates is to just collect DNA samples from all the siblings on each side of the brick wall and look at the average of the pairwise Shared cM between the two families. The original single observation may not have been able to distinguish between, say, second cousin (and equivalents), second cousin once removed (and equivalents) and third cousin (and equivalents). The average pairwise Shared cM between the two families will have a much smaller associated standard error, so will potentially give an unambiguous inference. Remember, however, that two first cousins, for example, are expected to share the same percentage of their DNA as a greatuncle and his greatnephew, so traditional genealogical evidence, such as birth dates, will still be required to distinguish between such equivalent relationships.
If you collect DNA from several family groups of first cousins, your observations are no longer independent, so a simple weighted averaging process is necessary.
In the case of the two sibling groups, the averaging procedure is essentially equivalent to estimating the Shared cM between one of the parents of one sibling group and one of the parents of the other sibling group.
If you have samples from several family groups descended from a common ancestor, then you can work back through the generations to estimate the Shared cM between the common ancestor and the person you are trying to fit into the family tree.
For example, suppose Joan and Peter are siblings and have Shared cM of 75.1 and 60.4 respectively with Anthea, whom we have reason to believe is related to them on their now deceased father Richard's side. Joan and Peter each got half of their overall DNA from Richard, so we expect that on average they got half of the DNA that Richard shared with Anthea. From Joan's figure, we can estimate that the Shared cM between Richard and Anthea was 2 x 75.1 = 150.2, and from Peter's figure we can estimate that the Shared cM between Richard and Anthea was 2 x 60.4 = 120.8. The obvious way to combine these estimates is to take the average, so our first crude point estimate of the Shared cM between Richard and Anthea is (150.2 + 120.8)/2 = 135.5.
Now suppose we also have a sample from Richard's first cousin Patricia, whose Shared cM with Anthea is 78.1. Using the same logic as before, Richard's mother Lillian is expected to have 2 x 135.5 = 271.0 Shared cM with Anthea and Patricia's father Thomas is expected to have 2 x 78.1 = 156.2 Shared cM with Anthea. Lillian and Thomas are siblings, so the average of their Shared cM with Anthea will be a more precise estimate than either of these figures on its own: (271.0 + 156.2)/2 = 213.6.
As the standard errors for the original one-to-one comparisons are not reported, it is difficult to work out the standard errors for the averaged comparisons, but they must be smaller.
These examples look only at the aggregate length of the regions on which two siblings are half-identical to a third person. A more sophisticated estimation technique can be used if the locations of the half-identical regions are also known. Ex ante, we would expect both siblings to be half-identical to the third person on 1/3 of the aggregate length; just one sibling to be identical on 2/3 of the aggregate length; and their relevant parent to be half-identical to the third person on a further 1/3 of the aggregate length on which neither child is half-identical to the third person. When we add the aggregate lengths together, double counting the 1/3 on which both siblings are half-identical exactly compensates for not counting the 1/3 on which neither sibling is half-identical. If both siblings are half-identical to the third person on exactly the same region(s), then we should begin to doubt that there was any other region(s) on which the parent was half-identical to the third person; adding the siblings' Shared cM in this case probably overestimates the parent's shared cM. Conversely, if there is no overlap between the regions where the two siblings are half-identical to the third person, then we should begin to suspect that there are other segments shared by the parent and the third person which neither child inherited; adding the siblings' Shared cM in this case probably underestimates the parent's shared cM. A better estimate of the Shared cM between the parent and the third person would be based on two separate estimates of the aggregate length of the regions where the parent but neither child matched the third person.
A little algebraic notation will make the calculations easier to follow. Let x denote the aggregate length of the regions where both siblings match the third person, and let y denote the aggregate length of the regions where exactly one sibling matches the third person. If we just knew x, then we would expect the aggregate length of the regions where the parent but neither child matched the third person to also equal x. Similarly, if we just knew y, then we would expect the aggregate length of the regions where the parent but neither child matched the third person to equal 0.5y. A better estimate of the uninherited length can be obtained averaging these estimates: 0.5x+0.25y. Thus the best estimate of the Shared cM between the parent and the third party is x+y+0.5x+0.25y=1.5x+1.25y.
For example, if both siblings match the third party on the same 100cM regions, then x=100 and y=0, so our best estimate is that the parent matched the third party on 150cM. However, if both siblings matched the third party on non-overlapping 100cM regions, then x=0 and y=200, so our best estimate is that the parent matched the third party on 250cM. As intuition suggested, the first of these estimates is less than the initial crude estimate of 200cM, but the second is greater.
Note that if two known relatives match a third party, the match is more significant (indicative of a closer relationship) if the half-identical regions are non-overlapping. Conversely, if two unconfirmed relatives match a third party, the match is more significant (indicative of an ancestor common to all three people) if the half-identical regions are overlapping; otherwise, the two relationships might be on opposite sides, or the half-identical regions might be only half-identical by chance.
A similar formula is easily derived for the case where n siblings are being compared to a single possible relative. Let l(i) denote the aggregate length of the regions on which exactly i of the n siblings are half-identical to the possible relative. Let f(i,n)=nCi*2-n denote the probability of i successes in n binomial trials where the probability of a success in each trial is 50%. We have n separate estimates of the unobservable l(0), namely l(i)*f(0,n)/f(i,n)=l(i)/nCi for i=1,2,...n. The best estimate of the unobservable l(0) is the simple average of these n separate estimates.
The estimation becomes a little more intricate when DNA from multiple siblings on each side of the brick wall is available for analysis. It becomes more intricate again if we move back another generation and want to estimate the shared cM between the deceased grandparent of two living cousin groups and a possible relative. The segments shared by the grandparent and the possible relative can be broken down into the following categories:
The first two categories can be measured. The last three have to be estimated using a similar methodology to that already used when going back just one generation. [A little more thought will be required to come up with a sensible methodology here.]
If you and/or your suspected relative have no siblings or members of an earlier generation available for sampling, or have already sampled all available siblings and want to move out to first cousins, then start with one cousin from each family. If an uncle by chance inherited less autosomal DNA than expected from the relevant side of the family, then his children can all be expected to have also inherited less autosomal DNA than otherwise expected from that side of the family. In other words, the information supplied by the DNA from two of his children is not independent in the same way as that supplied by two first cousins would be.
For this reason, if you are sure that your suspected relative is from your maternal side rather than from your paternal side, then you will actually learn more by sampling your maternal first cousins (one from each family) than by sampling your own siblings.
If funds extend to sampling multiple first cousins from different family groups, similar principles apply. Suppose Beatrice, Thomas (Jr.), Martin, Catherine and Anne, all now deceased, were siblings, children of Thomas (Sr.) and Mary, but you have DNA samples from three of Beatrice's children, five of Thomas's children, one of Martin's children, one of Catherine's children, and two of Anne's children. Let's suppose we want to investigate how Thomas (Sr.) or Mary was related to Anthea. First we estimate Beatrice's relationship to Anthea by averaging the Shared cM for Beatrice's three children, then doubling the result (since each child is expected to have inherited half of Beatrice's Shared cM). Similarly, average the Shared cM for Thomas (Jr.)'s and Anne's children and double the results, and double the Shared cM for Martin's child and Catherine's child. Now we have five independent estimates of the expected Shared cM between Anthea and the children of Thomas (Sr.) and Mary. Again just average these five estimates and double the result to estimate the Shared cM between Anthea and whichever of Thomas (Sr.) and Mary is related to her. If this Shared cM figure is not high enough to prove that Anthea is a direct descendant of Thomas (Sr.) and Mary, then further evidence (following the bisection strategy above) will be required to determine whether the relationship is on Thomas (Sr.) 's side or Mary's side.
In the same way as we can now combine Shared cM data for many descendants of a long-dead ancestor to estimate that long-dead ancestor's relationship to a living person, we could potentially combine the observed genome or SNPs of many descendants of the long-dead ancestor to estimate the genome of that ancestor. Blaine Bettinger takes this idea much further in his blog post Genetic Genealogy in 2050 (or Maybe 2015?).
If there is a choice between equally related individuals, always choose someone who might share X-DNA with someone whose DNA has already been sampled over someone who cannot. Within those who might share X-DNA, choose a male, who has only one X chromosome.
For example, I have two paternal first cousins, my uncle's daughter and my aunt's son. The uncle and aunt are both long deceased, so not available for sampling. My uncle's daughter has two X chromosomes: one on her mother's side which is of no interest to me; the other on her father's side, which comes from my paternal grandmother's parents (McNamara and Clancy). My aunt's son has one X chromosome, which comes from my paternal (his maternal) grandparents (Waldron and McNamara). If I was interested in investigating a relationship on the Clancy side, then I should get a sample from my uncle's daughter. If I was interested in investigating a relationship on the Waldron side, then I should get a sample from my aunt's son. Otherwise, my aunt's son's DNA will be more useful, as his mitochondrial DNA traces back to the mother of the four Galvin sisters and their Kelly half-sister, who founded an enormous dynasty of direct female line descendants.
On the other side, I have nine living maternal first cousins, all uncle's children, five male and four female. The males get their X-DNA from their mothers, who are not related to me. The females are my X cousins, so their samples are of more interest to me. One of them is an only child, so is the obvious candidate to sample first.
Ultimately, using X-DNA allows one to see whether the possible ancestor might be narrowed down to the sets of people from whom each individual inherits X-DNA. If the two autosomal matches are X-DNA matches and either of them is male, then a relationship on the paternal side (and on any line involving two males in consecutive generations) can immediately be ruled out.
Using X-DNA to further one's analysis involves no extra cost; using Y-DNA or mtDNA will require an extra payment. Once that extra payment has been made, it is relatively straightforward to use Y-DNA or mitochondrial DNA to investigate whether the relationship being investigated might be on the direct male line or on the direct female line.
I eventually decided to follow the bisection strategy outlined above.
On 23 February 2014, I convinced two first cousins, my father's sister's son Antoin and my mother's brother's daughter Mary, to allow me to order Family Finder for them. As they live outside the USA, there was really no choice about which DNA company to use: ancestry.com still refused at that time to take orders from anywhere outside the USA and 23AndMe imposes exorbitant shipping charges which are far better spent on having an extra person's DNA analysed by FamilyTreeDNA.
If you already have a FamilyTreeDNA account and want to order a kit for another person:
Neither of my first cousins was interested enough to want their own e-mail address or telephone number used! So I immediately received their kit numbers and passwords. What next?
While waiting for DNA results to arrive from FamilyTreeDNA, the following steps need to be taken. You will want to make an immediate impression on your DNA matches, so it is too late to do this after the results arrive. You can keep an eye on your Order History page to see when you should expect to receive your results. Sometimes the results begin to appear on the website a day or two before e-mail notification arrives.
After the results arrive for each person:
GEDmatch estimates that my parents were not related to each other, so I had no reason to expect that my first cousins Antoin and Mary would be related to each other. However, GEDmatch reports that they share numerous half-identical regions, the longest in cM being 4.3cM on Chromosome 12 and the longest in SNPs being 958 on Chromosome 3. In fact, each of us is half-identical to the other two on a region of 4.28cM and 584SNPs comprising most of the afore-mentioned region on Chromosome 12. Standard triangulation arguments would deduce that all three of us have a common ancestral couple, and thus that my parents were indeed related.
As mentioned above, my first known relative to give a DNA sample was my fifth cousin Cindy. We don't share an awful lot of autosomal DNA. GEDmatch.com, with its then default settings, reported these four half-identical regions:
Chr | Start Location | End Location | Centimorgans (cM) | SNPs |
5 | 87,936,919 | 92,507,944 | 3.5 | 731 |
8 | 6,933,286 | 10,377,521 | 4.6 | 875 |
10 | 61,608,258 | 65,937,864 | 3.2 | 876 |
14 | 57,185,249 | 62,008,482 | 3.8 | 1,177 |
The first thought that crossed my mind was to search for other people who may have inherited any such identical segments from the same Keas or O'Halloran ancestors.
I decided to start with the longest half-identical region on the centiMorgan scale (the one on chromosome 8) and the longest half-identical region on the SNP scale (the one on chromosome 14), in the hope of finding someone half-identical to both of us in both regions. I opened GEDmatch.com in four browser tabs and started the Find people who match with you on a specified segment process in each tab - two for Cindy and two for me; two for the region on chromosome 8, two for the region on chromosome 14. (GEDmatch.com has since removed this process as it placed too heavy a load on its servers.) The maximum number of hits that can be returned by each process is 301, and three of the four ran into this limit.
This procedure eventually identified:
Which, if any, of these should I approach about comparing paper trails?
The two kits with the same e-mail address seemed the most promising, as the chance of being half-identical by chance would be greatly reduced if they were only half-related. However, the total of segments > 3 cM shared by these two kits is 2,799.2 cM which suggests that they are siblings. The GEDmatch expanded graphic confirms that they are full-identical in the region of interest, so the second kit adds nothing to the information provided by the first kit.
The vast difference between the numbers who match with us on regions of 875 and 1,177 SNPs suggests that many half-identical regions of 875 SNPs or fewer must be merely half-identical by chance.
I have slowly evolved this regular manual procedure. I really must automate it, but full automation will be prevented by GEDmatch.com's login policy.
Two people who have found a common ancestral couple and half-identical regions of DNA will naturally want to progress their research further, both in traditional genealogy and in genetic genealogy. This section will address how they can make progress in genetic genealogy. The following steps are suggested:
Run 'People who match one or both of 2 kits' for each pair of descendants of the common ancestral couple.
Dear Charles
I found your e-mail address at FamilyTreeDNA.com (where I am Patrick Joseph Martin Waldron)/GEDmatch.com (where my kit number is F310654) and I believe that we may be related.
Our autosomal DNA is half-identical on the following region(s):
Chromosome | Start | End | centiMorgans | SNPs |
17 | 9,185,149 | 12,566,303 | 11.5 | 1,068 |
Furthermore, we are part of a group of three people, also including ..., who are each half-identical to the others on the ... of these regions.
I have uploaded a copy of my family tree database (as it was on 24 Feb 2014, minus the details of living people) to
http://pwaldron.info/tng/
You can Register for a User Account at
http://pwaldron.info/tng/newacctform.php
Furthermore, ... and I are known relatives (9th cousins twice removed). Our most recent common ancestral couple are Richard Blackall and his wife (whose maiden name was also Blackall). Richard came to Ireland from England during the reign of Charles I of England (1625/1649) and settled in county Limerick.
... and I have another possible but unconfirmed relationship through Robert Blakeney (d.1658/60) and his wife Susannah Ormsby (d.1659) of Castle Blakeney, county Galway; if that is confirmed, then it would also make us ninth cousins twice removed.
Once I have approved your registration on my TNG website, you will be able to see exactly how ...'s GGGGGuncle and I are related at
http://pwaldron.info/tng/relationship.php?secondpersonID=I1&primarypersonID=I11731
and how ...'s GGGGGuncle and my possible GGGGuncle are related at
http://pwaldron.info/tng/relationship.php?secondpersonID=I33441&primarypersonID=I11731
from where you can navigate around the tree and look for a possible link to your own known ancestry.
In ...'s version of the family tree, he is at
http://trees.ancestry.com/tree/6240035/person/-1324643958
the son of our common ancestral couple is at
http://trees.ancestry.com/tree/6240035/person/7017465396
and the grandson of our other possible common ancestral couple is at
http://trees.ancestry.com/tree/6240035/person/24046251078
I have looked at your GEDCOM file at FamilyTreeDNA.com/GEDmatch.com. I have (not) found any common ancestor.
/Where can I see your family tree so that I can search for our possible common ancestral couple?
I have four known relatives whose DNA is, or shortly will be, available for comparison:
What known relatives do you have whose DNA is available for comparison? If you don't have any yet, I recommend that you encourage at least one paternal relative (if still living, a paternal grandparent, greatuncle or greataunt, father or paternal cousin) and at least one maternal relative (if still living, a maternal grandparent, greatuncle or greataunt, mother or maternal cousin) to submit a DNA sample for analysis.
If you have not already copied your DNA data to GEDmatch.com, I recommend that you do so. I also recommend the Autosomal DNA Segment Analyzer at
http://www.dnagedcom.com/adsa/
and my own account of my experiences with autosomal DNA at
http://pwaldron.info/newdna.html
Yours sincerely
Paddy Waldron
Dear John
You show up as a DNA match to your fourth cousin Paddy Waldron, whose AncestryDNA kit I manage, as I am his cousin on the other side of his family.
He also has his DNA at FamilyTreeDNA.com and GEDmatch.com and would love to hear from you directly by e-mail at ...
He has a vast amount of information on your common McNamara ancestors which he will be delighted to share.
Dear John
Thank you very much for your e-mail.I am very anxious to look more closely at our shared DNA segments and see what we can learn from them about our ancestors and other common relatives, but first let's share what we know of our common ancestors.
You might enjoy the account of how I became aware of my relationship to the Kunzmann family through the Talty Millions case, at
http://pwaldron.info/oks/
I have uploaded a copy of my family tree database (as it was on 18 Oct 2015, minus the details of living people) tohttp://pwaldron.info/tng/
You can Register for a User Account at
http://pwaldron.info/tng/newacctform.php
Once I have approved your registration on my TNG website, you will be able to see exactly how we are related at
http://pwaldron.info/tng/relationship.php?secondpersonID=I1&primarypersonID=I61141
from where you can navigate around the tree.
I have looked at your own family tree at
http://trees.ancestry.com/tree/79437082/family?cfpid=34401375965
We have two other known mutual relatives whose DNA is available for comparison at GEDmatch.com and FamilyTreeDNA.com but not at Ancestry.com
What other known relatives do you have whose DNA is available for comparison? If you don't have any yet, I recommend that you encourage at least one paternal relative (if still living, a paternal grandparent, greatuncle or greataunt, father or paternal cousin) and at least one maternal relative (if still living, a maternal grandparent, greatuncle or greataunt, mother or maternal cousin) to submit a DNA sample for analysis.
If you have not already copied your DNA data to GEDmatch.com, I strongly recommend that you do so. I also recommend my own account of my experiences with autosomal DNA at
http://pwaldron.info/newdna.html
I will do my best to answer any other questions that you may have about our common ancestors or about DNA.
Yours sincerely
Paddy Waldron
Dear Nancy
Thank you for your e-mail about our possible DNA match.
You neglected to mention in your e-mail which of the many DNA
kits associated with this e-mail address on several different
websites you are writing about.
First, we need to know each other's GEDmatch.com kit numbers.
I used the GEDmatch User Lookup tool to search for your kit number(s), but it does not recognise your e-mail address. What is your GEDmatch.com kit number?
You can use the same tool if you need a reminder of the long list of kit numbers associated with this e-mail address. Which of them do you match?
Second, we need to see each other's online pedigree charts.
My own is at
http://pwaldron.info/tng/pedigreetext.php?personID=I1&generations=5
In order to see it, you will have to first register at
http://pwaldron.info/tng/newacctform.php
and then wait for an e-mail confirming that I have approved your registration.
Where is your online pedigree chart?
Finally, you may find the answers to other questions that you might want to ask me about DNA testing on my website, starting at
http://pwaldron.info/DNA/nextstep.html
Best wishes
Paddy Waldron
For my further thoughts on this subject, see Measuring the length, the rarity and the relevance of pieces of shared autosomal DNA.
My own online family tree is at http://pwaldron.info/tng/index.php but to see it you will have to Register for a New TNG User Account.
Comments about this page can be left on the facebook posts where I originally announced Chapter 1 and Chapter 2.