A Sceptic's Adventures in Genetic Genealogy

Chapter 1:

Thoughts and questions on DNA sampling, testing and marketing

by Paddy Waldron

Last updated: 11 Dec 2013

Background

Having been increasingly addicted to genealogy from the age of 12 or earlier and having a degree in mathematical sciences with a particular interest in probability and statistics, it was inevitable that I would develop an interest in DNA and in genetic genealogy.

I have attended various lectures on these subjects over the past several years, and have read lots of explanations, often ending up more confused rather than less confused after an effort to improve my understanding. I have still not found the inspirational book or inspirational teacher that suddenly fits everything into place within the context of my prior knowledge, such as happened with probability and statistics when I took Adrian Raftery's course (251) as a third year undergraduate at Trinity College Dublin back in 1983/4. (In the genetic genealogy field, my brief exposure to lectures by Maurice Gleeson and Dan Bradley has, however, helped a lot.)

On the third day of the joint Back To Our Past (BTOP) and Genetic Genealogy Ireland 2013 shows at the Royal Dublin Society (20 Oct 2013), Kathy Borges of the the International Society of Genetic Genealogists (ISOGG) persuaded me to submit my DNA to FamilyTreeDNA and to purchase Y-DNA and autosomal DNA products. Notification arrived by e-mail that my autosomal DNA results were available on 16 Nov 2013 and that my Y-DNA results were available on 21 Nov 2013.

I should probably try to weave my thoughts and the answers to my questions into the ISOGG Wiki, but for now I have more questions than answers and it is much quicker and easier to post them all together here in these new web pages on my own personal web site documenting my adventures as a sceptic in genetic genealogy. I hope that this first chapter will help to dispel some myths, in particular about the need for a little jargon; will help customers and FamilyTreeDNA respectively to get more out of the FamilyTreeDNA.com website and to fix some of its shortcomings; and will get me some feedback about interpreting my own autosomal DNA results, or lack thereof. First, some definitions will help.

Definitions

If you are reading this page, you hopefully have some basic understanding of DNA and of genetic genealogy. For those who don't, I had better begin by outlining some basics.

DNA is material contained within human cells and inherited by children from their parents. Genetic genealogy is the use of DNA to assist genealogical research. For the purposes of genetic genealogy, DNA is represented by long strings of the letters A, C, G and T. These long strings are divided naturally into shorter strings called chromosomes.

There are four main types of DNA, which each have very different inheritance paths:

Y-DNA
Males have one Y chromosome containing Y-DNA and one X chromosome containing X-DNA. Females have two X chromosomes, but do not have a Y chromosome. Y-DNA is inherited patrilineally by sons from their fathers, and so on, "back to Adam". (Most geneticists are not creationists, but the concept of "Adam" is still useful and used!) Some people are actually confused by the simple concept that Y-DNA follows the male line, and even by the simpler concept that in most cultures the surname follows the same male line. If you belong to (or join) the relevant facebook groups, you can read about examples of this confusion in discussions in the County Clare Ireland Genealogy group and The Waldron Clan Association group.
X-DNA
Every male inherits his single X chromosome from his mother. Every female inherits two X chromosomes, one each from her father and from her mother.
mtDNA
Similarly to Y-DNA, mitochondrial DNA (or mtDNA for short) is inherited matrilineally, but by both sons and daughters, from their mothers, and so on, "back to Eve".
Autosomal DNA
Autosomal DNA (or atDNA for short) is inherited by everyone in 22 pairs of chromosomes. One chromosome in each pair comes from the father and the other from the mother. These chromsomes are further broken down, because of the random process of recombination, into many smaller segments represented by shorter strings of A, C, G and T (less than 1000 segments of genealogical value per individual). The segments in the paternal chromosomes come roughly equally from both paternal grandparents, and those in the maternal chromosomes likewise come roughly equally from both maternal grandparents. Segments come ultimately from all ancestors in recent generations, but those large enough to be of genealogical value can be traced back to a vanishingly small proportion of the exponentially increasing number of ancestors in earlier generations.

The word test is used with many different meanings in many different fields. To a scientist or a medic, it may be a deterministic test with a definite positive or negative outcome. To a statistician, it is a hypothesis test which can accept or reject (but not prove or disprove) a hypothesis based on the observed outcome of one or more random experiments. The word is used loosely by genetic genealogists with other meanings, but I will try to stick to the rigorous statistical meaning.

To a statistician, a sample is the set of data collected from the random experiments on the basis of which a hypothesis is tested. So a DNA sample comprises the strings of letters returned by a DNA company from the cells collected from its customers. The relevant random experiment is not the collection of cells (which is deterministic) but the act of reproduction in which the random processes of mutation and recombination produce the child's DNA from the parents'.

The various competing DNA companies market various products which comprise both raw DNA samples and interpretation of both the genealogical and medical implications of those samples.

Dispelling myths

Genetic genealogy has been very poorly explained to the public.

The results of DNA testing are frequently combined with the history and mythology of human migration. The connection between genetics and the history of human migration is generally extremely poorly explained. Is it based on DNA extracted from prehistoric human remains, on other evidence from excavation of prehistoric settlements, or on pure guesswork based on the geographical spread of DNA in today's living people?

DNA tests CAN provide estimates of the probability that an individual living in place A and an individual living in place B had a common ancestor, either at any time, or within a specified number of generations. DNA from living people on its own CANNOT provide any information as to whether any such common ancestor lived in place A, lived in place B or lived in some other place C, or moved between places A, B and C.

Consider the extreme example of a family of two brothers, one of whom continued to live in his birthplace and fathered 10 daughters and no sons, the other of whom emigrated and fathered 10 sons. Their shared Y-DNA (passed from father to son) disappeared in one generation from their birthplace, but increased and multiplied in the emigrant's destination. The present location of the Y-DNA is therefore far away from the location where the common ancestor lived. (The initial brothers could of course have had male line cousins who passed on the same Y-DNA, perhaps in yet another different location.)

The units in which DNA testing (Y-DNA testing in particular) measures the genetic distance between two individuals are numbers of mutations or recombinations, i.e. rare (small probability) differences in DNA between a child and the parent from whom the child inherits the DNA. By studying the frequency distribution of mutations per reproduction or recombinations per reproduction (for autosomal DNA), we can begin to understand the significance of this genetic distance. With some knowledge of the number of reproductions per generation (i.e. the average number of children fathered by each male) and its variation over centuries and millennia, ESTIMATES of the average number of mutations per generation or recombinations per generation can be derived. These can then be used to provide further ESTIMATES of the number of generations between the two individuals. By studying the frequency distribution of the age of parents at reproduction (i.e. years per generation) and its variation over centuries and millennia, estimated numbers of years for variables like the time to the most recent common ancestor can be derived. As stated by Dan Bradley of Trinity College Dublin at BTOP, the error bars for such time estimates are typically of the order of +/-50% of the point estimate. (I presume that "error bar" is geneticists' jargon for what statisticians' jargon calls "confidence interval".)

Genetics is a branch of applied probability and statistics in exactly the same way as insurance, gambling, investment, lots of sports, medicine and many other aspects of everyday life are. The highly educated population of the 21st century are well capable of understanding it if it is explained clearly in this context. Indeed, as Kelly Wheaton says, "a statistics course is more important than a genetics one for genetic genealogists".

Genetic genealogy is a branch of genealogy which likewise has its place alongside traditional genealogical methods. Statistics prove nothing and likewise genetic genealogy alone proves nothing. Both, however, can be of great help in telling researchers where to look for the desired proof, and in disproving wrong hypotheses.

Jargon busting

Some DNA testing companies (ancestry.com in particular) have employed marketing people to sell their products by promising not to use jargon. In other words, they admit that they want to sell only to people who don't know what they are buying. Consumer protection authorities should look into this: the better regulated financial sector would never be allowed to get away with it! See Roberta Estes's blog for a further critique of the ancestry.com product.

Any new science requires a new vocabulary to explain it. However, an attempt to reconcile the geneticists' vocabulary, the genealogists' vocabulary and the statisticians' vocabulary is urgently required. Scientists and marketers should agree on the vocabulary, minimise the number of different synonyms used for each concept, avoid mentioning concepts which are not directly relevant to their audience, and define all new words clearly and precisely, with whatever diagrams and mathematical models are necessary to help the understanding of those who prefer verbal, spatial or quantitative approaches respectively. The problem is epitomised both by the looseness of FTDNA's glossary and by AncestryDNA's refusal to even use what it terms "jargon" to make its statements intelligible to multiple audiences.

For example, at familytreedna.com, the words "block" and "segment" appear to mean exactly the same thing and to be used interchangeably on the same page, unnecessarily confusing the company's customers. (If there is a subtle difference that I have missed, please let me know.)

As genetics is a branch of applied probability and statistics, it cannot be explained clearly without using the basic vocabulary of those subjects, i.e. words and phrases like probability, estimate, confidence interval and hypothesis test. Beware of anyone who tries to persuade you otherwise.

The FamilyTreeDNA.com website

Like any sophisticated and rapidly developing website, FamilyTreeDNA.com is bound to take some getting used to.

It appears that every visit has to start with a login screen even if one ticks the apparently useless `Remember me' checkbox. One must also remember to click the small dark "LOG IN" button towards the middle left, not the larger and brighter "Login" button at the top right, which merely reloads the login screen. There are regular annoying pop-ups saying things like: `You have been idle for 120 minutes. Your session may have timed out. The page will be reloaded and you may need to log in again.' or 'Your session will expire on Sun Nov 17 2013 13:19:52 GMT+0000 (GMT Standard Time). You have 5 minutes remaining until your session times out. Click OK to keep this session.' If facebook.com can keep its billions of users permanently logged in, there is no excuse for any smaller website such as FamilyTreeDNA.com not to provide this option. At least the timeout was increased from 30 minutes to 120 minutes soon after I started to use the website.

My initial autosomal DNA results are presented in the form of 36 pages of matches, with 10 matches per page, sortable by 3 fields. I have yet to find a formal statistical definition of the word "match".

The nearest to a definition that I can find in the FAQs is:

The Family Finder program has calculated all of your matches to be your relatives within the relationship range. Family Tree DNA uses stringent standards for the relationship range and for the degree of relatedness. Thus, only those determined with high confidence to be your actual genetic relatives are included.

Where are the "stringent standards" published? How high is "high"?

Every statistical inference is subject to two types of error. For no particular reason, they are known as Type I and Type II errors:

Back to the website layout:

There seems to be no means of viewing all 350+ matches in the same web browser window. I can, however, see all 350+ matches, sortable by all fields, in a single Microsoft Excel window by clicking the Excel button at the bottom right of the browser window. This causes Mozilla Firefox to offer to open an XML Document in XML Editor. I am not familiar with either of these, but clicking OK then opens a normal Excel window. The file downloaded is not a properly formatted Excel file and is probably just a CSV file: column widths are not set to match the content; panes are not frozen; autofilter is not turned on; dates are not in my preferred Microsoft Windows date format; e-mail addresses are not hyperlinked; long lists of surnames and placenames are not set to wrap in a readable manner; etc. As I will be re-downloading this file, I had to record this macro in order to make it usable in Excel 2010. Hopefully the macro will be of use to other FamilyTreeDNA customers. If you know how to use a macro in Excel, hopefully you know how to copy and paste someone else's macro into the appropriate place (and how to back up your macros, which Excel stupidly insists on storing in the Program Files directory hierarchy).

For each match, I can see the following fields either in the web browser window or in the Excel window or in both windows or somewhere else:

Longest block length:
This is the field by which results are sorted in Excel (Column F). It is not immediately visible in the web browser, but can be revealed by expanding a tiny, almost invisible, dropdown menu under any match's mugshot. I initially thought it was seven clicks away: click Family Finder, Chromosome Browser, Filter Matches By ..., Name, [type name, don't hit <Enter> key], Find, checkbox, View this data in a table, scan the centiMorgans column for the largest value. See my separate chapter on the Chromosome Broswer for more details and some examples.
Shared segments:
The number of shared segments is not visible in either the web browser window or the Excel window, but appears after the sixth click in the above alternative path to the longest block.
Full Name:
The full name may include a title (Mr. or Mrs.) in the web browser, but not in Excel (Column A). In the case of Mrs., the Full Name does not include the maiden surname, the only one of interest to the genealogist. Sortable by first name in Excel, but not sortable by surname! One of my matches (Mr. Robert M. Elliott) appears to have entered the "Mr." as part of his first names as it does appear in Excel.
Mugshot:
On a randomly selected page, 2 of the 10 matches have uploaded mugshots. Understandably not visible in Excel.
E-mail address:
A hyperlinked icon in the web browser; not hyperlinked in Excel (Column H). It took a lot of googling to find this page which taught me how to add appropriate hyperlinks using my Excel macro!
Note icon:
Allows notes to be added on the website. These are then included in Column L of the next Excel download.
Family tree icon:
Green indicates that the match has uploaded a GEDCOM. Grey (red on mouseover) indicates that the match has not uploaded a GEDCOM. This is not part of the Excel download, where it would be extremely useful. Why is mouseover required to show the red? A surprisingly and disappointingly small proportion of users have uploaded a GEDCOM - as few as one out of ten on a randomly selected page of matches.
Run Triangulate icon:
Allows filtering on "In Common" matches (which can then be downloaded into a smaller spreadsheet; for some bizarre reason this smaller spreadsheet does not include any information about the match with whom those included are in common!).
Match Date:
One of the three fields on which the web results can be sorted. In North American all-numeric mm/dd/yyyy format on the web page. All genealogists know that the use of ambiguous all-numeric dates is a mortal sin, guaranteed to lead to the hell of confused months and days for events in the first 12 days of any month. My macro, inter alia, converts the Excel date (Column B) to my own preferred yyyy-mm-dd format.
Relationship Range:
This is the primary field by which results are sorted by default in the web browser, but not in the Excel download (Column C). Where is the algorithm explaining how it is calculated from the numeric results?
Suggested Relationship:
Column D in the Excel spreadsheet, but not shown in web browser. Does "Suggested" mean what statisticians call "Estimated"? If so, then is it a maximum likelihood estimate? I can't think of any other estimation method for a discrete parameter.
Known Relationship:
User-entered field, which must then be confirmed by the other Match. There is a limited dropdown menu of possibilities with the vague "Distant Cousin" hidden in the middle to cover any omitted relationships. "Distant Cousin" is NOT a "Known Relationship"! There should be submenus or numeric fields to allow any degree of cousin and any degree of remove to be selected. What will I do if I find one of my 3rd Cousins 3R? If and when I find and enter known relatives, this field will be downloaded as Column G in the Excel spreadsheet.
Shared cM:
Presumably the sum in centiMorgans of block lengths for all shared segments longer than some unspecified minimum length, probably 1cM. Also represented by a graphical icon, which wastes a lot of valuable real estate in the web browser window, forcing other variables into the hidden dropdown underneath. For distant relatives, the graphical icons are so similar that they are much less informative than the numerical representation (in which trailing zeroes are properly included here). Column E in the Excel spreadsheet.
Ancestral Surnames:
This appears to be a free-format text field, combining surnames, placenames and the punctuation marks chosen by the customer entering the text. The entry of ancestral surnames appears to be completely independent of the uploading of a GEDCOM. Entries are delimited by " / ". The order of entries is unclear: should it be alphabetical, or ahnentafel order, or perhaps whatever random order the user entered the names in? The first few words are visible in the web browser window, and the remainder are revealed by mouseover. There appears to be no limit on the number of entries allowed. If known relationships are limited to 6th cousins, then should ancestral surnames not be limited to GGGGGgrandparents, or 128 surnames, not allowing for spelling changes or surnames not conventionally inherited. Even that is too many to be of use in free form unsortable text. Surnames which I have entered as ancestral surnames and which I share with my matches are apparently bolded. It is too tedious to check these by clicking on all ten matches on all 36 pages. In Excel 2010 (Column I), to find matches using a particular word in this field, click the auto-filter dropdown in Cell I1, type F then A then the word of interest then <Enter>.
Ancestral Placenames:
Hidden in with ancestral surnames. This should surely be a separate field. It should be (but is not) possible to download matches into a spreadsheet with one line per ancestral surname and/or one line per ancestral placename for further analysis.
Y-DNA Haplogroup:
Shown, if applicable (i.e. for males) and available (i.e. customer has paid for it and order has come to the head of the queue) on the tiny, almost invisible, dropdown menu under each match's mugshot and in Column J of the Excel spreadsheet.
mtDNA Haplogroup:
Shown, if available (i.e. customer has paid for it and order has come to the head of the queue) on the tiny, almost invisible, dropdown menu under each match's mugshot and in Column K of the Excel spreadsheet.

My autosomal results, or lack thereof

What was I expecting to find?

What did I actually find?

Continue to Chapter 2

My own online family tree is at http://pwaldron.info/tng/index.php but to see it you will have to Register for a New TNG User Account.

Comments about this page can be left on facebook.