Thursday 25 May 2017

Convergence - what is it?

There are several phenomena encountered in the the analysis of Y-DNA STR data that can throw a genetic spanner in the works, and Convergence is one of them!

In genetic genealogy, Convergence occurs when two men have DNA signatures that are exactly or nearly identical, but have evolved that way purely by chance. As a result, the two men will show up in each others' list of matches and will give the false impression that they may be closely related (e.g. within the last several hundred years) when in fact they are much more distantly related (e.g. within the last several thousand years). The problem is we cannot tell that Convergence has occurred simply by looking at the two men's STR results. It is hidden from our view. We cannot see it just by looking at the present-day STR data. And the danger is that if the two men think they are closely related, they may start chasing their common connection, thinking that they will find the answer via further documentary research, when in fact there is little hope of that at all. Their "close match" is a red herring. And their pursuit of the Common Ancestor is a wild goose chase.

So what can we do about it? How can we recognise it? How can we avoid it wasting our precious research time?

Confusion

The concept is occasionally discussed in Facebook groups or on various blogs, but there tends to be quite a lot of confusion around what it actually means. And there are a variety of quite understandable reasons for this. 

Firstly, there isn't a standard definition for Convergence, so how it is used varies from person to person. Some people apply it only to exact matches, others apply it to exact and close matches. Moreover, the concept of Convergence is closely tied up with the concept of lack of Divergence. Both are different phenomena, but their effects and consequences are very similar. Another contributing factor is the fact that it is difficult to see it or detect it in practice. We know that it exists, but we have no way of identifying it just by comparing two sets of STR results. In other words, it's largely a hidden phenomenon (like Black Holes). It is only when we do SNP testing that the extent of Convergence becomes apparent. And the problem is that not enough people have done SNP testing. 

The good news is that more and more people are doing SNP testing and as they do, the extent of Convergence becomes more apparent. The Lineage II members in the Gleason DNA Project are trailblazers in this regard and we will explore the results of the recent Z255 SNP Pack testing in subsequent blog posts.

But in this post, we will look at an example of Convergence from the Gleason DNA Project in order to illustrate some of the key characteristics and consequences of Convergence. In later posts, we will look at clues that may indicate that Convergence is present, attempt to quantify the number of Back Mutations & Parallel Mutations that occur over time (using the Mutation History Tree that we have previously constructed for Lineage II - the North Tipperary Gleeson's), and finally we will attempt to quantify Convergence itself.

But first of all, let's look at some of the aspects of the definition of the term.

Definition

A general definition for the term convergence from the Conicse Oxford English Dictionary illustrates some general characteristics of convergence that are worth exploring because they are of relevance to how the term is applied in genetic genealogy and to the analysis of Y-DNA STR data in particular:
converge 1. come together from different directions so as eventually to meet

convergent 2. Biology (of unrelated animals and plants) showing a tendency to evolve superficially similar characteristics ...
There are several important aspects to these definitions that we can apply to the analysis of STR data (e.g. your 37 marker data). First of all, the sense that things were initially apart, but then they come together. Secondly, the idea that two things can look the same or similar on the surface, but in fact they have come from very different directions. And thirdly, the idea that two things can evolve from something different into something the same.

Let's look at how this more general concept can be applied to the analysis of Y-STR data.

And a good starting point is the description of Convergence on the ISOGG Wiki:
Convergence (also known as evolutionary convergence) is a term used in genetic genealogy to describe the process whereby two different genetic signatures (usually Y-STR-based haplotypes) have mutated over time to become identical or near identical resulting in an accidental or coincidental match.
One can think of convergence as producing misleading matches – two men appear to be more closely related than they actually are. The same situation may result (very occasionally) if there is an exceptional lack of divergence. In other words, so few mutations occurred in the descendants of a common ancestor over the course of time that the common ancestor may appear to have lived only a few hundred years ago when in fact he lived much further back than that, perhaps several thousand years ago.
So let's pick apart some of the key elements of this definition. You might like to refamiliarise yourself with some basic concepts, such as the different types of DNA markers (STRs and SNPs), and what you are actually seeing when you look at the DNA Results page.

Basic Concepts

Firstly, the above description of Convergence refers to the genetic signature - the Y-STR haplotype. This is the string of numbers you see associated with your results on the DNA Results page of the project. I like to think of it as if all the Y-chromosomes of the men in the group were all stacked up on top of each other, in such a way that each of the individual markers along the chromosome were all aligned with one column for each marker. Thus in the diagram below, each of the men have a value of 13 for the first marker. The values for the second marker are a mixture of 23 and 24. And so on.

The Y-STR results for the men of Lineage II
(click to enlarge)

Another key point in the above description is the concept that some markers mutate over time e.g. the number changes from 14 to 15. These mutations are identified by comparing the value in each square to the modal value for the entire group (i.e. the most frequent value among the men in that group). The most frequent values for each of the markers are used to generate the "modal haplotype" which is a virtual signature constructed from these most frequent values (and is represented by the row marked "MODE", the 3rd row from the top in the diagram above).

Mutations are indicated by coloured squares. If the value for any marker is the same as the modal value for that marker (i.e. the most common value among the men in that group), then the square that the value is in will not have a colour. If however, the value is higher than the norm, it will be coloured pink; if it is lower than the norm, it will be coloured purple.

If you and someone else have exactly the same string of numbers, you will have the same coloured squares and the same "no-colour" squares. If you are not exactly identical, you will have some coloured squares that the other person does not have ... and vice versa. In other words, the sequence of numbers, and hence colours, will be different. Each coloured square represents a mutation - a small minor increase or decrease in the number (compared to the norm) for that particular marker, in that particular individual.

Convergence in theory

Let's imagine that some distant ancestor living 10,000 years ago gave rise to four distinct lines of descent surviving today (represented by the men A, B, C, and D in the diagram below). Let's look at what happened to their first 37 STR markers over time, and let's assume that mutations only occurred in 5 of these STR markers, as shown in the diagram below. How did the values change over the passage of time, from 10,000 years ago to the present day? And how many of the descendants of this ancestor "match" each other today?

In descendant A, only one of these 5 STR markers mutated. It underwent a single mutation (from 13 to 14) about 6000 years ago, and that was the only mutation over the span of 10,000 years. This is an rather extreme example of "lack of Divergence".

Descendant B had several mutations in his line of descent, but only affecting the first and the fifth markers. These show progressive "forward mutations" away from their original values. With the first marker, the mutations go forward in an upward direction (14,15,16,17) whilst with the fifth marker they go forward in a downward direction (15,14,13,12). This latter may seem counterintuitive but it serves to emphasise that "forward" means "away from" the original value, no matter if it is up numerically or down numerically.

Descendant C also has experienced mutations in only the first and fifth marker. But here we see two examples of a Back Mutation. The first marker shows a forward mutation 6000 years ago (13 becomes 12) but this has gone back to 13 by 4000 years ago. It then undergoes another forward mutation by the time of the present day (13 to 14). Similarly, the fifth marker undergoes a forward mutation (16 to 17) by 4000 years ago but a Back Mutation by 2000 years ago.

Descendant D undergoes mutations on all 5 of his STR markers. A Back Mutation occurs with the second marker between 2000 years ago and the present day (15 to 14); and likewise with the third marker (12 to 13); and likewise with the fifth marker (17 to 16). Two Back Mutations occur with the fourth marker (29 to 30 by 4000 years ago; and 31 to 30 by the present day).

Mutations over time in 4 distinct lines of descendants

Remember, these are four distinct lines of descent, with the MRCA (Most Recent Common Ancestor) represented by the first row of 5 STR markers in the diagram above. So now let's look to see if any of the mutations that occurred in these four individual lines of descent occurred in parallel i.e. the same mutational change occurred in two completely separate lines of descent.

Have a look at the first marker in A, B and C. All three men developed the same mutation on this marker - a change from a value of 13 to 14. In Lines A and B this change occurred in parallel around 6000 years ago. In Line C, the change occurred in parallel around about the present day.

There is a similar parallel mutation between Line C and D. Look at the fifth marker - it increases in value from 16 to 17 around about 6000 years ago in Line D and 4000 years ago in Line C.

And there is a parallel back mutation present in Lines C and D also - the fifth marker switches from 17 to 16 about 2000 years ago in Line C and around about the present day in Line D.

With Back Mutations you are only looking at a single line of descent. With Parallel Mutations we are comparing two or more lines of descent. And we will see that in practice Parallel Mutations are much more common than Back Mutations and have a much greater role to play in the development of Convergence.

The STR results of living people today tells us nothing about their evolutionary history - it is hidden from view

Which brings us to Convergence itself. Let's look at the Genetic Distance between each of these lines of descent. This helps to make the point that the DNA results from living people are only a snapshot in time. They do not tell us anything about how those STR values have evolved over the past 10,000 years:
  • A and B have a Genetic Distance (GD) of 7. This is made up of a 3-step difference on the first marker (14 vs 17) and a 4-step difference on the fifth marker (16 vs 12). And as these were the only changes on their first 37 markers, the GD would be written as 7/37. This exceeds FTDNA's threshold for declaring a match (i.e. 4 steps or less over the first 37 markers; written as 0-4/37) and so A and B would not appear in each other's list of matches.
  • A and C have a GD of zero. They are an exact match. Their GD for the first 37 markers is thus 0/37. They appear in each other's match list and the match looks really close. They think they have a common ancestor in the last few hundred years. They start comparing family trees, looking for the elusive ancestor. They will never find him. This is a wild goose chase. This is the consequence of Convergence.
  • A and D have a GD of 2 (or 2/37). This GD falls within the threshold for declaring a match. They both appear in the other's match list. They email each other, looking for the common ancestor - another wild goose chase. Another example of Convergence and its consequences.
  • B and C have a GD of 7/37. No match.
  • B and D have a GD of 9/37. No match.
  • C and D have a GD of 2/37. It's a match. It's Convergence. They don't know that. They spend months researching their connection. It's a wild goose chase.

The STR results of people living today tell us nothing about how those STR marker values have evolved over time. They may have come from a relatively recent common source, or they may have come from widely differing directions.

Below is another way of conceptualising how the numerical value of a single STR marker might evolve over time. This marker started out with a value of 8 for the common ancestor of 4 distinct lines of descent. But by the time of the present day, two lines had a value of 9, one had a value of 13 and one had a value of 5. But the evolutionary history of these 4 lines of descent is peppered with Back Mutations and Parallel Mutations:
  • Back Mutations
    • Line 2 (red) - 14 becomes 13 some time between 1000 years ago and the present day (0)
    • Line 4 (purple) - 4 to 5 between 1000 and 0 years ago
    • Line 3 (green) - 5 to 6, 6 to 7, and 7 to 8 between 7000 (7K) and 4000 (4K0 years ago
  • Parallel Mutations
    • 8 to 9 in Line 2 (10K to 9K), Line 1 (7K to 6K), and Line 3 (2K to 1K)
    • 8 to 7 in Line 3 (10K to 9K) and Line 4 (9K to 8K)
    • 7 to 6 in Line 3 (9K to 8K) and Line 4 (7K to 6K)
    • 6 to 5 in Line 3 (8K to 7K) and Line 4 (4K to 3K)

The evolution of values in a single STR marker over time in 4 descendant lines
of a common ancestor who lived some 10,000 years ago

The consequence of all these Parallel & Back Mutations is that the present day descendants of two of the lines (green Line 3 & blue Line 1) have exactly the same numerical value for this STR marker despite the fact that their evolutionary histories are so different.

This is an example of the evolutionary history for a single STR marker. And if this is representative of all STR markers, then the chances that the values for a particular marker will converge over time is really quite high. But our DNA results usually consist of 37 markers (the standard test most people start with) so what are the chances of the first 37 markers evolving in such a way as to result in convergence of a sufficient number of STR values to cause a coincidental match? ... well, the probability of that happening would be a lot lower. And the probability would be lower still with 67 markers, and lower still with 111 markers. But because so many people have tested (over 600,000 currently), we do see the phenomenon occurring even at higher marker levels (67 and 111).

And in a subsequent post we will look at clues to the presence of Convergence, so that you can look at your own or anyone's list of matches and adjust your suspicion level accordingly.

Convergence in practice

And to illustrate these points, I have temporarily moved one of the ungrouped project members into Lineage II, namely member Jim Treacy (B38804)*. He is third from the end in the diagram below. Don't worry about not being able to read the text (you can click to enlarge the diagram if you like) - just focus on the coloured squares. 

The Y-STR results for the men of Lineage II (with a Treacy third from the end)
(click to enlarge)

And Jim has no coloured squares for the first half of the markers. It is only when we reach the 19th marker in the row that he has a pink square with the value 16 inside it - everyone else in that column has a value of 15 for that marker, except for one person who has a value of 14. And as we continue along Jim's row, there are 4 other coloured squares, bringing the total to 5. This can be expressed as a Genetic Distance of 5/37 from the modal haplotype (i.e. the 3rd row from the top, which - to remind you - is a virtual signature constructed from the most frequent values for each of the markers).

Now a GD of 5/37 between two men would mean that they do not appear in each others' list of matches (because FTDNA have set the threshold for "declaring" a match to be 4/37 or less). But among Jim's list of matches at the 37 marker level, there are two members of Lineage II (with a GD of 4/37). And at the 67 marker level, Jim has 6 members of Lineage II among his matches (with a GD of 6 to 7/67). So this looks (on the surface) that Jim is relatively closely related to our Lineage II group. And this suggests (on the surface) that there may be a common ancestor some time in the past several hundred years, maybe somewhere between 1700-1850 (on the basis of TMRCA calculations based on the TiP Report). 

So what do we do next? Do we start looking for documentary evidence? Do we go back to the church records and land records and old newspapers to see if there is mention of a Gleeson-Treacy connection? 

We could do. But it would be a wild goose chase. Because the Treacy-Gleeson connection is a red herring. And we know this because we have done SNP testing.

Jim has done the Big Y test, as have 10 of the members of Lineage II. Both Jim and Lineage II members belong to Haplogroup R, and both share some SNP markers in common. Each marker characterises a branching point in the Tree of Mankind and a SNP Progression is a list of these SNP markers down to the finer "more downstream" branches of the Tree. Here are the SNP Progressions for Jim and for the Lineage II Gleeson's:
  • R-P312> Z290 > L21> DF13 > ZZ10 > Z255 > Z16437 > A557 > Z29008 > A10891
  • R-P312> Z290 > L21> DF13 > ZZ10 > Z255 > Z16437 > Z16438 > BY2852 > A5631

You can see that the branching points are exactly the same ... until marker Z16437. Thereafter, Jim goes down one branch and the Gleeson's go down another one. Now, let's be clear: the Gleason's and Jim do share a common ancestor. And if he was around today he would test positive for the SNP marker Z16437. But his children would have evolved along different paths - one path taking us down to our present-day Jim Treacy, the other taking us down to our present-day Gleeson's. You can see where Jim and the Gleeson's are placed on the Tree of Mankind in the diagram below.

And when did this common ancestor live? YFULL date the formation of Z16437 as 1650 years ago. The two markers downstream of this, A557 (Jim Treacy) and A5631 (Gleeson), both have formation dates of 1400 years ago. So from this we can say that the common ancestor of Treacy & the Gleeson's is somewhere between 1400 to 1650 years ago. Or to give it an actual date (by subtracting from 1950, the approximate birth year for members of Lineage II), sometime between 300 and 450 AD.

This is clearly a lot further back in time than the 1700-1850 AD estimate suggested by the STR data.

So this is a great example of Convergence. By chance, Jim's STR signature has evolved over time to approximate that of the Gleeson's of Lineage II and as a result, he looks a lot more closely related to the group than he actually is.

Maurice Gleeson
May 2017

* a big thank you to Jim for allowing me to use his name and his results in this example


Gleeson's to the left, Treacy's to the right, & about 1500 years in between






Friday 19 May 2017

23andMe Transition arrives in UK & Ireland

Some time ago, 23andMe transitioned their US customers to a new website format, whilst those of us in Europe remained with the old format. That was quite some time ago! But just this week, I have received an email informing me that I will be transitioned to the new format in June 2017. 

Below is the email I received. Of note, all Health Reports will be archived as pdf documents. I received mine before the FDA (Food & Drug Administration) put the extended hold on 23andMe's Health Reports, so I have 63 reports on physical traits, 53 on carrier status for inherited conditions, 25 on drug response, and 122 on health risks for a variety of medical conditions including Alzheimer's Disease and Parkinson's. 




The first bullet point talks about "Ethnicity" but on my screen it is described as "Ancestry" - click on your name (top right), then Edit Profile, & you will see it directly under the Ancestry Information heading. Click on Update.

You can also enter or update your ethnicity by clicking on the green button above (in the email you receive). Of particular note, if you manage several kits, after filling out the survey for your first kit, be sure to switch profiles and complete the survey for each one of your kits.

The new 23andMe experience is discussed on their international webpages here, and additional information for European customers is available on this link here and is abstracted below.

Some of the key features that stand out for me include:
  • some Health Reports may be available (depending on which chip was used - you can find this information on your Download Raw Data page in the Profile box toward the end of the page)
  • the maximum number of matches has increased to 2000
  • linking to online trees is allowed, even if they are with other companies (saves you the hassle of having to upload a gedcom ... which anyway is no longer available with the new experience)
  • when defining haplogroup subclades, they have switched from the old terminology (e.g. R1b1a) to the new one (e.g. I-M253)
  • any connections you currently share with your matches will be maintained in the new experience

One of the best additional features of the new experience will be the Relatives in Common feature. This is similar to the Shared Matches feature on Ancestry and the ICW (In Common With) Matches feature on FamilyTreeDNA.







Maurice Gleeson
May 2017