Tuesday, 13 June 2017

WDYTYA 2017 - videos going online

This year was the last year of Who Do You Think You Are? - Live! The event was an annual staple of the British genealogical calendar for the last 10 years. Starting in 2008 in Olympia in London, it moved to the National Exhibition Centre in Birmingham in 2014. The event attracted thousands of attendees year on year, and in 2009 Brian Swann, ISOGG-UK representative, persuaded FamilyTreeDNA to sponsor a stand at the event.

Shortly thereafter, the DNA Workshop began. And for the past several years this has been kindly sponsored by FamilyTreeDNA and run by volunteers from ISOGG. Each year the lecture schedule has attracted a host of international and local speakers, both academics and citizen scientists. And in addition, videos of the presentations have been made available free of charge on our dedicated YouTube channel as a service to the genetic genealogy community.

It is sad to see the demise of the WDYTYA event. It was a wonderful way of keeping in touch with friends and colleagues, and everyone looked forward to the manic three days of early mornings and late nights. Hopefully another annual event will rise to take its place. Nevertheless, at least the videos of the presentations will serve as a lasting legacy of the ten year run of WDYTYA.

The last batch of videos ever are ready to be uploaded to the YouTube channel and this will happen every Monday and Thursday over the coming 6 weeks or so. Three videos have already been uploaded and are attracting a large audience:

The Science of Admixture Percentages (Garrett Hellenthal)

DNA, emigration and shipping (Brian Swann)

Autosomal DNA demystified (Debbie Kennett)

The schedule of lectures from this year's event are indicated below (click the image to enlarge). Most of the presenters gave permission to upload their lectures and big thank you is due to the speakers for their generosity.

And this year we managed to get better audio recordings than ever before. Who would have thought that dangling a small microphone in front of the loudspeaker and recording a separate audio track on your iPhone would be the best way of conquering the ever-present background noise from the 10,000 people in the auditorium?!


Maurice Gleeson
June 2017

Thursday, 1 June 2017

Convergence - quantifying Back & Parallel Mutations (Part 1)

In a recent post I explored the concept of Convergence and made the point that the mechanism by which Convergence arises is via a combination of Parallel Mutations and Back Mutations in the STR marker values. These mutations are changes that occurred at some time in the past but because they remain hidden to us in the present, we cannot tell when they occurred or how frequently they occurred just by looking at two sets of STR results from people living today.

However, there is a way around this problem. Or at least a partial solution.

By using a combination of STR data and SNP data we can build a Mutation History Tree that is a more accurate representation of the branching structure of the "family tree" for a specific genetic group. And this type of tree allows us to more easily (and more accurately) spot Back Mutations and Parallel Mutations.

I did this for one particular genetic family in one of my surname projects - the North Tipperary Gleeson's (Lineage II of the Gleason DNA Project). This tree is a "best fit" tree, by which I mean a tree constructed in such a way as to explain the STR & SNP data in the most parsimonious way i.e. with the fewest number of branches that will accommodate or "fit" the data. This approach is also called the "maximum parsimony" approach and is often used when building cladograms or phylogenetic trees. The Mutation History Tree (MHT) is simply another type of cladogram. You can read about the process of how the tree was developed in this blog post here and subsequent posts.

But a key point here is that this "best fit" tree is likely to change as more data becomes available. And to illustrate this point, I'm going to compare the current version of the tree (Dec 2016) with the next version that is being prepared following the recent availability of new data from 12 sets of Z255 SNP Pack results.

Below is the current version of the MHT for Lineage II. By comparing each mutation in the tree with every other one, we can identify which mutations are Back Mutations (occurring on a single line of descent) and which are Parallel Mutations (occurring on two or more lines of descent). I have highlighted the Back Mutations in yellow and the Parallel Mutations in green.

Back Mutations in yellow, Parallel Mutations in green
from Gleeson Lineage II MHT (version Dec 2016)

Parallel Mutations occur in the following lines of descent:
  • CDYb 40-39 ... A, E, D, F (4 times)
  • CDYa 39-38 ... A, B, C, F (4 times)
  • 464c 17-16 ... A x2, D (3 times)
  • 461 12-11 ... A, B (2 times)
  • 576 18-19 ... A, D (2 times)
  • 390 23-24 ... A, B, C (3 times)
  • 390 24-23 ... B, C (2 times)
  • 456 16-15 ... B, D (2 times)
  • and so on ...
Back Mutations are more difficult to count, and to conceptualise. Whether you consider the value as mutating forward or back is entirely dependant on your reference point. If our anchor is the upstream Z255 branch, then the original value of marker 390 (for example) is 24, mutating (forward) to 23 on the Z16438 branch, and then back to 24 (in parallel) on Branches A, B & C, and then back to 23 (again in parallel) on Branches B & C. So there are several points to make here:
  • this is in fact a Back Mutation that occurs in parallel in 3 separate lines of descent. It is thus both a Back Mutation (relative to its earlier value of 24 on the Z255 branch) and a Parallel Mutation, occurring at (presumably) different time points in Branches A, B & C. It is thus coloured yellow and green.
  • It can also be considered a Triple Mutation relative to the Z255 branch - in the sense that it mutates forward to 23 then back to 24, then back to 23 again. But what happens if it flips forward and back 5 times? What would we call that? And what do we call it if it goes two steps forward and one step back? This is where terminology fails us. I'm not sure if there is a standardised way of describing these different kinds of mutation (if there is, please leave a comment below).
  • the mutation 390 24-23 occurs in Branches B & C ... relative to its value of 24 in the Z255 branch, this could be considered a Parallel Forward Back Forward Mutation ... for Pete's Sake!!

But if we just focus on the Back Mutations that occur downstream of the branch characterised by the STR mutation (710 36-37), just above the A5627 SNP Block. This "710 branch" incorporates all the Gleeson's of Lineage II, from Branch A to F.* On this overarching branch for Lineage II, the value of the STR marker 390 is 23 and Back Mutations are as follows:
  • 390 24-23 ... B, C ... this is the only Back Mutation below the "710 branch"
  • And it is also a Parallel Mutation
  • All the other yellow Back Mutations are relative to the upstream Z255 branch, and not our downstream "710 branch", and so are not counted in this particular exercise.

So, let's generate some statistics from these numbers:
  • The total number of mutations below the "710 branch" (irrespective of whether they are forward or back) is 71.
  • There are 69 Forward Mutations (i.e. away from the original value of the relevant marker on the "710 branch")
    • 31 Forward Mutations show an increase in the number (e.g. 9 to 10)
    • 38 Forward Mutations show a decrease in the number (e.g. 9 to 8)
  • There are 2 Back Mutations 
    • both Back Mutations show a decrease in the number (i.e. 24 to 23)
  • There are 26 Parallel Mutations
  • Forward Mutations outnumber Back Mutations by a ratio of 35.5 : 1
  • Parallel Mutations outnumber Back Mutations by a ratio of 13 : 1
  • There are 16 people in this tree, and if we make the big assumption that the "710 branch" starts 1000 years ago (i.e. roughly at the time of the introduction of the Gleeson surname), then over the course of 1000 years, the rate of each type of mutation is (crudely) as follows:
    • Forward Mutations = 69/16 = 4.3125 mutations per "line of descent" per 1000 years
    • Back Mutations = 2/16 = 0.125 mutations per "line of descent" per 1000 years
    • Parallel Mutations = 26/16 = 1.625 mutations per "line of descent" per 1000 years

These are crude estimates but they give some idea of the relative importance of Parallel Mutations compared to Back Mutations. And applying this information to the phenomenon of Convergence, it would seem that Back Mutations play a very minor role compared to Parallel Mutations.

This conjecture is supported by some recent modelling work undertaken by Dave Vance and written up for the L21 Yahoo Discussion Forum. In Dave's simple model, which is an extremely useful basis for further discussion, the "average tree" could expect to have a ratio of Parallel to Back Mutations in the range of 25:1 to 50:1.

This is a lot higher than what I have shown in my MHT for the Lineage II Gleeson's, but this can be partly explained by the fact that there are only 16 people in my Gleeson sample, and we are looking at (perhaps) only the last 1000 years. I would predict that the ratio will increase further as 1) I add more people to the sample; and 2) the duration of observation is extended backward from 1000 years ago (the 710 Branch) to 4300 years ago (the Z255 Branch).

In subsequent posts we will see how these calculations stand up when we add in additional data from 12 SNP Pack results and reconfigure the MHT for Gleeson Lineage II into the next version of the "best fit" model. And we will also attempt to quantify the total number of Back & Parallel Mutations below the upstream marker Z255. And lastly, we will attempt to quantify Convergence itself.

Maurice Gleeson
June 2017

* the Big Y results of a 10th member of the group indicate that this branch is characterised by the SNP A5631 although this result is not reflected in this version of the MHT