Thursday, August 3, 2017

Genomes_dnj Overview

Human Genetic History


Studies of the sequence and function of human DNA have shown that more of that DNA functions as a scaffold for complex protein assemblies than codes for protein amino acid sequences.  Over two thousand protein transcription factors have been identified that can bind to DNA and are believed to influence DNA transcription into RNA.  Some transcription factors are known to mediate the activity of chemical signalling systems that influence the synthesis of specific proteins.  Other transcription factors are known to have a role in the operation of enhancer DNA segments that control cell type specific transcription of protein coding RNA.

Studies of the operation of enhancer DNA sequences have shown that their cell type specific operation depends on complex three dimensional chromatin DNA scaffolds that wrap complex enhancer patterns back on the DNA that codes for the regulated proteins.  Enhancer DNA sequences can be separated by hundreds of thousands of DNA bases from the DNA whose transcription is enhanced.  Complex assemblies of proteins and perhaps RNA interact with both the enhancer DNA and the promoter of the regulated protein coding DNA to enable the transcription of that DNA.

Studies of complete human genomes have shown that the average human expresses more than 3,000,000 single nucleotide polymorphisms (SNPs).  Most of these genetic variations do not change protein amino acids.  Instead those SNPs impact the scaffold DNA.  Statistical studies of the relationship between SNPs and phenotypes have shown that many of these SNPs are quantitative trait loci.  Those SNPs have modest statistical impacts on their phenotypes.  But, overall they have a more significant genetic impact on disease than defects in proteins.

The work of the 1000 genomes project showed a strong tendency for correlated SNPs to form haplotypes.  That correlation allowed the 1000 genome project to impute the haploid association of SNPs for each of the 5008 haploid chromosome samples from the 2504 complete genomes in its phase 3 data.  The data with those imputed associations of SNPs make it possible to analyze variation in SNP associations within the different populations sampled in the 1000 genomes data.

The analysis presented in this set of notebooks uses a simple heuristic algorithm to group SNPs from 1000 genomes data into series.  Series of four or more SNPs were identified that were all expressed by the same chromosome samples.  The goal was to identify series of SNPs that met two criteria.  One required 16 or more chromosome samples to express 90% of the SNPs in the series.  The second required 90% of the chromosome samples expressing any SNP in the series to meet the first requirement.

More than 940,000 series of at least 4 SNPs were identified with this technique from the 1000 genomes phase 3 data for autosomal chromosomes.  Some summary statistics on the identified series and some plots of chromosome 2 series characteristics are presented in the notebook chrom2_plots.ipynb. Along most of the length of all of the different autosomal chromosomes, most of the 1000 genomes samples express some series.  But, there is considerable variation in the number of series crossing different chromosome positions, the number of SNPs in the series, and the length of the chromosome region from the first SNP in a series to the last.  There is also a pattern for positions on a chromosome that are not crossed by any identified series.

The analysis presented in these notebooks focused on the million base interval between positions 135,757,320 and 136,786,630 of chromosome 2.  This interval includes the genes rab3gap1, zranb3, r3hdm1, ubxn4, lct, mcm6, and dars.  This interval was chosen for analysis because of the location of the gene lct which codes for the protein that is needed to digest lactase and the location of the SNP rs4988235 in an intron of mcm6 that is generally thought to be the genetic variation responsible for the phenotype of lactase persistence.  The specific interval endpoint locations were chosen because they are the nearest positions on both sides of the lct gene where the number of active SNP series goes to zero.

The statistics of this interval are exceptional.  The distance between endpoints with no active series is at the long end for chromosome 2. An unusually large number of series are expressed in this interval.  But, it is the number of SNPs in those series that make this interval the most exceptional one on chromosome 2.

The 870,000 base 11 SNP series specifically associated with lactase persistence is the second longest one identified in the interval.  That 11 SNP series expressed by 765 1000 genomes chromosome samples has selected an hierarchy that also includes the series 6_1503, 4_1699, 4_911, 26_1414, 64_1575, 10_2206, and 7_1818.  The emergence of lactase persistence was a genetic process that selected 8 series including a total of 132 SNPs for overexpression.  See the notebook lactase_persistence.ipynb for more information about this hierarchy.

The process that resulted in 765 chromosome samples expressing the 8 series and 132 SNPs associated with lactase persistence is the most dramatic instance of selection visible in this interval.  But, the series 11_765 is only one of the 76 SNP series documented in this study.  For all of those series, the distribution of chromosome samples expressing the series SNPs is very far from the expectation for a collection of independent random variables.

A large number of these series emerged in three hierarchies during the human expansion out of Africa.  But, the exceptional characteristics of the studied region of chromosome 2 appear more to be the result of four overlapping series that each fill a large part of the 629,000 base region that covers the genes rab3gap1, zranb3, and half of r3hdm1.  In total these series include 493 SNPs.  All four series are expressed by 842 of the 843 chromosome samples that express the South Asian (SAS) tree hierarchy. The series 193_843 that includes 193 SNPs expressed by 843 chromosome samples is only expressed in this this hierarchy.  The other three series, 62_1265, 123_1561, and 117_1685 including a total of 302 SNPs are also expressed in varying combinations by large numbers of mostly African chromosome samples.  These 4 exceptionally long series with exceptionally large numbers of SNPs appear to be the most exceptional feature of this part of chromosome 2.  The data presented in these notebooks show several instances of independent processes that resulted in large overexpression of different combinations of the same series of these SNPs.

At least two well understood processes have the potential to produce overexpression of groups of SNPs largely by chance.  One is a population bottleneck.  It is likely that population bottlenecks have played some role in generating the differences observed for expression of SNPs among the different 1000 genomes regional populations. Another is linkage disequilibrium.  That is the potential for the selective functional advantage of a single SNP to result in overexpression of chance nearby SNPs because genetic events that recombine them are rare.

There is no doubt about the reality of linkage disequilibrium and little doubt about the significance of population bottlenecks.  But, several arguments based on the evidence presented in these notebooks suggest that these phenomena are not adequate to explain the observed results.

One is just the large fraction of SNPs that can be grouped into stable series.  More than half the SNPs expressed by 16 or more samples of autosomal chromosomes grouped into series of 4 or more SNPs with the algorithm used for this study.  Many of those series are in African populations and appear to have a long history that preceded the human expansion out of Africa.

A second is the large number of observed cases where the identity of a series is preserved across recombination events and in varied highly overexpressed associations with other series.  Consider the series in the hierarchy overexpressed with the lactase persistence phenotype.  The 765 samples that express the series 11_765 include 764 that express all of the series 6_1503, 4_1699, 26_1414, 64_1575, 10_2206, and 7_1868 in mostly European populations.  There are also 139 samples that express 6_1503 and 4_1699 with 4_1149, 95_176, and 51_176 that all come from East Asian populations.  There are 200 samples that express 4_1699 without 6_1503.  Those samples include 149 that express 4_1699 with 13_1696 and 32_1361.  The overexpression of 4_1699 by those samples is largely the result of several selection processes among African populations.  There are 259 mostly East Asian samples that express an association of series that include 6_1396, 5_684, 9_944, 32_1361, 64_1575, 10_2206, and 7_1868. There are 378 samples overexpressed in African populations that express the series 10_2206 in association with the series 9_378.  There are 161 almost all African samples that express 7_1868 with 4_163.  There are 48 almost all African samples that express 26_1414 in association with the series 14_48.  There are 16 samples from African populations that express 26_1414 in association with the series 5_16.

A third is the multiple cases of independent selection of complex series.  For example, the 117 SNP series 117_1685 was selected along with 123_1561, 62_1265, and 193_843 for expression by 842 samples through the processes that generated the SAS tree.  117_1685 was also selected along with 123_1561, 62_1265, and 67_329 without 193_843 for expression by 309 samples through the processes that generated the 67 SNP series 67_329 and the descendant series in its hierarchy.  117_1685 was also selected along with 74_210 without 123_1561, 62_1265, 193_843 or 67_329 for expression by 210 samples through the processes that generated 74_210 and the descendant series in its hierarchy.  117_1685 was also selected along with 123_1561, 62_1265, and the 209 SNP series 209_56 through the processes that generated the 209 SNPs for expression by 56 samples.  117_1685 was also selected along with the 290 SNP series 290_16 without 123_1561, 62_1265, 193_843, 67_329, or 209_56 for expression by 16 samples through the processes that generated those 290 SNPs.

A fourth is the varying patterns of recombination among series in different parts of the studied region and within different hierarchies.  Some history of recombination events has been detected in all parts of the 1,000,000 base studied region.  But, recombination events are more common in the top 400,000 bases than in the lower 600,000 bases.  The observed hierarchies suggest that processes which select a series that extends over the whole million base region are common.  But, almost all of those sequences have become partially fragmented through recombination events between the lower and upper parts of the region.  There also appears to be much more history of recombination within the lower region of the East Asian tree than in other observed hierarchies.

The 870,000 base series 11_765 associated with lactase persistence is the second longest one observed in the studied region.  It is exceptional for the length of the series and the absence of any history of recombination events by the samples that express it.  The 765 samples that express 11_765 include 760 that express 6_1503, 4_1699, 26_1414, 64_1575, 10_2206, 7_1868, 4_911 and no others.  This pattern fits the idea of a functional contribution from multiple SNPs much better than one where only a single SNP is functional and all the other associations result from chance. Certainly the kind of complex 3d scaffolding observed for enhancer controlled cell type specific regulation of DNA transcription is consistent with multiple SNP contributions to phenotypes.

The notebooks in this study provide a more detailed analysis of the hierarchies of series in the studied region of chromosome 2 and of sample populations that express them.  The results of the study include both charts of series associations and tables of data on sample expression.  The methods used are described in some detail with example charts in the notebook methods.ipynb.


Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...