Thursday, July 27, 2017

Genomes_dnj Notebooks


The results of project genomes_dnj are contained in a tree of 100 jupyter notebooks.

The top layer of the tree contains several notebooks that provide a summary of the patterns of human genetic history revealed by the project's study of 1000 genomes phase 3 data.

The next layer is divided into the main hierarchies of human genetic history observed in the study

For each of the major hierarchies, the bottom layer documents the SNPs in each of the series associated with the hierarchy.  In total 76 different series of 4 or more SNPs are documented.  The individual series notebooks provide distributions of expression of its SNPs among the 5008 chromosome samples in the 1000 genomes phase 3 data.  They also provide regional population data for the expression of the series and for all of its series associations within the studied interval of chromosome 2.

The easiest way to access the notebooks is to download the whole tree in html format from the notebooks html folder on google drive.

Another possibility is to download the notebooks in native format.  One method is to clone the github master branch for genomes_dnj.  An alternative is to download the whole tree from the notebooks folder on google drive.

Viewing native notebooks requires a python installation.  Anaconda2-4.1.1 was used for all of the project work.

The top level notebooks can be viewed directly online from anaconda cloud.

Wednesday, July 26, 2017

Project Genomes_dnj


The genomes_dnj project has been an effort to use 1000 genomes phase 3 data to identify the different DNA sequences expressed by its 5008 chromosome samples.  Over 10,000,000 SNPs across the autosomal chromosomes expressed by 16 or more chromosome samples were grouped into more than 940,000 series of 4 or more SNPs.  The series data for a 1,000,000 DNA base segment of chromosome 2 has been explored in detail.

The results include the identification of three hierarchical histories of series that have generated a large part of the segment's DNA sequences for the 1000 genomes populations that emerged from Africa.  Each of those histories is rooted in a different complex association of series with histories that extend into the African past.

The series in the lower 600,000 bases of the studied segment show more history of stability than those in the upper 400,000 bases.  One of the out of Africa hierarchies plus several African hierarchies show a record of independent selection of the same series of hundreds of SNPs.  Other samples show more complex patterns of series remodeling or patterns that appear to have resulted from randomization of series associations.

A stream of results over the last several decades has shown that much more human DNA has a role as a scaffold for organizing the activity of assemblies of proteins than codes for those proteins.  Over two thousand human proteins are known to bind to DNA.  Large numbers of those proteins are known to function as transcription factors that influence the transcription of DNA that does code for specific proteins.

That DNA scaffold has been shown to form a complex 3d structure that can wrap a complex pattern of cell type specific enhancer DNA segments back on the protein coding DNA that the enhancers control.  The enhancer process involves complex assemblies of proteins that are anchored and guided by the enhancer DNA segments.  To be effective those assemblies must be aligned with the promoter regions of the protein coding DNA.  Enhancer DNA segments can be separated by hundreds of thousands of DNA bases from the protein coding genes they enhance.

The DNA scaffolds function as part of an extremely complex system of biochemical activity.  Even small changes to the dynamics of some part of that system have a potential for producing significant results.  The sequence of scaffold DNA is the least constrained part of the system.  Variations in that sequence are very good candidates for the providing the degrees of freedom needed for successful evolution by natural selection.

The 1000 genomes project has identified large numbers of common genetic variations.  It discovered the tendency of many of those genetic variations to cluster on a single chromosome.  It used statistical techniques based on that discovery to impute the haploid association of single nucleotide genetic variations from diploid sequence data.  This work has exploited the 1000 genomes results.  Its results provide a strong confirmation of the effectiveness of the 1000 genomes statistical methods.

The 1000 genomes results imply that there is no single normal human DNA sequence.  Instead, all human beings express varying associated genetic variations in patterns that are very far from any kind of random distribution.  Nevertheless, most of the scientific community still appears to think of human DNA as a single reference sequence with some number of independent randomly distributed genetic variations.  That model underlies the assumption that correlation of individual SNPs with some phenotype is an effective technique to understand the functional role of individual genetic variations.

The results of this work call that model into question.  Central to this work is a technique for visualizing the associations of clustered SNPs.  Use of this technique was a major factor in the recognition of the structures and histories reported in these results.  Perhaps inspection of some of the genomes_dnj notebooks can help in the recognition of the reality that is implied both by the 1000 genomes results and by all of the work that shows the role of complex DNA scaffolds in regulating major cellular processes.

This blog has several goals:

Guide access to the jupyter notebooks that contain the results of the work.

Guide access to the source code that provides the details of the methods used.

Outline the construction of a python package capable of repeating the project's results.

Outline the process required to extend the project with data for all of the autosomal chromosomes.

Discuss the value of the work and its relation to the emerging understanding of human genetic function.

There are two project repositories:

 genomes_dnj on github contains the project source code and the jupyter notebooks.

genomes_dnj on google drive duplicates the github content. It adds the data for all autosomal chromosomes in hdf5 format and the notebooks in html format.


Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...