Thursday, November 23, 2017

Assembling Genomes_dnj Packages

The genomes_dnj_2 github repository was split from the genomes_dnj repository to make it easier for anyone who wanted to use the genomes_dnj source code to understand its specific work or to reproduce that work.  Both repositories have similar package structures and the same large dependency on hdf5 files of thousand genome data for each autosomal chromosome.

The google drive genomes_dnj folder contains all of the packages from the genomes_dnj and genomes_dnj_2 repositories along with all of the hdf5 data files used in the analysis of thousand genome data.  In all cases, the data in the hdf5 files is accessed through modules that are in the same package as the data.  The data_preparation folders in some packages do not contain a __init__.py file and are, therefore, not subpackages.  They contain source modules that are primarily provided to document how the data was prepared.

The simplest way to set up a python executable environment is to clone one of the github repositories and replace the packages containing hdf5 files with copies downloaded from google drive.  The autosome_snp_data package is the big dependency.  It contains more than 7 gigabytes of hdf5 files.  Note that it is possible to do the genomes_dnj style analysis on data from individual chromosomes just by downloading the hdf5 files for those chromosomes and placing them in the autosome_snp_data package.

The genomes_dnj_2 packages revise the processing for doing statistical analysis by series, by chromosome, and across the whole genome.  The chrom_plots, genome_plots, stats_by_pos, and stats_by_series subpackages all create hdf5 files of statistics data from the data in the autosome_snp_data package.  In all cases, the code for creating the hdf5 files is in a package module.  Packages with the hdf5 data already created are in the google drive genomes_dnj folder.  It should be possible to download these packages and do the genomes_dnj_2 analysis without downloading the individual chromosome data needed for the autosome_snp_data_package.

Except for jupyter notebook processing, all of the module references in the source code are relative to the root package.  It should be possible to give that root package any name a user wants.

Tuesday, November 21, 2017

Grouping SNPs Into Series

Filtering Thousand Genome Data


The first step in processing thousand genome data was filtering the data for all autosomal chromosomes and selecting single nucleotide polymorphisms (SNPs) that were expressed by at least 16 of the data's 5008 chromosome samples.  This process selected 18,336,427 SNPs.  The data for those SNPs was formated in chromosome hdf5 SNP data files to enable easier access for analysis processing.

Grouping SNPs


The second step was to group SNPs into series.  The code for that process is in the file autosome_snp_data/data_preparation/chrom_snp_series_data_finder.py.  The ideal goal was to group SNPs by an association with a set of samples where each of the samples expressed all of the SNPs in a series and no other samples expressed any of those SNPs.

90% Threshold


Both match requirements were relaxed to 90%.  Samples expressing the series were required to express 90% of the SNPs in it.  90% of the samples expressing any SNP in the series needed to meet the requirement for expressing 90% of all the SNPs in the series.

Recursive Matching


The second relaxation of the algorithm was a process for recursive extension of a series.  The process started with the lowest chromosome position SNP that had not already been grouped into a series.  The samples that expressed that SNP were used as a test set.  SNPs were added to the series if 90% of the test set expressed the SNP and if the test set samples expressing the candidate SNP were more than 90% of the samples that expressed that SNP.  When no more matches could be found for a test set, the recursive procedure tried to use the highest position SNP in the series as the source of a new test sample set.  Additional SNPs at higher chromosome positions that met the match criteria for the new test set were added to the series.  This process was carried out recursively as long as additional SNPs could be added to the series.  Chromosome samples were considered to express the series if they expressed at least 90% of the complete series of SNPs.

Results


All of the SNPs from each of the 22 autosomal chromosomes were grouped into series.  A row of data for each of those series was created in the chromosomes snp series data hdf5 file.  That file included 5,045,658 rows for single SNP series.  The rest of the SNPs could be grouped into series of at least two SNPs.  The procedure identified 946,618 series of four or more SNPs.  A total of 10,123,510 SNPs were grouped into those series.  The analysis carried out with this data focuses on the 946,618 series of four or more SNPs.  Generally the term series used in the discussion of the results of this analysis refers those series that include at least four SNPs.

The analysis documented in the genomes_dnj git repository 
https://github.com/dnjake/genomes_dnj examined 76 SNP series in detail.  All of them have distributions of SNP and chromosome sample associations that are very far from the kind of distribution expected for independent random variables.

Genomes_dnj Data Structures

The main data structure used the genomes_dnj work is a 5008 element numpy array of boolean values.  Each value indicates whether one of the chromosome samples in the thousand genome phase 3 data expresses a SNP or a SNP series.  This data structure is usually called an allele mask.  It maps directly to the values in the variant call format used by thousand genome phase 3 data to indicate which samples from its sample populations express a particular SNP.

Two data types have been used for the allele mask array.  One is a one byte unsigned integer that takes the values 0 or 1.  The other is a one byte boolean that takes the values True or False.  In most numerical or logical numpy operations, these data types work exactly the same.  The one exception is when the array is used as a mask to select the elements of another same sized array at the indexes where the mask has a True value.  This operation requires a boolean data type for the mask array.

The allele mask array for an SNP is associated with a row of structured data.  That data identifies the SNP, locates it by chromosome position, and provides some population data for the chromosome samples that express the SNP.

SNP Data File


A separate hdf5 file is used to hold the SNP data for each chromosome.  The file chrom_1_snp_data.h5 holds the SNP data for chromosome 1.  The files for other chromosomes use the same naming convention except the number of the chromosome. The SNP data is held in two root folder tables.  The table named snp_data holds the structured data.  The other named bitpacked_allele_values contains the allele mask for each SNP with one bit for each of the 5008 chromosome samples to indicate whether not that sample expresses the SNP.

SNP Data Tables


The columns in the SNP data table are:

'snp_index'

An identifier assigned to the SNP that is unique for its chromosome.

'chrom'
'ref'
'alt'
'id'

Fields copied directly from the thousand genome variant call format data.

'not_expressed_is_variant'

For the work carried out in this effort the allele that is expressed at lower frequency by the thousand genome samples is considered the variant.  This field is set to 1 when that allele is not the reference variant.

'pos'

The position of the SNP within its chromosome.

'all_count'

The total number of samples that express the variant.

'afr_obs_to_pred' and other region observed to predicted ratios.

The observed to predicted ratios indicate the relative expression of the variant within regional populations.  The African populations that don't live in Africa were broken out because gene flow into those populations is a common pattern for SNPs close to absent in African populations.  The South Asian populations were also broken out.  But, the pattern that causes the issue for African populations does not appear to exist for them.

'beb_count' and other country counts

The counts of chromosome samples from each of the thousand genome country populations that expresses the SNP.

The bitpacked_allele_values table has two fields:

'snp_index'

This is the same index as the one used in the SNP data table.  Associated rows in both tables are at the same offset.  The index is duplicated to guard against potential problems with corruption.

'bitpacked_values'

This 326 byte field holds the data 8 bits to the byte for the 5008 byte array used for the analysis operations.

SNP Series Data File


A separate hdf5 file holds the series data for each chromosome's SNPs.  The file chrom_1_snp_series_data.h5' holds the data for chromosome 1.  A similar naming convention is used for the other chromosomes.  The data is held in two root folder tables.  The chrom_1_snp_series_data table holds the structured data for each of chromosome 1's SNP series.  The chrom_1_series_bitpacked_p90_allele_mask holds a bit packed  array of true false values for chromosome samples that express the SNP series.

SNP Series Data Tables


The columns of the chromosome snp series data tables are:

'data_index'

A unique index by chromosome for the series.  Within the analysis a series is identified by its chromosome and its data index.

'first_snp_index'

The index used in the SNP data table for the first SNP in the series.

'chrom'

The series chromosome.

'first_pos'

The chromosome position of the first SNP in the series.

'last_pos'

The chromosome position of the last SNP in the series

'item_data_start'

A reference to a row of data in another hdf5 that identifies the specific SNPs that form the series.  The value of this field is the row in that table that identifies the
first SNP in this series.

'item_count'

The number of SNPs in this series and the number of item rows that associate this series with specific SNPs.

'p90_allele_count'

The count of chromosome samples that express 90% of the SNPs in this series.

'afr_obs_to_pred' and other region observed to predicted values

Relative expression values for the series by different thousand genome regional populations.

The columns of the chromosome series_bitpacked_allele_mask table are:

'data_index'

The id of the series.

'bitpacked_p90_allele_mask'

A 326 byte version of the 5008 byte true false array of samples that express the series.

SNP Series File


There is also a separate file for each chromosome with data that associates SNP series data rows with the data rows for each of the series SNPs.  The file chrom_1_snp_series.h5 holds the associations for chromosome 1 SNPs and SNP series.  Each file has two tables in its root folder.  The chrom_1_snp_series table has a row for each of the chromosome 1 SNP series.  Each row in that table locates the records that associate a particular series with its SNPs in the chrom_1_snp_series_items table.  Information in each of the rows in the SNP series data tables can also be used directly to locate the SNP associations in the snp_series_items table for the particular chromosome.

SNP Series Table


'first_snp_index'

The snp_index of the first SNP in the series.  This index identifies the series.

'item_data_start'

The offset of the row in the snp_series_items table that starts the rows that associate this series and its SNPs.

'item_data_count'

The number of SNPs in this series and the number of rows in the snp_series_items table that associates this series with its SNPs.

SNP Series Items Table


'snp_index'

The index of the SNP that can be used to find its data in the SNP data file.

'first_snp_index'

The index of the first SNP in the series this SNP belongs to.

Chromosome Sample Populations File


The file allele_country_and_region_plus_maps.h5 holds the data that associates region and country population chromosome samples with the 5008 elements in the allele mask array.  The root folder of the file holds two tables.  The country_alleles table associates each country with the mask of samples that comes from that country's population.  The region_alleles table associates each region with the mask of samples that comes from that region's population.  Thousand genome African and South Asian regions were divided into populations that actually live in the region and those that live elsewhere.

Country Alleles Table


'index'

Offset of the row with the country's data.

'region_index'

Offset of the row in the region table that holds the data for the region that the country is part of.

'code'

A three letter code for the country.

'name'

The full name of the country.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this country.

Region Alleles Table


'index'

Offset of the row with the region's data.

'code'

A three letter code for the region.

'name'

The full name of the region.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this region.





Thursday, August 3, 2017

Genomes_dnj Overview

Human Genetic History


Studies of the sequence and function of human DNA have shown that more of that DNA functions as a scaffold for complex protein assemblies than codes for protein amino acid sequences.  Over two thousand protein transcription factors have been identified that can bind to DNA and are believed to influence DNA transcription into RNA.  Some transcription factors are known to mediate the activity of chemical signalling systems that influence the synthesis of specific proteins.  Other transcription factors are known to have a role in the operation of enhancer DNA segments that control cell type specific transcription of protein coding RNA.

Studies of the operation of enhancer DNA sequences have shown that their cell type specific operation depends on complex three dimensional chromatin DNA scaffolds that wrap complex enhancer patterns back on the DNA that codes for the regulated proteins.  Enhancer DNA sequences can be separated by hundreds of thousands of DNA bases from the DNA whose transcription is enhanced.  Complex assemblies of proteins and perhaps RNA interact with both the enhancer DNA and the promoter of the regulated protein coding DNA to enable the transcription of that DNA.

Studies of complete human genomes have shown that the average human expresses more than 3,000,000 single nucleotide polymorphisms (SNPs).  Most of these genetic variations do not change protein amino acids.  Instead those SNPs impact the scaffold DNA.  Statistical studies of the relationship between SNPs and phenotypes have shown that many of these SNPs are quantitative trait loci.  Those SNPs have modest statistical impacts on their phenotypes.  But, overall they have a more significant genetic impact on disease than defects in proteins.

The work of the 1000 genomes project showed a strong tendency for correlated SNPs to form haplotypes.  That correlation allowed the 1000 genome project to impute the haploid association of SNPs for each of the 5008 haploid chromosome samples from the 2504 complete genomes in its phase 3 data.  The data with those imputed associations of SNPs make it possible to analyze variation in SNP associations within the different populations sampled in the 1000 genomes data.

The analysis presented in this set of notebooks uses a simple heuristic algorithm to group SNPs from 1000 genomes data into series.  Series of four or more SNPs were identified that were all expressed by the same chromosome samples.  The goal was to identify series of SNPs that met two criteria.  One required 16 or more chromosome samples to express 90% of the SNPs in the series.  The second required 90% of the chromosome samples expressing any SNP in the series to meet the first requirement.

More than 940,000 series of at least 4 SNPs were identified with this technique from the 1000 genomes phase 3 data for autosomal chromosomes.  Some summary statistics on the identified series and some plots of chromosome 2 series characteristics are presented in the notebook chrom2_plots.ipynb. Along most of the length of all of the different autosomal chromosomes, most of the 1000 genomes samples express some series.  But, there is considerable variation in the number of series crossing different chromosome positions, the number of SNPs in the series, and the length of the chromosome region from the first SNP in a series to the last.  There is also a pattern for positions on a chromosome that are not crossed by any identified series.

The analysis presented in these notebooks focused on the million base interval between positions 135,757,320 and 136,786,630 of chromosome 2.  This interval includes the genes rab3gap1, zranb3, r3hdm1, ubxn4, lct, mcm6, and dars.  This interval was chosen for analysis because of the location of the gene lct which codes for the protein that is needed to digest lactase and the location of the SNP rs4988235 in an intron of mcm6 that is generally thought to be the genetic variation responsible for the phenotype of lactase persistence.  The specific interval endpoint locations were chosen because they are the nearest positions on both sides of the lct gene where the number of active SNP series goes to zero.

The statistics of this interval are exceptional.  The distance between endpoints with no active series is at the long end for chromosome 2. An unusually large number of series are expressed in this interval.  But, it is the number of SNPs in those series that make this interval the most exceptional one on chromosome 2.

The 870,000 base 11 SNP series specifically associated with lactase persistence is the second longest one identified in the interval.  That 11 SNP series expressed by 765 1000 genomes chromosome samples has selected an hierarchy that also includes the series 6_1503, 4_1699, 4_911, 26_1414, 64_1575, 10_2206, and 7_1818.  The emergence of lactase persistence was a genetic process that selected 8 series including a total of 132 SNPs for overexpression.  See the notebook lactase_persistence.ipynb for more information about this hierarchy.

The process that resulted in 765 chromosome samples expressing the 8 series and 132 SNPs associated with lactase persistence is the most dramatic instance of selection visible in this interval.  But, the series 11_765 is only one of the 76 SNP series documented in this study.  For all of those series, the distribution of chromosome samples expressing the series SNPs is very far from the expectation for a collection of independent random variables.

A large number of these series emerged in three hierarchies during the human expansion out of Africa.  But, the exceptional characteristics of the studied region of chromosome 2 appear more to be the result of four overlapping series that each fill a large part of the 629,000 base region that covers the genes rab3gap1, zranb3, and half of r3hdm1.  In total these series include 493 SNPs.  All four series are expressed by 842 of the 843 chromosome samples that express the South Asian (SAS) tree hierarchy. The series 193_843 that includes 193 SNPs expressed by 843 chromosome samples is only expressed in this this hierarchy.  The other three series, 62_1265, 123_1561, and 117_1685 including a total of 302 SNPs are also expressed in varying combinations by large numbers of mostly African chromosome samples.  These 4 exceptionally long series with exceptionally large numbers of SNPs appear to be the most exceptional feature of this part of chromosome 2.  The data presented in these notebooks show several instances of independent processes that resulted in large overexpression of different combinations of the same series of these SNPs.

At least two well understood processes have the potential to produce overexpression of groups of SNPs largely by chance.  One is a population bottleneck.  It is likely that population bottlenecks have played some role in generating the differences observed for expression of SNPs among the different 1000 genomes regional populations. Another is linkage disequilibrium.  That is the potential for the selective functional advantage of a single SNP to result in overexpression of chance nearby SNPs because genetic events that recombine them are rare.

There is no doubt about the reality of linkage disequilibrium and little doubt about the significance of population bottlenecks.  But, several arguments based on the evidence presented in these notebooks suggest that these phenomena are not adequate to explain the observed results.

One is just the large fraction of SNPs that can be grouped into stable series.  More than half the SNPs expressed by 16 or more samples of autosomal chromosomes grouped into series of 4 or more SNPs with the algorithm used for this study.  Many of those series are in African populations and appear to have a long history that preceded the human expansion out of Africa.

A second is the large number of observed cases where the identity of a series is preserved across recombination events and in varied highly overexpressed associations with other series.  Consider the series in the hierarchy overexpressed with the lactase persistence phenotype.  The 765 samples that express the series 11_765 include 764 that express all of the series 6_1503, 4_1699, 26_1414, 64_1575, 10_2206, and 7_1868 in mostly European populations.  There are also 139 samples that express 6_1503 and 4_1699 with 4_1149, 95_176, and 51_176 that all come from East Asian populations.  There are 200 samples that express 4_1699 without 6_1503.  Those samples include 149 that express 4_1699 with 13_1696 and 32_1361.  The overexpression of 4_1699 by those samples is largely the result of several selection processes among African populations.  There are 259 mostly East Asian samples that express an association of series that include 6_1396, 5_684, 9_944, 32_1361, 64_1575, 10_2206, and 7_1868. There are 378 samples overexpressed in African populations that express the series 10_2206 in association with the series 9_378.  There are 161 almost all African samples that express 7_1868 with 4_163.  There are 48 almost all African samples that express 26_1414 in association with the series 14_48.  There are 16 samples from African populations that express 26_1414 in association with the series 5_16.

A third is the multiple cases of independent selection of complex series.  For example, the 117 SNP series 117_1685 was selected along with 123_1561, 62_1265, and 193_843 for expression by 842 samples through the processes that generated the SAS tree.  117_1685 was also selected along with 123_1561, 62_1265, and 67_329 without 193_843 for expression by 309 samples through the processes that generated the 67 SNP series 67_329 and the descendant series in its hierarchy.  117_1685 was also selected along with 74_210 without 123_1561, 62_1265, 193_843 or 67_329 for expression by 210 samples through the processes that generated 74_210 and the descendant series in its hierarchy.  117_1685 was also selected along with 123_1561, 62_1265, and the 209 SNP series 209_56 through the processes that generated the 209 SNPs for expression by 56 samples.  117_1685 was also selected along with the 290 SNP series 290_16 without 123_1561, 62_1265, 193_843, 67_329, or 209_56 for expression by 16 samples through the processes that generated those 290 SNPs.

A fourth is the varying patterns of recombination among series in different parts of the studied region and within different hierarchies.  Some history of recombination events has been detected in all parts of the 1,000,000 base studied region.  But, recombination events are more common in the top 400,000 bases than in the lower 600,000 bases.  The observed hierarchies suggest that processes which select a series that extends over the whole million base region are common.  But, almost all of those sequences have become partially fragmented through recombination events between the lower and upper parts of the region.  There also appears to be much more history of recombination within the lower region of the East Asian tree than in other observed hierarchies.

The 870,000 base series 11_765 associated with lactase persistence is the second longest one observed in the studied region.  It is exceptional for the length of the series and the absence of any history of recombination events by the samples that express it.  The 765 samples that express 11_765 include 760 that express 6_1503, 4_1699, 26_1414, 64_1575, 10_2206, 7_1868, 4_911 and no others.  This pattern fits the idea of a functional contribution from multiple SNPs much better than one where only a single SNP is functional and all the other associations result from chance. Certainly the kind of complex 3d scaffolding observed for enhancer controlled cell type specific regulation of DNA transcription is consistent with multiple SNP contributions to phenotypes.

The notebooks in this study provide a more detailed analysis of the hierarchies of series in the studied region of chromosome 2 and of sample populations that express them.  The results of the study include both charts of series associations and tables of data on sample expression.  The methods used are described in some detail with example charts in the notebook methods.ipynb.


Thursday, July 27, 2017

Genomes_dnj Notebooks


The results of project genomes_dnj are contained in a tree of 100 jupyter notebooks.

The top layer of the tree contains several notebooks that provide a summary of the patterns of human genetic history revealed by the project's study of 1000 genomes phase 3 data.

The next layer is divided into the main hierarchies of human genetic history observed in the study

For each of the major hierarchies, the bottom layer documents the SNPs in each of the series associated with the hierarchy.  In total 76 different series of 4 or more SNPs are documented.  The individual series notebooks provide distributions of expression of its SNPs among the 5008 chromosome samples in the 1000 genomes phase 3 data.  They also provide regional population data for the expression of the series and for all of its series associations within the studied interval of chromosome 2.

The easiest way to access the notebooks is to download the whole tree in html format from the notebooks html folder on google drive.

Another possibility is to download the notebooks in native format.  One method is to clone the github master branch for genomes_dnj.  An alternative is to download the whole tree from the notebooks folder on google drive.

Viewing native notebooks requires a python installation.  Anaconda2-4.1.1 was used for all of the project work.

The top level notebooks can be viewed directly online from anaconda cloud.

Wednesday, July 26, 2017

Project Genomes_dnj


The genomes_dnj project has been an effort to use 1000 genomes phase 3 data to identify the different DNA sequences expressed by its 5008 chromosome samples.  Over 10,000,000 SNPs across the autosomal chromosomes expressed by 16 or more chromosome samples were grouped into more than 940,000 series of 4 or more SNPs.  The series data for a 1,000,000 DNA base segment of chromosome 2 has been explored in detail.

The results include the identification of three hierarchical histories of series that have generated a large part of the segment's DNA sequences for the 1000 genomes populations that emerged from Africa.  Each of those histories is rooted in a different complex association of series with histories that extend into the African past.

The series in the lower 600,000 bases of the studied segment show more history of stability than those in the upper 400,000 bases.  One of the out of Africa hierarchies plus several African hierarchies show a record of independent selection of the same series of hundreds of SNPs.  Other samples show more complex patterns of series remodeling or patterns that appear to have resulted from randomization of series associations.

A stream of results over the last several decades has shown that much more human DNA has a role as a scaffold for organizing the activity of assemblies of proteins than codes for those proteins.  Over two thousand human proteins are known to bind to DNA.  Large numbers of those proteins are known to function as transcription factors that influence the transcription of DNA that does code for specific proteins.

That DNA scaffold has been shown to form a complex 3d structure that can wrap a complex pattern of cell type specific enhancer DNA segments back on the protein coding DNA that the enhancers control.  The enhancer process involves complex assemblies of proteins that are anchored and guided by the enhancer DNA segments.  To be effective those assemblies must be aligned with the promoter regions of the protein coding DNA.  Enhancer DNA segments can be separated by hundreds of thousands of DNA bases from the protein coding genes they enhance.

The DNA scaffolds function as part of an extremely complex system of biochemical activity.  Even small changes to the dynamics of some part of that system have a potential for producing significant results.  The sequence of scaffold DNA is the least constrained part of the system.  Variations in that sequence are very good candidates for the providing the degrees of freedom needed for successful evolution by natural selection.

The 1000 genomes project has identified large numbers of common genetic variations.  It discovered the tendency of many of those genetic variations to cluster on a single chromosome.  It used statistical techniques based on that discovery to impute the haploid association of single nucleotide genetic variations from diploid sequence data.  This work has exploited the 1000 genomes results.  Its results provide a strong confirmation of the effectiveness of the 1000 genomes statistical methods.

The 1000 genomes results imply that there is no single normal human DNA sequence.  Instead, all human beings express varying associated genetic variations in patterns that are very far from any kind of random distribution.  Nevertheless, most of the scientific community still appears to think of human DNA as a single reference sequence with some number of independent randomly distributed genetic variations.  That model underlies the assumption that correlation of individual SNPs with some phenotype is an effective technique to understand the functional role of individual genetic variations.

The results of this work call that model into question.  Central to this work is a technique for visualizing the associations of clustered SNPs.  Use of this technique was a major factor in the recognition of the structures and histories reported in these results.  Perhaps inspection of some of the genomes_dnj notebooks can help in the recognition of the reality that is implied both by the 1000 genomes results and by all of the work that shows the role of complex DNA scaffolds in regulating major cellular processes.

This blog has several goals:

Guide access to the jupyter notebooks that contain the results of the work.

Guide access to the source code that provides the details of the methods used.

Outline the construction of a python package capable of repeating the project's results.

Outline the process required to extend the project with data for all of the autosomal chromosomes.

Discuss the value of the work and its relation to the emerging understanding of human genetic function.

There are two project repositories:

 genomes_dnj on github contains the project source code and the jupyter notebooks.

genomes_dnj on google drive duplicates the github content. It adds the data for all autosomal chromosomes in hdf5 format and the notebooks in html format.


Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...