genomes_dnj: November 2017

Thursday, November 23, 2017

Assembling Genomes_dnj Packages

The genomes_dnj_2 github repository was split from the genomes_dnj repository to make it easier for anyone who wanted to use the genomes_dnj source code to understand its specific work or to reproduce that work. Both repositories have similar package structures and the same large dependency on hdf5 files of thousand genome data for each autosomal chromosome.

The google drive genomes_dnj folder contains all of the packages from the genomes_dnj and genomes_dnj_2 repositories along with all of the hdf5 data files used in the analysis of thousand genome data. In all cases, the data in the hdf5 files is accessed through modules that are in the same package as the data. The data_preparation folders in some packages do not contain a __init__.py file and are, therefore, not subpackages. They contain source modules that are primarily provided to document how the data was prepared.

The simplest way to set up a python executable environment is to clone one of the github repositories and replace the packages containing hdf5 files with copies downloaded from google drive. The autosome_snp_data package is the big dependency. It contains more than 7 gigabytes of hdf5 files. Note that it is possible to do the genomes_dnj style analysis on data from individual chromosomes just by downloading the hdf5 files for those chromosomes and placing them in the autosome_snp_data package.

The genomes_dnj_2 packages revise the processing for doing statistical analysis by series, by chromosome, and across the whole genome. The chrom_plots, genome_plots, stats_by_pos, and stats_by_series subpackages all create hdf5 files of statistics data from the data in the autosome_snp_data package. In all cases, the code for creating the hdf5 files is in a package module. Packages with the hdf5 data already created are in the google drive genomes_dnj folder. It should be possible to download these packages and do the genomes_dnj_2 analysis without downloading the individual chromosome data needed for the autosome_snp_data_package.

Except for jupyter notebook processing, all of the module references in the source code are relative to the root package. It should be possible to give that root package any name a user wants.

Tuesday, November 21, 2017

Grouping SNPs Into Series

Filtering Thousand Genome Data

The first step in processing thousand genome data was filtering the data for all autosomal chromosomes and selecting single nucleotide polymorphisms (SNPs) that were expressed by at least 16 of the data's 5008 chromosome samples. This process selected 18,336,427 SNPs. The data for those SNPs was formated in chromosome hdf5 SNP data files to enable easier access for analysis processing.

Grouping SNPs

The second step was to group SNPs into series. The code for that process is in the file autosome_snp_data/data_preparation/chrom_snp_series_data_finder.py. The ideal goal was to group SNPs by an association with a set of samples where each of the samples expressed all of the SNPs in a series and no other samples expressed any of those SNPs.

90% Threshold

Both match requirements were relaxed to 90%. Samples expressing the series were required to express 90% of the SNPs in it. 90% of the samples expressing any SNP in the series needed to meet the requirement for expressing 90% of all the SNPs in the series.

Recursive Matching

The second relaxation of the algorithm was a process for recursive extension of a series. The process started with the lowest chromosome position SNP that had not already been grouped into a series. The samples that expressed that SNP were used as a test set. SNPs were added to the series if 90% of the test set expressed the SNP and if the test set samples expressing the candidate SNP were more than 90% of the samples that expressed that SNP. When no more matches could be found for a test set, the recursive procedure tried to use the highest position SNP in the series as the source of a new test sample set. Additional SNPs at higher chromosome positions that met the match criteria for the new test set were added to the series. This process was carried out recursively as long as additional SNPs could be added to the series. Chromosome samples were considered to express the series if they expressed at least 90% of the complete series of SNPs.

Results

All of the SNPs from each of the 22 autosomal chromosomes were grouped into series. A row of data for each of those series was created in the chromosomes snp series data hdf5 file. That file included 5,045,658 rows for single SNP series. The rest of the SNPs could be grouped into series of at least two SNPs. The procedure identified 946,618 series of four or more SNPs. A total of 10,123,510 SNPs were grouped into those series. The analysis carried out with this data focuses on the 946,618 series of four or more SNPs. Generally the term series used in the discussion of the results of this analysis refers those series that include at least four SNPs.

The analysis documented in the genomes_dnj git repository

https://github.com/dnjake/genomes_dnj examined 76 SNP series in detail. All of them have distributions of SNP and chromosome sample associations that are very far from the kind of distribution expected for independent random variables.

Genomes_dnj Data Structures

The main data structure used the genomes_dnj work is a 5008 element numpy array of boolean values. Each value indicates whether one of the chromosome samples in the thousand genome phase 3 data expresses a SNP or a SNP series. This data structure is usually called an allele mask. It maps directly to the values in the variant call format used by thousand genome phase 3 data to indicate which samples from its sample populations express a particular SNP.

Two data types have been used for the allele mask array. One is a one byte unsigned integer that takes the values 0 or 1. The other is a one byte boolean that takes the values True or False. In most numerical or logical numpy operations, these data types work exactly the same. The one exception is when the array is used as a mask to select the elements of another same sized array at the indexes where the mask has a True value. This operation requires a boolean data type for the mask array.

The allele mask array for an SNP is associated with a row of structured data. That data identifies the SNP, locates it by chromosome position, and provides some population data for the chromosome samples that express the SNP.

SNP Data File

A separate hdf5 file is used to hold the SNP data for each chromosome. The file chrom_1_snp_data.h5 holds the SNP data for chromosome 1. The files for other chromosomes use the same naming convention except the number of the chromosome. The SNP data is held in two root folder tables. The table named snp_data holds the structured data. The other named bitpacked_allele_values contains the allele mask for each SNP with one bit for each of the 5008 chromosome samples to indicate whether not that sample expresses the SNP.

SNP Data Tables

The columns in the SNP data table are:

'snp_index'

An identifier assigned to the SNP that is unique for its chromosome.

'chrom'
'ref'
'alt'
'id'

Fields copied directly from the thousand genome variant call format data.

'not_expressed_is_variant'

For the work carried out in this effort the allele that is expressed at lower frequency by the thousand genome samples is considered the variant. This field is set to 1 when that allele is not the reference variant.

'pos'

The position of the SNP within its chromosome.

'all_count'

The total number of samples that express the variant.

'afr_obs_to_pred' and other region observed to predicted ratios.

The observed to predicted ratios indicate the relative expression of the variant within regional populations. The African populations that don't live in Africa were broken out because gene flow into those populations is a common pattern for SNPs close to absent in African populations. The South Asian populations were also broken out. But, the pattern that causes the issue for African populations does not appear to exist for them.

'beb_count' and other country counts

The counts of chromosome samples from each of the thousand genome country populations that expresses the SNP.

The bitpacked_allele_values table has two fields:

'snp_index'

This is the same index as the one used in the SNP data table. Associated rows in both tables are at the same offset. The index is duplicated to guard against potential problems with corruption.

'bitpacked_values'

This 326 byte field holds the data 8 bits to the byte for the 5008 byte array used for the analysis operations.

SNP Series Data File

A separate hdf5 file holds the series data for each chromosome's SNPs. The file chrom_1_snp_series_data.h5' holds the data for chromosome 1. A similar naming convention is used for the other chromosomes. The data is held in two root folder tables. The chrom_1_snp_series_data table holds the structured data for each of chromosome 1's SNP series. The chrom_1_series_bitpacked_p90_allele_mask holds a bit packed array of true false values for chromosome samples that express the SNP series.

SNP Series Data Tables

The columns of the chromosome snp series data tables are:

'data_index'

A unique index by chromosome for the series. Within the analysis a series is identified by its chromosome and its data index.

'first_snp_index'

The index used in the SNP data table for the first SNP in the series.

'chrom'

The series chromosome.

'first_pos'

The chromosome position of the first SNP in the series.

'last_pos'

The chromosome position of the last SNP in the series

'item_data_start'

A reference to a row of data in another hdf5 that identifies the specific SNPs that form the series. The value of this field is the row in that table that identifies the

first SNP in this series.

'item_count'

The number of SNPs in this series and the number of item rows that associate this series with specific SNPs.

'p90_allele_count'

The count of chromosome samples that express 90% of the SNPs in this series.

'afr_obs_to_pred' and other region observed to predicted values

Relative expression values for the series by different thousand genome regional populations.

The columns of the chromosome series_bitpacked_allele_mask table are:

'data_index'

The id of the series.

'bitpacked_p90_allele_mask'

A 326 byte version of the 5008 byte true false array of samples that express the series.

SNP Series File

There is also a separate file for each chromosome with data that associates SNP series data rows with the data rows for each of the series SNPs. The file chrom_1_snp_series.h5 holds the associations for chromosome 1 SNPs and SNP series. Each file has two tables in its root folder. The chrom_1_snp_series table has a row for each of the chromosome 1 SNP series. Each row in that table locates the records that associate a particular series with its SNPs in the chrom_1_snp_series_items table. Information in each of the rows in the SNP series data tables can also be used directly to locate the SNP associations in the snp_series_items table for the particular chromosome.

SNP Series Table

'first_snp_index'

The snp_index of the first SNP in the series. This index identifies the series.

'item_data_start'

The offset of the row in the snp_series_items table that starts the rows that associate this series and its SNPs.

'item_data_count'

The number of SNPs in this series and the number of rows in the snp_series_items table that associates this series with its SNPs.

SNP Series Items Table

'snp_index'

The index of the SNP that can be used to find its data in the SNP data file.

'first_snp_index'

The index of the first SNP in the series this SNP belongs to.

Chromosome Sample Populations File

The file allele_country_and_region_plus_maps.h5 holds the data that associates region and country population chromosome samples with the 5008 elements in the allele mask array. The root folder of the file holds two tables. The country_alleles table associates each country with the mask of samples that comes from that country's population. The region_alleles table associates each region with the mask of samples that comes from that region's population. Thousand genome African and South Asian regions were divided into populations that actually live in the region and those that live elsewhere.

Country Alleles Table

'index'

Offset of the row with the country's data.

'region_index'

Offset of the row in the region table that holds the data for the region that the country is part of.

'code'

A three letter code for the country.

'name'

The full name of the country.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this country.

Region Alleles Table

'index'

Offset of the row with the region's data.

'code'

A three letter code for the region.

'name'

The full name of the region.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this region.