Tuesday, November 21, 2017

Genomes_dnj Data Structures

The main data structure used the genomes_dnj work is a 5008 element numpy array of boolean values.  Each value indicates whether one of the chromosome samples in the thousand genome phase 3 data expresses a SNP or a SNP series.  This data structure is usually called an allele mask.  It maps directly to the values in the variant call format used by thousand genome phase 3 data to indicate which samples from its sample populations express a particular SNP.

Two data types have been used for the allele mask array.  One is a one byte unsigned integer that takes the values 0 or 1.  The other is a one byte boolean that takes the values True or False.  In most numerical or logical numpy operations, these data types work exactly the same.  The one exception is when the array is used as a mask to select the elements of another same sized array at the indexes where the mask has a True value.  This operation requires a boolean data type for the mask array.

The allele mask array for an SNP is associated with a row of structured data.  That data identifies the SNP, locates it by chromosome position, and provides some population data for the chromosome samples that express the SNP.

SNP Data File


A separate hdf5 file is used to hold the SNP data for each chromosome.  The file chrom_1_snp_data.h5 holds the SNP data for chromosome 1.  The files for other chromosomes use the same naming convention except the number of the chromosome. The SNP data is held in two root folder tables.  The table named snp_data holds the structured data.  The other named bitpacked_allele_values contains the allele mask for each SNP with one bit for each of the 5008 chromosome samples to indicate whether not that sample expresses the SNP.

SNP Data Tables


The columns in the SNP data table are:

'snp_index'

An identifier assigned to the SNP that is unique for its chromosome.

'chrom'
'ref'
'alt'
'id'

Fields copied directly from the thousand genome variant call format data.

'not_expressed_is_variant'

For the work carried out in this effort the allele that is expressed at lower frequency by the thousand genome samples is considered the variant.  This field is set to 1 when that allele is not the reference variant.

'pos'

The position of the SNP within its chromosome.

'all_count'

The total number of samples that express the variant.

'afr_obs_to_pred' and other region observed to predicted ratios.

The observed to predicted ratios indicate the relative expression of the variant within regional populations.  The African populations that don't live in Africa were broken out because gene flow into those populations is a common pattern for SNPs close to absent in African populations.  The South Asian populations were also broken out.  But, the pattern that causes the issue for African populations does not appear to exist for them.

'beb_count' and other country counts

The counts of chromosome samples from each of the thousand genome country populations that expresses the SNP.

The bitpacked_allele_values table has two fields:

'snp_index'

This is the same index as the one used in the SNP data table.  Associated rows in both tables are at the same offset.  The index is duplicated to guard against potential problems with corruption.

'bitpacked_values'

This 326 byte field holds the data 8 bits to the byte for the 5008 byte array used for the analysis operations.

SNP Series Data File


A separate hdf5 file holds the series data for each chromosome's SNPs.  The file chrom_1_snp_series_data.h5' holds the data for chromosome 1.  A similar naming convention is used for the other chromosomes.  The data is held in two root folder tables.  The chrom_1_snp_series_data table holds the structured data for each of chromosome 1's SNP series.  The chrom_1_series_bitpacked_p90_allele_mask holds a bit packed  array of true false values for chromosome samples that express the SNP series.

SNP Series Data Tables


The columns of the chromosome snp series data tables are:

'data_index'

A unique index by chromosome for the series.  Within the analysis a series is identified by its chromosome and its data index.

'first_snp_index'

The index used in the SNP data table for the first SNP in the series.

'chrom'

The series chromosome.

'first_pos'

The chromosome position of the first SNP in the series.

'last_pos'

The chromosome position of the last SNP in the series

'item_data_start'

A reference to a row of data in another hdf5 that identifies the specific SNPs that form the series.  The value of this field is the row in that table that identifies the
first SNP in this series.

'item_count'

The number of SNPs in this series and the number of item rows that associate this series with specific SNPs.

'p90_allele_count'

The count of chromosome samples that express 90% of the SNPs in this series.

'afr_obs_to_pred' and other region observed to predicted values

Relative expression values for the series by different thousand genome regional populations.

The columns of the chromosome series_bitpacked_allele_mask table are:

'data_index'

The id of the series.

'bitpacked_p90_allele_mask'

A 326 byte version of the 5008 byte true false array of samples that express the series.

SNP Series File


There is also a separate file for each chromosome with data that associates SNP series data rows with the data rows for each of the series SNPs.  The file chrom_1_snp_series.h5 holds the associations for chromosome 1 SNPs and SNP series.  Each file has two tables in its root folder.  The chrom_1_snp_series table has a row for each of the chromosome 1 SNP series.  Each row in that table locates the records that associate a particular series with its SNPs in the chrom_1_snp_series_items table.  Information in each of the rows in the SNP series data tables can also be used directly to locate the SNP associations in the snp_series_items table for the particular chromosome.

SNP Series Table


'first_snp_index'

The snp_index of the first SNP in the series.  This index identifies the series.

'item_data_start'

The offset of the row in the snp_series_items table that starts the rows that associate this series and its SNPs.

'item_data_count'

The number of SNPs in this series and the number of rows in the snp_series_items table that associates this series with its SNPs.

SNP Series Items Table


'snp_index'

The index of the SNP that can be used to find its data in the SNP data file.

'first_snp_index'

The index of the first SNP in the series this SNP belongs to.

Chromosome Sample Populations File


The file allele_country_and_region_plus_maps.h5 holds the data that associates region and country population chromosome samples with the 5008 elements in the allele mask array.  The root folder of the file holds two tables.  The country_alleles table associates each country with the mask of samples that comes from that country's population.  The region_alleles table associates each region with the mask of samples that comes from that region's population.  Thousand genome African and South Asian regions were divided into populations that actually live in the region and those that live elsewhere.

Country Alleles Table


'index'

Offset of the row with the country's data.

'region_index'

Offset of the row in the region table that holds the data for the region that the country is part of.

'code'

A three letter code for the country.

'name'

The full name of the country.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this country.

Region Alleles Table


'index'

Offset of the row with the region's data.

'code'

A three letter code for the region.

'name'

The full name of the region.

'allele_mask'

A true false array that indicates the elements of the 5008 byte allele mask array that come from this region.





No comments:

Post a Comment

Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...