Tuesday, November 21, 2017

Grouping SNPs Into Series

Filtering Thousand Genome Data


The first step in processing thousand genome data was filtering the data for all autosomal chromosomes and selecting single nucleotide polymorphisms (SNPs) that were expressed by at least 16 of the data's 5008 chromosome samples.  This process selected 18,336,427 SNPs.  The data for those SNPs was formated in chromosome hdf5 SNP data files to enable easier access for analysis processing.

Grouping SNPs


The second step was to group SNPs into series.  The code for that process is in the file autosome_snp_data/data_preparation/chrom_snp_series_data_finder.py.  The ideal goal was to group SNPs by an association with a set of samples where each of the samples expressed all of the SNPs in a series and no other samples expressed any of those SNPs.

90% Threshold


Both match requirements were relaxed to 90%.  Samples expressing the series were required to express 90% of the SNPs in it.  90% of the samples expressing any SNP in the series needed to meet the requirement for expressing 90% of all the SNPs in the series.

Recursive Matching


The second relaxation of the algorithm was a process for recursive extension of a series.  The process started with the lowest chromosome position SNP that had not already been grouped into a series.  The samples that expressed that SNP were used as a test set.  SNPs were added to the series if 90% of the test set expressed the SNP and if the test set samples expressing the candidate SNP were more than 90% of the samples that expressed that SNP.  When no more matches could be found for a test set, the recursive procedure tried to use the highest position SNP in the series as the source of a new test sample set.  Additional SNPs at higher chromosome positions that met the match criteria for the new test set were added to the series.  This process was carried out recursively as long as additional SNPs could be added to the series.  Chromosome samples were considered to express the series if they expressed at least 90% of the complete series of SNPs.

Results


All of the SNPs from each of the 22 autosomal chromosomes were grouped into series.  A row of data for each of those series was created in the chromosomes snp series data hdf5 file.  That file included 5,045,658 rows for single SNP series.  The rest of the SNPs could be grouped into series of at least two SNPs.  The procedure identified 946,618 series of four or more SNPs.  A total of 10,123,510 SNPs were grouped into those series.  The analysis carried out with this data focuses on the 946,618 series of four or more SNPs.  Generally the term series used in the discussion of the results of this analysis refers those series that include at least four SNPs.

The analysis documented in the genomes_dnj git repository 
https://github.com/dnjake/genomes_dnj examined 76 SNP series in detail.  All of them have distributions of SNP and chromosome sample associations that are very far from the kind of distribution expected for independent random variables.

No comments:

Post a Comment

Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...