Thursday, November 23, 2017

Assembling Genomes_dnj Packages

The genomes_dnj_2 github repository was split from the genomes_dnj repository to make it easier for anyone who wanted to use the genomes_dnj source code to understand its specific work or to reproduce that work.  Both repositories have similar package structures and the same large dependency on hdf5 files of thousand genome data for each autosomal chromosome.

The google drive genomes_dnj folder contains all of the packages from the genomes_dnj and genomes_dnj_2 repositories along with all of the hdf5 data files used in the analysis of thousand genome data.  In all cases, the data in the hdf5 files is accessed through modules that are in the same package as the data.  The data_preparation folders in some packages do not contain a __init__.py file and are, therefore, not subpackages.  They contain source modules that are primarily provided to document how the data was prepared.

The simplest way to set up a python executable environment is to clone one of the github repositories and replace the packages containing hdf5 files with copies downloaded from google drive.  The autosome_snp_data package is the big dependency.  It contains more than 7 gigabytes of hdf5 files.  Note that it is possible to do the genomes_dnj style analysis on data from individual chromosomes just by downloading the hdf5 files for those chromosomes and placing them in the autosome_snp_data package.

The genomes_dnj_2 packages revise the processing for doing statistical analysis by series, by chromosome, and across the whole genome.  The chrom_plots, genome_plots, stats_by_pos, and stats_by_series subpackages all create hdf5 files of statistics data from the data in the autosome_snp_data package.  In all cases, the code for creating the hdf5 files is in a package module.  Packages with the hdf5 data already created are in the google drive genomes_dnj folder.  It should be possible to download these packages and do the genomes_dnj_2 analysis without downloading the individual chromosome data needed for the autosome_snp_data_package.

Except for jupyter notebook processing, all of the module references in the source code are relative to the root package.  It should be possible to give that root package any name a user wants.

No comments:

Post a Comment

Assembling Genomes_dnj Packages

The genomes_dnj_2  github repository was split from the  genomes_dnj  repository to make it easier for anyone who wanted to use the genomes_...