Difference between revisions of "ChunkChromosome"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
''ChunkChromosome'' is a helper utility for [[minimac]] and [[MaCH]]. It can be used to facilitate analyses of very large datasets in overlapping slices.
+
''ChunkChromosome'' is a helper utility for [[minimac]] and [[MaCH]]. It can be used to facilitate analyses of very large datasets in overlapping slices. For information on how to put the resulting chunks back together, see [[Ligate Minimac|this page]].
  
 
== Parameters ==
 
== Parameters ==
Line 16: Line 16:
 
awk '{ if ($1 == "M") print $2; }' < chr1.dat > chr1.snps
 
awk '{ if ($1 == "M") print $2; }' < chr1.dat > chr1.snps
 
mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --prefix chr1.haps
 
mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --prefix chr1.haps
minimac --refHaps 1000genomes.chr1.haps.gz  --refSnps 1000genomes.chr1.snps --haps chr1.haps.gz --snps chr1.snps --rounds 5 --states 200 --prefix imputation-results
+
minimac --refHaps 1000genomes.chr1.haps.gz  --refSnps 1000genomes.chr1.snps --haps chr1.haps.gz --snps chr1.snps --rounds 5 --states 200 --prefix chr1.imputed
 
</source>
 
</source>
  
Line 22: Line 22:
  
 
<source lang="bash">
 
<source lang="bash">
ChunkChromosome -d chr1.dat
+
#!/bin/tcsh
 +
 
 +
@ length = 2500
 +
@ overlap = 500
 +
 
 +
# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
 +
foreach chr (`seq 1 22`)
 +
 
 +
  ChunkChromosome -d chr$chr.dat -n $length -o $overlap
 +
 
 +
  foreach chunk (chunk*-chr$chr.dat)
 +
 
 +
      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
 +
          --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &
 +
 
 +
  end
  
# Phase each chunk in parallel
 
foreach chunk (chunk*-chr1.dat)
 
  mach1 -d $chunk -p chr1.ped --rounds 20 --states 200 --phase --prefix ${chunk:r}.haps >& ${chunk:r}-mach.log &
 
 
end
 
end
 
wait
 
wait
  
# Impute each chunk in parallel
+
# Impute into phased haplotypes
foreach chunk (chunk*-chr1.dat)
+
foreach chr (`seq 1 22`)
  minimac --autoClip autoChunk-chr1.dat \
+
 
          --refHaps 1000genomes.chr1.haps.gz  --refSnps 1000genomes.chr1.snps --rounds 5 --states 200 \
+
  foreach chunk (chunk*-chr$chr.dat)
          --haps ${chunk:r}.haps.gz --snps ${chunk}.snps --prefix ${chunk:r}-results >& ${chunk:r}-minimac.log &
+
 
 +
      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
 +
      set snps = /data/1000g/snps/chr$chr.snps
 +
 
 +
      minimac --refHaps $haps --refSnps $snps --rounds 5 --states 200 \
 +
              --haps ${chunk:r}.gz --snps ${chunk}.snps --autoClip autoChunk-chr$chr.dat  \
 +
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &
 +
 
 +
  end
 +
 
 
end
 
end
 
wait
 
wait
Line 42: Line 63:
  
 
== Download ==
 
== Download ==
 +
 +
You can download source code for the ChunkChromosome program in a tar-ball archive [http://csg.sph.umich.edu//cfuchsb/generic-ChunkChromosome-2014-05-27.tar.gz generic-ChunkChromosome-2014-05-27.tar.gz]. After downloading it, unpack the archive and use Make to compile the tool.

Latest revision as of 10:02, 2 February 2017

ChunkChromosome is a helper utility for minimac and MaCH. It can be used to facilitate analyses of very large datasets in overlapping slices. For information on how to put the resulting chunks back together, see this page.

Parameters

ChunkChromosome expects three parameters:

  • A data file (specified with the -d command line option), listing all the markers along one chromosome. The data file can optionally include phenotype and other information, which is safely ignored.
  • A desired core chunk size, in markers (specified with the -n command line option and defaulting to 5000 markers).
  • A desired overlap between chunks, also in markers (specified with the -o command line option and defaulting to 500 markers).

Usage

Suppose you plan to run 1000 Genomes Imputation using MaCH and minimac. Typically, you'd accomplish this by running the following commands:

awk '{ if ($1 == "M") print $2; }' < chr1.dat > chr1.snps
mach1 -d chr1.dat -p chr1.ped --rounds 20 --states 200 --phase --prefix chr1.haps
minimac --refHaps 1000genomes.chr1.haps.gz  --refSnps 1000genomes.chr1.snps --haps chr1.haps.gz --snps chr1.snps --rounds 5 --states 200 --prefix chr1.imputed

These commands would haplotype (with MaCH) and then impute (with Minimac) an entire chromosome. While the process works, it can be rather time consuming for large chromosomes and large numbers of individuals. ChunkChromosome allows the process to be streamlined by running different portions of each chromosome in parallel.

#!/bin/tcsh

@ length = 2500
@ overlap = 500

# Estimate haplotypes for all individuals, in 2500 marker chunks, with 500 marker overhang
foreach chr (`seq 1 22`)

   ChunkChromosome -d chr$chr.dat -n $length -o $overlap

   foreach chunk (chunk*-chr$chr.dat)

      mach -d $chunk -p chr$chr.ped --prefix ${chunk:r} \
           --rounds 20 --states 200 --phase --sample 5 >& ${chunk:r}-mach.log &

   end

end
wait

# Impute into phased haplotypes
foreach chr (`seq 1 22`)

   foreach chunk (chunk*-chr$chr.dat)

      set haps = /data/1000g/hap/all/20101123.chr$chr.hap.gz
      set snps = /data/1000g/snps/chr$chr.snps

      minimac --refHaps $haps --refSnps $snps --rounds 5 --states 200 \
              --haps ${chunk:r}.gz --snps ${chunk}.snps  --autoClip autoChunk-chr$chr.dat  \
              --prefix ${chunk:r}.imputed >& ${chunk:r}-minimac.log &

   end

end
wait

The autoChunk file, generated by the ChunkChromosome program, tells minimac what are the markers of interest for each chunk. This allows chunks to overlap (which improves accuracy near the edges) but still ensures that each marker is only imputed once.

Download

You can download source code for the ChunkChromosome program in a tar-ball archive generic-ChunkChromosome-2014-05-27.tar.gz. After downloading it, unpack the archive and use Make to compile the tool.