Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 6: Line 6:     
[[Media:Variant Calling and Filtering for INDELs.pdf|Intro Slides]]
 
[[Media:Variant Calling and Filtering for INDELs.pdf|Intro Slides]]
 +
 +
== Setup in person at the SeqShop Workshop ==
 +
''This section is specifically for the SeqShop Workshop computers.''
 +
<div class="mw-collapsible mw-collapsed" style="width:600px">
 +
''If you are not running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
    
{{SeqShopLogin}}
 
{{SeqShopLogin}}
   −
== Setup your run environment==
+
=== Setup your run environment===
 +
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.
   −
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in. This will setup some environment variables to point you to:
+
This will setup some environment variables to point you to
* GotCloud program
+
* [[GotCloud]] program
 
* Tutorial input files
 
* Tutorial input files
 
* Setup an output directory
 
* Setup an output directory
Line 19: Line 26:  
* You won't see any output after running <code>source</code>
 
* You won't see any output after running <code>source</code>
 
** It silently sets up your environment
 
** It silently sets up your environment
 +
** If you want to view the detail of the setup, type
 +
less /home/mktrost/seqshop/setup.txt
 +
and press 'q' to finish.
 +
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
View setup.txt
 
View setup.txt
Line 25: Line 36:  
</div>
 
</div>
 
</div>
 
</div>
 +
</div>
 +
</div>
 +
 +
== Setup when running on your own outside of the SeqShop Workshop ==
 +
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 +
<div class="mw-collapsible" style="width:600px">
 +
''If you are running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
=== Download the example data ===
 +
 +
=== Setup your run environment ===
 +
 +
Environment variables will be used throughout the tutorial.
    +
We recommend that you setup these variables so you won't have to modify every command in the tutorial.
 +
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
I'm using bash (replace the paths below with the appropriate paths):
 +
<div class="mw-collapsible-content">
 +
* Point to where you installed GotCloud
 +
*:<pre>export GC=/home/username/gotcloud</pre>
 +
* Point to where you installed the seqshop files
 +
*:<pre>export SS=/home/username/seqshop/</pre>
 +
* Point to where you want the output to go
 +
*:<pre>export OUT=/home/username/seqshop_output/</pre>
 +
</div>
 +
</div>
 +
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
I'm using tcsh (replace the paths below with the appropriate paths):
 +
<div class="mw-collapsible-content">
 +
* Point to where you installed GotCloud
 +
*:<pre>setenv GC /home/username/gotcloud</pre>
 +
* Point to where you installed the seqshop files
 +
*:<pre>setenv SS /home/username/seqshop/</pre>
 +
* Point to where you want the output to go
 +
*:<pre>setenv OUT /home/username/seqshop_output/</pre>
 +
</div>
 +
</div>
 +
 +
</div>
 +
</div>
    
== Examining GotCloud Indel Input files ==
 
== Examining GotCloud Indel Input files ==
Line 34: Line 86:     
== Running GotCloud Indel ==
 
== Running GotCloud Indel ==
  ${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000
+
  ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}
 +
* <code>${GC}/gotcloud</code> runs GotCloud
 +
* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.
 +
* <code>--conf</code> tells GotCloud the name of the configuration file to use.
 +
** The configuration for this test was downloaded with the seqshop input files.
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
** Depends on your system
 
** Depends on your system
 
* --region 22:36000000-37000000
 
* --region 22:36000000-37000000
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 +
* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.
 +
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}
 +
** Alternatively, gotcloud.conf could be updated to specify the full paths
 +
* <code>--out_dir</code> tells GotCloud where to write the output.
 +
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line
    
<div class="mw-collapsible mw-collapsed" style="width:500px">
 
<div class="mw-collapsible mw-collapsed" style="width:500px">
Line 66: Line 127:  
Let's look at the <code>final</code> directory:
 
Let's look at the <code>final</code> directory:
 
  ls ${OUT}/final
 
  ls ${OUT}/final
Just a <code>chr22</code> directory, so look inside of there:
+
 
ls ${OUT}/vcfs/chr22
   
;Can you identify the final indel VCF?
 
;Can you identify the final indel VCF?
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
Line 97: Line 157:  
Let's see if we found the indel  
 
Let's see if we found the indel  
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1  
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1  
    
Did you see a variant at the position?
 
Did you see a variant at the position?
Line 103: Line 163:  
Let's check the sequence data to confirm that the variant really exists
 
Let's check the sequence data to confirm that the variant really exists
   −
  $GC/bin/samtools tview $IN/bams/HG01101.recal.bam $REF/human.g1k.v37.chr22.fa
+
  ${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa
    
* Type 'g' to go to a specific position
 
* Type 'g' to go to a specific position
Line 125: Line 185:  
==== Header ====
 
==== Header ====
 
First, let's look at the header:
 
First, let's look at the header:
  $GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz
+
  ${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz
    
The header is as follows:
 
The header is as follows:
Line 164: Line 224:  
To view a specific region of records (such as APOL1 g2 allele)
 
To view a specific region of records (such as APOL1 g2 allele)
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041
    
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
 
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
Line 279: Line 339:  
It is usually useful to examine the call sets against known data sets for the passed variants.
 
It is usually useful to examine the call sets against known data sets for the passed variants.
   −
  ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${REF}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
+
  ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
      Line 319: Line 379:  
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
 
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
   −
   ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${REF}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
+
   ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
    
   data set
 
   data set

Navigation menu