Protocols
Preprocessing (aligning --> bam )
Step 1: PrepareFastQ
Spike in the PhiX reads
To see whether the pipeline ran correctly. The reads will be inserted in each sample. Later on (step 23) of the pipeline there will be a concordance check to see if the SNPs that are put in, will be found.
Scriptname: SpikePhiX
Input: raw sequence file in the form of a gzipped fastq file
Output: FastQ files (${filePrefix}${lane}${barcode}.fq.gz)*
Check the Illumina encoding
In this step the encoding of the FastQ files will be checked. Older (2012 and older) sequence data contains the old Phred+64 encoding (this is called Illumina 1.5 encoding), new sequence data is encoded in Illumina 1.8 or 1.9 (Phred+33). If the data is 1.5, it will be converted to 1.9 encoding
Toolname: seqTk Scriptname: CheckIlluminaEncoding Input: FastQ files (${filePrefix}${lane}${barcode}.fq.gz) Output: If necessary encoded.fq.gz
Step 3: Calculate QC metrics on raw data
In this step, Fastqc, quality control (QC) metrics are calculated for the raw sequencing data. This is done using the tool FastQC. This tool will run a series of tests on the input file. The output is a text file containing the output data which is used to create a summary in the form of several HTML pages with graphs for each test. Both the text file and the HTML document provide a flag for each test: pass, warning or fail. This flag is based on criteria set by the makers of this tool. Warnings or even failures do not necessarily mean that there is a problem with the data, only that it is unusual compared to the used criteria. It is possible that the biological nature of the sample means that this particular bias is to be expected.
Toolname: FastQC Scriptname: Fastqc Input: fastq1.fq.gz and fastq_2.fq.gz (${filePrefix}${lane}_${barcode}.fq.gz) Output: ${filePrefix}.fastqc.zip archive containing amongst others the HTML document and the text file
Step 4: Alignment + SortSam
In this step, the Burrows-Wheeler Aligner (BWA) is used to align the (mostly paired end) sequencing data to the reference genome. The method that is used is BWA mem. The output is a FIFO piped SAM file, this way we can do the sorting of the sam/bam file in one go without writing it to disk in between.
Scriptname: BwaAlignAndSortBam
Aligning
Toolname: BWA Input: raw sequence file in the form of a gzipped fastq file (${filePrefix}.fq.gz) Output: FIFO piped SAM formatted file (${filePrefix}.sam)
java -Djava.io.tmpdir=${tempDir} -XX:ParallelGCThreads=4 -Xmx29G -jar /folder/to/picard/picard.jar SortSam INPUT= aligned.sam OUTPUT=aligned.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true
sambamba merge merged.bam ${arrayOfSortedBams[@]}
Step 7: Marking duplicates
In this step, the BAM file is examined to locate duplicate reads, using Sambamba markdup. A mapped read is considered to be duplicate if the start and end base of two or more reads are located at the same chromosomal position in comparison to the reference genome. For paired-end data the start and end locations for both ends need to be the same to be called duplicate. One read pair of these duplicates is kept, the remaining ones are flagged as being duplicate.
Toolname: Sambamba markdup Scriptname: MarkDuplicates Input: Merged BAM file (${sample}.merged.bam)
Output:
BAM file with duplicates flagged (${sample}.dedup.bam)
BAM index file (${sample}.dedup.bam.bai)
Mark duplicates:
Step 8: Flagstat (dedup metrics)
Calculating dedup metrics
Toolname: Sambamba flagstat Scriptname: FlagstatMetrics Input: dedup BAM file (${sample}.dedup.bam) Output: .flagstat file (${sample}.dedup.bam.flagstat)
Indel calling with Manta, Convading & XHMM
Step 9a: Calling big deletions with Manta
In this step, the progam Manta calls all types (DEL,DUP,INV,TRA,INS) from the merged BAM file. The calls are written to 3 different gzipped VCF files. These files are candidateSmallIndels, candidateSV and diploidSV along with information such as difference in length between REF and ALT alleles, type of structural variant end information about allele depth.
Toolname: Manta Scriptname: Manta Input: dedup BAM file (${sample}.dedup.bam) Output: ${sample}.candidateSmallIndels.vcf.gz ${sample}.candidateSV.vcf.gz ${sample}.diploidSV.vcf.gz
Prepare workflow
run workflow
Step 9b: CoNVaDING
CoNVaDING (Copy Number Variation Detection In Next-generation sequencing Gene panels) was designed for small (single-exon) copy number variation (CNV) detection in high coverage next-generation sequencing (NGS) data, such as obtained by analysis of smaller targeted gene panels. This step includes the 4 Convading steps in one protocol. It is gender specific for the sex chromosomes. For more detail about this step: http://molgenis.github.io/software/CoNVaDING Note: This step needs an already defined controlsgroup!
Toolname: Convading Scriptname: Convading Input: dedup BAM file (${sample}.dedup.bam) Output: file containing regions that contain a CNV ${sample}.finallist
Step 1:StartWithBam
Step 2:StartWithMatchScore
Step 3:StartWithBestScore
Step 4: CreateFinalList
Step 9c: XHMM
The XHMM software suite was written to call copy number variation (CNV) from next-generation sequencing projects, where exome capture was used (or targeted sequencing, more generally) This protocol contains all the steps described here http://atgu.mgh.harvard.edu/xhmm/tutorial.shtml
Toolname: XHMM
Scriptname: XHMM
Input: dedup BAM file (${sample}.dedup.bam)
Output: file containing regions that contain a CNV ${sample}.xcnv
Step 9d: Decision tree
To determine if a CNV is true the output from steps Convading and XHMM will be checked. There is a decision tree developed by L. Johannson and a student him (M. Frans). The end results of the decision tree will result in a file with a very high confidence CNVs.
Scriptname: DecisionTree
Input: multiple output files from XHMM (${SAMPLE}_step10.xcnv.final) and Convading (.shortlist.finallist.txt, .shortlist.txt, .totallist.txt and .longlist.txt)
Output: ${SAMPLE}.longlistplusplusFinal.txt
Step 9e: Annotating manta output
The Manta output will be annotated with VEP. To prevent long runtimes and messy output only the diploid output of Manta will be annotated since this is QC'ed.
Toolname: VEP
Scriptname: MantaAnnotation
Input: Manta output (${sample}.diploidSV.vcf.gz)
Output: (${sample}.diploidSV_VEP.vcf.gz)
Determine gender
Step 10: GenderCalculate
Due to the fact a male has only one X chromosome it is important to know if the sample is male or female. Calculating the coverage on the non pseudo autosomal region and compare this to the average coverage on the complete genome predicts male or female well.
Toolname: Picard CalculateHSMetrics Scriptname: GenderCalculate Input: dedup BAM file (${sample}.dedup.bam) Output: ${dedupBam}.nonAutosomalRegionChrX_hs_metrics
Side steps (Cram conversion and concordance check)
Step 11: CramConversion
Producing more compressed bam files, decreasing size with 40%
Toolname: Scramble Scriptname:CramConversion Input: dedup BAM file (${sample}.dedup.bam) Output: dedup CRAM file (${sample}.dedup.bam.cram)
Step 12: Make md5’s for the bam files
Small step to create md5sums for the bams created in the MarkDuplicates step
Scriptname: MakeDedupBamMd5 Input: realigned BAM file (.merged.dedup.bam) Output: md5sums (.merged.dedup.bam.md5)
Coverage calculations (Diagnostics only)
Step 13: Calculate coverage per base and per target
Calculates coverage per base and per target, the output will contain chromosomal position, coverage per base and gene annotation
Toolname: GATK DepthOfCoverage Scriptname: CoverageCalculations Input: dedup BAM file (.merged.dedup.bam) Output: tab delimeted file containing chromosomal position, coverage per base and Gene annotation name (.coveragePerBase.txt)
Per base:
Per target:
Metrics calculations
Step 14 (a,b,c,d): Calculate alignment QC metrics
In this step, QC metrics are calculated for the alignment created in the previous steps. This is done using several QC related Picard tools:
● CollectAlignmentSummaryMetrics ● CollectGcBiasMetrics ● CollectInsertSizeMetrics ● MeanQualityByCycle (machine cycle) ● QualityScoreDistribution ● CalculateHsMetrics (hybrid selection) ● BamIndexStats
These metrics are later used to create tables and graphs (step 24). The Picard tools also output a PDF version of the data themselves, containing graphs.
Toolname: several Picard QC tools Scriptname: Collect metrics Input: dedup BAM file (.merged.dedup.bam) Output: alignmentmetrics, gcbiasmetrics, insertsizemetrics, meanqualitybycycle, qualityscoredistribution, hsmetrics, bamindexstats (text files and matching PDF files)
Determine Gender
Step 15: Gender check
Due to the fact a male has only one X chromosome it is important to know if the sample is male or female. Calculating the coverage on the non pseudo autosomal region and compare this to the average coverage on the complete genome predicts male or female well.
Scriptname: GenderCheck Input: ${dedupBam}.hs_metrics (CalculateHsMetrics step) (${dedupBam}.nonAutosomalRegionChrX_hs_metrics (GenderCalculate step)
Output: ${sample}.chosenSex.txt
Variant discovery
Step 16a: Call variants (VariantCalling)
The GATK HaplotypeCaller estimates the most likely genotypes and allele frequencies in an alignment using a Bayesian likelihood model for every position of the genome regardless of whether a variant was detected at that site or not. This information can later be used in the project based genotyping step.
Scriptname:: GATK HaplotypeCaller Scriptname: VariantGVCFCalling Input: merged BAM files Output: gVCF file (${sample}.${batchBed}.variant.calls.g.vcf)
Step 16b: Combine variants
When there 200 or more samples the gVCF files should be combined into batches of equal size. (NB: These batches are different then the ${batchBed}.) The batches will be calculated and created in this step. If there are less then 200, this step will automatically be skipped.
Toolname: GATK CombineGVCFs Scriptname: VariantGVCFCombine Input: gVCF file Output: Multiple combined gVCF files (${project}.${batchBed}.variant.calls.combined.g.vcf{batch}
Step 16c: Genotype variants
In this step there will be a joint analysis over all the samples in the project. This leads to a posterior probability of a variant allele at a site. SNPs and small Indels are written to a VCF file, along with information such as genotype quality, allele frequency, strand bias and read depth for that SNP/Indel.
Toolname: GATK GenotypeGVCFs Scriptname: VariantGVCFGenotype Input: gVCF files from step 16a or combined gVCF files from step 16b Output: VCF file for all the samples in the project (${project}.${batchBed}.variant.calls.genotyped.vcf)
Annotation
Step 17a: Annotating with SnpEff
Data will be annotated with SnpEff Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).
Toolname: SnpEff ScriptName: SnpEff Input: genotyped vcf (.genotyped.vcf) Output: snpeff annotated vcf (.snpeff.vcf)
Step 17b: Annotating with VEP
Data will be annotated with VEP. VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions
Toolname: VEP ScriptName: VEP Input: genotyped vcf (.genotyped.vcf) Output: VEP annotated vcf (.variant.calls.VEP.vcf)
Step 18: Annotating with CADD, GoNL, ExAC (CmdLineAnnotator)
Data will be annotated with CADD, GoNL and ExAC
Toolname: CmdLineAnnotator Scriptname: CmdLineAnnotator Input: snpeff annotated vcf (.snpeff.vcf) Output: cadd, gonl and exac annotated vcf (.exac.gonl.cadd.vcf)
Step s19: Merge batches
Running GATK CatVariants to merge all the files created in the previous into one.
Toolname: GATK CatVariants Scriptname: MergeChrAndSplitVariants Input: CmdLineAnnotator vcf file (.genotyped.snpeff.exac.gonl.cadd.vcf) Output: merged (batches) vcf per project (${project}.variant.calls.vcf)
Step 20: Split indels and SNPs
This step is necessary because the filtering of the vcf needs to be done seperately.
Toolname: GATK SelectVariants Scriptname: SplitIndelsAndSNPs Input: merged (batches) vcf per project (${project}.variant.calls.GATK.vcf) (from step 22) Output:
.annotated.indels.vcf
.annotated.snps.vcf
Step 21: (a) SNP and (b) Indel filtration
Based on certain quality thresholds (based on GATK best practices) the SNPs and indels are filtered or marked as Pass.
Toolname: GATK VariantFiltration Scriptname: VariantFiltration Input: -annotated.snps.vcf -.annotated.indels.vcf Output:
Filtered snp vcf file (.annotated.filtered.snps.vcf)
Filtered indel vcf file (.annotated.filtered.indels.vcf)
SNP:
Indel:
Step 22: Merge indels and SNPs
Merge all the SNPs and indels into one file (per project) and merge SNPs and indels per sample.
Toolname: GATK CombineVariants Scriptname: MergeIndelsAndSnps Input: .annotated.filtered.indels.vcf and .annotated.snps.vcf
Output:
sample.final.vcf
project.final.vcf
Per sample:
Per project:
Step s23a: Gavin
Tool that predict the impact of the SNP with the help of different databases (CADD etc).
Gavin first round
Scriptname: Gavin Toolname: Gavin_toolpack Input: merged vcf per sample ${sample}.variant.calls.GATK.vcf Output:
First draft (.GAVIN.RVCF.firstpass.vcf)
Tab seperated file to be send to CADD (.toCadd.tsv)
Get CADD annotations locally
Toolname: CADD Input: tab seperated file to be send to CADD (.toCadd.tsv) Output: tab seperated file to be send from CADD (.fromCadd.tsv.gz)
Gavin second round
Toolname: Gavin_toolpack Input:
merged vcf per sample ${sample}.variant.calls.GATK.vcf
tab seperated file to be send from CADD (.fromCadd.tsv.gz)
Output: Gavin final output (.GAVIN.RVCF.final.vcf)
Merging Gavin output with original
Toolname: Gavin_toolpack Input:
merged vcf per sample ${sample}.variant.calls.GATK.vcf
Gavin final output (.GAVIN.RVCF.final.vcf)
Output: Gavin final output merged with original (.GAVIN.rlv.vcf)
Step s23b: GeneNetwork
Tool that ranks genes based on HPO ID's, there will be 2 extra INFO fields in the output vcf with a score and position.
Toolname: VEP
Scriptname: GeneNetwork
Input:
HPO ID's
output.GAVIN.rlv.vcf
Output: ${sample}.GAVIN.geneNetwork.final.vcf
Step 24: Convert structural variants VCF to table
In this step the indels in VCF format are converted into a tabular format using Perlscript vcf2tab by F. Van Dijk.
Toolname: vcf2tab.pl Scriptname: IndelVcfToTable Input: (${sample}.final.vcf) Output: (${sample}.final.vcf.table)
QC-ing
Step 25: In silico concordance check
The reads that are inserted contain SNPs that are handmade. To see whether the pipeline ran correctly at least these SNPs should be found.
Input: InSilicoData.chrNC_001422.1.variant.calls.vcf and ${sample}.variant.calls.sorted.vcf Output: inSilicoConcordance.txt
Step 26a: Prepare QC Report, collecting metrics
Combining all the statistics which are used in the QC report.
Scriptname:QCStats Toolname: pull_DNA_Seq_Stats.py Input: metrics files (flagstat file, .hsmetrics, .alignmentmetrics, .insertsizemetrics and concordance file (.dedup.metrics.concordance.ngsVSarray.txt) Output: ${sample}.total.qc.metrics.table
Step 27b: Generate quality control report
The step in the inhouse sequence analysis pipeline is to output the statistics and metrics from each step that produced such data that was collected in the QCStats step before. We use the multiQC tool to create an interactive html with all the statistics
Toolname: multiqc Scriptname: MultiQC Input: ${sample}.total.qc.metrics.table Output: A quality control report html(*_multiqc.html)
Step 28: Check if all files are finished
This step is checking if all the steps in the pipeline are actually finished. It sometimes happens that a job is not submitted to the scheduler. If everything is finished than it will write a file called CountAllFinishedFiles_CORRECT, if not it will make CountAllFinishedFiles_INCORRECT. When it is not all finished it will show in the CountAllFinishedFiles_INCORRECT file which files are not finished yet.
Scriptname: CountAllFinishedFiles Input: all .sh scripts + all .sh.finished files in the jobs folder Output: CountAllFinishedFiles_CORRECT or CountAllFinishedFiles_INCORRECT
Step 29: Prepare data to ship to the customer
In this last step the final results of the inhouse sequence analysis pipeline are gathered and prepared to be shipped to the customer. The pipeline tools and scripts write intermediate results to a temporary directory. From these, a selection is copied to a results directory. This directory has five subdirectories:
o alignment: the merged BAM file with index o coverage: coverage statistics and plots o coverage_visualization: coverage BEDfiles o qc: all QC files, from which the QC report is made o rawdata/ngs: symbolic links to the raw sequence files and their md5 sum o snps: all SNP calls in VCF format and in tab-delimited format o structural_variants: all SVs calls in VCF and in tab-delimited format Additionally, the results directory contains the final QC report, the worksheet which was the basis for this analysis (see 4.2) and a zipped archive with the data that will be shipped to the client (see: GCC_P0006_Datashipment.docx). The archive is accompanied by an md5 sum and README file explaining the contents.
Scriptname: CopyToResultsDir
Last updated