A Beginners Guide to DNA-seq: Bioinformatics Analysis
Next-generation sequencing (NGS) technologies have advanced scientific experiments by enabling large-scale genomic sequencing projects to be carried out. The study of genetic variation is applicable across a wide range of fields and can be used to answer questions from ecology and evolution to examining various diseases.
DNA sequencing at Novogene
DNA sequencing is used to determine the sequence of the four bases that make up the DNA molecule. The advancement of next-generation sequencing has made it a popular tool for answering a variety of questions in the life science and medical fields.
Novogene offers a range of DNA sequencing services on three different platforms. The Illumina platform is used for short reads and for long reads we use both the PacBio and ONT systems. The type of sequencing platform used will depend on the type of data that you are sequencing and will need to take into consideration how much data is required to determine the breadth and depth of the sequencing reads. At Novogene, we offer both whole genome and whole exome sequencing. Whole genome sequencing will provide you with data on all the genes as well as the non-coding regions, while the whole exome only examines the DNA contained in exomes and can be more cost-effective depending on the type of data you are looking for.
An overview of DNA-Seq Bioinformatics Analysis: Germline vs Somatic
At Novogene, we can also perform the bioinformatics analysis on your samples. To give you an idea of what that analysis will look like let’s briefly go over some of the pipelines that we use. The first variant calling workflow we will examine is the Germline Variant Calling workflow and this is used when carrying out short reads on the Illumina platform. This pipeline consists of five steps:
In the first step, we clean up the raw reads before using a program called BWA to carry out an alignment. This can be either a local alignment or global alignment depending on the type of reads produced. Once aligned, the files are converted to a BAM file and assessed for duplicates. Finally, we check the quality of the data by calculating a base quality score recalibration. Once this has been done, we use a program called GATK4 haplotype caller which is one of the common tools for germline variant calling. This uses a sliding window across the reference genome to identify the active region. This generates a genomic VCF file that can be used for joint genotyping. In addition, from this, we can call variants across all samples to get a final VCF file. Another tool we can use for germline variant calling the Deep Variant. This tool is also used with the BAM or reference file and produces a VCF file. The VCF file can then be annotated with either genome annotation or region-based annotation depending on your needs.
The second type of workflow that we are going to talk you through is somatic variant calling. This is a whole different story to the germline pipeline as we are dealing with low-frequency variants which need to be identified and separated from artifacts. An example of a basic somatic variant calling pipeline is the VarScan2 pipeline. Here we start with BAM files or Germline population resources and perform SNV and InDEL calling. This produces an Indel VCF or an SNV VCF depending on what you start with. Variant filtering can then be used to produce analysis-ready variants.
Long reads and advanced analysis
For longer reads, we use the PacBio and Nanopore variant calling. These are invaluable tools for the discovery of structured variants (SV). Structured variants are variations within the human genome that exceed 50 base pairs. SVs are identified using one of four methods:
4.De novo assembly
These methods work in different ways depending on the data that you have to identify areas where there have been insertions or deletions in the DNA sequence. In addition to these analyses, we also offer advanced analyses such as Mendelian disease analysis and in-depth cancer analysis.
Novogene is a world expert in the sequencing field and will provide you with a comprehensive service that includes recommendations on sequencing and bioinformatics depending on the types of samples that you wish to process.
For more information on variant calling pipelines for Illumina, PacBio, and Nanopore data and our advanced analysis approaches for disease and cancer studies you can listen to our webinar available here:A Beginner’s Guide to DNA-seq Bioinformatics Analysis – Novogene Feel free to learn more about Whole Genome Sequencing here:Novogene Whole Genome Sequencing