Data Analysis: The perks and fruits of Next Generation Sequencing
Advances in the research in next generation sequencing have created a noteworthy paradigm shift in the field of deep learning, bioanalysis and much more. DNA and RNA-sequencing allows for the identification of unknown genomes or search through the sequenced genome for variations, especially amongst different samples. Another practical use of the identification of distinctive regions in the DNA strand, such as DNA binding proteins regions or for transcription factors. These areas are potential therapeutic targets for diseases with a genetic component.
However, the most frequent application is the detection of variations (mutations), such as Single Nucleotide Polymorphisms (SNPs), variations due to Insertion/Deletion (InDels), Copy Number Variations (CNVs) and many other kinds of Structural Variations (SVs) that may have an impact in the pathogenesis of diseases or changes in the species’ phenotypes.
Genomic sequencing can be used as a clinical tool to assess the prognosis of different diseases, such as acute myeloid leukemia, lung cancer, breast cancer, renal pathologies and more. Prognostic assessment allows accurate stratification of low-risk individuals and high-risk individuals, pivotal in epidemiologic surveillance.
Research in Whole Genome Sequencing (WGS) or Whole Exome Sequencing (WES) has a tremendous research value, such as extensive phenotyping of large cohorts (allowing researchers to pinpoint the underlying genetics of specific traits in a species), cataloguing somatic mutations of rare tumor types, innovation in the fields of pharmacogenetics and even in the field of molecular agriculture, which direct implications in the economics field.
Processing and understanding your data
Before this data becomes any useful and allows researchers to draw conclusions, the data must be thoroughly analyzed and converted into bioinformatics. DNA sequencing companies are characterized for rigorous standards to ensure the utmost quality in output results.
In Novogene, the analysis of data can be divided into multiple steps: data quality control, alignment with the reference genome, annotation and variation calling of structural variants and somatic analysis of samples such as SNPs, InDels and CNV.
For assessment of quality, genomic services provide useful tools such as FastQC. Essentially, this step trims out any unfavourable or poor-quality reads that do not meet the standards. FastQC is a quality control application for high throughput sequence data. With this software, you are able to determine whether it has any problems or issues through a set of analyses provided (quick overviews, summary graphs and tables).
Sequence alignment, the second step, is mapping the DNA or RNA sequences next to each other to identify areas of similarity between them. By doing so, you can establish relationships between similar or identical areas and their possible implications. Accurate alignment of high-throughput RNA-seq data can be done through the use of efficient software tools such as STAR.
In alignment, identity refers to when the nucleotides or amino acids at a particular position are exactly the same. Identity of 25% or higher indicates a degree of similarity in function, whereas an identity of 18-25% implies similarity or matching of structure or function. Similarity implies there is a degree of resemblance between two sequences and, albeit they share common properties, they are not exactly identical. It indicates some degree of conserved function or structure.
The following step in NGS genomics is the identification of variants, such as Single Nucleotide Polymorphisms (SNPs), indels, translocations, inversions and much more. Software utilized for this purpose are the GATK, for germline samples of SNPs, and MuTect/Strelka for somatic samples, to name a few. Variant annotations are the process of assigning information to variants that have previously described and added to the variants database.
This saves time to the research and helps identify the variants associated with diseases. Efficient software tools include the likes of ANNOVAR.
Finally, NGS data is visualized using tools and genome services browsers. Visualizing data helps gauge mapping quality, draw information from aligned reads, annotation information, their impact and more. Each software brings their own advantages and disadvantages to the sequencing services process, facilitating analysis tasks to researchers.
The development of NGS technologies has made the data analysis the rate-limiting steps instead of data generation in genomics studies. These steps allow researchers to detect any mismatching, faulty reads, the presence of mutations or any other structural change among sequences and making the entire process much more accessible and less time consuming for the researchers.
 Rucha M. Wadapurkar, Renu Vyas, Computational analysis of next generation sequencing data and its applications in clinical oncology, Informatics in Medicine Unlocked, Volume 11, 2018, pages 75-82. [ScienceDirect]
 John M. Butler, Chapter 12 – Single Nucleotide Polymorphisms and Applications, Editor(s): John M. Butler, Advanced Topics in Forensic DNA Typing: Methodology, Academic Press, 2012, Pages 347-369. [ScienceDirect]
 Holly J. Pederson, Jennifer R. Klemp, 85 – Breast Cancer Survivorship, Editor(s): Kirby I. Bland, Edward M. Copeland, V. Suzanne Klimberg, William J. Gradishar, The Breast (Fifth Edition), Elsevier, 2018, Pages 1049-1056.e4.[ScienceDirect]
<ispanid=”References1″> Aono, A.H., Costa, E.A., Rody, H.V.S. et al. Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance. Sci Rep 10, 20057 (2020). [Nature scientific reports]
 Rucha M. Wadapurkar, Renu Vyas, Computational analysis of next generation sequencing data and its applications in clinical oncology, Informatics in Medicine Unlocked, Volume 11, 2018, Pages 75-82.[ScienceDirect]
 Babraham Bioinformatics. FastQC. [Bioinformatics]
 Nielsen, C., Cantor, M., Dubchak, I. et al. Visualizing genomes: techniques and challenges. Nat Methods 7, S5–S15 (2010). [Nature methods]