{"id":35489,"date":"2022-12-27T02:01:36","date_gmt":"2022-12-27T10:01:36","guid":{"rendered":"https:\/\/www.novogene.com\/us-en\/?post_type=resources&#038;p=35489"},"modified":"2025-05-29T03:41:26","modified_gmt":"2025-05-29T10:41:26","slug":"geo-data-mining-ib-downloading-sequence-read-archive-raw-data","status":"publish","type":"resources","link":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/","title":{"rendered":"GEO data mining (IB) &#8211; Downloading Sequence Read Archive raw data"},"content":{"rendered":"<p>Life science researchers may be interested in performing bioinformatics research or in incorporating bioinformatics analysis into their research programs, but they may not be sure which gene is the best starting point. Alternatively, they may have a gene of interest, but may be unsure of how to find disease research data that can be used for further analysis. <\/p>\n<p>The Sequence Read Archive (SRA) database is an excellent source of raw sequencing data that can be mined to address those questions or uncertainties. For example, SRA data can be a source of potentially interesting genes from high-scoring publications with source data that can be downloaded for in-depth mining analysis, and then bioinformatics results can be verified with \u201cwet lab\u201d experiments. <\/p>\n<p>In <a href=\"https:\/\/www.novogene.com\/eu-en\/resources\/blog\/gene-expression-omnibus-data-mining-ia-quick-and-easy-download-of-geo-data\/\" target=\"_blank\" rel=\"noopener noreferrer\">Part IA<\/a> , we covered how to download Gene Expression Omnibus (GEO) data. However, using GEO expression matrix data to analyze up- or down-regulation of genes is not the only source of innovative research, and some \u201cstar\u201d genes in the GEO database may have already been thoroughly analyzed. Original published raw sequencing data in the SRA database is another source of great potential waiting to be tapped. Investigators who are used to traditional work on coding genes can use SRA data to enrich their studies with non-coding genes and explore DNA regulatory elements. If the sequencing depth is very deep, trans-shearing can be studied to mine potential circular RNAs. Even the original sequencing data can be analyzed from scratch to explore new genes. Therefore, combining the GEO database and SRA database for data mining enables investigators to explore gene functions and pathways expansively and in greater detail than if they used only one of the two databases.<\/p>\n<p><strong>Introduction of SRA Database<\/strong><\/p>\n<p>The SRA database is an NCBI (National Center for Biotechnology Information) sub-library for storing high-throughput sequencing data. The SRA database collects original sequencing data, and the original sequencing data from published articles around the world can be downloaded for free. The basic framework of the SRA database uses four types of metadata: STUDY, SAMPLE, EXPERIMENT, and RUN.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining.png\" alt=\"\" \/><\/p>\n<ul>\n<li>STUDY: corresponds to a research topic or research project. These have the prefix \u201cSRP,\u201d \u201cDRP,\u201d or \u201cERP.\u201d<\/li>\n<li>SAMPLE: one or more samples make up an experiment. These have the prefix \u201cSRS,\u201d \u201cDRS,\u201d or \u201cERS.\u201d<\/li>\n<li>EXPERIMENT: includes one or more samples for one or more RUN sequencing results. These have the prefix \u201cSRX,\u201d \u201cDRX,\u201d or \u201cERX.\u201d<\/li>\n<li>RUN: corresponds to results. These have the prefix \u201cSRR,\u201d \u201cDRR,\u201d or \u201cERR\u201d.The rest of this article will focus on downloading RUN files because they contain actual sequence data in FASTQ format.\n<\/li>\n<\/ul>\n<p><strong>Downloading SRA library raw data <\/strong><\/p>\n<p>Raw sequence data can be downloaded from the SRA database in both non-LINUX and LINUX operating environments.<\/p>\n<p>Web-based data downloads for non-LINUX operating environments<br \/>\n1. Web page download<br \/>\nSRA data can be downloaded using any browser. Enter an SRA Run Accession Number into the Run Browser on the SRA website which is currently accessible at: <a href=\"https:\/\/trace.ncbi.nlm.nih.gov\/Traces\/index.html?view=run_browser&#038;display=download.\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/trace.ncbi.nlm.nih.gov\/Traces\/index.html?view=run_browser&#038;display=download.<\/a> <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining1.png\" alt=\"\" \/><\/p>\n<p>For example, entering SRA Run Accession Number SRR9826926 brings up the record for Experiment SRX6583594<br \/>\n(<a href=\"https:\/\/trace.ncbi.nlm.nih.gov\/Traces\/index.html?view=run_browser&#038;acc=SRR9826926&#038;display=download\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/trace.ncbi.nlm.nih.gov\/Traces\/index.html?view=run_browser&#038;acc=SRR9826926&#038;display=download<\/a>), as shown in the image below. Sequence data can be downloaded automatically by clicking on either the FASTA or FASTQ buttons <\/p>\n<p> The first letter of a prefix indicates the source database to which the sample was originally uploaded: S, SRA; D, DNA Data Bank of Japan (DDBJ); and E, European Bioinformatics Institute (EBI). The SRA database synchronizes the sequencing data from EBI and DDBJ.<br \/>\n(circled in red). <\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining2.png\" alt=\"\" \/><\/p>\n<p>2. Browser plug-ins<br \/>\nThe IBM Aspera Connect browser plug-in enables quick downloading of batched SRA data. Download the Aspera connect plug-in (<a href=\"https:\/\/www.ibm.com\/aspera\/connect\/\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.ibm.com\/aspera\/connect\/<\/a>) as shown in the image below, the follow the webpage download steps described above.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining3.png\" alt=\"\" \/><\/p>\n<p>Downloading SRA data in a LINUX operating environment<br \/>\nSRA data can be downloaded in a LINUX environment using the prefetch command to grab the Accession List information downloaded by SRA (shown in the image below) and then using Aspera Connect to download to the required FASTQ file(s).<\/p>\n<p>Prefetch command download<br \/>\nFirst, integrate the Accession List to be downloaded through the RUN selector of SRA, and then download SRA files in batches through the Linux command line.<br \/>\n<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/sra\/?term=SRP216259\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.ncbi.nlm.nih.gov\/sra\/?term=SRP216259<\/a><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining4.png\" alt=\"\" \/><\/p>\n<p>Click the button of \u2018Create file\u2019 to get the Accession list<\/p>\n<p style=\"background-color: #aaaaaa;padding: 20px;\"># Use the prefetch command to download a single file: such as SRR1039510<br \/>\nprefetch SRR1039510<br \/>\n# Batch download: create a loop and check<br \/>\noutputdir=\/**\/sra<br \/>\ncat sampleId.txt | while read id<br \/>\ndo<br \/>\necho &#8220;prefetch ${id} -O ${outputdir} &#8221;<br \/>\ndone >download.sh<br \/>\nnohup sh download.sh >download.log &#038;<br \/>\n# Verify data integrity<br \/>\nVdb &#8211; validate SRR1039510<\/p>\n<p>Aspera Connect download<br \/>\nSearch the European Nucleotide Archive (ENA;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining5.png\" alt=\"\" \/><\/p>\n<p>https:\/\/www.ebi.ac.uk\/ena\/browser\/home) for a project number of interest to obtain the download address of the relevant FASTQ file(s), and check the required information under Column Selection. For example, the image below shows the Column Selection options for project PRJNA229998 (<a href=\"https:\/\/www.ebi.ac.uk\/ena\/browser\/view\/PRJNA229998\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.ebi.ac.uk\/ena\/browser\/view\/PRJNA229998<\/a>).<br \/>\nUse the following code to download the FASTQ files via Aspera Connect:<\/p>\n<p style=\"background-color: #aaaaaa;padding: 20px;\"># download a single file<br \/>\n# sra format<br \/>\nascp -k 1 -QT -l 300m -P33001 -i ~\/**\/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:\/vol1\/srr\/SRR103\/008\/SRR1039508 .<br \/>\n# gz format<br \/>\nascp -k 1 -QT -l 300m -P33001 -i ~\/**\/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:\/vol1\/fastq\/SRR103\/000\/SRR1039510\/SRR1039510_1.fastq. gz .<br \/>\n# Batch download<br \/>\n# Get the sra.url file, if there are special characters at the end of the line, run sed -i &#8220;s\/\\s*$\/\/g&#8221; sra.url to remove the special characters at the end of the line<br \/>\ncat filereport_read_run_PRJNA229998_tsv.txt |awk &#8216;NR>1{print $NF}&#8217; >sra.url<br \/>\ncat filereport_read_run_PRJNA310728_tsv.xls |awk -F &#8216;\\t&#8217; &#8216;NR>1 {print $20}&#8217; |tr &#8216;;&#8217; &#8216;\\n&#8217; >fastq.url<br \/>\n# Order<br \/>\noutputdir=\/**\/sra<br \/>\ncat sra.url | while read id<br \/>\ndo<br \/>\necho &#8220;ascp -k 1 -QT -l 300m -P33001 -i ~\/**\/asperaweb_id_dsa.openssh era-fasp@${id} ${outputdir}&#8221;<br \/>\ndone >sra.download.sh<br \/>\n# Submit background<br \/>\nnohup sh sra.download.sh >sra.download.log &#038;<br \/>\n## Data integrity check<br \/>\n# get md5 value<br \/>\nawk &#8216;NR>1{print $11&#8243;\\t&#8221;$4}&#8217; filereport_read_run_PRJNA229998_tsv.txt >md5.txt<br \/>\n# md5 value test<br \/>\nmd5sum -c md5.txt<\/p>\n<p>We have now obtained the raw materials for analysis by downloading sequencing data from the GEO and SRA databases. Stay tuned for the next tutorials in our &#8220;GEO Data Mining&#8221; series. We will cover analysis of differential gene expression and visualization, pathway enrichment analysis, and more.  <\/p>\n","protected":false},"featured_media":0,"parent":0,"template":"","yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v20.8 (Yoast SEO v20.8) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>GEO data mining (IB) - Downloading Sequence Read Archive raw data - Novogene<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"GEO data mining (IB) - Downloading Sequence Read Archive raw data\" \/>\n<meta property=\"og:description\" content=\"Life science researchers may be interested in performing bioinformatics research or in incorporating bioinformatics analysis into their research programs, but they may not be sure which gene is the best starting point. Alternatively, they may have a gene of interest, but may be unsure of how to find disease research data that can be used\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/\" \/>\n<meta property=\"og:site_name\" content=\"Novogene\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/NovogeneAmerica\/\" \/>\n<meta property=\"article:modified_time\" content=\"2025-05-29T10:41:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@Novogene_Global\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/\",\"url\":\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/\",\"name\":\"GEO data mining (IB) - Downloading Sequence Read Archive raw data - Novogene\",\"isPartOf\":{\"@id\":\"https:\/\/www.novogene.com\/us-en\/#website\"},\"datePublished\":\"2022-12-27T10:01:36+00:00\",\"dateModified\":\"2025-05-29T10:41:26+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.novogene.com\/us-en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Resources\",\"item\":\"https:\/\/www.novogene.com\/us-en\/resources\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"GEO data mining (IB) &#8211; Downloading Sequence Read Archive raw data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.novogene.com\/us-en\/#website\",\"url\":\"https:\/\/www.novogene.com\/us-en\/\",\"name\":\"Novogene\",\"description\":\"USA Based Lab Guaranteed Data Security\",\"publisher\":{\"@id\":\"https:\/\/www.novogene.com\/us-en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.novogene.com\/us-en\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.novogene.com\/us-en\/#organization\",\"name\":\"Novogene\",\"url\":\"https:\/\/www.novogene.com\/us-en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.novogene.com\/us-en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2020\/05\/20200506113246.png\",\"contentUrl\":\"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2020\/05\/20200506113246.png\",\"width\":941,\"height\":269,\"caption\":\"Novogene\"},\"image\":{\"@id\":\"https:\/\/www.novogene.com\/us-en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/NovogeneAmerica\/\",\"https:\/\/twitter.com\/Novogene_Global\",\"https:\/\/www.linkedin.com\/company\/novogene\/\",\"https:\/\/www.youtube.com\/c\/NovogeneGlobal\"]}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"GEO data mining (IB) - Downloading Sequence Read Archive raw data - Novogene","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/","og_locale":"en_US","og_type":"article","og_title":"GEO data mining (IB) - Downloading Sequence Read Archive raw data","og_description":"Life science researchers may be interested in performing bioinformatics research or in incorporating bioinformatics analysis into their research programs, but they may not be sure which gene is the best starting point. Alternatively, they may have a gene of interest, but may be unsure of how to find disease research data that can be used","og_url":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/","og_site_name":"Novogene","article_publisher":"https:\/\/www.facebook.com\/NovogeneAmerica\/","article_modified_time":"2025-05-29T10:41:26+00:00","og_image":[{"url":"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2022\/12\/GEO-data-mining.png"}],"twitter_card":"summary_large_image","twitter_site":"@Novogene_Global","twitter_misc":{"Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/","url":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/","name":"GEO data mining (IB) - Downloading Sequence Read Archive raw data - Novogene","isPartOf":{"@id":"https:\/\/www.novogene.com\/us-en\/#website"},"datePublished":"2022-12-27T10:01:36+00:00","dateModified":"2025-05-29T10:41:26+00:00","breadcrumb":{"@id":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.novogene.com\/us-en\/resources\/blog\/geo-data-mining-ib-downloading-sequence-read-archive-raw-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.novogene.com\/us-en\/"},{"@type":"ListItem","position":2,"name":"Resources","item":"https:\/\/www.novogene.com\/us-en\/resources\/"},{"@type":"ListItem","position":3,"name":"GEO data mining (IB) &#8211; Downloading Sequence Read Archive raw data"}]},{"@type":"WebSite","@id":"https:\/\/www.novogene.com\/us-en\/#website","url":"https:\/\/www.novogene.com\/us-en\/","name":"Novogene","description":"USA Based Lab Guaranteed Data Security","publisher":{"@id":"https:\/\/www.novogene.com\/us-en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.novogene.com\/us-en\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.novogene.com\/us-en\/#organization","name":"Novogene","url":"https:\/\/www.novogene.com\/us-en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.novogene.com\/us-en\/#\/schema\/logo\/image\/","url":"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2020\/05\/20200506113246.png","contentUrl":"https:\/\/www.novogene.com\/us-en\/wp-content\/uploads\/sites\/4\/2020\/05\/20200506113246.png","width":941,"height":269,"caption":"Novogene"},"image":{"@id":"https:\/\/www.novogene.com\/us-en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/NovogeneAmerica\/","https:\/\/twitter.com\/Novogene_Global","https:\/\/www.linkedin.com\/company\/novogene\/","https:\/\/www.youtube.com\/c\/NovogeneGlobal"]}]}},"acf":[],"_links":{"self":[{"href":"https:\/\/www.novogene.com\/us-en\/wp-json\/wp\/v2\/resources\/35489"}],"collection":[{"href":"https:\/\/www.novogene.com\/us-en\/wp-json\/wp\/v2\/resources"}],"about":[{"href":"https:\/\/www.novogene.com\/us-en\/wp-json\/wp\/v2\/types\/resources"}],"wp:attachment":[{"href":"https:\/\/www.novogene.com\/us-en\/wp-json\/wp\/v2\/media?parent=35489"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}