ipyrad-analysis toolkit: sratools¶
For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We’ve written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do.
Required software¶
[1]:
# conda install ipyrad -c bioconda
# conda install sratools -c bioconda
[2]:
import ipyrad.analysis as ipa
Fetch info for a published data set by its accession ID¶
You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.).
[3]:
# init sratools object with an accessions argument
sra = ipa.sratools(accessions="SRP065788")
[4]:
# fetch info for all samples from this study, save as a dataframe
stable = sra.fetch_runinfo()
Fetching project data...
[5]:
# the dataframe has all information about this study
stable.head()
[5]:
Run | ReleaseDate | LoadDate | spots | bases | spots_with_mates | avgLength | size_MB | AssemblyName | download_path | ... | SRAStudy | BioProject | Study_Pubmed_id | ProjectID | Sample | BioSample | SampleType | TaxID | ScientificName | SampleName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SRR2895732 | 2015-11-04 15:50:01 | 2015-11-04 17:19:15 | 2009174 | 182834834 | 0 | 91 | 116 | NaN | https://sra-download.ncbi.nlm.nih.gov/sos/sra-... | ... | SRP065788 | PRJNA299402 | NaN | 299402 | SRS1146158 | SAMN04202163 | simple | 224736 | Viburnum betulifolium | Lib1_betulifolium |
1 | SRR2895743 | 2015-11-04 15:50:01 | 2015-11-04 17:18:35 | 2452970 | 223220270 | 0 | 91 | 140 | NaN | https://sra-download.ncbi.nlm.nih.gov/sos/sra-... | ... | SRP065788 | PRJNA299402 | NaN | 299402 | SRS1146171 | SAMN04202164 | simple | 1220044 | Viburnum bitchiuense | Lib1_bitchiuense_combined |
2 | SRR2895755 | 2015-11-04 15:50:01 | 2015-11-04 17:18:46 | 4640732 | 422306612 | 0 | 91 | 264 | NaN | https://sra-download.ncbi.nlm.nih.gov/sos/sra-... | ... | SRP065788 | PRJNA299402 | NaN | 299402 | SRS1146182 | SAMN04202165 | simple | 237927 | Viburnum carlesii | Lib1_carlesii_D1_BP_001 |
3 | SRR2895756 | 2015-11-04 15:50:01 | 2015-11-04 17:20:18 | 3719383 | 338463853 | 0 | 91 | 214 | NaN | https://sra-download.ncbi.nlm.nih.gov/sos/sra-... | ... | SRP065788 | PRJNA299402 | NaN | 299402 | SRS1146183 | SAMN04202166 | simple | 237928 | Viburnum cinnamomifolium | Lib1_cinnamomifolium_PWS2105X |
4 | SRR2895757 | 2015-11-04 15:50:01 | 2015-11-04 17:20:06 | 3745852 | 340872532 | 0 | 91 | 213 | NaN | https://sra-download.ncbi.nlm.nih.gov/sos/sra-... | ... | SRP065788 | PRJNA299402 | NaN | 299402 | SRS1146181 | SAMN04202167 | simple | 237929 | Viburnum clemensae | Lib1_clemensiae_DRY6_PWS_2135 |
5 rows × 30 columns
File names¶
You can select columns by their index number to use for file names. See below.
[8]:
stable.iloc[:5, [0, 28, 29]]
[8]:
Run | ScientificName | SampleName | |
---|---|---|---|
0 | SRR2895732 | Viburnum betulifolium | Lib1_betulifolium |
1 | SRR2895743 | Viburnum bitchiuense | Lib1_bitchiuense_combined |
2 | SRR2895755 | Viburnum carlesii | Lib1_carlesii_D1_BP_001 |
3 | SRR2895756 | Viburnum cinnamomifolium | Lib1_cinnamomifolium_PWS2105X |
4 | SRR2895757 | Viburnum clemensae | Lib1_clemensiae_DRY6_PWS_2135 |
Download the data¶
From an sratools object you can fetch just the info, or you can download the files as well. Here we call .run()
to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.
[10]:
# select first 5 samples
list_of_srrs = stable.Run[:5]
list_of_srrs
[10]:
0 SRR2895732
1 SRR2895743
2 SRR2895755
3 SRR2895756
4 SRR2895757
Name: Run, dtype: object
[11]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")
# call download (run) function
sra2.run(auto=True, name_fields=(1,30))
Parallel connection | oud: 4 cores
[####################] 100% 0:02:07 | downloading/extracting fastq data
5 fastq files downloaded to /home/deren/Documents/ipyrad/newdocs/cookbook/downloaded
Check the data files¶
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved.
[12]:
! ls -l downloaded
total 6174784
-rw-rw-r-- 1 deren deren 1372440058 Aug 17 16:36 SRR2895732_Lib1_betulifolium.fastq
-rw-rw-r-- 1 deren deren 1422226640 Aug 17 16:36 SRR2895743_Lib1_bitchiuense_combined.fastq
-rw-rw-r-- 1 deren deren 759216310 Aug 17 16:37 SRR2895755_Lib1_carlesii_D1_BP_001.fastq
-rw-rw-r-- 1 deren deren 1812215534 Aug 17 16:36 SRR2895756_Lib1_cinnamomifolium_PWS2105X.fastq
-rw-rw-r-- 1 deren deren 956848184 Aug 17 16:36 SRR2895757_Lib1_clemensiae_DRY6_PWS_2135.fastq