ipyrad-analysis toolkit: sratools¶

For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We’ve written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do.

Required software¶

[1]:

# conda install ipyrad -c bioconda
# conda install sratools -c bioconda

[2]:

import ipyrad.analysis as ipa

Fetch info for a published data set by its accession ID¶

You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.).

[3]:

# init sratools object with an accessions argument
sra = ipa.sratools(accessions="SRP065788")

[4]:

# fetch info for all samples from this study, save as a dataframe
stable = sra.fetch_runinfo()

Fetching project data...

[5]:

# the dataframe has all information about this study
stable.head()

[5]:

	Run	ReleaseDate	LoadDate	spots	bases	avgLength	size_MB	AssemblyName	download_path	...	SRAStudy	BioProject	Study_Pubmed_id	ProjectID	Sample	BioSample	SampleType	TaxID	ScientificName	SampleName
0	SRR2895732	2015-11-04 15:50:01	2015-11-04 17:19:15	2009174	182834834	91	116	NaN	https://sra-download.ncbi.nlm.nih.gov/sos/sra-...	...	SRP065788	PRJNA299402	NaN	299402	SRS1146158	SAMN04202163	simple	224736	Viburnum betulifolium	Lib1_betulifolium
1	SRR2895743	2015-11-04 15:50:01	2015-11-04 17:18:35	2452970	223220270	91	140	NaN	https://sra-download.ncbi.nlm.nih.gov/sos/sra-...	...	SRP065788	PRJNA299402	NaN	299402	SRS1146171	SAMN04202164	simple	1220044	Viburnum bitchiuense	Lib1_bitchiuense_combined
2	SRR2895755	2015-11-04 15:50:01	2015-11-04 17:18:46	4640732	422306612	91	264	NaN	https://sra-download.ncbi.nlm.nih.gov/sos/sra-...	...	SRP065788	PRJNA299402	NaN	299402	SRS1146182	SAMN04202165	simple	237927	Viburnum carlesii	Lib1_carlesii_D1_BP_001
3	SRR2895756	2015-11-04 15:50:01	2015-11-04 17:20:18	3719383	338463853	91	214	NaN	https://sra-download.ncbi.nlm.nih.gov/sos/sra-...	...	SRP065788	PRJNA299402	NaN	299402	SRS1146183	SAMN04202166	simple	237928	Viburnum cinnamomifolium	Lib1_cinnamomifolium_PWS2105X
4	SRR2895757	2015-11-04 15:50:01	2015-11-04 17:20:06	3745852	340872532	91	213	NaN	https://sra-download.ncbi.nlm.nih.gov/sos/sra-...	...	SRP065788	PRJNA299402	NaN	299402	SRS1146181	SAMN04202167	simple	237929	Viburnum clemensae	Lib1_clemensiae_DRY6_PWS_2135

5 rows × 30 columns

File names¶

You can select columns by their index number to use for file names. See below.

[8]:

stable.iloc[:5, [0, 28, 29]]

[8]:

	Run	ScientificName	SampleName
0	SRR2895732	Viburnum betulifolium	Lib1_betulifolium
1	SRR2895743	Viburnum bitchiuense	Lib1_bitchiuense_combined
2	SRR2895755	Viburnum carlesii	Lib1_carlesii_D1_BP_001
3	SRR2895756	Viburnum cinnamomifolium	Lib1_cinnamomifolium_PWS2105X
4	SRR2895757	Viburnum clemensae	Lib1_clemensiae_DRY6_PWS_2135

Download the data¶

From an sratools object you can fetch just the info, or you can download the files as well. Here we call .run() to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.

[10]:

# select first 5 samples
list_of_srrs = stable.Run[:5]
list_of_srrs

[10]:

0    SRR2895732
1    SRR2895743
2    SRR2895755
3    SRR2895756
4    SRR2895757
Name: Run, dtype: object

[11]:

# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))

Parallel connection | oud: 4 cores
[####################] 100% 0:02:07 | downloading/extracting fastq data

5 fastq files downloaded to /home/deren/Documents/ipyrad/newdocs/cookbook/downloaded

Check the data files¶

You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved.

[12]:

! ls -l downloaded

total 6174784
-rw-rw-r-- 1 deren deren 1372440058 Aug 17 16:36 SRR2895732_Lib1_betulifolium.fastq
-rw-rw-r-- 1 deren deren 1422226640 Aug 17 16:36 SRR2895743_Lib1_bitchiuense_combined.fastq
-rw-rw-r-- 1 deren deren  759216310 Aug 17 16:37 SRR2895755_Lib1_carlesii_D1_BP_001.fastq
-rw-rw-r-- 1 deren deren 1812215534 Aug 17 16:36 SRR2895756_Lib1_cinnamomifolium_PWS2105X.fastq
-rw-rw-r-- 1 deren deren  956848184 Aug 17 16:36 SRR2895757_Lib1_clemensiae_DRY6_PWS_2135.fastq