Tutorial

The following examples show how KvarQ can be used to quickly analyze genomic data in .fastq format. All examples assume that you have successfully downloaded and installed KvarQ, but have no other prerequisites. The tutorials are ordered from simpler to more complicated.

Ebola Outbreak 2014

Gire et al have sequenced some 99 virus genomes during the 2014 Ebola outbreak and immediately released all sequence data to “facilitate rapid global research”. In the following, we’re going to develop a simple testsuite that allows KvarQ to say whether a .fastq file is from a) a ebola virus, b) the 2014 outbreak, and c) from any of the three sublineages defined in the paper (see figure 4A).

Creating the Testsuite

The following steps led to the Kvarq Ebola sierraleone14 testsuite.

All necessary supplementary materials can be downloaded from the Science webpage. In particular, we’re interested in

  • Table S2 with the accession numbers of all 99 sequenced genomes. This are serial isolates from 78 patients.
  • File S1 that contains ebov.mafft.fasta with the sequence alignment. There are 20 sequences before 2014 and 81 sequences from 2014. The sequences form Sierra Leone (sequenced in this paper) are identified by comparing them to the ID column in Table S2.
  • Table S4 with the sheet 2014_specific_snps that lists all SNPs that were found in the new sequences.

With these three files we can generate a list of SNPs that are unique to the isolates from Sierra Leone. Gire et al define three sub-lineages (figure 4A) and by comparing the number of sequences that have specific SNPs we can compile the following table (see the script to extract the SNPs on github).

Position Ancestral Derived Sublineage
800 C T SL2
1849 T C SL1
6283 C T SL1
8928 A C SL2
10218 G A SL3
13856 A G SL1
15660 T C SL1
15963 G A SL2
17142 T C SL2

Creating a testsuite from this data is quite straightforward. First, we choose a reference genome to extract data from. This reference genome should have the ancestral genotype for every SNP that we define. Because we are only interested in SNPs from the 2014 strains, we simply take a genome from a previous isolate, for example the sequence EBOV_1976_KC242801 from the file ebov.mafft.fasta and we save it in a new file called EBOV76.fasta. this file is then loaded as a Genome in the testsuite:

# old ebola genome from previous outbreak
EBOV76 = Genome(os.path.join(os.path.dirname(__file__), 'EBOV76.fasta'))

Next, we define the a Reference that identifies the source of the data. Then, we Genotype can be bound to a gene, but in our case we simply specify it by name.

gire14 = Reference('Gire et al (2014) doi 10.1126/science.1259657')

# sub-lineages as defined in gire14
SL1 = Genotype('SL1')
SL2 = Genotype('SL2')
SL3 = Genotype('SL3')

In the next step we define the actual SNPs and bind them to the genotypes defined above.

# SNPs extracted from primary data using suppl/_extract_SNPs.py
SNPs = [
        Test(SNP(genome=EBOV76, pos=800, orig='C', base='T'), SL2, gire14),
        Test(SNP(genome=EBOV76, pos=1849, orig='T', base='C'), SL1, gire14),
        Test(SNP(genome=EBOV76, pos=6283, orig='C', base='T'), SL1, gire14),
        Test(SNP(genome=EBOV76, pos=8928, orig='A', base='C'), SL2, gire14),
        Test(SNP(genome=EBOV76, pos=10218, orig='G', base='A'), SL3, gire14),
        Test(SNP(genome=EBOV76, pos=13856, orig='A', base='G'), SL1, gire14),
        Test(SNP(genome=EBOV76, pos=15660, orig='T', base='C'), SL1, gire14),
        Test(SNP(genome=EBOV76, pos=15963, orig='G', base='A'), SL2, gire14),
        Test(SNP(genome=EBOV76, pos=17142, orig='T', base='C'), SL2, gire14),
    ]

Finally, we define a new Testsuite from these SNPs and instantiate it to a variable called sierraleone14, which must be the same name as python file. We could use the standard testsuite:

sierraleone14 = Testsuite(SNPs, VERSION)

But we instead choose to define a new testsuite called CountGenotype that subclasses the _analyse method, to summarize all SNPs into one line that shows the genotype and the number of SNPs found for this genotype. See the complete testsuite on github.

Creating a testsuite from this test data is quite straightforward: simply define each of the SNPs as a Test and instantiate a Testsuite. As a bonus, the Kvarq Ebola sierraleone14 testsuite overrides the _analyse method of the testsuite to display how many of the specified SNPs have been found for every sublineage. The reference genome EBOV76.fasta is the first genome found in the file ebov.mafft.fasta.

View the complete source code of the testsuite on github.

Running the Testsuite

Choose any of the patients, e.g. EM119 from table S2. The corresponding accession number KM233042 in the nucleotide archive yields another link into the biosample database : SAMN02951962, from where the raw sequencing data can be downloaded. NCBI stores the .fastq file in .sra format, but this can easily be converted after download using the fastq-dump command from the SRA Toolkit.

Now simply download the Kvarq Ebola sierraleone14 testsuite, start the KvarQ GUI, load the testsuite and analyze the .fastq file, or launch KvarQ from the command line. The resulting .json file can be opened in the explorer and should show that the sample from EM119 is sublineage three, showing all the 6 SNPs from SL1, the 4 SNPs from SL2, and the SNP from SL3.

Downloading the sample from EM120 (biosample SAMN02951963) and analyzing it the same way shows that this sample also is positive for the 6 SNPs from SL1, and the 4 SNPs from SL2, but that is missing the SNP from SL3 (opening the SNP with the explorer shows that it has the original base G at position 10218).

Creating a new SNP testsuite

After reading the interesting article A robust SNP barcode for typing Mycobacterium tuberculosis complex strains I thought it would be nice to analyze some .fastq files with that new barcoding scheme.

To get things done quickly, I was browsing through the testsuites in the testsuites/examples directory and found a testsuite called SNPs.py that looked promising. This testsuite defines a function that loads SNP declarations from a .tsv file that can easily be edited with a popular spreadsheet program.

here = os.path.dirname(__file__)
# we borrow the reference from the ../MTBC testsuites
genome_path = os.path.join(here, os.path.pardir, 'MTBC', 'MTB_ancestor_reference.bases')
genome = Genome(genome_path, 'MTB ancestor')
ref = Reference('specify reference here')
# load SNP information from .tsv file (can be edited with Excel)
SNPs = tsv2SNPs(os.path.join(here, 'SNPs.tsv'), genome, ref)

The format of the .tsv file is straightfoward:

lineage1 3920109 G/T
lineage1 3597682 C/T
lineage1 1590555 C/T
lineage2 1834177 A/C
lineage2 3304966 G/A
... ... ...

There is simply an identifier, followed by the position of the SNP within the reference genome (loaded from the file ../MTBC/MTB_ancestor_reference.bases in the example), then the original base, and finally the derived base. Actually, the SNPs defined in this example testsuite are the same as the ones used for the main lineage classification in the testsuite MTBC/phylo. We can quickly confirm this by performing a scan using the MTBC/phylo and the examples/SNPs testsuites and comparing the result (type these commands in KvarQ’s root directory)

kvarq scan -l MTBC/phylo -l examples/SNPs tests/fastqs/N0116_1_hits_1k.fastq N0116_phylo_SNPs.json
kvarq illustrate -r N0116_phylo_SNPs.json

This should result in the following output:

examples/SNPs
-------------
['lineage2::SNP1834177AC', 'lineage2::SNP3304966GA']

MTBC/phylo
----------
'lineage 2 -- low coverage (median below 10x)'

So indeed both testsuites report lineage2 – because examples/SNPs does not subclass kvarq.genes.Testsuite, the result is simply the list of SNPs that were found in the file, while MTBC/phylo fuses the two SNPs into one lineage result and warns at the same time of low coverage, but that’s material for another tutorial post...

Coming back the SNP barcoding: it’s simple enough to compile a list of all SNPs mentioned in the paper. It starts like this:

lineage1 615938 G/A
lineage1.1 4404247 G/A
lineage1.1.1 3021283 G/A
lineage1.1.1.1 3216553 G/A
lineage1.1.2 2622402 G/A
lineage1.1.3 1491275 G/A
lineage1.2.1 3479545 C/A
... ... ...

So let’s first create a new directory for the testsuite-to-be-created, calling it testsuites/MTBC-SNP-barcodes. Then we copy the following files

  • testsuites/MTBC-SNP-barcodes/coll14.py : a copy of the file testsuites/examples/SNPs.py, will be modified below
  • testsuites/MTBC-SNP-barcodes/coll14.tsv : the SNP list extracted from the paper; you can download the list from github

Some parts of the example testsuite have to be modified accordingly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
VERSION = '0.1'
GENES_COMPATIBILITY = '0.0'

import os.path

from kvarq.genes import Genome, Reference, SNP, Test, Testsuite, Genotype

def tsv2SNPs(path, genome, reference):

    tests = []
    for line in file(path):

        parts = line.strip().split('\t')
        name = parts[0]
        pos = int(parts[1])
        bases = parts[2].split('/')

        snp = SNP(genome=genome, pos=pos, orig=bases[0], base=bases[1])
        test = Test(snp, Genotype(name), reference)
        tests.append(test)

    return tests

here = os.path.dirname(__file__)
genome = Genome(os.path.join(here, 'MTB_ancestor_reference_coll.bases'), 'MTB ancestor')
coll14 = Reference('Coll et al (2014) -- doi: 10.1038/ncomms5812')
SNPs = tsv2SNPs(os.path.join(here, 'coll14.tsv'), genome, coll14)

coll14 = Testsuite(SNPs, VERSION)

Remarks

  • line 1 : it doesn’t really matter what VERSION we specify, but it’s important to increase it when the testsuite is modified to maintain compatibility
  • line 25 : because the reference genome MTBC/MTB_ancestor_reference.bases that was used in the MTBC/phylo testsuite has already the derived base in some of the SNPs defined in coll14.tsv, we cannot use it as a reference genome (KvarQ asserts that the reference genome has the ancestral base for all defined SNPs to prevent errors). therefore, I have assembled a new reference genome that has the ancestral base for all SNPs
  • line 26 : the reference for the testsuite is the original publication from which the SNPs are taken
  • line 27 : the SNPs are read rom the file coll14.tsv
  • line 29 : the testsuite must be named like the file

Now let’s see whether KvarQ accepts the new testsuite: the command kvarq info -l MTBC-SNP-barcodes/coll14 should produce the following output:

version=0.12.2
testsuites=MTBC-SNP-barcodes/coll14-0.1[62:3162bp]
sum=62 tests,3162bp
sys.prefix=/Library/Frameworks/Python.framework/Versions/2.7

So the new testsuite is accepted and KvarQ tells us that it contains 62 tests totaling 3162 base pairs (that’s 62 times 1 base plus two flanks of 25 base pairs each).

Running the testsuite

Let’s first scan a single .fastq to make sure the testsuite works as expected. For example from the internet: MTB_98_1833. Then we scan this file with our new testsuite:

kvarq scan -p -l MTBC-SNP-barcodes/coll14 MTB_98_1833.fastq.gz MTB_98_1833.json

After a couple of minutes we can examine the result of the scan:

kvarq illustrate MTB_98_1833.json
kvarq explorer MTB_98_1833.json

This shows us the file contained at the same time SNPs characteristic for lineage 2 and lineage 4, and that the reads are quite short (around 35 base pairs after quality trimming). Practicaly all SNPs were found (with a coverage ranging from 20 to 50), most in their ancestral variant.

Ok, so everything seems to work and we can proceed scanning our local library of .fastq files, by writing a simple bash script

#!/bin/bash
mkdir results/
for fastq in /genomes/fastqs/*.fastq; do
    json=`basename "$fastq"`.json
    kvarq -l coll14_scan.log scan -l MTBC-SNP-barcodes/coll14 $fastq results/$json
done

Some hours later we have scanned for the SNP barcodes of hundreds of genomes, with a copy of the KvarQ log in the file coll14_scan.log and a new .json file in the results/ directory for every genome scanned. This information can then be further analyzed using a script that shows all information in tabular form and can also be downloaded from github (note that the script starts with an underscore _ because it is not a testsuite itself and should not be auto-discovered).

The finished testsuite can also be found on github.