Complete Genome Sequences of Strains from 36 Serotypes of Salmonella
James Robertson, Catherine Yoshida, Simone Gurnik, John H. E. Nash
We report here the completed closed genome sequences of strains representing 36 serotypes of Salmonella. These genome sequences will provide useful references for understanding the genetic variation between serotypes, particularly as references for mapping of raw reads or to create assemblies of higher quality, as well as to aid in studies of comparative genomics of Salmonella.
Salmonella spp. are the leading cause of bacterial gastroenteritis in North America, with over 1.7 million cases per annum (1). Public health jurisdictions are replacing traditional serotyping with whole-genome sequencing (WGS) methodologies for quicker and more accurate outbreak detection and surveillance activities (2). To this end, we previously developed an in silico serotyping platform for Salmonella (3, 4).
Unfortunately, the large amount of raw data available in the SRA are primarily composed of Illumina short reads which cannot circularize the Salmonella genome as one contiguous nucleic acid molecule. As of November 2017, the number of fully closed genomes is 501 for Salmonella enterica and 4 for Salmonella bongori. Therefore, we sequenced 36 diverse serotypes of Salmonella using a combination of Illumina and PacBio technologies to produce high-quality genomes for public health and comparative genomics applications. This data set represents 25 novel serotypes with closed reference genomes.
Genomic DNA was isolated using the automated Qiagen EZ1 DNA tissue kit, using the manufacturer’s protocol, except 180 μl of G2 buffer was used with 10 μl of proteinase K and 10 μl of lysozyme (10 mg/ml; Sigma-Aldrich, Gillingham, UK). PacBio sequencing was performed at the Génome Québec Innovation Centre (McGill University, Quebec, Canada) using single-molecule real-time (SMRT) cells in an RSII sequencer, which produced 100,000 to 150,000 reads per sample, with an average read length of 6,000 bp. The PacBio read sets were assembled into circular consensus sequences using the HGAP workflow 1.1.13. Illumina sequencing on MiSeq version 3 (600-cycle kit) using Nextera XT libraries was performed at the National Microbiology Laboratory at Winnipeg (Winnipeg, Manitoba, Canada) to a target of 60-fold coverage. The quality of the Illumina read sets was examined using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Illumina read correction was performed using Lighter version 1.1.1 (https://github.com/mourisl/Lighter). Corrected Illumina reads were then mapped to the PacBio assembly using Bowtie2 version 2.1.0 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) using the very-sensitive-local option. The output was sorted and converted into a bam file using SAMtools version 1.3 (http://samtools.sourceforge.net/) and input to Pilon version 1.2.2 (https://github.com/broadinstitute/pilon). The process was performed iteratively on the corrected assemblies until no changes were made to the output. Final assemblies were examined using Gap5 software version 1.2.14 (http://www.sanger.ac.uk/science/tools/gap5). Completed assemblies were processed through the Salmonella In Silico Typing Resource (SISTR) (3, 4) to confirm that the in silicopredictions matched the serotype previously performed by our OIE Reference Laboratory for Salmonellosis in Guelph, Ontario, Canada.
Closed reference genomes provide great value to an understanding of the biology of pathogens, and as such, it is important that genome repositories contain as many of them as possible. These would make important contributions as reference sequences for the WGS assembly of isolates of the same or highly similar serotypes, as well as provide more accurate genomes for comparative and epidemiological studies on outbreak detection and surveillance of Salmonella.
DOI : 10.1128/genomeA.01472-17