NGS – Next Generation Sequencing
Next generation sequencing (NGS) data is extremely high throughput, allowing for exponentially higher amounts of data to be generated than the traditional Sanger Sequencing. This is made possible by procuring millions of sequence clusters in parallel, and reading the sequences of all of these clusters base by base, through cycles of nucleotide incorporation, fluorescence reading, and dye cleaving. The explosion of experimental data from NGS has driven the need for new paradigms for data computation and knowledge extraction. In this knowledge base, we describe the common formats of NGS raw data files and some downstream analysis of NGS data.
The raw output of all Illumina-based next-generation sequencing machines is the .bcl format. These files are named after, and represent base calls per cycle, which is a binary file that contains both the base call and the quality of that base call for every “tile” in every cycle. Each lane of the flow cell has a set amount of swaths on the top and bottom surface, with each swath containing a variable amount of tiles, with a variable amount of nucleotide clusters in each tile. Illumina sequencing platforms use flow cells that vary in the number of lanes, swaths per lane, and tiles per swath. For example, the high output flow cell for the NextSeq 500 system has four lanes, with three swaths on the top and bottom of the lanes, for a total of 864 tiles each containing thousands of template clusters.
The advantage of the .bcl file format is that each base calling is recorded as the machine actually calls that base. In contrast to the fastq file format, where the base call recording is made after the entire sequence is read, calling the bases for every cluster in a particular tile for a particular cycle number in .bcl format is a much more efficient process for the sequencing machine.