Saturday, August 22, 2020

[gfcxhqoa] SARS-CoV-2 genome compressed to 2 bits per nucleotide

Here is a Perl script that emits the SARS-CoV-2 genome.  The script internally stores the genome in a simple but compressed form.

SARS-CoV-2 is the RNA virus which causes COVID-19.  It has a genome of 29903 nucleotides (GenBank reference sequence NC_045512.2).  (Note: the virus mutates, so it also has sequences of other lengths in GenBank.)

RNA has 4 possible nucleotides (A C G U) so can be encoded with 2 bits per nucleotide or 4 nucleotides per byte.  Using this "naive" compression method, the genome can be compressed to 29903/4 = 7475.75 bytes.

To deal with the partial byte at the end, we append one additional adenine to the end of the genome.  The genome already has a poly-A tail of 34 adenines; we make it 35.  (We assume this change does not biologically affect the virus: it is likely poly-A tails of many different lengths exist among the viruses in the wild.)  This makes the compressed length 7476 bytes.

We prepend an 86-byte Perl script (also given below) in front of the binary compressed genome.  The script reads the appended binary compressed genome out of the DATA file handle, making the whole thing self-extracting under Perl.  The script prints the text genome to standard output with the nucleotides A C G and U (not T as originally deposited on GenBank).

Even if you don't have Perl, the script specifies how to decode the binary data (documenting details such endianness within a byte, how nucleotides are encoded as 2 bits).  The script uses Perl's well documented vec() function to parse the bitstring.  Encoding "A" as zero causes the poly-A tail to grow by 1, as mentioned above.

Can code golf reduce the length of this script further?  If you don't need the specification to be executable, just "perl vec ACGU\n" (14 bytes) might be sufficient to hint at how to decode.

#!perl
undef$/;$_=<DATA>;for$i(1..4*length){print(qw/A C G U/[vec$_,$i-1,2])}
__END__

For completeness, here is the bash command line that produced the self-extracting Perl script.  sars-cov-2.fasta is the sequence downloaded from GenBank in FASTA format.  The files rna-header.pl and rna-compress.pl are available here.

( cat rna-header.pl ; perl -nwe 'next if /^>/;y/T/U/;die unless/^[ACGU]*$/;chomp;print' sars-cov-2.fasta | perl rna-compress.pl ) > sars-cov-2-genome.pl

The resulting self-extracting Perl script has length 86+7476=7562 bytes.

Incidentally, conventional data compression programs do not do well with the genome.  We will explore such data compression of the SARS-CoV-2 genome in a future post.

No comments :