Monday, August 31, 2020

[fpseyamg] COVID-19 genome as QR

Below is the SARS-CoV-2 genome encoded as 6 QR code symbols.  The 6 symbols are linked with the "Structured Append" QR code feature.  Unfortunately, I do not know of any QR code reader application that handles Structured Append.  The ZXing library does sort of decode Structured Append metadata, (also see https://programresource.net/en/2013/05/04/2202.html) but no app I know of does anything with it.  Also, hand-held QR code scanners typically do poorly on large QR codes.  If you submit an image with Structured Append into https://zxing.org/w/decode, you get Structured-Append-related 5 header bytes and trailer bytes (in the final code) in the output raw bytes, which you have to parse and deal with.  The data is encoded in QR code's 8-bit mode, which typical scanners and decoders handle poorly if the data is not printable text.

Input was the reference genome NC_045512.2 encoded with 2 bits per nucleotide, with a short Perl script prepended specifying exactly how to decode.

Here is the bash command line that generated the image:

qrencode -l H -S --verbose -8 -o foo.png -v 40 < sars-cov-2-genome.pl && for file in foo*.png ; do pngtopnm $file > $file.pnm ; done && pnmcat -tb foo-0*.pnm |pnmtopng > sars-cov-2-genome-qr.png

The default QR code image settings felt good: dot size 3, quiet zone 4 dots as per spec.

The 7562 bytes of input fit into 6 "version 40" (i.e., size 40) QR code symbols, using error correction level H.  Version 40 is the largest possible symbol.  6 of the version 40 symbols can encode at most 7626 bytes, so the data fits with not too much space left over.  6 version 39 symbols can encode at most 7302 bytes.  5 version 40 symbols can encode at most 6355 bytes.

Incidentally, the qrencode program cannot emit Structured Append QR codes of differing sizes, perhaps the last one sized to fit the input data as tightly as possible.  (Would that be permitted by the QR code specification?)  Instead, the last code might have lot of empty space.  If so, it gets a visually detectable filler pattern.

Structured Append can combine at most 16 symbols.  The genome data would fit into 16 version 24 symbols.  16 version 23 symbols can encode at most 7344 bytes.

Above, we used error correction level "H", the most error correction available.  If we had instead used "L", the least error correction, we would have the following:

The data would fit into 3 version 37 symbols.  3 version 36 symbols can encode at most 7287 bytes.  2 version 40 symbols (the maximum QR code size) can encode at most 5902 bytes.

The data would fit into 15 version 15 symbols.  16 version 14 symbols can encode at most 7296 bytes.

Here are Netpbm commands to cut the tall image below into 6:

for i in `seq 0 5`; do pnmcut 0 $(expr $i \* 555) 0 555 orig.pnm > cut$i.pnm ; done

No comments :