Sunday, March 25, 2007

gs and netpbm

gs -sDEVICE=ppmraw -sOutputFile=- -r1720x1720 -dNOPAUSE -q -dBATCH MBTA-system_map-back.pdf | pnmdepth 65535 | pnmscale 0.5 | pnmgamma 0.5 | pnmdepth 255 | gzip >| f.gz &

gs

Got bits of this command line from pstopnm -verbose and generally do better to call gs directly than use pstopnm. pstopnm was doing some weird "translate" that made the image get clipped wrong. I suspect sign error.

Looking at gs with top, it only uses about 14 megabytes of memory even for creating very large images. This suggests it is allocating a small buffer "window", rendering to that window, emitting it, and sliding the window down, making another pass. There seems no way to make the window larger if you have memory to spare and want better performance.

Such a windowed output has the possibility of some computer-graphics tricks where the PDF is parsed into a range tree so that only the parts that are going to be in that window will be rendered.

-sDEVICE=ppmraw

You can also use png16m. ("16 million" or 24-bit color.) But pngtopnm has a bug that it tries to render the entire image at once and runs out of memory. If the png is not interlaced, it should do rows. If it is interlaced, it should buffer the compressed data in memory, create an index (possibly including zlib state) of the beginnings of each sub-image and each row in each sub-image, and then render row-by-row.

-sOutputFile=-

gs is broken for large files, so emitting a file greater than 2Gb breaks. If this happens from within pstopnm, it dies with a horribly useless error (signal #foo), but invoking gs directly gets you an error message like "file size limit exceeded". Use the png16m option above, or pipe to stdout. My shell (bash) does handle large file pipes ok (old versions of tcsh did not). Note that piping to stdout won't work quite so easily for multi-page documents.

-r1720x1720

Specify the resolution directly, rather than the output image size. I don't know if this accepts decimal points, though it ought to.

-dNOPAUSE

Don't pause at the end of the page (reading stdin).

-q

Don't emit gs introductory text. It would be bad if this got prepended to stdout.

-dBATCH

Without this option, gs gives you the "GS>" prompt at the end (which gets written to stdout), and waits for user input, e.g., quit. This option makes it quit automtically after the last page.

MBTA-system_map-back.pdf

Hey look! gs can take as input pdfs, letting you skip the sketchy pdf to ps (postscript) conversion.

| pnmdepth 65535 | pnmscale 0.5 | pnmgamma 0.5 | pnmdepth 255

Fine black-on-white detail (text) gets shaded down to grey when an image is shrunk, so change the gamma to compensate. In hopes of not losing color, we increase the depth to 16-bits during the operation.

There ought to be an pnmdepth option to do dithering.

There's a bug in pnmscalefixed: "pnmscalefixed: object too large"

| gzip > f.gz

Use gzip to prevent the uncompressed data from ever hitting disk. Disk I/O is kind of slow, and especially O (Output) slow with journaling (ext3). zcat is the opposite of gzip.

In other news, pnmcrop should have a two-pass operation where it gives output the coordinates you can feed to pnmcut. This way it does not have to buffer in memory when cropping left and right.

pnmcut (or another tool) should have a way of cutting out many rectangles all at once, say breaking an image into a 100x100 tiling.

pnmcut should exit after it's emitted the last output; no need to read the rest of the input (this might be fixed).

No comments :