## Monday, October 16, 2017

### [duimuqdd] Compressing prime numbers

Given a large prime number p not of any special form, can one data-compress it to specify it in less than log_2(p) bits?

Straightforward is to say it is the m-th prime.  Expressing m will take about log_2(p/log(p)) bits by the prime number theorem.  This is the theoretical optimum by some measure, because the m's map 1-to-1 with the primes.  The number of bits saved is log_2(p)- log_2(p/log(p)) = log_2(p) - [log_2(p) - log_2(log(p))] = log_2(log(p)) which, because of the twice-iterated logarithm, is a very small savings.  For 370-bit primes around exp(256), the savings is 8 bits, or a reduction of 1 part in 46.  For larger primes, the relative savings is even less.  Around exp(2^16), the reduction is 1 part in 5909.

Despite the mostly uselessness of this endeavor, we charge forward.

Computing m is not practical for large primes.  For a practical method, we form a new number x = 2^k * floor(p / 2^k), equivalent to setting k lower bits of p to zero.  Then we can encode p as the n-th prime larger than x.  For small k, n is easy to compute, and it is easy to recover p from n and x (e.g., with a prime sieve starting from x).  Storing n instead of the lower bits of p results in some data savings because primes are less dense.  Curiously, the number of bits saved on average is log_2(log(p)), which is the theoretical maximum, and not dependent on k.  However, we also need to store k, which eats up some of those savings.  Because the number of bits saved is constant, this method paradoxically works less and less well for larger k.  However, with small k, n might vary significantly from the average, which could be good or bad.  Here is some Haskell code to compute n.  Here is some example output on the prime 3^233 + 176, which is around exp(2^8).  We can see there is generally around 8 bits of savings, except for 2^21 where we got lucky.

input 1476564251485392778927857721313837180933869708288569663932077079002031653266328641356763872492873429131586567699
prime # 1 after 2^ 9 * 2883914553682407771343472111941088244011464274001112624867338044925843072785798127649929438462643416272630015
prime # 3 after 2^ 10 * 1441957276841203885671736055970544122005732137000556312433669022462921536392899063824964719231321708136315007
prime # 8 after 2^ 11 * 720978638420601942835868027985272061002866068500278156216834511231460768196449531912482359615660854068157503
prime # 12 after 2^ 12 * 360489319210300971417934013992636030501433034250139078108417255615730384098224765956241179807830427034078751
prime # 24 after 2^ 13 * 180244659605150485708967006996318015250716517125069539054208627807865192049112382978120589903915213517039375
prime # 59 after 2^ 14 * 90122329802575242854483503498159007625358258562534769527104313903932596024556191489060294951957606758519687
prime # 129 after 2^ 15 * 45061164901287621427241751749079503812679129281267384763552156951966298012278095744530147475978803379259843
prime # 252 after 2^ 16 * 22530582450643810713620875874539751906339564640633692381776078475983149006139047872265073737989401689629921
prime # 526 after 2^ 21 * 704080701582619084800652371079367247073111395019802886930502452374473406441845246008283554312168802800935
prime # 8789 after 2^ 22 * 352040350791309542400326185539683623536555697509901443465251226187236703220922623004141777156084401400467

Future exploration:  "Luck" above is easily explained by long internal strings of zeroes in the binary representation of p: the least significant 21 bits of p are 000011111111000010011.  We could also try to take advantage of long internal strings of ones.  More generally, sieving for primes works well for regions defined by arithmetic progressions.  We might search generally for representations of p as the "a-th prime number of the form b*x + c larger/smaller than d*e^k".  Above we explored b=1, c=0, "larger", e=2, d unrestricted.  We could also explore anchor expressions more complicated than d*e^k.