Friday, October 29, 2021

[gnsalavv] counting sentences with parentheses

consider the following rules defining a "sentence":

  1. a sentence is string of letters, spaces, and parentheses.
  2. the first character in a sentence may not be a space.
  3. the final character in a sentence may not be a space.
  4. there may not be 2 or more spaces in a row.
  5. parentheses must be balanced
  6. the string between balanced parentheses must be a valid sentence.

any string of letters forms a word: there is no dictionary or rules concerning permitted words.  there is no grammar about where words can go in a sentence.

(previously, without parentheses.)

there are three possibilities for rules concerning a space before an open-parenthesis and after a close-parenthesis:

  1. "spaces required": a space is always required before an open parenthesis and after a close parentheses, except at the beginning or end of a sentence.  in other words, a parenthetical substitutes for a word.
  2. "spaces forbidden": spaces are forbidden before an open parenthesis and after a close parentheses.  this makes #1 more compact, taking the following view: a parenthesis is a separator, so space is redundant.
  3. "spaces optional": no rule, i.e., space permitted but not required in those locations.  sentences otherwise identical can differ by the presence or absence of such spaces.  in other words, a parenthetical can be a portion of a word.

we also consider a 4th possibility for completeness: "no spaces anywhere" = spaces forbidden everywhere.  sentences consist of letters and matched parentheses only.  previously.

(note that, because the string inside parentheses must be a valid sentence, and because sentences may not begin or end with spaces, spaces are never permitted after an open parenthesis or before a close parenthesis in any of the possibilities.)

possibility 1: spaces required

space required before an open parenthesis and after a close parenthesis.

assume an alphabet of exactly one letter for simplicity.

number of possible sentences of a given length, for an alphabet of 1 letter:

0: 1
1: 1
2: 2
3: 3
4: 7
5: 13
6: 28
7: 57
8: 122
9: 260
10: 563
20: 1803320
50: 234639824043294365
100: 2401704148722713218387226080782103540
200: 675802081149514595440924093540053480236325063284358890328823379454019858491
500: 82616673865208219840168987274343611550170249309187075398615393168193465853781619472575603429097239515922012333672624393300660923478883779991454212379508123961866005624093205144548567401645612

here are the 7 sentences exactly 4 characters long.  we use period as the space character to make it visible:

(()) ().a (aa) a.() a.aa aa.a aaaa

number of possible sentences of a given length, for an alphabet of X letters:

0: 1
1: X + 0
2: X^2 + 0 X + 1
3: X^3 + 1 X^2 + 1 X + 0
4: X^4 + 2 X^3 + 1 X^2 + 2 X + 1
5: X^5 + 3 X^4 + 2 X^3 + 5 X^2 + 1 X + 1
6: X^6 + 4 X^5 + 4 X^4 + 8 X^3 + 4 X^2 + 6 X + 1
7: X^7 + 5 X^6 + 7 X^5 + 12 X^4 + 13 X^3 + 12 X^2 + 4 X + 3
8: X^8 + 6 X^7 + 11 X^6 + 18 X^5 + 28 X^4 + 22 X^3 + 22 X^2 + 12 X + 2
9: X^9 + 7 X^8 + 16 X^7 + 27 X^6 + 50 X^5 + 46 X^4 + 60 X^3 + 28 X^2 + 19 X + 6
10: X^10 + 8 X^9 + 22 X^8 + 40 X^7 + 81 X^6 + 94 X^5 + 123 X^4 + 84 X^3 + 79 X^2 + 24 X + 7

the constant terms of the polynomials are OEIS A329691.  these are sentences formed from an empty alphabet, sentences containing only spaces and parentheses, no letters.

possibility 2: spaces forbidden

spaces forbidden next to parentheses.

number of possible sentences of a given length, for an alphabet of 1 letter:

0: 1
1: 1
2: 2
3: 5
4: 10
5: 24
6: 54
7: 128
8: 305
9: 738
10: 1812
20: 22305995
50: 209085861583658998988
100: 2896163012069368328112246952862062350725222
200: 1538301376752204151873892519966611640610203038661778430235437213872294001191888210488379
500: 1296046302196493674416139158608742818799860682561639815148641768773774389295081119244166279227434997260848624438315999195498401549080170742562237181628517294428031369534378524852432685463153189441233522913434883852344497135

here are the 10 sentences exactly 4 characters long: (()) ()() ()aa (a)a (aa) a(a) a.aa aa() aa.a aaaa

number of possible sentences of a given length, for an alphabet of X letters:

0: 1
1: X + 0
2: X^2 + 0 X + 1
3: X^3 + 1 X^2 + 3 X + 0
4: X^4 + 2 X^3 + 5 X^2 + 0 X + 2
5: X^5 + 3 X^4 + 8 X^3 + 3 X^2 + 9 X + 0
6: X^6 + 4 X^5 + 12 X^4 + 10 X^3 + 22 X^2 + 0 X + 5
7: X^7 + 5 X^6 + 17 X^5 + 22 X^4 + 44 X^3 + 9 X^2 + 30 X + 0
8: X^8 + 6 X^7 + 23 X^6 + 40 X^5 + 81 X^4 + 44 X^3 + 96 X^2 + 0 X + 14
9: X^9 + 7 X^8 + 30 X^7 + 65 X^6 + 140 X^5 + 126 X^4 + 234 X^3 + 30 X^2 + 105 X + 0
10: X^10 + 8 X^9 + 38 X^8 + 98 X^7 + 229 X^6 + 284 X^5 + 505 X^4 + 192 X^3 + 415 X^2 + 0 X + 42

the constant terms of the polynomials, corresponding to an empty alphabet, are OEIS A126120, the Catalan sequence interspersed with zeros.

possibility 3: spaces optional

number of possible sentences of a given length, for an alphabet of 1 letter:

0: 1
1: 1
2: 2
3: 5
4: 13
5: 35
6: 97
7: 275
8: 794
9: 2327
10: 6905
20: 543861219
50: 1128333693246416457543548
100: 119628765495379222079660420511035921565831351707845
200: 3649654665420403223682340275125820396477186776743324179123217492162687232294919994358381100091839560848
500: 573687344404198770128089072739235272009117916044030235515101699291546310381453636803458852283006441770075054703364427541927298229216743855593043191380481458147380991797504771043559300958504387502948387474913767477040710263802219479301907466467399365606135187581

here are the 13 sentences exactly 4 characters long: (()) ()() ().a ()aa (a)a (aa) a()a a(a) a.() a.aa aa() aa.a aaaa

number of possible sentences of a given length, for an alphabet of X letters:

0: 1
1: X + 0
2: X^2 + 0 X + 1
3: X^3 + 1 X^2 + 3 X + 0
4: X^4 + 2 X^3 + 6 X^2 + 2 X + 2
5: X^5 + 3 X^4 + 11 X^3 + 9 X^2 + 10 X + 1
6: X^6 + 4 X^5 + 18 X^4 + 24 X^3 + 33 X^2 + 12 X + 5
7: X^7 + 5 X^6 + 27 X^5 + 51 X^4 + 88 X^3 + 60 X^2 + 38 X + 5
8: X^8 + 6 X^7 + 38 X^6 + 94 X^5 + 200 X^4 + 204 X^3 + 176 X^2 + 60 X + 15
9: X^9 + 7 X^8 + 51 X^7 + 157 X^6 + 403 X^5 + 555 X^4 + 620 X^3 + 356 X^2 + 156 X + 21
10: X^10 + 8 X^9 + 66 X^8 + 244 X^7 + 740 X^6 + 1296 X^5 + 1805 X^4 + 1480 X^3 + 930 X^2 + 284 X + 51

the constant terms of the polynomials, corresponding to an empty alphabet, are OEIS A167638.

possibility 4: no spaces anywhere

number of possible sentences of a given length, for an alphabet of 1 letter.  this OEIS A001006 "Motzkin numbers".

0: 1
1: 1
2: 2
3: 4
4: 9
5: 21
6: 51
7: 127
8: 323
9: 835
10: 2188
20: 50852019
50: 2837208756709314025578
100: 737415571391164350797051905752637361193303669
200: 135992213076511509724385771675602872730799169849536546904800645523735889653182288054195352807
500: 4743905065248174705076073568924606940687938425039221715559064934518576868883249780520846342406761520737185061080260761031842147475829201295002628662481146249642585695179800443503586610960281508335011647289023649206194869780049646575619

here are the 9 sentences exactly 4 characters long: (()) ()() ()aa (a)a (aa) a()a a(a) aa() aaaa

number of possible sentences of a given length, for an alphabet of X letters.  there are many zero coefficients:

0: 1
1: X + 0
2: X^2 + 0 X + 1
3: X^3 + 0 X^2 + 3 X + 0
4: X^4 + 0 X^3 + 6 X^2 + 0 X + 2
5: X^5 + 0 X^4 + 10 X^3 + 0 X^2 + 10 X + 0
6: X^6 + 0 X^5 + 15 X^4 + 0 X^3 + 30 X^2 + 0 X + 5
7: X^7 + 0 X^6 + 21 X^5 + 0 X^4 + 70 X^3 + 0 X^2 + 35 X + 0
8: X^8 + 0 X^7 + 28 X^6 + 0 X^5 + 140 X^4 + 0 X^3 + 140 X^2 + 0 X + 14
9: X^9 + 0 X^8 + 36 X^7 + 0 X^6 + 252 X^5 + 0 X^4 + 420 X^3 + 0 X^2 + 126 X + 0
10: X^10 + 0 X^9 + 45 X^8 + 0 X^7 + 420 X^6 + 0 X^5 + 1050 X^4 + 0 X^3 + 630 X^2 + 0 X + 42

the constant terms of the polynomials, corresponding to an empty alphabet, are OEIS A126120, the Catalan sequence interspersed with zeros, same as possibility 2.

alphabet of size 26

next, we consider an alphabet of size X=26, in contrast to X=1 above.

number of possible sentences of length 18, for the 4 possibilities of spaces around parentheses:

53015540666536623728839718 = 10^ 25.724403194816087 (spaces required)

55175049982065362539432242 = 10^ 25.741742735253272 (spaces forbidden)

64004269864112471169216906 = 10^ 25.806208947680233 (spaces optional)

36555393489772334041228182 = 10^ 25.56295146310067 (no spaces anywhere)

number of possible sentences ("tweets") of length 140, for the 4 possibilities:

195934261323419694751910685930748987539493098537206064673150995769544781921774070731971968699512843893026062219012006297509999482003429684592290894981480864413992714261505322833423124487191189991245116 = 10^ 200.29211038394115 (spaces required)

279651170699176161293244273654763429207797519615693389895644252455841236471855853136070695512690866656907376176359966857585855146382978692320648870033749655770564609113840755678875203578876312834446376 = 10^ 200.44661664174802 (spaces forbidden)

82796358617823453501990692494372407625581449023146800949812002179372327361110631037893437558483173977254085396667494514825913324957849322646741729490697338529492350764942937726613380909403414602288119700 = 10^ 202.91801123694236 (spaces optional)

958111401189737767248075704214258864131901794586118314462994973720998051971111561994445027276127246857724857253761295408341276986563999954106395387550976416770852601625771829448219928651378985693500472 = 10^ 200.98141600814867 (no spaces anywhere)

in general, the number of possible sentences increases through "spaces required", "spaces forbidden", then "spaces optional".  possibility #4, "no spaces anywhere", does not fit in a consistent place among them.

implementation

computations were done with a Haskell program.  source code here.

we used the poly package to do arithmetic on polynomials.

the number of possibilities of given length is often expressed in terms of a smaller size, so we used memoization to reuse previously calculated smaller results.  we manually implemented memoization using a State monad, storing results in a collection of Maps.  (using Data.Map to store a collection indexed by consecutive integers was a little bit of overkill.)

(previously, seemingly magical automatic memoization without a monad using the MemoTrie package.)

an important space optimization was to fully evaluate values with seq before inserting into a memoization map:

do{
...
  answer :: Poly.VPoly Integer <- calcanswer;
-- Poly (version 0.3.1) does not yet support DeepSeq ; it was added 0.3.2
  seq (answer==answer) $ State.modify (setter n answer);
...}

length 500 was approximately the longest sentence we could count the number of sentences.  beyond that, computation time became hours and memory usage became gigabytes.  large space usage was partially because we memoized polynomials, not integers, only evaluating a polynomial at the end if we were given a concrete size of the alphabet.

to avoid boilerplate for referring to State components, we used fclabels to create lenses for manipulating State.  the mkLabels declaration (the Template Haskell "splice") has to be carefully placed in the code.  the declaration has to come after the data declaration.  the declaration has to come before any code that uses the lenses.  any code before mkLabels cannot refer to anything declared after mkLabels, even if neither does anything with the lenses.  these rules were quite surprising.  i discovered them by trial and error.  i had never used Template Haskell before.  in a normal Haskell program, all declarations other than import statements can be in any order.

normally i prefer to put the main function near the beginning of the program, but this became not possible.

the -ddump-splices compilation flag dumps the splices generated by Template Haskell.  the generated code looks formidable, heavily using arrows, categorical points, and categories, so i chose not to try to understand what it is doing.

we tried both Control.Monad.State.Strict and Control.Monad.State.Lazy.  Strict had less memory usage, but Lazy allows output of infinite lists, which is convenient, demonstrated in "cumulative1family" which computes the running sum of all sentences of n characters or less.

generating all possible sentences of a given length is equivalent to running a parser in reverse, with additional bookkeeping to keep track of the number of characters still left to output.  we used list as the nondeterminism monad.  we did not do memoization while generating all possible sentences because we never did generation on large sentence lengths.

because we use a state monad for memoization of counting and the nondeterminism monad for generation of sentences, the corresponding monadic codes are structurally similar.  below are some examples.  (the suffixes "1family" and "2family" refer to possibilities 1 (spaces required) and 2 (spaces forbidden) respectively.)

gensentence1family :: Integer -> [String];
gensentence1family n | n<0 = Monad.mzero
| n==0 = return ""
| True = do {
  i <- [1..n];
  Applicative.liftA2 (++)
   (genword1family i)
   (genrestsentence1family $ n-i);
};

countsentence1family :: Integer -> Monad1family Mypoly;
countsentence1family n | n<0 = return 0
| n==0 = return 1
| True = memoizelens n lenssentence1family $ sumM [1..n] (\i -> Applicative.liftA2 (*)
  (countword1family i)
  (countrestsentence1family $ n-i));

...

genparenthesized2family :: Integer -> [String];
genparenthesized2family n = do {
  y <- gensentence2family $ n-2;
  return $ ('(':y)++")";
};

countparenthesized2family :: Integer -> Monad2family Mypoly;
countparenthesized2family n = countsentence2family $ n-2;

it seems within the realm of imagination to unify the structurally similar codes, parameterizing over a polymorphic monad.  we did not pursue this.

it also seems possible to explicitly define a grammar then derive the generation and counting functions generically from grammars.  avoid ambiguous grammars to prevent generating duplicates and double-counting.  we did not pursue this.

future work: if there are N possible sentences, define and compute both directions of a bijection between sentences and the numbers 1 to N.

No comments :