TL;DR → The main problem is coming up with a way to reorder an array non-randomly but without introducing bulky code. Like the effect of shuffling a deck of cards in a deterministic cheating way.


Full background:

I would like to generate reference numbers for letters sent via postal mail. An sqlite db is used to track the sequence numbers (but not the reference numbers). This is the bash code I have so far:

typeset -a symbolset=(a b c d e f g h   j k   m n   p q r s t u v w x y z     2 3 4 5 6 7 8 9)
ln_symbolset=${#symbolset[@]}; # 41 is the answer, not 42
itemseq=$(sqlite3 ltr_tracking.db "select max(counter) from $tbl;")
printf '%s\n' "next letter reference number is: $(date +%Y)-${symbolset[$((itemseq / ln_symbolset))]}${symbolset[$((itemseq % ln_symbolset))]}"

An array is defined with alphanumeric symbols, taking care to eliminate symbols that humans struggle to distinguish (e.g. 1l0o). Then integer div and mod operations produce a two character number which is then prefixed with the year. So e.g. 2024-aa. Just two chars gives more numbers than would ever be generated in one calandar year.

This code mostly satisfies the need. But there’s a problem: a recipient who receives two letters can easily realise how many letters were sent in the time span of the two letters they receive. Most numbers will start with “a” “b” or “c”.

I do not need or want a cryptographic level of security which then leads to ungodly 16 byte numbers. Simplicity¹ is far more important than confidentiality. Just a small tweak to stifle the most trivial analysis would be useful.

One temptation is to simply manually mix up the order of chars in the symbolset array, hard-coded. But then that makes the code less readible. So I probably need to create a 2nd array “symbolseq” which arbitrarily unorders the symbolset array. I say arbitrary and not random because the sequence must be deterministic and static from one execution to the next.

An associative array is one idea:

typeset -A symbolset_lookup_table=(
[a]=k
[b]=3
[c]=s
…

I’m just slightly put off by the fact that it’s not readily evident that the RHS values are all used from the same set as the LHS keys exactly once.

I should probably encode the year as well. This would give a two char year:

printf '%s ' "$(((2024/41) % 41))" "$((2024 % 41))" "→ ${symbolset[$(((2024 / 41) % 41))]}" "${symbolset[$((2024 % 41))]}"

output:
8 15 → j s

(edit)
All the calculations must be easily reversible so a ref number can be converted back into a sequence number for DB queries.

¹ simplicity in both the code and in the numbers generated.

  • @breadsmasher
    link
    English
    2
    edit-2
    4 months ago

    Do they need to always be unique? j s being two letters would presumably clash pretty quickly?

    edit

    Rereading, maybe I misunderstood - would the full string include the date? so 2024-js as a complete example?

    • @[email protected]OP
      link
      fedilink
      English
      1
      edit-2
      4 months ago

      The “js” example is just to encode the year which is a prefix to the encoded sequence number. So if 2024 gives “js”, then ref numbers would look like this:
      js-aa
      js-ab
      js-ac
      …etc.

      And I do not reset the counter at the beginning of the year. So 2025 would be like:

      jt-ad
      jt-ae
      jt-af
      …etc.

      (update)

      Rereading, maybe I misunderstood - would the full string include the date? so 2024-js as a complete example?

      Yes, but note that “js” in my example was for an encoding of the year, which helps shrink the reference number and mask the fact that the 2nd token is a sequence.

  • @[email protected]
    link
    fedilink
    English
    24 months ago

    Date sent plus database ID base64 encoded. Why make it so complicated?

    I.e 2024-15 becomes MjAyNC0xNQ==

    Can even drop any equals signs to make it less obvious it’s base64 if that’s a concern

    • @[email protected]OP
      link
      fedilink
      English
      1
      edit-2
      4 months ago

      That is certainly a winner from the standpoint of code simplicity. And it’s trivially reversible. But I’m also prioritizing simplicity for human recipients above code simplicity. Base64 output is case sensitive and someone writing back and referencing a ref number would not necessarily preserve case. It’s also intolerant of human errors like confusing a “1” for a “l”.

      (edit) I think base32 would avoid the case sensitivity problem. So here’s a sample:

      for seq in {1..60}; do printf '%s → ' 2024-"$seq"; printf '%s\n' 2024-"$seq" | base32 | awk '{print tolower($1)}' | sed 's/=//g'; done
      
      output:
      2024-1 → giydenbngefa
      2024-2 → giydenbngifa
      2024-3 → giydenbngmfa
      2024-4 → giydenbngqfa
      2024-5 → giydenbngufa
      2024-6 → giydenbngyfa
      2024-7 → giydenbng4fa
      2024-8 → giydenbnhafa
      2024-9 → giydenbnhefa
      2024-10 → giydenbngeyau
      2024-11 → giydenbngeyqu
      2024-12 → giydenbngezau
      2024-13 → giydenbngezqu
      2024-14 → giydenbnge2au
      2024-15 → giydenbnge2qu
      2024-16 → giydenbnge3au
      2024-17 → giydenbnge3qu
      2024-18 → giydenbnge4au
      2024-19 → giydenbnge4qu
      2024-20 → giydenbngiyau
      2024-21 → giydenbngiyqu
      2024-22 → giydenbngizau
      2024-23 → giydenbngizqu
      2024-24 → giydenbngi2au
      2024-25 → giydenbngi2qu
      
  • @[email protected]OP
    link
    fedilink
    English
    1
    edit-2
    4 months ago

    I probably need a perfect hash function. This code seems to do the job:

    encoded_reference()
    {
        local -r yr=$1
        local -r seqno=$2
        
        local -ar symbolset=(a b c d e f g h   j k   m n   p q r s t u v w x y z     2 3 4 5 6 7 8 9)
        local -a seedset=("${symbolset[@]}")
        local -r ln_symbolset=${#symbolset[@]}; # 31
        local ln_seedset=${#seedset[@]}
        local -A lookup_table=()
    
        for sym in "${symbolset[@]}"
        do
            pos=$((50 % ln_seedset)); # 50 is just an arbitrary static number
            lookup_table+=(["$sym"]=${seedset["$pos"]})
            seedset=(${seedset[@]/${seedset[$pos]}}); # remove used elements from the seedset
            ln_seedset=${#seedset[@]}
        done
        
        local yr_enc=${symbolset[$(((yr / ln_symbolset) % ln_symbolset))]}${symbolset[$(($yr % ln_symbolset))]}
        local most_sig_fig=$((seqno / ln_symbolset))
        local least_sig_fig=$((seqno % ln_symbolset))
        
        # caution: if the seqno exceeds ln_symbolset², this calculation is out of range
        local seq_enc=${lookup_table[${symbolset[$most_sig_fig]}]}${lookup_table[${symbolset[$least_sig_fig]}]}
        
        printf '%s\n' "answer → ${yr_enc}-$seq_enc"
    };#encoded_reference
    
    for yr in 2024 2025 2026
    do
        for seqno in {1..20}
        do
            encoded_reference "$yr" "$seqno"
        done
    done
    
    output

    answer → js-wy answer → js-w2 answer → js-w4 answer → js-w6 answer → js-w8 answer → js-wa answer → js-wd answer → js-wg answer → js-wk answer → js-wp answer → js-ws answer → js-wv answer → js-w3 answer → js-w9 answer → js-we answer → js-wm answer → js-wt answer → js-w5 answer → js-wf answer → js-wr answer → jt-wy answer → jt-w2 answer → jt-w4 answer → jt-w6 answer → jt-w8 answer → jt-wa answer → jt-wd answer → jt-wg answer → jt-wk answer → jt-wp answer → jt-ws answer → jt-wv answer → jt-w3 answer → jt-w9 answer → jt-we answer → jt-wm answer → jt-wt answer → jt-w5 answer → jt-wf answer → jt-wr answer → ju-wy answer → ju-w2 answer → ju-w4 answer → ju-w6 answer → ju-w8 answer → ju-wa answer → ju-wd answer → ju-wg answer → ju-wk answer → ju-wp answer → ju-ws answer → ju-wv answer → ju-w3 answer → ju-w9 answer → ju-we answer → ju-wm answer → ju-wt answer → ju-w5 answer → ju-wf answer → ju-wr

    This is close to ideal, but I just thought of another problem: what if a year-seq pair were to derive an encoded number like “fy-ou” or “us-uk” or “sh-it”? A bias that nearly ensures a digit is used would help avoid generating offending words. But I guess I’m getting well into over-engineering territory.

    • @[email protected]OP
      link
      fedilink
      English
      1
      edit-2
      4 months ago

      This is the decode function if anyone is interested:

      decoded_reference()
      decoded_reference()
      {
          local yr_msd=${1:0:1}
          local yr_lsd=${1:1:1}
          local seq_enc_msd=${1:3:1}
          local seq_enc_lsd=${1:4:1}
          local seq_msd=${lookup_table_reverse[$seq_enc_msd]}
          local seq_lsd=${lookup_table_reverse[$seq_enc_lsd]}
          local seq_msd_index=$(typeset -p symbolset | grep -oP '[0-9]+(?=]="'"$seq_msd"'")')
          local seq_lsd_index=$(typeset -p symbolset | grep -oP '[0-9]+(?=]="'"$seq_lsd"'")')
          local seq=$((seq_msd_index * ln_symbolset + seq_lsd_index))
          local yr_msd_index=$(typeset -p symbolset | grep -oP '[0-9]+(?=]="'"$yr_msd"'")')
          local yr_lsd_index=$(typeset -p symbolset | grep -oP '[0-9]+(?=]="'"$yr_lsd"'")')
          local yr=$((ln_symbolset * ln_symbolset * 2 + yr_msd_index * ln_symbolset + yr_lsd_index)); # warning: the “2” is a dangerous hard-coding! Hopefully that bug manifests after I am dead
      
          printf '%s\n' "${yr}-$seq"
      };#decoded_reference