I have thousands of side-by-side translations for two computer languages (lower level to higher level), and I would like to train a model that is able to do translations on new data with higher accuracy.

Got any suggestions on what to do? I don’t think I want to fine tune a ChatGPT-style model since I think the task is more structured than that. Also, I consider myself technically competent but probably would fail at designing my own model and pipeline.

  • @larlyssa
    link
    English
    3
    edit-2
    1 year ago

    If you’re working with a well known language, then you can probably use NLTK to tokenize your words. Word2vec is also helpful if you want a word embedding approach. https://github.com/nltk/nltk

    • @[email protected]OP
      link
      fedilink
      English
      41 year ago

      Thanks for the tips. After doing a bunch of searching, I found that what I needed was BPE, or byte-pair encoding. This allows the token set to contain sub-word sequences, which lets the tokenizer represent a unique constant like 0x0373 as ['__sow', '0x', '03', '73', '__eow'].