Free Open-Source Artificial Intelligence

2928 readers

1 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

GitHub Stars

FOSAI Time Capsule

founded 2 years ago

MODERATORS

[email protected]

Is there anything that makes training a translation task easy? (lemmy.dbzer0.com)

submitted 1 year ago by [email protected] to c/[email protected]

4 comments fedilink hide all child comments

I have thousands of side-by-side translations for two computer languages (lower level to higher level), and I would like to train a model that is able to do translations on new data with higher accuracy.

Got any suggestions on what to do? I don't think I want to fine tune a ChatGPT-style model since I think the task is more structured than that. Also, I consider myself technically competent but probably would fail at designing my own model and pipeline.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 0 points 1 year ago* (last edited 1 year ago) (1 children)

Try looking into OpenNMT, I used it for a similar task.

https://opennmt.net

[–] [email protected] 0 points 1 year ago (1 children)

Thanks, the quickstart guide was straightforward to follow. Do you have any suggestions on how to do word splitting with code, if any? For example, on a test run, I found that the model was not able to synthesize unique constants correctly even though this test run consisted only of obvious "a to b" relationships.

[–] [email protected] 0 points 1 year ago* (last edited 1 year ago) (1 children)

If you’re working with a well known language, then you can probably use NLTK to tokenize your words. Word2vec is also helpful if you want a word embedding approach. https://github.com/nltk/nltk

[–] [email protected] 1 points 1 year ago

Thanks for the tips. After doing a bunch of searching, I found that what I needed was BPE, or byte-pair encoding. This allows the token set to contain sub-word sequences, which lets the tokenizer represent a unique constant like 0x0373 as ['__sow', '0x', '03', '73', '__eow'].