Devlog by @Sabrina

@Sabrina on LLM Text Compression · about 2 months ago

3h 51m 1s logged

I created a way to train a model based on text files.
But more importantly i used that model to compress and decompress text.

To compress text it takes the input and seed and goes letter by letter and has the ai try and predict the next letter. if it gets it correct it will mark that letter as correct and move on. if it was wrong it will mark it as needing to be remembered and then it’s memory will think it predicted it.

It uses that list to make a bit mask of which letters to predict or not, and will end the file with all the letters it marked to remember.

To decompress it’s pretty much the same. it’ll take a seed and set the rng to it. it will traverse the bit mask. if it reads a 1, meaning to read a remembered letter, it will read the first letter it can, add it to the output and ai memory and move on.
If it reads a 0 then it will use the previous letter and states to predict the next letter. and since we have the same seeds and states the ai is deterministic so we will generate the same letter the ai though of.

What do i need to do now? I need to increase it’s training data. Right now it only uses “Alice in Wonderland” by Lewis Carroll, taken from Project Gutenberg.
I need to find some way to ethnically source all of the data so if you know where let me know! I’m think rn about using wikipedia, but i have to look into it first.