LLM Text Compression
- 10 Devlogs
- 20 Total hours
I'm using an LLM to compress text as much as possible
I'm using an LLM to compress text as much as possible
Created the readme
I finished the website and the API! I added a mode to stream the output of the decompression so you can watch it happen in real time.
I also made a bit writer class to clean up the compress function. I was also gonan stream that but it was gonna be a little weird since i reverse all the bytes at the end and modify the head which wouldve been in the start byte.
But I have deployed the website to “https://llm-text-compression.netlify.app/” and I have deployed the API to nest @ “https://aicompressapi.sabrina.hackclub.app//compress”
I shall now ship it <3
well i need a readme.md also so in like 10 mins
I added support for longer utf-8 characters which had me rewrite a little bit of the compression and decompression parts. It inflated the size a little bit but i’d say its manageable right now.
I also started working on an API and website so its easy for people to vote. You can input hex bytes or text to compress or decompress.
I also wrote a lot about how the compression and decompression algorithm works!
Okay I got it working!!! I huffman encoded the top 12 predicted next characters and got the size down to 57 Bytes! the huffman coding on the same string is about 56 bytes!
I havent crossed the threshold of beating huffman coding but thats okay. I may experiment with arithmetic coding for the index to possibly reduce it further but im happy with this result so far
I increased the return amount to the top 7 results because in the worse case where the correct character is at index 6, the encoded data for it would use 8 bits, which is the same as just encoding the character, and we already used the llm to generate the character
i want to work on better huffman coding for the top results as currently it just encodes it by the number of 1s prefixing a 0
I fixed the varint4 reading and writing but as it seems right now it increases the size of the bit mask. but i think i mightve found a way to do without the bit mask.
the vocab is encoded in utf-8 and so for all my characters (a-z, A-Z, 0-9, and a space), only the lowder 7 bits are used, the 8th bit is ALWAYS 0
because of this, a character will NEVER start with a 1, meaning i can see if the next bit in the compress data is a 1 and if so we read an index for the llm’s prediction!
I got compression and decompression w/ the top 4 predictions working!! I encoded the indexes and then just slapped them in there. i had to pad the ending of the bit stream with 0s to align it to a byte but thats fine. I had to make a bit reader class to read it back and change up my decompresser code to read a byte and keep reading until we finish reading our index, but it was quite simple.
Now ive got my message being compressed to about 66 bytes, much better than the ~100 bytes from earlier with just raw prediction.
For reference the input text is 110 bytes.
The huffman encoding website says it compresses it down to 441 bits or ~56 bytes.
Right now im working on run length encoding to compress the llm prediction bitmask down. for my example message it is 16 bytes.
but since it is a bit mask there are lots of 0s and 1s in a row, so RLE should work wonders here. Hopefully i can reduce those 16 bytes donn to 6 and then i’d be tied with the huffman compression!
That would be cool,,, it is a 40% reduction in bytes so not impossible but we’ll see how it goes.
I have been working on trying to get the LLM to compress more. I tried to brute force a bunch of seeds and they only kinda worked
I tried a few other leads but no cigar
Now im planning on having my predict next character return the top 5 characters, check if the correct character is in there, and if so then we use entropy/huffman coding to write which one it is. the lower on the list the less
example for 5 characters
11 -> 1st result
10 -> 2nd
01 -> 3rd
001 -> 4th
000 -> 5th
if i return 4 characters then my tree looks like this:
0 -> 1st
10 -> 2nd
110 -> 3rd
1110 -> 4th
the less probable characters have longer codes!
this does have the added complexity of throwing off the bit count, but i can either just pop off bits or i can pad the current byte to 8 bits before i encode a character
I created utilized a wikipedia scraper via getting featured articles from a toolforge.org and then downloaded some articles using wikitext.eluni.co
I then added the training support for it by reading all the wikipedia text files and loading them into an array
I will now run some training on them to hopefully achieve better compression results!
I created a way to train a model based on text files.
But more importantly i used that model to compress and decompress text.
To compress text it takes the input and seed and goes letter by letter and has the ai try and predict the next letter. if it gets it correct it will mark that letter as correct and move on. if it was wrong it will mark it as needing to be remembered and then it’s memory will think it predicted it.
It uses that list to make a bit mask of which letters to predict or not, and will end the file with all the letters it marked to remember.
To decompress it’s pretty much the same. it’ll take a seed and set the rng to it. it will traverse the bit mask. if it reads a 1, meaning to read a remembered letter, it will read the first letter it can, add it to the output and ai memory and move on.
If it reads a 0 then it will use the previous letter and states to predict the next letter. and since we have the same seeds and states the ai is deterministic so we will generate the same letter the ai though of.
What do i need to do now? I need to increase it’s training data. Right now it only uses “Alice in Wonderland” by Lewis Carroll, taken from Project Gutenberg.
I need to find some way to ethnically source all of the data so if you know where let me know! I’m think rn about using wikipedia, but i have to look into it first.