LLM Text Compression

Ship #1

@Sabrina on LLM Text Compression · about 2 months ago

I created a lossless text compression algorithm that uses an AI to guess which letter should come next!
My project works best on the latin alphabet, and specifically english text! Even better with minimal punctuation! It works best with this set of characters: a-z, A-Z, 0-9, and the space

Two things were challenging. First was getting even getting a model to train, I had to followed a tutorial on tensorflow, but after that google colab trained like there was no tomorrow. The second thing was making the AI deterministic and finding an optimal way to compress it.
Initially, I had the ability to set seeds and then i would compress it many many many times over again with different seeds to see which one was optimal. I eventually changed the entire format and how it worked. I made it 100% deterministic and then stored the top 11 results instead of just the top result.
I’m proud that in certain cases my compression beat huffman coding by a few bytes. in like 90% of the cases though I am consistently like 1-10 bytes more inflated than huffman coding. But my decoder works on all strings, but for huffman coding you need the table as well, and the optimal one changes.

10 devlogs
20h
18.19x multiplier
360 Stardust

Try project → See source code →

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

15m 57s logged

Created the readme

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

2h 2m 32s logged

I finished the website and the API! I added a mode to stream the output of the decompression so you can watch it happen in real time.
I also made a bit writer class to clean up the compress function. I was also gonan stream that but it was gonna be a little weird since i reverse all the bytes at the end and modify the head which wouldve been in the start byte.
But I have deployed the website to “https://llm-text-compression.netlify.app/” and I have deployed the API to nest @ “https://aicompressapi.sabrina.hackclub.app//compress”
I shall now ship it <3
well i need a readme.md also so in like 10 mins

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

5h 11m 35s logged

I added support for longer utf-8 characters which had me rewrite a little bit of the compression and decompression parts. It inflated the size a little bit but i’d say its manageable right now.
I also started working on an API and website so its easy for people to vote. You can input hex bytes or text to compress or decompress.
I also wrote a lot about how the compression and decompression algorithm works!

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

1h 28m 57s logged

Okay I got it working!!! I huffman encoded the top 12 predicted next characters and got the size down to 57 Bytes! the huffman coding on the same string is about 56 bytes!

I havent crossed the threshold of beating huffman coding but thats okay. I may experiment with arithmetic coding for the index to possibly reduce it further but im happy with this result so far

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

1h 22m 12s logged

I increased the return amount to the top 7 results because in the worse case where the correct character is at index 6, the encoded data for it would use 8 bits, which is the same as just encoding the character, and we already used the llm to generate the character

i want to work on better huffman coding for the top results as currently it just encodes it by the number of 1s prefixing a 0

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

25m 43s logged

I fixed the varint4 reading and writing but as it seems right now it increases the size of the bit mask. but i think i mightve found a way to do without the bit mask.
the vocab is encoded in utf-8 and so for all my characters (a-z, A-Z, 0-9, and a space), only the lowder 7 bits are used, the 8th bit is ALWAYS 0
because of this, a character will NEVER start with a 1, meaning i can see if the next bit in the compress data is a 1 and if so we read an index for the llm’s prediction!

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

58m 14s logged

I got compression and decompression w/ the top 4 predictions working!! I encoded the indexes and then just slapped them in there. i had to pad the ending of the bit stream with 0s to align it to a byte but thats fine. I had to make a bit reader class to read it back and change up my decompresser code to read a byte and keep reading until we finish reading our index, but it was quite simple.

Now ive got my message being compressed to about 66 bytes, much better than the ~100 bytes from earlier with just raw prediction.
For reference the input text is 110 bytes.
The huffman encoding website says it compresses it down to 441 bits or ~56 bytes.

Right now im working on run length encoding to compress the llm prediction bitmask down. for my example message it is 16 bytes.
but since it is a bit mask there are lots of 0s and 1s in a row, so RLE should work wonders here. Hopefully i can reduce those 16 bytes donn to 6 and then i’d be tied with the huffman compression!
That would be cool,,, it is a 40% reduction in bytes so not impossible but we’ll see how it goes.

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

3h 0m 38s logged

I have been working on trying to get the LLM to compress more. I tried to brute force a bunch of seeds and they only kinda worked
I tried a few other leads but no cigar

Now im planning on having my predict next character return the top 5 characters, check if the correct character is in there, and if so then we use entropy/huffman coding to write which one it is. the lower on the list the less

example for 5 characters
11 -> 1st result
10 -> 2nd
01 -> 3rd
001 -> 4th
000 -> 5th

if i return 4 characters then my tree looks like this:
0 -> 1st
10 -> 2nd
110 -> 3rd
1110 -> 4th

the less probable characters have longer codes!
this does have the added complexity of throwing off the bit count, but i can either just pop off bits or i can pad the current byte to 8 bits before i encode a character

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

1h 9m 14s logged

I created utilized a wikipedia scraper via getting featured articles from a toolforge.org and then downloaded some articles using wikitext.eluni.co
I then added the training support for it by reading all the wikipedia text files and loading them into an array

I will now run some training on them to hopefully achieve better compression results!

Open comments for this post

@Sabrina on LLM Text Compression · about 2 months ago

3h 51m 1s logged

I created a way to train a model based on text files.
But more importantly i used that model to compress and decompress text.

To compress text it takes the input and seed and goes letter by letter and has the ai try and predict the next letter. if it gets it correct it will mark that letter as correct and move on. if it was wrong it will mark it as needing to be remembered and then it’s memory will think it predicted it.

It uses that list to make a bit mask of which letters to predict or not, and will end the file with all the letters it marked to remember.

To decompress it’s pretty much the same. it’ll take a seed and set the rng to it. it will traverse the bit mask. if it reads a 1, meaning to read a remembered letter, it will read the first letter it can, add it to the output and ai memory and move on.
If it reads a 0 then it will use the previous letter and states to predict the next letter. and since we have the same seeds and states the ai is deterministic so we will generate the same letter the ai though of.

What do i need to do now? I need to increase it’s training data. Right now it only uses “Alice in Wonderland” by Lewis Carroll, taken from Project Gutenberg.
I need to find some way to ethnically source all of the data so if you know where let me know! I’m think rn about using wikipedia, but i have to look into it first.