I recently discovered Karpathy’s YouTube series on machine learning, and subsequently his nanoGPT project for training and fine-tuning GPT models, and between that and the amazing results from the open-source Stable Diffusion community, it got me thinking that maybe I could do something interesting with a fine-tuned GPT model.
I wanted to fine-tune a GPT model on scary stories and use it to generate content for a sketchy looking website claiming to be haunted. To do this I would need a pre-trained base model; I can use GPTj-6B or another huggingface gpt model. I will also need a lot of text data to fine-tune the model.
I learned about pushshift.io, a site run by
someone named Jason Michael
Baumgartner. There I found archives of various social media sites,
including monthly dumps of the content. I eventually found a torrent
with the Reddit submissions up to June 2022, and I began downloading
the 421GB archive. While waiting I found a repo full of
scripts specifically for processing this data. I used it to run 16
processes in parallel, filtering for the nosleep
subreddit.
The script reported a pretty consistent 150mb/s, which at first seemed slow. Cursory googling led me to believe I should be seeing 2500mb/s. But actually, since I was running 16 threads it makes more sense, since \(150\cdot16=2400\).
Regardless, the scan finished in about an hour and then I had a
pretty quality dataset. I eventually would clean
it up a little more by removing [deleted],
[removed], the name of the subreddit, etc. I also crafted
a nifty (and probably flawed) regex to convert markdown formatted
links to text: sed "s/\[\([^]]*\)\]([^\)]*)/\1/g".
Another improvement was to threshold the dataset by number of upvotes. A cut-off of 1000 upvotes gave a dataset of roughly 10k entries.
At the time of writing Karpathy’s instructions for setting up a working environment were as follows:
Dependencies:
- [pytorch](https://pytorch.org) <3
- [numpy](https://numpy.org/install/) <3
- `pip install transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
- `pip install datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
- `pip install tiktoken` for OpenAI's fast BPE code <3
- `pip install wandb` for optional logging <3
- `pip install tqdm`
So not nothing, but not exactly repeatable either. I am
very grateful that the requirements are so minimal though. I was able
to get things running on a fresh Ubuntu install both locally and in
the cloud using conda. Lambda cloud GPU helpfully has
torch installed by default, but apparently not nightly
and not torch 2.0 at this time.
I trained a medium model on my home machine, acheiving a loss of
about 2.6. The results were pretty weird, and not very
coherent. After playing with the learning rate to no avail, I figured
I would need a bigger model to get more coherent output. Lambda GPU
cloud is in high demand these days, but I managed to snag a single
A100 with 40GB of RAM for the day and got cracking. With this gpu I
was able to use the gpt-xl model, but only with a batch
size of 1. With this, I was able to get the loss down to something
like 2.3. I generated around 700 fairly low-quality
samples.
Eventually I moved to the gpt-j models using this repo leveraging deepspeed and a very useful dockerfile, getting a loss of 1.983 (and learning to use wandb).
These produced some nice results. Not totally coherent but not awful either.
I threw together a static site using the next.js
framework and terraformed
some infrastructure in aws. The site is hosted at ghostsdontdie.com, a domain I
already owned. Some of the scripting to break up the files was
a little sketchy, some of the outputs were actually empty,
resulting in some strange breaks in some of the stories for the 7B
model.