I recently discovered Karpathy’s YouTube series on machine learning, and subsequently his nanoGPT project for training and fine-tuning GPT models, and between that and the amazing results from the open-source Stable Diffusion community, it got me thinking that maybe I could do something interesting with a fine-tuned GPT model.
I wanted to fine-tune a GPT model on scary stories and use it to generate content for a sketchy looking website claiming to be haunted. To do this I would need a pre-trained base model; I can use GPTj-6B or another huggingface gpt model. I will also need a lot of text data to fine-tune the model.
I learned about pushshift.io, a site run by
someone named Jason Michael
Baumgartner. There I found archives of various social media sites,
including monthly dumps of the content. I eventually found a torrent
with the Reddit submissions up to June 2022, and I began downloading
the 421GB archive. While waiting I found a repo full of
scripts specifically for processing this data. I used it to run 16
processes in parallel, filtering for the nosleep
subreddit.
The script reported a pretty consistent 150mb/s, which at first seemed slow. Cursory googling led me to believe I should be seeing 2500mb/s. But actually, since I was running 16 threads it makes more sense, since \(150\cdot16=2400\).
Regardless, the scan finished in about an hour and then I had a
pretty quality dataset. I eventually would clean
it up a little more by removing [deleted]
,
[removed]
, the name of the subreddit, etc. I also crafted
a nifty (and probably flawed) regex to convert markdown formatted
links to text: sed "s/\[\([^]]*\)\]([^\)]*)/\1/g"
.
Another improvement was to threshold the dataset by number of upvotes. A cut-off of 1000 upvotes gave a dataset of roughly 10k entries.
At the time of writing Karpathy’s instructions for setting up a working environment were as follows:
Dependencies:
- [pytorch](https://pytorch.org) <3
- [numpy](https://numpy.org/install/) <3
- `pip install transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
- `pip install datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
- `pip install tiktoken` for OpenAI's fast BPE code <3
- `pip install wandb` for optional logging <3
- `pip install tqdm`
So not nothing, but not exactly repeatable either. I am
very grateful that the requirements are so minimal though. I was able
to get things running on a fresh Ubuntu install both locally and in
the cloud using conda
. Lambda cloud GPU helpfully has
torch installed by default, but apparently not nightly
and not torch 2.0
at this time.
I trained a medium model on my home machine, acheiving a loss of
about 2.6
. The results were pretty weird, and not very
coherent. After playing with the learning rate to no avail, I figured
I would need a bigger model to get more coherent output. Lambda GPU
cloud is in high demand these days, but I managed to snag a single
A100 with 40GB of RAM for the day and got cracking. With this gpu I
was able to use the gpt-xl
model, but only with a batch
size of 1. With this, I was able to get the loss down to something
like 2.3
. I generated around 700 fairly low-quality
samples.
Eventually I moved to the gpt-j models using this repo leveraging deepspeed and a very useful dockerfile, getting a loss of 1.983 (and learning to use wandb).
These produced some nice results. Not totally coherent but not awful either.
I threw together a static site using the next.js
framework and terraformed
some infrastructure in aws. The site is hosted at ghostsdontdie.com, a domain I
already owned. Some of the scripting to break up the files was
a little sketchy, some of the outputs were actually empty,
resulting in some strange breaks in some of the stories for the 7B
model.