Fine-tuning various gpt models to write scary stories

I recently discovered Karpathy’s YouTube series on machine learning, and subsequently his nanoGPT project for training and fine-tuning GPT models, and between that and the amazing results from the open-source Stable Diffusion community, it got me thinking that maybe I could do something interesting with a fine-tuned GPT model.

Objective

I wanted to fine-tune a GPT model on scary stories and use it to generate content for a sketchy looking website claiming to be haunted. To do this I would need a pre-trained base model; I can use GPTj-6B or another huggingface gpt model. I will also need a lot of text data to fine-tune the model.

Data

I learned about pushshift.io, a site run by someone named Jason Michael Baumgartner. There I found archives of various social media sites, including monthly dumps of the content. I eventually found a torrent with the Reddit submissions up to June 2022, and I began downloading the 421GB archive. While waiting I found a repo full of scripts specifically for processing this data. I used it to run 16 processes in parallel, filtering for the nosleep subreddit.

The script reported a pretty consistent 150mb/s, which at first seemed slow. Cursory googling led me to believe I should be seeing 2500mb/s. But actually, since I was running 16 threads it makes more sense, since \(150\cdot16=2400\).

Regardless, the scan finished in about an hour and then I had a pretty quality dataset. I eventually would clean it up a little more by removing [deleted], [removed], the name of the subreddit, etc. I also crafted a nifty (and probably flawed) regex to convert markdown formatted links to text: sed "s/\[\([^]]*\)\]([^\)]*)/\1/g".

Another improvement was to threshold the dataset by number of upvotes. A cut-off of 1000 upvotes gave a dataset of roughly 10k entries.

Environment setup

At the time of writing Karpathy’s instructions for setting up a working environment were as follows:

Dependencies:
-   [pytorch](https://pytorch.org) <3
-   [numpy](https://numpy.org/install/) <3
-   `pip install transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
-   `pip install datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
-   `pip install tiktoken` for OpenAI's fast BPE code <3
-   `pip install wandb` for optional logging <3
-   `pip install tqdm`

So not nothing, but not exactly repeatable either. I am very grateful that the requirements are so minimal though. I was able to get things running on a fresh Ubuntu install both locally and in the cloud using conda. Lambda cloud GPU helpfully has torch installed by default, but apparently not nightly and not torch 2.0 at this time.

Training pt 1

I trained a medium model on my home machine, acheiving a loss of about 2.6. The results were pretty weird, and not very coherent. After playing with the learning rate to no avail, I figured I would need a bigger model to get more coherent output. Lambda GPU cloud is in high demand these days, but I managed to snag a single A100 with 40GB of RAM for the day and got cracking. With this gpu I was able to use the gpt-xl model, but only with a batch size of 1. With this, I was able to get the loss down to something like 2.3. I generated around 700 fairly low-quality samples.

Training pt 2

Eventually I moved to the gpt-j models using this repo leveraging deepspeed and a very useful dockerfile, getting a loss of 1.983 (and learning to use wandb).

These produced some nice results. Not totally coherent but not awful either.

Displaying the results

I threw together a static site using the next.js framework and terraformed some infrastructure in aws. The site is hosted at ghostsdontdie.com, a domain I already owned. Some of the scripting to break up the files was a little sketchy, some of the outputs were actually empty, resulting in some strange breaks in some of the stories for the 7B model.