Learning Boids behavior via Reinforcement Learning

Boids

The boids algorithm is a well-known algorithm for simulating realistic looking flocking behavior using simple rules. I have wanted to implement this algorithm for years, but never had a good reason to do so, since it has been implemented many times by people more dedicated to it than I.

Recently, I began learning about Reinforcment Learning and thought that, rather than the standard CartPole, Pendulum, or Mars Lander introductory algorithms, I could craft my own envvironment, and boids seemed like a great choice. After all, boids consists of 3 simple rules, usually formulated as direct positional update rules, but I guessed they could be formulated instead as rewards in the Reinforcement Learning framework.

Boids rules as rewards

Here’s what I think is a basic formulation of the classic boids algorithm in terms of rewards.

Let’s define a single Boid to be a position vector \(X\), and a heading \(\theta\). Each boid also has a perceptual region of radius \(R\) around it which it uses to examine the position and heading of other boids. The rewards for boids can be formulated in the following way:

Separation

My initial sketches were based on \(\frac{1}{||X-\alpha||}\), but I couldn’t get that to normalize in a nice way. Eventually I ended up with:

\[-\frac{1}{R}\cdot\frac{\sum_{\alpha}||X-{\alpha}||}{n}\] Taking the average distance between ourselves and perceived neighbor positions \(\alpha\), and normalizing it by \(R\). I suspect this is equivalent to the negative of the cohesion reward formula below, which considers the centroid within the region of radius \(R\), and this similarity bothered me, since if they were equivalent, they would cancel each other out unless bounded by different radii.

I ended up simplifying this to

\[\frac{\min_{\alpha}||X-\alpha||}{R}\]

That is, simply using the closest object and normalizing the difference to \([0,1]\) using the radius \(R\). You can check that a distance of \(R\) (the maximum distance) would result in a reward of \(1\), whereas a distance of \(0\) would result in a reward of \(0\).

Alignment

\[|\theta - \frac{\sum_{\phi} \phi}{n}|\] This seems like a very straighforward formula, comparing our heading to the average heading in the area, but it does get a little messy when you consider edge cases. Also, this should be normalized to \([0,1]\) by dividing by 180 or \(\pi\) depending on units.

Cohesion

\[||X - \frac{\sum_{\alpha} \alpha}{n}||\] Nearly identical to alignment, but for position, comparing our position to the average position in the area. This should be divided by \(R\) to normalize into the \([0,1]\) region.

Reinforcement Learning

Now that I had formulated rewards, I looked for a way to simulate the environment and perform reinforcement learning. I found a convenient repo for using godot as a gym environment, and set about hacking up a run of OpenAI’s darling PPO algorithm from the stable baselines repo.

There’s an issue here: Boids are almost inherently multi-agent, and PPO is not a multi-agent algorithm. It could be argued that each Boid is the same agent though. Naively, I thought I could just cycle through every boid after the reward phase and get away with it. Unfortunately, I now think this is the wrong approach. Though I did have some success, the boids do not simulate the behavior correctly.

Because of this, I ended up training each rule separately at first. One of these hacks I used is having the boids move at a constant rate forwards. This made it quite easy to get visually appealing results just by training, say the alignment rule on its own.

Here is a video of the alignment rule after having been learnt by single-agent PPO by just cycling; the agents simply try to adjust their rotation to the average rotation in their perceptual area:

And here’s another one where I vary the speed of each of the agents randomly each step:

Cohesion and separation appeared to do something (there were definitely bugs in my early separation implementation), but the overall behavior when I combined the three looked nothing like a classic Boids implementation.

You can also already see in the above videos, the agents have a bias in one direction. This would be a persistent issue over all my attempts.

Multi-Agent Reinforcement Learning

So now I had to find a gym-compatible multi-agent reinforcement learning library. As I began to hack this together, it occurred to me that, except for visualizing the end-result, neither Godot nor the Gym interface was necessary. Still, I think they did provide some benefits; Godot allows one to quickly create environments and agents, and Gym provides a uniform interface that is compatible with a lot of RL libraries.

For multi-agent training, I found epymarl, an extension of pymarl, a framework for doing reinforcement learning on the game Starcraft. epymarl extends pymarl to include support for gym environments, and more algorithms, including mutli-agent versions of modern reinforcement learning aglorithms, namely MAPPO and MADDPG.

I quickly hacked together a MAPPO run using the same action and reward spaces as before, and running the data through the same Godot-Gym framework. This led to some results that look a little better, though still not quite close enough to Boids.

Replacing Softmax

Unfortunately epymarl does not support continuous action spaces, so I attempted to replace the gumbel-softmax non-linearity from the MADDPG implementation, instead feeding the logits into a variant of tanh that would squash them into \([-1,1]\). Unfortunately the more I learned about MADDPG, the less confident I felt in using it at all. It seems PPO is more appropriate for this task. Work is paused on this project for now until I can learn more about these algorithms in detail.