The *boids* algorithm is a well-known algorithm for
simulating realistic looking flocking behavior using simple rules. I
have wanted to implement this algorithm for years, but never had a
good reason to do so, since it has been implemented many times by
people more dedicated to it than I.

Recently, I began learning about Reinforcment Learning and thought
that, rather than the standard CartPole, Pendulum, or Mars Lander
introductory algorithms, I could craft my own envvironment, and
*boids* seemed like a great choice. After all, *boids*
consists of 3 simple rules, usually formulated as direct positional
update rules, but I guessed they could be formulated instead as
*rewards* in the Reinforcement Learning framework.

Here’s what I think is a basic formulation of the classic
*boids* algorithm in terms of rewards.

Let’s define a single Boid to be a position vector \(X\), and a heading \(\theta\). Each boid also has a perceptual
region of radius \(R\) around it
which it uses to examine the position and heading of other
*boids*. The rewards for boids can be formulated in the
following way:

My initial sketches were based on \(\frac{1}{||X-\alpha||}\), but I couldn’t get that to normalize in a nice way. Eventually I ended up with:

\[-\frac{1}{R}\cdot\frac{\sum_{\alpha}||X-{\alpha}||}{n}\]
Taking the average distance between ourselves and perceived neighbor
positions \(\alpha\), and normalizing
it by \(R\). I suspect this is
equivalent to the negative of the cohesion reward formula below, which
considers the centroid within the region of radius \(R\), and this similarity bothered me,
since if they *were* equivalent, they would cancel each other
out unless bounded by different radii.

I ended up simplifying this to

\[\frac{\min_{\alpha}||X-\alpha||}{R}\]

That is, simply using the closest object and normalizing the difference to \([0,1]\) using the radius \(R\). You can check that a distance of \(R\) (the maximum distance) would result in a reward of \(1\), whereas a distance of \(0\) would result in a reward of \(0\).

\[|\theta - \frac{\sum_{\phi} \phi}{n}|\] This seems like a very straighforward formula, comparing our heading to the average heading in the area, but it does get a little messy when you consider edge cases. Also, this should be normalized to \([0,1]\) by dividing by 180 or \(\pi\) depending on units.

\[||X - \frac{\sum_{\alpha} \alpha}{n}||\] Nearly identical to alignment, but for position, comparing our position to the average position in the area. This should be divided by \(R\) to normalize into the \([0,1]\) region.

Now that I had formulated rewards, I looked for a way to simulate the environment and perform reinforcement learning. I found a convenient repo for using godot as a gym environment, and set about hacking up a run of OpenAI’s darling PPO algorithm from the stable baselines repo.

There’s an issue here: *Boids* are almost
*inherently* multi-agent, and PPO is not a multi-agent
algorithm. It could be argued that each Boid is the same agent though.
Naively, I thought I could just cycle through every boid after the
*reward* phase and get away with it. Unfortunately, I now think
this is the wrong approach. Though I did have *some* success,
the boids do not simulate the behavior correctly.

Because of this, I ended up training each rule separately at first.
One of these hacks I used is having the boids move at a constant rate
forwards. This made it quite easy to get visually appealing results
just by training, say the *alignment* rule on its own.

Here is a video of the alignment rule after having been learnt by single-agent PPO by just cycling; the agents simply try to adjust their rotation to the average rotation in their perceptual area:

And here’s another one where I vary the speed of each of the agents randomly each step:

Cohesion and separation appeared to do *something* (there
were definitely bugs in my early separation implementation), but the
overall behavior when I combined the three looked nothing like a
classic *Boids* implementation.

You can also already see in the above videos, the agents have a bias in one direction. This would be a persistent issue over all my attempts.

So now I had to find a gym-compatible multi-agent reinforcement
learning library. As I began to hack this together, it occurred to me
that, except for visualizing the end-result, neither Godot nor the Gym
interface was necessary. Still, I think they did provide some
benefits; Godot allows one to quickly create environments and agents,
and Gym provides a uniform interface that is compatible with **a
lot** of RL libraries.

For multi-agent training, I found epymarl, an extension
of pymarl, a framework
for doing reinforcement learning on the game *Starcraft*.
`epymarl`

extends `pymarl`

to include support
for `gym`

environments, and more algorithms, including
mutli-agent versions of modern reinforcement learning aglorithms,
namely MAPPO and MADDPG.

I quickly hacked together a MAPPO run using the same action and reward spaces as before, and running the data through the same Godot-Gym framework. This led to some results that look a little better, though still not quite close enough to Boids.

Unfortunately epymarl does not support continuous action spaces, so I attempted to replace the gumbel-softmax non-linearity from the MADDPG implementation, instead feeding the logits into a variant of tanh that would squash them into \([-1,1]\). Unfortunately the more I learned about MADDPG, the less confident I felt in using it at all. It seems PPO is more appropriate for this task. Work is paused on this project for now until I can learn more about these algorithms in detail.