Inhalt des Dokuments
Adaptive Traffic Light Control 3
Group members: Lars Peer André Bonczek, Christian Helbig, Antonia Köngeter, Marc-Fabio Pascal Niemella, Nico Rohland, Volodymyr Shcherbyna, Franz Martin Zuther
In this project, we have optimized the traffic flow in a simulation of the Ernst-Reuter-Platz roundabout on the TU Berlin campus. We are controlling the independent microcontrollers in every traffic light with a Deep Reinforcement Learning (Deep RL) algorithm and showing the results of our simulation in a Webots environment.
From the start, we have set ambitious goals. As TU Berlin students, we all knew the daily hustle at the Ernst-Reuter-Platz. Even though we haven't seen it in a while, our goals were clear. Instead of just trying to switch some lights on a normal intersection, we wanted to solve this traffic jam-ridden mess. We wanted an independent RL agent for each individual intersection so that it can handle all kinds of intersections. With this approach, we can adapt our software to all kinds of combinations of intersections.
In this project, we were required to have three separate components. These components should communicate non-stop using our own network protocol.
- Simulation environment provided by Cyberbotics Ltd.
- Simulate realistic traffic flow
- Track the cars with several sensors
- Show the results
- Control the actions of a specific traffic light
- All controllers have the same software
- Written in C++
- Deep Reinforcement Learning agent written in Python
- Dynamically change the traffic lights based on the current (and previous) traffic situations
To achieve our goals, we decided to split our big group into three teams, so that every team can concentrate on a single component. We organized our work with a Kanban board and discussed our achievements in weekly meetings. The project was divided into 4 stages:
In the first stage, we started with a project plan. All teams researched their assignments and started to build a basic structure. For example, the backend decided to concentrate on the fine-tuning of optimal training parameters instead of wasting time on implementing an already existing algorithm. We decided to use a Stable Baselines environment, which allows us to choose between several pre-implemented algorithms, and we picked one to adapt to our specific problem.
In the second step, we concentrated on the communication between our components, implementing a network protocol to exchange the required data. With this, we were able to control traffic lights using commands from the backend and display the resulting traffic in a Webots simulation.
In the third step, the three teams worked separately on their components. The Webots team tried to make our simulation environment look as realistic as possible, focusing on realistic traffic scenarios and visual assets such as buildings. The microcontroller team worked on optimizing the communication, as well as implementing a fallback mechanism in case we lose our connection to the backend. The backend team focused on the training of our algorithm in our toy problem. At the third milestone, we were able to fully integrate our reinforcement learning agent into our full Webots simulation.
After our minimum viable product (MVP) was achieved, we concentrated on fine-tuning and testing in the last phase.
Even though our system would only be used in a virtual environment we still wanted to make it as realistic as possible. Therefore we included several fail safes. Some general information about your ideas for the overall system architecture.
Our software has three main components:
Here we build and trained our deep reinforcement learning agent. In the backend we first researched different RL-Algorithms. To be able to train our algorithm evidently we first build a toy problem, that only had one simple intersection.
Then we trained or algorithm with this problem. You can find out more about the algorithm we used in the "Algorithms" section. Lastly, we have a TCP connection to our microcontroller to send the traffic light commands and to receive the messured data from the Webots enviroment. This data is now fed into our trained agent and gives back the traffic lights commands.
This is the communication device between the Webots and Backend part. To keep the robot's controller lightweight and fast, we needed a layer in between to handle all the logic necessary for our project to work. Furthermore, the microcontroller is responsible for making our system failsafe. So whenever there would be a broken connection to the backend the microcontroller would notice, take over for the given intersection and use a time-based approach for switching the lights locally instead. It also checks whether the command is valid (e.g. all traffic lights on an intersection should never be green at once). If it detects an invalid message like that from the backend this message is ignored.
In this part of the project, we focussed on making a realistic simulation. We included SUMO into our architecture here, combining the traffic and world simulation. We implemented a supervisor, that is responsible for starting traffic light controllers, reading their light states, writing simulated sensor data back, and stepping through the simulation. It is started by webots in its own process.
In our implementation this is the already mentioned traffic light controller in the Webots part. It receives commands and sends back sensor data over the network. It has an internal modeling for the status of the traffic light, which is used by the supervisor to adjust the simulation. The traffic light controller is started automatically by the supervisor, depending on the configuration file. Every traffic light controller runs in its own thread, so that the code can profit from multithreading.
At first, we were unsure how to design an environment in Webots efficiently. Luckily we then discovered SUMO (https://cyberbotics.com/doc/automobile/sumo-interface) to make this task easier. With SUMOs help generating a large number of vehicles that make our simulation more realistic, was simple. Furthermore we decided that we didn't want to optimize just any intersection: Our goal was to optimize the Ernst-Reuter-Platz! We wanted to optimize the Ernst-Reuter-Platz, because every student at TU knows it's slow and impractical and we wanted to see if we could make it better with our algorithm.
And if you take a look at our finished product, you can see we achieved it:
We decided to use Stable Baselines (https://stable-baselines.readthedocs.io/en/master/), which build on OpenAI Baselines. It contains already implemented reinforcement learning algorithms.
Our next step was choosing an algorithm and after some contemplation, we settled on the DQN-Algorithm(Explanation of DQN: https://towardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b ; paper that first introduced DQN: https://arxiv.org/abs/1312.5602).
To understand how the DQN-algorithm works we first need to find out what it stands for: Deep Q-learning Networks. Q-learning takes the current state and reward of the problem as well as the previous ones and calculates the best next state from this. Now deep networks means, that the learning process is helped with neural networks. If you want to find out more check out: https://wiki.pathmind.com/neural-network for a more in-depth explanation.
Despite setting ambitious goals, we were able to fulfill each one. Even in these difficult times, our team was able to work together perfectly, and we reached all our targets long before the deadlines. We built an easily adaptable RL agent, which can handle almost every type of intersection. Furthermore, our code allows us to dynamically instantiate every agent and microcontroller using a single JSON configuration file. With this, we can simulate all kinds of street layouts in Webots, while the rest of the code doesn't need to be changed. Our end goal of optimizing the Ernst-Reuter-Platz traffic flow was achieved.
Our algorithm shows outstanding results.
We were able to optimize the average reward* up to 67% for an individual intersection, compared to a conventional time-based traffic light program. To achieve this, we have trained our agent for 100.000 steps. As the graphic below shows, longer training doesn't necessarily yield better results: The agent performed best after 20.000 training steps.
*Reward of 0 means all cars waiting anywhere at the intersection were able to pass.
time-based algorithm reward (mean): -10.79
trained (20k steps) algorithm reward (mean): -3.54
trained (70k steps) algorithm reward (mean): -5.5
We learned that planning and communication can be a greater challenge than the project goal itself. Every team member is unique, has a different skill set, and different working habits. The difficulty is further amplified in the current situation, since we are only able to communicate online. This makes it very important to plan everything, set realistic deadlines, and stay in continuous contact. Fortunately, we were up to the challenge and delivered a great product.