Shantanu Acharya and Rakhee

One of the current major areas of research in Reinforcement Learning is trying to make policies that are generalizable. Currently, in most of the tasks, a policy trained to perform well in an environment starts to struggle even when deployed in a slightly different environment.

We aim to tackle this issue by creating a multi-environment decision-making policy that can perform well in different environment settings. We try to achieve this goal by using Deep Q Network (DQN) and modifying its policy network.

Environment

We use Elurent’s Highway-env project which offers a collection of environments for autonomous driving and tactical decision-making tasks. From this project, we used three environments named Highway, Merge, and Roundabout for our experiments. The goal of the agent (a.k.a the ego vehicle) is to drive at a high speed without colliding with neighboring vehicles.

Highway

The ego vehicle will drive on a multilane highway that is populated with other vehicles. A huge negative reward is given for collision and a positive reward is given for changing lanes, driving at high speed, and driving on the right lane.

Figure 1: Highway

                                                                                                        Figure 1: Highway

Merge

Merge has a similar road structure as Highway. The ego vehicle starts on a multilane highway and after a while, some of the roads are merged forming a road junction. The agent is required to slow down before approaching the junction to make room for the other vehicles so that all the vehicles can merge safely in the traffic. A negative reward is given for collision, changing lanes, and speeding up near junction and a positive reward is given for driving at high speed and on the right lane.

Figure 2: Merge

                                                                                                        Figure 2: Merge

Roundabout

In this environment, the ego vehicle has to pass a roundabout as fast as possible while following a planned route. A positive reward is given for driving at high speed and on the right lane and a negative reward is given for collision and changing lanes.

Figure 3: Roundabout

                                                          Figure 3: Roundabout

Observations

The observations are fed as states contain the kinematics information about the nearby vehicles. Each observation is a 2-D matrix of shape V x F, where V represents the number of nearby vehicles and F represents the feature size. The feature values are with respect to the ego vehicle and contain the following information

Presence
Position (x and y coordinates)
Velocity in x and y-direction

Environment

Highway

Merge

Roundabout

Observations

Actions