Training and validating end-to-end models based on self-driving car simulator data

13 min readJun 14, 2021

Authors: Risto Künnapas, Friedrich Krull

Project for the course University of Tartu Neural Networks (LTAT.02.001) course in the spring semester of 2020/2021.

Introduction

Project description and goals

The initial idea behind this project was to collect driving data (perception + steering angles) from a simulator and train end-to-end models to predict steering angles on the real-world (perception) dataset. We wanted essentially to see how models trained based on simulator data would stack up against actual humans driving the car on the roads.

In the process of making it, we stumbled on multiple problems retrieving and running the Carla simulator on our computers and because of the time constraints had to abandon our idea. We ended up leaving out the part where we tested the models in the real world and decided to validate them only in the simulator.

In this project we collected data from the simulated world using a LIDAR and an RGB camera (perception) sensor, the steering angles were retrieved from the control outputs of the car controller. For the simulated medium, we used CARLA, an open-source driving simulator. CARLA is one of the more advanced open-source simulators out there specifically dedicated to self-driving cars. Based on the Unreal game engine, it has relatively good resolution and physics modules for example to simulate traffic conditions, weather patterns and driving dynamics. The downsides with working with CARLA is that it is quite GPU heavy and for that reason, if you have an average computer, then it is very laggy and the FPS is low.

Before we continue on to our work there are two important concepts we want to also touch on synthetic data and end-to-end models.

Synthetic a.k.a simulator data

Self-driving cars are vehicles that are capable of driving without the input of a human driver. To achieve that most of these vehicles use today ML models that have been trained only on the real-world data to determine the next course of action, like turning, braking or acceleration.

Although in theory, we could just collect data from the world itself, so far it is quite evident that to have a reference about all of the different scenarios that could happen we need massive amounts of time and effort. This is a problem where clearly synthetically generated datasets could be used to complement real-world data. Since computer simulations can be run much faster than driving down a road and multiple simulations can occur simultaneously, this makes the process of gathering but also testing and analyzing data much faster. Simulated data does not however come without its downsides — because it is often much “cleaner” and the physical processes are not so accurately modeled — the data collected is also not adequate in for example giving a precise enough representation of vehicles driving dynamics or weather conditions [1, 2].

The main driving force so far behind the generation of simulator data and physics engines has been the gaming industry and it probably continues to be so. Even today, despite the problems that the simulators have, many companies are still using more and more simulators in product creation and refinement. For example, Boston Dynamics in the case of its robot family (Figure 1).

It is quite clear that in the following decades one of the advancements made in AI/ML will definitely be synthetic data and this will most likely make training deploying custom solutions even faster.

End-to-End models

End-to-end machine learning models are part of the deep learning models family, specifically, they are a type or a class of deep learning models. The main difference between end-to-end models and other deep learning models is how the learning process is divided into the model to solve a problem. In the case of end-to-end learning, we learn all of the parameters of the input data jointly as together, but with other deep learning models, this process is usually divided into different steps. The best way to understand this is to think that in the case of end-to-end models the model decides itself which features are more important in the input data (to predict the output data), but in the case of other deep learning methods, this feature extraction process by a model is usually given to the model by the human as in biased feature extraction methods incorporated into the model. For example, to detect a person on the image (Figure 2):

End-to-end models take the input and give the output. There are no specific feature extraction methods in between.
Other deep learning models take the input and divide the problem into multiple steps in the model training or have multiple models altogether. Some model parts or models may (1) zoom into the data and detect a human face, some (2) extract the face features for comparison to other facial features in the database and some (3) compare the facial features in the image to the features existing in the database.

Figure 2. Difference between end-to-end and traditional learning

End-to-end models have thus advantages of not requiring any specific human processing input in methods, but also as disadvantages, they usually need lots of data to work and are black box which behavior is difficult to predict.

End-to-End Driving using camera sensor data

Dataset collection

The CARLA environment was used to collect camera output and the steering wheel output, which was between -1 and 1, where negative values were left rotations and positive values were right rotations. During this project images were collected in different simulated conditions however, the initial goal was to get the simulated vehicle to drive in a simulated sunny environment and after that to train models that would be able to drive around regardless of what the conditions were.

Data Preprocessing (and augmentation)

After collecting the data a model was trained using just the image data and steering output with almost no preprocessing, this resulted in a model that could only move forward, as after looking at the steering labels (Figure 3), it became clear that the data was unbalanced because a lot of the driving was done on straight roads and there were very few turns along the path, so the data needed to be processed so that the model could be able to identify left or right turns.

Figure 3. Distribution of rotations in the driving data

This was done by under-sampling the images that depicted moving in a forward path and over-sampling the rotations so that the model could predict the rotations if needed. The only other preprocessing done to the camera sensor images were resizing, so that they would be lighter on the computational end and normalizing the images so that the pixels in the image were between 0 and 1.

Model training and selection

For model training with the camera sensor images, the input of the model was a normalized image with color values between 0 and 1. Alongside images the corresponding steering wheel rotations were also collected, these values were -1 to 1, where negative values indicated a left turn and positive values indicated right turns so that it would be possible to train models. The trained models were saved and then used in the environment to predict, how the steering wheel needed to rotate. The model used for training the model was based on a model used by NVIDIA in their end-to-end learning system [4] with small changes (Figure 4).

Figure 4. Based on the model used by NVIDIA [4].

End-to-End Driving using LIDAR and camera sensor data

Dataset collection

In total, we collected about 3000 lidar point cloud and RGB image frames + steering angles in synchronous mode. Using synchronous mode here means that the data was collected and ascertained simultaneously to make sure that the frames and angles match. Lidar information was saved in .ply format and RGB images in .png format. Steering angles were saved into a .txt file.

The most difficult part of the dataset collection came when reading Carla`s documentation and also examining their own results. The official documentation of Carla is for multiple simulator versions, but there are minor differences that seem to be not documented fully. For example in the case of semantic lidar data, they had written that you have to store truly for some function parameters, but actually, you had to store false, etc. It is also unclear from the provided parameters and code snippets how to reproduce some of the results that they have managed to have — like the quite dense semantic point cloud representations of the environment in Figure 5. What also makes the data collection more troublesome is that in order to visualize the data in the case of lidar, you have to rely on provided code snippets that do not produce so good results or write something yourself which in the case of 3D dimensional data is not easy (top-down views, sensor fusion, etc).

Figure 5. quite dense semantic point cloud representations.

Data Preprocessing (and augmentation)

After collecting the data I tried to look into utilizing the lidar data to its fullest. Initially, the plan was not actually to use the RGB camera information and do any sensor fusion, but only rely on lidar data. The premise of that was that if I can detect with the lidar the road features (specifically road lines) I do not have to incorporate RGB images.

I looked into the following possibilities in the case of lidar data:

Top-down views -

I generated top-down views from the LIDAR data by summing the average points in different height planes of Zi (Figure 6). This meant that the result was a one plane image a.k.a 2D image which showed the points collected from different planes on one plane Z.

Figure 6. Top-down views from LIDAR data

On the image at the top, you can see some areas that are denser because of that and almost have some depth into them. But you can also see that the resolution is not so good in detecting the road lines (not visible).

2. Semantic lidar data

The idea behind semantic lidar was to use it to generate front views or top-down images. It was the next logical step because in the case of CARLA’s documentation demo videos it had a really good resolution and capability in detecting the road lines. The problem although was that the semantic lidar did not work in all the provided simulator worlds so initially I tried to understand why some of the information was displayed and why some not and then when I accidentally used it in Town3 (different world) it worked. This was not written down in the documentation or any other source.

2. Sensor fusion

In theory, Carla should have an example of how to fuse camera and lidar data together. Utilizing this though proved to be more difficult than anticipated. For some reason, Carla kept crashing all the time when using the script and thus I had to write a lighter version of the fusion. Most of my work here went under adding semantic lidar data to the RGB image and I was hoping to get it to work. I almost even succeeded, but I had difficulties adding the right color by the semantic reference tag to the objects that the lidar has “seen”. In the end, I had to suffice with an image where the distance points of the lidar were just overlaid on the RGB image (Figure 8).

Figure 8. Distance points of the LIDAR laid over the RGB image.

In the next step of data preprocessing I dealt with balancing the steering wheel data. From the below histogram you can see the initial steering angles that I received from the simulator. They should be in the range of [-1; 1], but because the car kept turning only right in the simulator I mostly got angles in between [0;0.55].

To mitigate this problem I used OpenCV to flip the right-side images and add negative steering angles to the data correlating with the image (Figure 10).

Figure 10. Flipped images were used to counter imbalanced data

Model training and selection

As already mentioned, in the case of end-to-end learning, we need lots of data (and training) in order for the model to generalize well enough to solve the problem at hand. To mitigate this issue I sought out a model that was already mostly pre-trained and where I could only train the last layers, by keeping the base layers with corresponding weights frozen. Eventually, I found a CNN model named VGG-16, which has proven to be quite good at detecting geometric features from the data. And because in order for the model to work it had to also detect the road lines (which require geometrics understanding), I selected this model as a basis.

Figure 11. The model used with LIDAR and RGB

The model (Figure 11) worked by taking as a base layer the part marked with “Frozen weights”. After that, I flattened the input and pushed the images through three consecutive dense layers with neurons of sizes 3000, 1000, and 500. For the dense layers, I used ReLu as the activation function and L2 regularization. The final two layers were the dropout layers (to prevent overfitting) and one more dense layer for the output.

Results

Camera sensor

The models that were created using camera data were not usable enough that the vehicle could drive itself far enough without crashing into a wall. Initial models that were created had a mean absolute error of 0.03 but the problem was that the training data was imbalanced and so the trained model was only capable of driving forward, which was its huge flaw. The error was low because most of the data matched with just driving forward but this mean absolute error is not great at showing outlying values. After data processing, so that the number of different rotations and nonrotations, the mean absolute error got worse because, it started to identify steering wheel rotations even when they were unnecessary, so the model's mean absolute error went up to 0.1, which leads to taking incorrect driving lanes(Figure 12).

Figure 12. Mean absolute error in different epochs

From the video below it is possible to see that the model started to turn left almost as soon as it started and it quickly drow into a wall.

Compare that to the path driven by the autopilot it is possible to see that the model is not usable in driving down this path.

Camera + LIDAR sensor

The goal was to train the model for 30 epochs and then validate the output on existing data. I divided the training and validation dataset as 80/20 which means that 2400 images for meant for training and 600 images for validating. I managed to get the model to train on a smaller dataset, but for some reason, Google Colab gives mount point error when training.

Although I did not manage to train and get the results the model architecture and code are available on GitHub for review.

Conclusion

Overall, the created models were not capable of driving in the simulated environment without constantly crashing. Taking into consideration the results of our data analysis, we assume that the problem should be in the amount of data needed to train end-to-end models. In total we used about 3000 images and did not use augmentation extensively to create more data — instead we balanced it. Although some of the approaches have used much less data than we used (~500 images), we should probably take it with a grain of salt.

The main problem that prohibited us from using and collecting more data came unfortunately in the end down to our hardware performance. Carla requires at least 4GB of dedicated GPU to run efficiently, otherwise, it will crash and burn constantly. This prohibits collecting the necessary data in a timely manner.

Future work should regard the following:

Experimentation with different data sizes (intervals) — there is a point where on training the end-to-end model gets exponentially more difficult and the returns are quite small. It would be interesting to find this relative point of no return.
More extensive sensor fusion — in practice adding more information = more possibilities to extract features.

End-to-end model explainability and actions — as it is with end-to-end, but also many other deep learning models, the ability of a model to explain itself is quite important in situations where people (or even artificial systems) require further insight.

References

[1] Ruiz, Nataniel “Learning to Simulate”, Towards Data Science, https://towardsdatascience.com/learning-to-simulate-c53d8b393a56 (2019).

[2] Dilmegani, Cem “Synthetic data: Pros, Cons, Case studies & AI Applications”, AI Multiple, https://research.aimultiple.com/synthetic-data/ (2021).

[3] Coumans, Erwin, https://github.com/erwincoumans/pybullet_robots.

[4] Bojarski, M., Firner, B., Flepp, B., Jackel, L., Muller, U., Zieba, K., Testa, D. D., “End-to-End Deep Learning for Self-Driving Cars”, NVIDIA Developer, https://developer.nvidia.com/blog/deep-learning-self-driving-cars/, (2016).

[5] Jeong, J., Yoon, T. S., Park, J. B., “Towards a Meaningful 3D Map Using a 3D Lidar and a Camera”, Sensors, 18, 2571, (2018).

Additional links

GitHub repository: https://github.com/ristokynnapas/neural_networks_project
CARLA environment: https://carla.readthedocs.io/en/0.9.11