Overview
In this project I made 17 data visualisations for Formula 1 race tracks using Tableau & SQL. I wanted to do a data project on a sport that I'm a fan of and Formula 1 has been something that I grew up from a young age watching on Sunday afternoons with my Dad.
I wanted to create a visually striking set of visualisations, similar to what you see on Sky Sports before lights out. I made a dashboard for some of my favourite circuits my pulling data for three areas of interest: The fastest lap achieved on each track and by which driver, the total points earned by constructor team (and driver) and my favourite part, a heatmap showing the average points earned by starting position.
I downloaded this dataset from Kaggle, the dataset can be found here. https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020?select=results.csv
I decided to only analyse data from 2010 to 2023, the reason being, in 2010 the FIA changed the rules regarding points won for races, pre 2010 a first place would earn you 10 points and post 2010 this was changed to 25 points, this was so that more teams could score points. However, a lot of the data I present here is the sum of data points and I didn't want the post 2010 data sets to skew the results.
Creating the Treemap
A treemap is a visualisation that shows the composition of a sum of data points split by categories. In the above example, the size of each tile represents the total number of points scored by each driver between 2010 and 2023.
To work with any of the visualisations I decided to use a MySQL database that I created for this project, I then imported the csv's to create tables I could work with. I ended up needing four tables from the dataset: circuits, constructors, drivers & lap_times.
First, I needed to do some data aggregation to calculate the total sum of points obtained by each driver racing with a particular constructor. For anyone not familiar with F1, the constructors are the teams who make the cars and manage race strategy (e.g. Ferrari, Mercedes, Red Bull). I achieved this by using GROUP BY on both the driver's code and the constructor name (since some drivers have raced for more than one constructor). Since the data is normalised across tables, I had to do a few joins between the results table foreign keys and the detail tables primary keys (specifically on raceId, driverId, constructorId and circuitId). This allowed me to display the results with the actual names of the different entities, instead of using hard to read IDs.
I also wanted to filter out any results earlier than 2010, since the point system changed.
Finally, to save myself some time in Tableau, I also grouped by the circuit name. That way I could export the data once and then just filter by circuit when making each individual dashboard. Our query looks like this:
SQL query to return the sum of points earned by each driver, grouped by drivers, constructors and circuit |
Which returned the following data.
Returned data from our query, ready to be exported then connected to Tableau |
Then I exported the data as a csv, converted it to xls and then connected it as a data source to Tableau. This is because I'm using Tableau public and I can't easily connect to a SQL server without upgrading to Tableau desktop.
To create the treemap I would begin with a blank worksheet and then drag the results.points column into the size field. This gives us a single small tile, which isn't very useful just yet.
The first stage of creating a treemap |
Next I add the Name column (which is the constructor team name) to the colour field. Now it's starting to look more like a treemap. I also went and formatted each colour to match the constructors traditional colour (teal for Mercedes-Petronas, Red for Ferrari etc).
Adding a field to the colour tile seperates out the inital tile into a more familiar treemap |
Next I drag over the Code column (which is the drivers name, e.g HAM) and the sum of points column over to the label field to give some info about the data. This gives the following image.
A finished treemap but for all circuit data |
You may notice there are a high number of points in these tiles (>3000 for Hamilton for example). That's because we're summing the data from all of the race tracks and we want to do this by circuit. In my SQL query to generate this table I deliberately didn't filter by the specific race tracks using WHERE. That's because I want to recreate this table 16 times and that would be very tedious to do. So instead I will use the filter in Tableau and pass in the column called Name 1 (which is the circuit name). Let's do this for the Silverstone track.
By not filtering the SQL query I can easily replicate the visualisations for different tracks |
Almost perfect. But I want to make the background of this black so it looks better on the finished dashboard. I can do that by going to format > shading and selecting black. I'll also want the borders between the tiles to be black as well, I can do this by clicking on the colour field in the marks tab and changing border colour from automatic to black. Our finished treemap looks like this.
The finished treemap for Silverstone Circuit (2010 - 2023) |
And now, to create this for another circuit all I have to do is duplicate the sheet and change the filter to whatever race track I want to get the data for, easy.
Creating the Heatmap
The next part of my visualisation is to create a heatmap. The purpose of a heatmap is to highlight to the reader where either categorical or positional data can influence an outcome. I wanted to know what is the best position to start the race in in order to get the most points (with a win earning the maximum 25 points, 2nd being 18 points, then 15, 12, 10 and so on). Now it might sound obvious that first place is the best place to start in, however some tracks have a sharp left turn for example so if you're starting in 3rd place you may have a better advantage than the driver in 2nd place if you get off the grid at the right pace.
To make this visualisation I knew I would have to create some 'bins' for left and right side of the grid, as the grid is made up of 20 positions with 2 columns for left and right I decided to use CASE WHEN in SQL to create a customised output. The query is like so.
A SQL query using CASE WHEN to build our grid starting position |
What we're doing here is taking the grid data which gives us a starting position from 1 to 20, using CASE WHEN we're saying when the grid number is even (i.e. grid % 2 = 0, meaning there are no remainders) THEN assign the value to bin "Right", ELSE assign to bin "Left". This gives us a new column called grid_column.
We then create grid_row by simply dividing the number on the grid by 2 and rounding to get the integer row.
The output is as follows:
Returned results for the heatmap query |
Now when I create an xls file of this table and connect it to Tableau I can do the following.
I pass in grid columns to column and grid row to row and set the marks to square, now it looks like our starting grid.
Building the heatmap visualisation |
Next I add some text to the visualisation by dragging in the Grid position to the labels field. Then I add the average points to the colour field.
Adding labels to the heatmap |
The 'grid' is upside down relative to what we know as an F1 starting grid, so I adjust this by clicking on the y axis and selecting 'reversed' which reverses the order. I also change the colour type to red-green diverging to emulate the race track starting lights.
Reversing the order of the y axis creates the familiar F1 starting grid |
Lastly, I want to add what the actual average points per grid position are, so I do this by again dragging the average points column into the labels field. Finally to keep things consistent with the rest of the dashboard I make the background black and the text white, I remove the grid lines and select a filter for a race track, this time we'll use the Spa circuit in Belgium. We're left with the following visualisation.
A finished heatmap visualisation for Spa Francorchamps (2010 - 2023) |
I recreate a new heatmap for each circuit by simply duplicating the worksheet and then changing the filter in the same way as was done with the treemap.
Generating the Fastest Lap statistic
The last piece of info I want for each track is who achieved the fastest lap and what that lap time was. This was a pretty tricky query to create and it is as follows:
A SQL query to return the fastest lap achieved and driver for each circuit |
This query would have been more straightforward if I just had wanted to retrieve the fastest lap time per circuit. This could have been achieved with a GROUP BY circuit name, with an aggregate function of the minimum lap time. However, I also wanted to return which driver had achieved this fastest lap time, or in some cases which drivers, since there could be ties on time.
To achieve this I made a common table expression (CTE) to join to other tables (e.g. to get the circuit and drivers name), but also to rank the lap times, from smallest to largest. This was done with the RANK() function over a partition by circuit name, and ordered by the lap time.
With that subquery, I could then filter for lap_rank equal to 1 (the fastest), grouping by circuit name and aggregating the remaining columns: the minimum time and by concatenating the list of drivers who tied in achieving that time.
This returned the following data, with one row per circuit, a list of drivers who tied on the fastest lap time, and of course the lap time itself.
The returned fastest lap data |
Then in Tableau I simply drag the driver's name (code) and lap_time into the text field and I'm given the fastest time for the circuit I have applied the filter for, in this case Monaco. As before duplicating the sheet and changing the filter for the track gets me the fastest lap of each track.
Putting it all together |
Creating a dashboard in Tableau now allows me to pull all of these graphs into a single place, I add images of each circuit that I've found and place each graph in the same location (using the position and size features in the dashboard layout tab). The finished result looks like this.
A finished dashboard for Portimão Circuit - Portugal |
To recreate the other circuits all I needed to do was duplicate the sheet (to keep formatting consistent) and then replace each treemap, heatmap, fastest lap and image with the corresponding data sources and the dashboard is complete.
Changing data sources
When making these visualisations I made a serious mistake that I didn't realise until I'd finished all 17 visualisations! I had accidentally downloaded the wrong dataset. When looking through Kaggle I had bookmarked the dataset I was going to use however I ended up downloaded the original data source, but the data was 5 years old; unfortunately I did all my visualisation work on this dataset. However all was not lost as the data was formatted in exactly the same way. So I simply downloaded the updated data (up to 2023) and ran my queries again.
Inside Tableau you can select to change the data source. This automatically updates all your data to the new source (providing the data is formatted in the same way). This saved me a lot of time and just added 5 years worth of data to my visualisations, a useful lesson.
Replacing data sources in Tableau |
Conclusion
In this project I created several dashboards for Formula 1 circuits, I used SQL to correctly format the data prior to creating the visualisations that I wanted.
This project has covered:
- Working with multiple tables of data and creating JOINs
- Creating treemaps from SQL queries
- Using CASE WHEN to create custom tables for positional information
- Creating heatmaps
- Using common table expressions (CTEs) and RANK to create the fastest lap statistic
Thanks for reading, please feel free to check out my other portfolio work or and connect with me on LinkedIn.