1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
|
Challenges and issues with the rating process
=============================================
Two main obstacles had to be overcome to achieve satisfactory quality of the analysis and the entire project.
One of them was purely technical, the other - theoretical/methodological. This documents presents them very briefly.
Gathering and unifying data
---------------------------
The project obviously involved dealing with lots of race and qualifying results (well, basically, with *all* of it).
The usual process of compiling results into project database consisted of several steps:
1. scraping data from various public web sources
2. converting data to consistent format (i.e. CSV)
3. semi-manual compilation of data in the CSV to importable format
4. importing CSV data to project database
5. normalization of imported data
### Scraping the data
At first, I tried to use Wikipedia as an initial source for race results. Since any source with uniform tabular would do just fine, it seemed like a good idea at the time. Unfortunately, wikipedia is far from uniform. Throughout 1000+ wikipedia pages for both championship and non-championship races, not only the format of the results table varies, but the number of tables (separate qualifying table vs. "grid" column), or the placement and headings of the tables, as well.
Plus, Wikipedia is far from being complete in terms of non-championship races results. And since that meant I would need a separate data source anyway, fighting with Wikipedia scrape automation didn't seem worth it (despite the presence of a proper API for fetching revisions of wiki pages).
The next source, this time with consitently formatted tables, was [A Second A Lap blog](http://second-a-lap.blogspot.com/) - which showcases one ray each day. Main advantage of the source: lightweight native Blogspot API for fetching article contents. Main (and critical) flaw: the series is in the mid-60s as of now so the data would not be complete for a long time. Also, the tables lacked entry numbers, a nice flavour in the data.
Fortunately, that's were [Ergast Developer API](http://ergast.com/mrd/db) came in handy. The SQL database image provided an achievable baseline for testing the algorithm (on championship race results and the qualifying data included in the dump). As a side-effect, after running some checks on the Ergast-imported data, some inconsistencies were detected and reported to Ergast maintainers.
The initial source for non-championship races was then chosen: [The Formula One Archives](http://www.silhouet.com/motorsport/archive/f1/title.html). That meant dealing with plain-text data and lots of manual aligning of the data to form columns. Surprisingly, CSV-formatted data manipulation in any office software works very well so this wasn't as hard as it seemed.
Yet still, there was a need for qualifying/prequalifying data not included in Ergast DB. So the ultimate source (without which compiling wouldn't be possible in relatively short time) was the magnificent [Chicane F1 website](http://chicanef1.com/). Despite some minor errors in the data (most likely - propagation of print errors from original sources, small things like wrong digits in timing which then lead to incorrect qualifying speeds by which I had to sort my data to discard grid penalties), the website data turned out bullet-proof (comprehensive, consistent, covering at least everything covered in other sources) and dataset was completed.
### Scraped data conversion
The conversion process for the sources which I managed to automate happened on the fly. The helper script for both scraping and HTML->CSV conversion is available in the dumps/ directory of project tree.
The directory includes:
- scrapers for chicanef1 and second-a-lap single result pages
- scrapers which extract all possible qualifying and non-championship result page URLs from chicane
- utility which fetches URLs and converts the tables to CSV files and compiles multiple CSV files to single file, producing the list of compiled files as a side-effect
**I urge you to proceed with highest level of attention when trying to scrape the websites on your own using these tools.**
They involve lots of excess requests (since it was easier to fetch everything that moves and filter out the weeds afterwards) and may generate substantial bandwidth on the source websites if used without common sense. And the last thing I'd want is to cause trouble to the amazing people who put these data for public access in the first place.
### Compiling the data
The most time-consuming and boring part of the whole process. Extending the CSV data to accomodate algorithm's ranking criteria. Spread sheet manual labor with lots of filtering, sanity checks, conditional formulas, more filtering, more formula etc.
Nothing useful in terms of source-code or data not included in the resulting database came up from this stage, apart from a sense of satisfaction and drastic appreciation of spreadsheet software.
### Importing the data
That's where the result database structure kicks in. For clarity, I'll show you the schema first:
+---------+ +------------+
| races | | race_types |
+---------+ 0..* 1 +------------+
| + date |-----------| + code |
| + race | _type +------------+
+---------+
1 | _race
|
0..* |
+-----------------+ +-----------+ +--------------+
| entries | | drivers | | rankings |
+-----------------+ +-----------+ 1 0..* +--------------+
| + result | 0..* 0..* | + driver |--------------| + ranking |
| + car_no |------------------| + country | _driver | + rank_date |
| + result_group | _entry _driver +-----------+ +--------------+
+-----------------+ (driver_entries)
Since `race_types` table is a pre-filled dictionary and values in `rankings` are only calculated by the main rating application, dataset import operates on `races`, `entries` and `drivers`.
The `races` table is pretty much straight-forward, so CSV-formatted file can easily be imported into the table (e.g. with the help of any proper web-based RDBMS administration tool).
The aim of main import procedure was to populate the `drivers`, `driver_entries` and `entries` tables, with - if possible - shared drives support.
That pushed some constraints on the amount of information and format of the imported CSV file. Very rudimentary import script (import-csv.py) assumes CSV file with either of the following line formats:
- 6 columns: race ID, text description of entry result, car number, driver country, driver name, result group for Elo algorithm outcome
- 2 columns: driver country, driver name
Detecting first row format created a new entry for race, the second one - appended another driver to a shared drive for the last processed entry. On top of that, the `drivers` table was being filled with every driver name not yet present in the database.
### Data normalization
Usually after running import script, the main concern was the normalization of driver names present in the database. Since lookup and identification of drivers during the import took only their into account, there were lots of duplicates - drivers racing under nicknames, drivers with multiple names they'd figured under or drivers with various spelling variant of their names.
After manual normalization of such values, the dataset was ready for processing (usually - resetting the ranking DB to the date of first newly imported session and running the rating onwards).
|