From 8f87ad195fb0ec981c9cbf78356f01ce96712d00 Mon Sep 17 00:00:00 2001 From: emkael Date: Wed, 12 Nov 2014 22:18:29 +0100 Subject: * more documentation, markdown --- doc/challenges.txt | 53 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 48 insertions(+), 5 deletions(-) (limited to 'doc') diff --git a/doc/challenges.txt b/doc/challenges.txt index 1e838a6..4221bcd 100644 --- a/doc/challenges.txt +++ b/doc/challenges.txt @@ -11,6 +11,7 @@ Gathering and unifying data The project obviously involved dealing with lots of race and qualifying results (well, basically, with *all* of it). The usual process of compiling results into project database consisted of several steps: + 1. scraping data from various public web sources 2. converting data to consistent format (i.e. CSV) 3. semi-manual compilation of data in the CSV to importable format @@ -36,9 +37,10 @@ Yet still, there was a need for qualifying/prequalifying data not included in Er The conversion process for the sources which I managed to automate happened on the fly. The helper script for both scraping and HTML->CSV conversion is available in the dumps/ directory of project tree. The directory includes: - - scrapers for chicanef1 and second-a-lap single result pages - - scrapers which extract all possible qualifying and non-championship result page URLs from chicane - - utility which fetches URLs and converts the tables to CSV files and compiles multiple CSV files to single file, producing the list of compiled files as a side-effect + + * scrapers for chicanef1 and second-a-lap single result pages + * scrapers which extract all possible qualifying and non-championship result page URLs from chicane + * utility which fetches URLs and converts the tables to CSV files and compiles multiple CSV files to single file, producing the list of compiled files as a side-effect **I urge you to proceed with highest level of attention when trying to scrape the websites on your own using these tools.** They involve lots of excess requests (since it was easier to fetch everything that moves and filter out the weeds afterwards) and may generate substantial bandwidth on the source websites if used without common sense. And the last thing I'd want is to cause trouble to the amazing people who put these data for public access in the first place. @@ -77,8 +79,10 @@ The `races` table is pretty much straight-forward, so CSV-formatted file can eas The aim of main import procedure was to populate the `drivers`, `driver_entries` and `entries` tables, with - if possible - shared drives support. That pushed some constraints on the amount of information and format of the imported CSV file. Very rudimentary import script (import-csv.py) assumes CSV file with either of the following line formats: - - 6 columns: race ID, text description of entry result, car number, driver country, driver name, result group for Elo algorithm outcome - - 2 columns: driver country, driver name + + * 6 columns: race ID, text description of entry result, car number, driver country, driver name, result group for Elo algorithm outcome + * 2 columns: driver country, driver name + Detecting first row format created a new entry for race, the second one - appended another driver to a shared drive for the last processed entry. On top of that, the `drivers` table was being filled with every driver name not yet present in the database. ### Data normalization @@ -86,3 +90,42 @@ Detecting first row format created a new entry for race, the second one - append Usually after running import script, the main concern was the normalization of driver names present in the database. Since lookup and identification of drivers during the import took only their into account, there were lots of duplicates - drivers racing under nicknames, drivers with multiple names they'd figured under or drivers with various spelling variant of their names. After manual normalization of such values, the dataset was ready for processing (usually - resetting the ranking DB to the date of first newly imported session and running the rating onwards). + +Preventing ranking inflation/deflation +-------------------------------------- + +Applying Elo rating algorithm to a selected sample of players and not to a wider spectrum of skill among competitors has several flaws. + +It's highly unlikely that the distribution of skill among drivers who already reached the top class of motorsports is compatible with normal distribution or logistic distribution, which makes it rather inconvenient for purposes of Elo rating. + +The evolution of ranking values over time is sensitive to various effects: + + * drivers entering the sport aren't "rookies" with negligible skill (in the classic algorithm: until some number of ranked games) + * drivers with lower skill tend to leave the sport sooner than drivers with higher skill + * the conditions which influence outcomes of duels between drivers change rapidly due to technical development and/or team changes - so the likelyhood of yet lower rated driver defeating higher rated driver is sometimes unpredictable (and long streaks of lower ranked drivers defeating higher ranked drivers are not uncommon, which leads to constant "leap-frogging" in the rankings and rapid rating changes) + +In classical Elo model (i.e. players entering the rankings with fairly low ranking, with constant disparity factor of the game) this leads to ranking inflation over time. This can be managed to some extent by a few simple adjustement of Elo parameters: + + * the entry ranking for new players (which can be changed in application configuration and has been set at a fairly high level of 2000 points) + * the threshold ranking values for decreasing rating changes coming from duels (by default the thresholds are set so that newcomers already fall into the middle interval and very quickly fall into the highest interval, while it's still possible for them to fall into to lower interval of higher rating swings) + * the disparity value of the field (lower value tends to stop already dominating drivers from gaining even more advantage while higher value is better when the field is less predictive, e.g. at the beginning of the season) + +The aforementioned adjustements made thigns slightly better, but the model was still unstable, as it was very easy to overcorrect - which, in turn, led to rating deflation and underappreciation of drivers as time went by. + +From the beginning I hoped to avoid switching to some more complex algorithms, like Glicko, and to stay as closely to core Elo rating as possible. Yet some changes were still necessary. + +As I've already mentioned in the write-up of pairing and ranking method (see: doc/results.txt), the field disparity can be adjusted dynamically. The more the ratings change over recent time, the lower the disparity (so that initial shift of power among the grid pushes the drivers closer the top but prevents their dominanve to escalate until the ratings are stabilized again). + +The cut-off had to be more subtle than linear decrease of disparity, so the following formula was used: + +`disparity` = `base_disparity` * 0.5 * (2.5 + `base_rating_change` / (`rating_change` - 2 * `base_rating_change`))) [if `rating_change` <= `base_rating_change`] +`disparity` = 0.75 * `base_disparity` [if `rating_change` >= `base_rating_change`], where: + + * `base_disparity` and `base_rating_change` are application parameters + * `rating_change` is average amplitude of driver rating for previous 3 months (max(rating) - min(rating) over 3 months for every driver, averaged) + +The equation may look scary, but it's just a concave segment of hyperbolic function, spanning from base_disparity in 0 to 75% of base_disparity in base_rating_change and staying constant for higher values of rating change. The adjustement can be turned off in application configuration. + +Now, it has to be said that if you have a predetermined bias or prejudice for (or against) certain drivers or decades, it's fairly easy to tweak the algorithm so that it yields results that satisfy you. For the results I'm showing, I tried my best to keep overall ranking conditions over the entire field constant over time. It's debatable if that's not defeating the entire point of the simulation - one of the main reasons for using Elo-like rating was to separate absolute driver achievements (yes, "achievements", not "skill", since comparisons among the entire field do not factor out technical level of cars driven by competitors) from the general "quality" of the grid they're fielded against. But as the average rating is comparable over decades, the variation is not - so some conclusions on overall quality of certain grids can be drawn. + +For full disclosure purposes: simple tweaks to field disparity either underline the achievements of Fangio era (with rapid inflation of rating in the early years - the fact that the sport got dominated that much not long after the beginning doesn't help) or Vettel's achievements (with inflation kicking in around Schumacher era). It's interesting to see, though, that no matter which way the absolute ratings are skewed, last decade of the sport shows signs of "standing on the shoulders of giants": Alonso, by defeating Schumacher in 2005-06, gets higher ratings, which then are being surpassed by Vettel and, sometimes, Hamilton at the moment. Basically - beating highly rated drivers in a space of 1-2 seasons elevates a driver even higher. -- cgit v1.2.3