diff options
author | emkael <emkael@tlen.pl> | 2014-11-12 17:06:46 +0100 |
---|---|---|
committer | emkael <emkael@tlen.pl> | 2014-11-12 17:06:46 +0100 |
commit | fc0bd3faac636e4f9125c39a61b4a84d2758cd1f (patch) | |
tree | 61edafb28345e78f0653c703c2d7e85358ac587b | |
parent | 7ea4582fa20c76c6189816b798a0fa7d0a5dea28 (diff) |
* data import documentation
-rw-r--r-- | doc/challenges.txt | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/doc/challenges.txt b/doc/challenges.txt index 48683c8..1e838a6 100644 --- a/doc/challenges.txt +++ b/doc/challenges.txt @@ -1,6 +1,10 @@ Challenges and issues with the rating process ============================================= +Two main obstacles had to be overcome to achieve satisfactory quality of the analysis and the entire project. + +One of them was purely technical, the other - theoretical/methodological. This documents presents them very briefly. + Gathering and unifying data --------------------------- @@ -65,3 +69,20 @@ That's where the result database structure kicks in. For clarity, I'll show you | + car_no |------------------| + country | _driver | + rank_date | | + result_group | _entry _driver +-----------+ +--------------+ +-----------------+ (driver_entries) + +Since `race_types` table is a pre-filled dictionary and values in `rankings` are only calculated by the main rating application, dataset import operates on `races`, `entries` and `drivers`. + +The `races` table is pretty much straight-forward, so CSV-formatted file can easily be imported into the table (e.g. with the help of any proper web-based RDBMS administration tool). + +The aim of main import procedure was to populate the `drivers`, `driver_entries` and `entries` tables, with - if possible - shared drives support. + +That pushed some constraints on the amount of information and format of the imported CSV file. Very rudimentary import script (import-csv.py) assumes CSV file with either of the following line formats: + - 6 columns: race ID, text description of entry result, car number, driver country, driver name, result group for Elo algorithm outcome + - 2 columns: driver country, driver name +Detecting first row format created a new entry for race, the second one - appended another driver to a shared drive for the last processed entry. On top of that, the `drivers` table was being filled with every driver name not yet present in the database. + +### Data normalization + +Usually after running import script, the main concern was the normalization of driver names present in the database. Since lookup and identification of drivers during the import took only their into account, there were lots of duplicates - drivers racing under nicknames, drivers with multiple names they'd figured under or drivers with various spelling variant of their names. + +After manual normalization of such values, the dataset was ready for processing (usually - resetting the ranking DB to the date of first newly imported session and running the rating onwards). |