This repository is used to create training dataset and train a power outage model. Operative running code is in another repository. This code is open to support an related article. The data is, however, propriety data of the power distribution companies.
The overall process is following:
- Fetch and prepare necessary data
- Identify and track storm objects
- Extract predictive features
- Train classifier
- Classify
Data consists of three parts: 1) ERA5 2) Luke forest inventory 3) power outage data.
ERA5 is fetched from https://cds.climate.copernicus.eu/ and is stored in AWS (private) bucket fmi-era5-world-nwp-parameters
.
Luke forest inventory data is fetched from http://kartta.luke.fi/opendata/valinta.html and stored in (private) bucket fmi-asi-data-puusto
. The data is stored as original (16m resolution) and lowres (1.6 km resoltion) GeoTiff and lowres GRIB files.
Following actions can be used to fetch new versions of the files:
- Fetch the tiles from the service and upload them to the bucket with correct name (i.e. luke/2017/fra_luokka/xx.tif)
- Create composite GeoTiff:
bin/process_tree_files.sh fra_luokka)
(for all parameters) - Convert to GRIB:
bin/geotiff_to_grib
(process all parameters)
Power outage data is received from power distribution companies.
See https://github.com/fmidev/sasse-era5-smartmet-grid
Dataset is split to train and testset separately from training classifier. This ensures fair comparison and enable creating test examples. The split is conducted in notebook dataset_split.
Training the classifier is done partly with script classifier/train_classifier.py
and partly in notebooks train_and_validate_rfc (random search of the best hyperparameters), and train_and_validate_mlp.
The script reads relevant config as arguments and from cnf/options.ini
. One can set config file and config section name as arguments config_filename
and config_name
respectively. Relevant setups at the moment are thin
and thin_energiateollisuus
. Train and test data are always set as arguments train_data
and test_data
. If files do not exist locally, the script tries to fetch them from AWS bucket listed in cnf/options.ini
variable s3_bucket
. model
(script supports svct, rfc, gnb, and gp). dataset
argument is used to format model and results output path.
In practice, the script is ran with docker-compose. To train for example energiateollisuus dataset with 20 m/s threshold, one could use following commands:
export model=rfc
export dataset=national_random_20
export train_data=data/energiateollisuus_random_20_thin_res.csv
export test_data=data/energiateollisuus_random_20_thin_test.csv
export config_name=thin_energiateollisuus
docker-compose run --rm cl
For operational use, see https://github.com/fmidev/contour-storms
classifier/create_examples.py
is used to create examples. Consult source code and docker-compose file for more details.