Could not load data.json. Run python3 presentation/scripts/build_data.py from the repo root (optionally set FUEL_CSV to your CSV path), then redeploy.
Global fuel prices, 1924–2024
This page is a structured presentation of the work in RMAID.ipynb: historical gasoline and related
features across countries, exploratory charts, a chronologically split machine learning pipeline, several
regression models, hyperparameter search, gradient boosting interpretability, and a detailed discussion of tree-based
models and domain-driven feature importance.
Data acquisition and loading
The notebook installs helpers, downloads all_countries_combined.csv (Google Drive export URL in the
notebook), and validates the file. Columns include year, country, region, fuel prices (nominal and inflation-adjusted
to 2024 USD), crude oil, tax share of pump price, subsidy regime, oil producer flag, CPI, and more.
The companion script presentation/scripts/build_data.py uses the same URL (or FUEL_CSV)
to regenerate data.json for the charts and tables on this site—so your deployed numbers always match the
CSV you last built from.
Exploratory Data Analysis (expanded)
The notebook explains that this section visualizes distributions, correlations, and categorical structure to understand what drives pump prices. It includes missing-value maps, global and regional time series, country comparisons for the latest year, histograms of nominal vs real prices, a broad correlation matrix, crude vs gasoline scatter (often colored by year), decade box plots, tax vs price by region, year-on-year change heatmaps, subsidy and oil-producer analyses, and price-tier evolution.
Pearson correlation between crude oil and nominal gasoline (full loaded dataset): —.
Global averages: nominal vs real gasoline and crude oil
Average gasoline price by region
Global average gasoline vs diesel over time
Feature engineering, encoding, and chronological split
Because the panel spans a century, the notebook uses a chronological split, not a random split, to
avoid using future years to predict the past. Categorical fields region and subsidy_regime
are label-encoded. Rows with missing values in the modelling columns are dropped, and the frame is sorted by year.
Intended periods in the notebook narrative: training on the earliest years, validation on the middle period, and testing on the most recent years (approximately 1924–1994 train, 1994–2009 validation, 2009–2024 test, depending on row availability after dropping missing data).
Features used
year, region (encoded), is_oil_producer,
crude_oil_usd_per_barrel, tax_pct_of_pump_price, subsidy_regime (encoded),
us_cpi.
First modelling pass (nominal target)
The notebook describes training five models on the nominal target gasoline_usd_per_liter for comparison:
Linear Regression, Ridge, Decision Tree, Random Forest, and Gradient Boosting. In that narrative, Gradient Boosting
led on validation with about R² 0.46 and the lowest MAE (~0.22), while linear models underperformed—consistent with
strong non-linear relationships across a hundred years of regions and policy regimes. Ensemble methods captured
long-horizon structure better than linear baselines.
Real 2024 USD target and Gradient Boosting
The analysis then switches the target to gasoline_real_2024usd so predictions are in inflation-adjusted
terms. The same feature matrix and chronological split are reused. Gradient Boosting is fit on the training slice and
reported on validation and test; the notebook also shows an actual-vs-predicted scatter on the test window and a
horizontal bar chart of GB feature importances.
SHAP for Gradient Boosting
The notebook uses shap.TreeExplainer on the fitted gradient boosting model and the real-target test
features. It generates a summary plot (each point is a country-year: position shows SHAP impact,
color shows feature value) and a dependence plot for tax_pct_of_pump_price, showing how
the marginal effect of tax on the model output changes across observed tax levels. Together, these complement raw
feature importances by exposing directionality and heterogeneity.
Scaling numerical inputs
Continuous fields year, crude_oil_usd_per_barrel, tax_pct_of_pump_price, and
us_cpi are standardized with StandardScaler fit on the training split. The notebook notes
that linear models should be trained and evaluated on these scaled matrices, while tree ensembles are less sensitive
to monotonic scaling but can still occasionally benefit.
Hyperparameter tuning
Gradient Boosting: RandomizedSearchCV over estimators, learning rate, tree depth,
min_samples_split, and subsample; scored by R² on cross-validation folds fit on the real-target training
data.
Other models: The notebook also runs randomized search for Ridge, Decision Tree, and Random Forest on the real target and prints a consolidated table of best parameters.
CatBoost and LightGBM
CatBoost and LightGBM regressors are trained on the same X_train_r / y_train_r setup
(encoded categoricals as numeric inputs), evaluated on validation, and merged into the running comparison table with
MAE, RMSE, and R². If those libraries are not installed when you run build_data.py, they are skipped and
omitted from data.json.
Test-set evaluation and visualizations
All models (including untuned and tuned gradient boosting, CatBoost, and LightGBM when available) are evaluated on
the held-out test slice X_test_r / y_test_r. The notebook plans three visual families: bar
charts comparing R², MAE, and RMSE; actual-vs-predicted scatter grids per model; and residual plots (actual minus
predicted vs predicted) to inspect heteroskedasticity and bias.
Top test model in the current build: — (ranked by R² on the test split). Rankings can change with the exact CSV rows and library versions; the prose below records the notebook’s written interpretation when the Decision Tree led on the test set.
| Model | R² | MAE | RMSE |
|---|
Test R² (higher is better)
Test MAE (lower is better)
Test RMSE (lower is better)
Actual vs predicted (test subsample)
In-depth analysis: Decision Tree
The notebook highlights the Decision Tree for interpretability: feature importances from the fitted tree, SHAP
summary and tax dependence plots with TreeExplainer, and extended discussion.
Decision Tree feature importances
Gradient Boosting feature importances (real target)
Model complexity, regularization, and ensembles
The Decision Tree model emerged as the best performer on the test set in the notebook’s reported run for predicting real gasoline prices. Trees capture non-linear relationships and interactions without explicit interaction terms—useful over a century of socio-economic change. While single trees can overfit, in that run the tree generalized slightly better than bagging or boosting variants, which the notebook suggests could mean the decision boundary is relatively simple, the test window aligns well with the tree’s structure, or ensemble hyperparameters were not fully optimal for this split. Linear models struggled, underscoring non-linearity.
Domain relevance of feature importances
-
tax_pct_of_pump_price(dominant): Taxes are a direct policy lever on pump prices; high tax shares align with higher consumer prices. -
subsidy_encoded: Subsidies lower effective prices; regimes differ markedly from “none” to heavy support. -
region_encoded: Geography bundles regulation, logistics, and market structure beyond crude price alone. -
us_cpi: Acts as a macro proxy linked to the real price target and global inflation conditions. -
crude_oil_usd_per_barrel: Essential for input costs, but pump prices add tax, distribution, and policy wedges—so crude may rank below tax and subsidy in tree splits. -
year: Captures slow trends not fully encoded elsewhere; may be secondary to policy and price drivers. -
is_oil_producer: Producer status matters for supply and domestic policy, but consumer pump prices still reflect taxes, subsidies, and local markets—so importance can be modest.
Overall, importances match economic intuition: fiscal policy and regional context sit alongside global crude and macro indicators.
SHAP analysis for the Decision Tree
As with gradient boosting, SHAP summary and dependence plots for the Decision Tree summarize how each feature pushes predictions up or down on held-out observations, with the tax dependence plot highlighting non-linear marginal effects of the tax share on predicted real prices.
Team contribution table
| Section | Responsible member |
|---|---|
| Data acquisition & loading | Monta |
| Exploratory Data Analysis | Ruslans |
| Feature engineering & scaling | Auseklis |
| Model development & tuning | Igors |
| Evaluation & visualization | Dāvis |
References
- Dataset: Global Fuel Prices (1924–2024) on Kaggle (historical fuel price data).