Global fuel prices, 1924–2024

This page is a structured presentation of the work in RMAID.ipynb: historical gasoline and related features across countries, exploratory charts, a chronologically split machine learning pipeline, several regression models, hyperparameter search, gradient boosting interpretability, and a detailed discussion of tree-based models and domain-driven feature importance.

Data acquisition and loading

The notebook installs helpers, downloads all_countries_combined.csv (Google Drive export URL in the notebook), and validates the file. Columns include year, country, region, fuel prices (nominal and inflation-adjusted to 2024 USD), crude oil, tax share of pump price, subsidy regime, oil producer flag, CPI, and more.

The companion script presentation/scripts/build_data.py uses the same URL (or FUEL_CSV) to regenerate data.json for the charts and tables on this site—so your deployed numbers always match the CSV you last built from.

Exploratory Data Analysis (expanded)

The notebook explains that this section visualizes distributions, correlations, and categorical structure to understand what drives pump prices. It includes missing-value maps, global and regional time series, country comparisons for the latest year, histograms of nominal vs real prices, a broad correlation matrix, crude vs gasoline scatter (often colored by year), decade box plots, tax vs price by region, year-on-year change heatmaps, subsidy and oil-producer analyses, and price-tier evolution.

Pearson correlation between crude oil and nominal gasoline (full loaded dataset): —.

Global averages: nominal vs real gasoline and crude oil

Average gasoline price by region

Global average gasoline vs diesel over time

Feature engineering, encoding, and chronological split

Because the panel spans a century, the notebook uses a chronological split, not a random split, to avoid using future years to predict the past. Categorical fields region and subsidy_regime are label-encoded. Rows with missing values in the modelling columns are dropped, and the frame is sorted by year.

Intended periods in the notebook narrative: training on the earliest years, validation on the middle period, and testing on the most recent years (approximately 1924–1994 train, 1994–2009 validation, 2009–2024 test, depending on row availability after dropping missing data).

Features used year, region (encoded), is_oil_producer, crude_oil_usd_per_barrel, tax_pct_of_pump_price, subsidy_regime (encoded), us_cpi.

First modelling pass (nominal target)

The notebook describes training five models on the nominal target gasoline_usd_per_liter for comparison: Linear Regression, Ridge, Decision Tree, Random Forest, and Gradient Boosting. In that narrative, Gradient Boosting led on validation with about R² 0.46 and the lowest MAE (~0.22), while linear models underperformed—consistent with strong non-linear relationships across a hundred years of regions and policy regimes. Ensemble methods captured long-horizon structure better than linear baselines.

Real 2024 USD target and Gradient Boosting

The analysis then switches the target to gasoline_real_2024usd so predictions are in inflation-adjusted terms. The same feature matrix and chronological split are reused. Gradient Boosting is fit on the training slice and reported on validation and test; the notebook also shows an actual-vs-predicted scatter on the test window and a horizontal bar chart of GB feature importances.

SHAP for Gradient Boosting

The notebook uses shap.TreeExplainer on the fitted gradient boosting model and the real-target test features. It generates a summary plot (each point is a country-year: position shows SHAP impact, color shows feature value) and a dependence plot for tax_pct_of_pump_price, showing how the marginal effect of tax on the model output changes across observed tax levels. Together, these complement raw feature importances by exposing directionality and heterogeneity.

Scaling numerical inputs

Continuous fields year, crude_oil_usd_per_barrel, tax_pct_of_pump_price, and us_cpi are standardized with StandardScaler fit on the training split. The notebook notes that linear models should be trained and evaluated on these scaled matrices, while tree ensembles are less sensitive to monotonic scaling but can still occasionally benefit.

Hyperparameter tuning

Gradient Boosting: RandomizedSearchCV over estimators, learning rate, tree depth, min_samples_split, and subsample; scored by R² on cross-validation folds fit on the real-target training data.

Other models: The notebook also runs randomized search for Ridge, Decision Tree, and Random Forest on the real target and prints a consolidated table of best parameters.

CatBoost and LightGBM

CatBoost and LightGBM regressors are trained on the same X_train_r / y_train_r setup (encoded categoricals as numeric inputs), evaluated on validation, and merged into the running comparison table with MAE, RMSE, and R². If those libraries are not installed when you run build_data.py, they are skipped and omitted from data.json.

Test-set evaluation and visualizations

All models (including untuned and tuned gradient boosting, CatBoost, and LightGBM when available) are evaluated on the held-out test slice X_test_r / y_test_r. The notebook plans three visual families: bar charts comparing R², MAE, and RMSE; actual-vs-predicted scatter grids per model; and residual plots (actual minus predicted vs predicted) to inspect heteroskedasticity and bias.

Top test model in the current build: — (ranked by R² on the test split). Rankings can change with the exact CSV rows and library versions; the prose below records the notebook’s written interpretation when the Decision Tree led on the test set.

Model	R²	MAE	RMSE

Test R² (higher is better)

Test MAE (lower is better)

Test RMSE (lower is better)

Actual vs predicted (test subsample)

In-depth analysis: Decision Tree

The notebook highlights the Decision Tree for interpretability: feature importances from the fitted tree, SHAP summary and tax dependence plots with TreeExplainer, and extended discussion.

Decision Tree feature importances

Gradient Boosting feature importances (real target)

Model complexity, regularization, and ensembles

The Decision Tree model emerged as the best performer on the test set in the notebook’s reported run for predicting real gasoline prices. Trees capture non-linear relationships and interactions without explicit interaction terms—useful over a century of socio-economic change. While single trees can overfit, in that run the tree generalized slightly better than bagging or boosting variants, which the notebook suggests could mean the decision boundary is relatively simple, the test window aligns well with the tree’s structure, or ensemble hyperparameters were not fully optimal for this split. Linear models struggled, underscoring non-linearity.

Domain relevance of feature importances

tax_pct_of_pump_price (dominant): Taxes are a direct policy lever on pump prices; high tax shares align with higher consumer prices.
subsidy_encoded: Subsidies lower effective prices; regimes differ markedly from “none” to heavy support.
region_encoded: Geography bundles regulation, logistics, and market structure beyond crude price alone.
us_cpi: Acts as a macro proxy linked to the real price target and global inflation conditions.
crude_oil_usd_per_barrel: Essential for input costs, but pump prices add tax, distribution, and policy wedges—so crude may rank below tax and subsidy in tree splits.
year: Captures slow trends not fully encoded elsewhere; may be secondary to policy and price drivers.
is_oil_producer: Producer status matters for supply and domestic policy, but consumer pump prices still reflect taxes, subsidies, and local markets—so importance can be modest.

Overall, importances match economic intuition: fiscal policy and regional context sit alongside global crude and macro indicators.

SHAP analysis for the Decision Tree

As with gradient boosting, SHAP summary and dependence plots for the Decision Tree summarize how each feature pushes predictions up or down on held-out observations, with the tax dependence plot highlighting non-linear marginal effects of the tax share on predicted real prices.

Team contribution table

Section	Responsible member
Data acquisition & loading	Monta
Exploratory Data Analysis	Ruslans
Feature engineering & scaling	Auseklis
Model development & tuning	Igors
Evaluation & visualization	Dāvis

References

Dataset: Global Fuel Prices (1924–2024) on Kaggle (historical fuel price data).