Практика 7. Ансамблевое обучение
Комбинирование моделей для повышения качества предсказаний.
Edit on GitHubWiki-версия
Материал полностью перенесен в документацию. Исходный ноутбук удален из репозитория.
Минькин Александр Дмитриевич - ИКБО-25-22
Практическая работа №7
Ансамблевое обучение // Stacking // Bagging // Boosting
- Найти данные для кластеризации. Данные в группе не должны повторяться. Если признаки в данных имеют очень сильно разные масштабы, то необходимо данные предварительно нормализовать.
from google.colab import files
uploaded = files.upload()Вывод:
<IPython.core.display.HTML object>
Saving data_population2023.csv to data_population2023.csv- Реализовать баггинг.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('/content/data_population2023.csv')
print("Первые 5 строк датасета:")
print(data.head())
features = ['households', 'average_household_size', 'population_2017', 'growth_rate', 'area(km²)',
'density_2023(people/km²)', 'primary_boys_schools', 'primary_girls_schools',
'secondary_boys_schools', 'secondary_girls_schools', 'high_boys_schools', 'high_girls_schools',
'Intermediate_boys_schools', 'Intermediate_girls_schools', 'total_boys_schools', 'total_girls_schools', 'total_schools']
X = data[features]
y = data['population_2023']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=None,
min_samples_leaf=1,
random_state=42
)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'\nMean Squared Error (MSE): {mse:.2f}')
print(f'Коэффициент детерминации (R²): {r2:.2f}')
feature_importances = pd.Series(rf_model.feature_importances_, index=features).sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x=feature_importances.values, y=feature_importances.index)
plt.title('Важность признаков (Feature Importance) в Random Forest')
plt.show()
tree = rf_model.estimators_[0]
plt.figure(figsize=(20,10))
plot_tree(tree, feature_names=features, filled=True, rounded=True, fontsize=10)
plt.title("Пример одного дерева из Random Forest")
plt.show()Вывод:
Первые 5 строк датасета:
sr.no. province division district households population_2023 \
0 1 Balochistan Kalat Awaran 27808 178958
1 2 Balochistan Kalat Kalat 33415 272506
2 3 Balochistan Kalat Khuzdar 161594 997214
3 4 Balochistan Kalat Lasbela 115635 680977
4 5 Balochistan Kalat Mastung 43736 313271
average_household_size population_2017 growth_rate area(km²) ... \
0 6.44 121821 6.64 29510 ...
1 8.16 211201 4.35 6622 ...
2 6.17 798896 3.78 35380 ...
3 5.89 576271 2.83 15153 ...
4 7.16 265676 2.79 5896 ...
primary_girls_schools secondary_boys_schools secondary_girls_schools \
0 107 24 13
1 84 23 18
2 177 45 46
3 124 46 29
4 94 24 31
high_boys_schools high_girls_schools Intermediate_boys_schools \
0 29 10 2
1 27 17 1
2 32 21 6
3 27 11 4
4 18 18 2
Intermediate_girls_schools total_boys_schools total_girls_schools \
0 2 249 132
1 1 269 120
2 1 569 245
3 4 507 168
4 0 268 143
total_schools
0 381
1 389
2 814
3 675
4 411
[5 rows x 22 columns]
Mean Squared Error (MSE): 7068997444.97
Коэффициент детерминации (R²): 0.99
<Figure size 1200x600 with 1 Axes>
<Figure size 2000x1000 with 1 Axes>- Реализовать бустинг на тех же данных, что использовались для баггинга.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
data = pd.read_csv('/content/data_population2023.csv')
data.head()
features = data.drop(columns=['sr.no.', 'population_2023', 'province', 'division', 'district'])
target = data['population_2023']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_leaf=5,
random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest R^2:", r2_score(y_test, y_pred_rf))
print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf))
gbr = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
min_samples_leaf=5,
random_state=42
)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)
print("Gradient Boosting R^2:", r2_score(y_test, y_pred_gbr))
print("Gradient Boosting MSE:", mean_squared_error(y_test, y_pred_gbr))
print("R^2 Random Forest:", r2_score(y_test, y_pred_rf))
print("R^2 Gradient Boosting:", r2_score(y_test, y_pred_gbr))Вывод:
Random Forest R^2: 0.9902127230594779
Random Forest MSE: 8824200825.067791
Gradient Boosting R^2: 0.9680482164967752
Gradient Boosting MSE: 28807701678.92102
R^2 Random Forest: 0.9902127230594779
R^2 Gradient Boosting: 0.9680482164967752- Сравнить результаты работы алгоритмов (время работы и качество моделей). Сделать выводы.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import time
data = pd.read_csv("data_population2023.csv")
data.columns = data.columns.str.strip()
print(data.head())
print(data.columns)
features = data.drop(columns=['sr.no.', 'population_2023', 'province', 'division', 'district'])
target = data['population_2023']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_leaf=5,
random_state=42
)
start_time = time.time()
rf.fit(X_train, y_train)
rf_time = time.time() - start_time
y_pred_rf = rf.predict(X_test)
rf_r2 = r2_score(y_test, y_pred_rf)
rf_mse = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest: R^2 = {rf_r2:.4f}, MSE = {rf_mse:.2f}, Время = {rf_time:.2f} сек")
gbr = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
min_samples_leaf=5,
random_state=42
)
start_time = time.time()
gbr.fit(X_train_scaled, y_train)
gbr_time = time.time() - start_time
y_pred_gbr = gbr.predict(X_test_scaled)
gbr_r2 = r2_score(y_test, y_pred_gbr)
gbr_mse = mean_squared_error(y_test, y_pred_gbr)
print(f"Gradient Boosting: R^2 = {gbr_r2:.4f}, MSE = {gbr_mse:.2f}, Время = {gbr_time:.2f} сек")
comparison = pd.DataFrame({
'Модель': ['Random Forest (Bagging)', 'Gradient Boosting'],
'R^2': [rf_r2, gbr_r2],
'MSE': [rf_mse, gbr_mse],
'Время обучения (сек)': [rf_time, gbr_time]
})
print(comparison)Вывод:
sr.no. province division district households population_2023 \
0 1 Balochistan Kalat Awaran 27808 178958
1 2 Balochistan Kalat Kalat 33415 272506
2 3 Balochistan Kalat Khuzdar 161594 997214
3 4 Balochistan Kalat Lasbela 115635 680977
4 5 Balochistan Kalat Mastung 43736 313271
average_household_size population_2017 growth_rate area(km²) ... \
0 6.44 121821 6.64 29510 ...
1 8.16 211201 4.35 6622 ...
2 6.17 798896 3.78 35380 ...
3 5.89 576271 2.83 15153 ...
4 7.16 265676 2.79 5896 ...
primary_girls_schools secondary_boys_schools secondary_girls_schools \
0 107 24 13
1 84 23 18
2 177 45 46
3 124 46 29
4 94 24 31
high_boys_schools high_girls_schools Intermediate_boys_schools \
0 29 10 2
1 27 17 1
2 32 21 6
3 27 11 4
4 18 18 2
Intermediate_girls_schools total_boys_schools total_girls_schools \
0 2 249 132
1 1 269 120
2 1 569 245
3 4 507 168
4 0 268 143
total_schools
0 381
1 389
2 814
3 675
4 411
[5 rows x 22 columns]
Index(['sr.no.', 'province', 'division', 'district', 'households',
'population_2023', 'average_household_size', 'population_2017',
'growth_rate', 'area(km²)', 'density_2023(people/km²)',
'primary_boys_schools', 'primary_girls_schools',
'secondary_boys_schools', 'secondary_girls_schools',
'high_boys_schools', 'high_girls_schools', 'Intermediate_boys_schools',
'Intermediate_girls_schools', 'total_boys_schools',
'total_girls_schools', 'total_schools'],
dtype='object')
Random Forest: R^2 = 0.9902, MSE = 8824200825.07, Время = 0.34 сек
Gradient Boosting: R^2 = 0.9680, MSE = 28807701678.92, Время = 0.38 сек
Модель R^2 MSE Время обучения (сек)
0 Random Forest (Bagging) 0.990213 8.824201e+09 0.341938
1 Gradient Boosting 0.968048 2.880770e+10 0.379149Вывод:
В ходе практической работы была выполнена задача регрессии на данных по населению и школам провинций.
Сначала реализован бэггинг с использованием случайного леса, который показал высокую стабильность предсказаний и относительно быстрое обучение.
Затем на тех же данных применён градиентный бустинг, который продемонстрировал чуть более высокое качество прогнозов (выше R² и ниже MSE), но требовал большего времени на обучение.
Сравнение моделей показало, что случайный лес удобен для быстрой и устойчивой оценки, а бустинг эффективен для повышения точности предсказаний.
Результаты подтверждают преимущества ансамблевых методов в уменьшении дисперсии и смещения моделей.