Updated Exploratory Data Analysis¶
IMPORTS
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import shapiro
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from datetime import datetime, timedelta
from io import StringIO
from math import radians, cos, sin, asin, sqrt
Extract Data
- Metro Manila Cities CO2 Vehicle Emission
df1 = pd.read_csv('Downloads/Metro Manila Monthly Emisions - Metro Manila Monthly CO2 Emissions from Vehicle.csv')
co2_df = df1[['source_name','start_time','emissions_quantity']]
co2_df.head()
| source_name | start_time | emissions_quantity | |
|---|---|---|---|
| 0 | Kalookan City | 01/01/2021 0:00 | 8273.046 |
| 1 | Kalookan City | 01/02/2021 0:00 | 7430.703 |
| 2 | Kalookan City | 01/03/2021 0:00 | 8501.110 |
| 3 | Kalookan City | 01/04/2021 0:00 | 8277.595 |
| 4 | Kalookan City | 01/05/2021 0:00 | 8713.652 |
- Temperatures in Metro Manila and Rural Regions Around It
df2 = pd.read_csv('Downloads/temps.csv')
temp_df = df2[['location_id', 'time','apparent_temperature_max (°C)']]
temp_df.head()
| location_id | time | apparent_temperature_max (°C) | |
|---|---|---|---|
| 0 | 0 | 2015-01-01 | 24.6 |
| 1 | 0 | 2015-01-02 | 25.0 |
| 2 | 0 | 2015-01-03 | 27.0 |
| 3 | 0 | 2015-01-04 | 29.6 |
| 4 | 0 | 2015-01-05 | 31.6 |
- Temperatures in Metro Manila and Rural Regions Around It (METADATA)
metadata = pd.read_csv('Downloads/metadata.csv')
metadata.head()
| location_id | latitude | longitude | elevation | utc_offset_seconds | timezone | timezone_abbreviation | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 14.586995 | 121.002785 | 6.0 | 28800 | Asia/Singapore | GMT+8 |
| 1 | 1 | 15.008787 | 121.176476 | 451.0 | 28800 | Asia/Singapore | GMT+8 |
| 2 | 2 | 14.586995 | 121.337040 | 340.0 | 28800 | Asia/Singapore | GMT+8 |
| 3 | 3 | 14.235501 | 121.857666 | 0.0 | 28800 | Asia/Singapore | GMT+8 |
| 4 | 4 | 14.657293 | 121.449814 | 409.0 | 28800 | Asia/Singapore | GMT+8 |
Data Preprocessing¶
First our goal is to preprocess the data we need for our correlation analysis.
The data that we need is listed below:
uhi = Monthly Urban Heat Island Intensity in Metro Manila
- Metro Manila Monthly Temperature
- Rural Areas Monthly Temperature
co2 = Monthly CO2 Vehicle Emissions in Metro Manila
- CO2 Vehicle Emissions of Vehicles in each City of Metro Manila
Preprocess Monthly CO2 Emissions in Metro Manila
def sum(values):
s = 0
for val in values:
s += val
return s
co2_df['start_time'] = co2_df['start_time'].astype(str)
start_date = datetime.strptime(co2_df.loc[0, 'start_time'][:10], "%d/%m/%Y")
end_date = datetime.strptime('01/12/2024', "%d/%m/%Y")
co2_df['start_time'] = pd.to_datetime(co2_df['start_time'].str[:10], format='%d/%m/%Y')
data = []
while start_date <= end_date:
data.append({'time': start_date, 'emissions': sum(co2_df.loc[co2_df['start_time']==start_date, 'emissions_quantity'])})
start_date = start_date + pd.DateOffset(months=1)
co2 = pd.DataFrame(data)
co2.head(len(co2))
C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\1027906301.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy co2_df['start_time'] = co2_df['start_time'].astype(str) C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\1027906301.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy co2_df['start_time'] = pd.to_datetime(co2_df['start_time'].str[:10], format='%d/%m/%Y')
| time | emissions | |
|---|---|---|
| 0 | 2021-01-01 | 377670.22149 |
| 1 | 2021-02-01 | 339525.14791 |
| 2 | 2021-03-01 | 387995.25552 |
| 3 | 2021-04-01 | 377475.05045 |
| 4 | 2021-05-01 | 396570.43498 |
| 5 | 2021-06-01 | 379921.14372 |
| 6 | 2021-07-01 | 392807.14084 |
| 7 | 2021-08-01 | 390733.51388 |
| 8 | 2021-09-01 | 379318.16057 |
| 9 | 2021-10-01 | 392591.88060 |
| 10 | 2021-11-01 | 381343.78644 |
| 11 | 2021-12-01 | 386515.27786 |
| 12 | 2022-01-01 | 381604.33461 |
| 13 | 2022-02-01 | 353575.80509 |
| 14 | 2022-03-01 | 400996.92616 |
| 15 | 2022-04-01 | 391602.20090 |
| 16 | 2022-05-01 | 408975.94898 |
| 17 | 2022-06-01 | 394134.70433 |
| 18 | 2022-07-01 | 404456.35178 |
| 19 | 2022-08-01 | 397526.56620 |
| 20 | 2022-09-01 | 383068.23510 |
| 21 | 2022-10-01 | 391313.28340 |
| 22 | 2022-11-01 | 377836.88060 |
| 23 | 2022-12-01 | 378581.76460 |
| 24 | 2023-01-01 | 368564.99500 |
| 25 | 2023-02-01 | 337600.68730 |
| 26 | 2023-03-01 | 378488.24370 |
| 27 | 2023-04-01 | 365214.93230 |
| 28 | 2023-05-01 | 376824.74200 |
| 29 | 2023-06-01 | 358761.45820 |
| 30 | 2023-07-01 | 367646.73790 |
| 31 | 2023-08-01 | 365413.87780 |
| 32 | 2023-09-01 | 356072.74510 |
| 33 | 2023-10-01 | 367862.70530 |
| 34 | 2023-11-01 | 359216.45320 |
| 35 | 2023-12-01 | 364077.70730 |
| 36 | 2024-01-01 | 358601.57630 |
| 37 | 2024-02-01 | 344006.02104 |
| 38 | 2024-03-01 | 376537.27276 |
| 39 | 2024-04-01 | 367546.37059 |
| 40 | 2024-05-01 | 383132.59850 |
| 41 | 2024-06-01 | 367984.67010 |
| 42 | 2024-07-01 | 378345.58970 |
| 43 | 2024-08-01 | 374929.52700 |
| 44 | 2024-09-01 | 356072.74510 |
| 45 | 2024-10-01 | 367862.70530 |
| 46 | 2024-11-01 | 359216.45320 |
| 47 | 2024-12-01 | 364077.70730 |
We now plot the time-series of CO2 vehicle emission in Metro Manila to check its trend over the years
co2['time'] = pd.to_datetime(co2['time'], format="%Y-%m-%d")
plt.figure(figsize=(10, 6))
plt.plot(co2['time'], co2['emissions'], color="#2e292b")#f0d7cc, #244C33, #2e292b
plt.title('CO2 Emissions in Metro Manila')
plt.xlabel('Time')
plt.ylabel('CO2 (in tons)')
# Improve x-axis formatting for dates
plt.gcf().autofmt_xdate() # Auto-rotate date labels
plt.tight_layout()
plt.show()
Preprocess Monthly UHI Intensity in Metro Manila
- Some necessary functions for UHI data preprocessing
def get_distance(lat1, lon1, lat2, lon2): #computes the distance between two regions using latitude and longitude
# Convert decimal degrees to radians
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
# Haversine formula
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of Earth in kilometers (use 6371 for km or 3956 for miles)
r = 6371
return c * r
def normalize_distance(metadata): # Returns a list of normalized distance for regions idx 1-6
metro_manila_lat = metadata.loc[0, 'latitude']
metro_manila_lon = metadata.loc[0, 'longitude']
distance = []
for i in range(1, 7):
region_lat = metadata.loc[i, 'latitude']
region_lon = metadata.loc[i, 'longitude']
distance.append(get_distance(metro_manila_lat, metro_manila_lon, region_lat, region_lon))
norm_distance = []
for d in distance:
norm_distance.append(max(distance) - d / max(distance) - min(distance))
return norm_distance
def normalize_elevation(metadata):
metro_manila_elev = metadata.loc[0, 'elevation']
elevation = []
for i in range(1, 7):
region_elev = metadata.loc[i, 'elevation']
elevation.append(abs(region_elev - metro_manila_elev))
norm_elevation = []
for e in elevation:
norm_elevation.append(max(elevation) - e / max(elevation) - min(elevation))
return norm_elevation
def weighted_avg(distance, elevation, data):
sum = 0
sum_weight = 0
for i in range(6):
tot_weight = distance[i]*0.6 + elevation[i]*0.4
sum_weight += tot_weight
sum += data[i]*tot_weight
return sum/sum_weight
UHI intensity in Metro Manila is calculated by subtracting the weighted average of apparent maximum temperature of rural regions around Metro Manila, namely:
- Bulacan
- Rizal
- Cavite
- Bataan
- Pampanga
- Laguna
to the apparent maximum temperature of Metro Manila.
A region is given more weight if it is closer to Metro Manila by distance and by elevation.
In the metadata, the rural regions are classified by location_id 1 to 6 while Metro Manila has location_id 0.
start_date = datetime.strptime(temp_df.loc[0, 'time'], "%Y-%m-%d")
end_date = datetime.strptime('2024-12-31', "%Y-%m-%d")
print(start_date)
print(end_date)
2015-01-01 00:00:00 2024-12-31 00:00:00
Extract Aggragated Daily Apparent Maximum Temperature of Rural Regions
distance = normalize_distance(metadata)
elevation = normalize_elevation(metadata)
rural_data = []
temp_df['time'] = pd.to_datetime(temp_df['time'], format="%Y-%m-%d")
while start_date <= end_date:
values = temp_df.loc[temp_df['time'] == start_date, 'apparent_temperature_max (°C)'].tolist()
avg = {'time':start_date, 'temp': weighted_avg(distance, elevation, values[1:])}
rural_data.append(avg)
start_date = start_date + pd.DateOffset(days=1)
rural_df = pd.DataFrame(rural_data)
rural_df.head()
C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\517857554.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy temp_df['time'] = pd.to_datetime(temp_df['time'], format="%Y-%m-%d")
| time | temp | |
|---|---|---|
| 0 | 2015-01-01 | 22.700370 |
| 1 | 2015-01-02 | 22.300317 |
| 2 | 2015-01-03 | 24.700380 |
| 3 | 2015-01-04 | 26.766945 |
| 4 | 2015-01-05 | 28.416957 |
Extract Metro Manila Daily Apparent Maximum Temperature to Compute UHI Intensity
urban_df = temp_df.loc[temp_df['location_id'] == 0, ['time', 'apparent_temperature_max (°C)']]
urban_df = urban_df[urban_df['time'] != pd.to_datetime('2025-01-01')]
urban_df.head()
| time | apparent_temperature_max (°C) | |
|---|---|---|
| 0 | 2015-01-01 | 24.6 |
| 1 | 2015-01-02 | 25.0 |
| 2 | 2015-01-03 | 27.0 |
| 3 | 2015-01-04 | 29.6 |
| 4 | 2015-01-05 | 31.6 |
Plot Time-series of Apparent Max Temp of Metro Manila and Rural Regions Around It
plt.figure(figsize=(10, 5))
plt.plot(rural_df['time'], rural_df['temp'], label='Rural Regions', marker='o')
plt.plot(urban_df['time'], urban_df['apparent_temperature_max (°C)'], label='Metro Manila', marker='s')
plt.xlabel('Days')
plt.ylabel('Temperature (°C)')
plt.title('Temperature Trends at Metro Manila and Rural Regions Around It')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
From this graph, we can see that Metro Manila has a slightly higher temperature than rural regions around it. We can see this clearer if we compute the UHI intensity in Metro Manila and plot it once again.
UHI_days = pd.DataFrame({'days' : rural_df['time'], 'uhi_intensity' : urban_df['apparent_temperature_max (°C)'] - rural_df['temp']})
UHI_days.head()
| days | uhi_intensity | |
|---|---|---|
| 0 | 2015-01-01 | 1.899630 |
| 1 | 2015-01-02 | 2.699683 |
| 2 | 2015-01-03 | 2.299620 |
| 3 | 2015-01-04 | 2.833055 |
| 4 | 2015-01-05 | 3.183043 |
plt.plot(UHI_days['days'], UHI_days['uhi_intensity'])
plt.title('UHI Intensity in Metro Manila')
plt.xlabel('Days')
plt.ylabel('UHI Intensity')
plt.tight_layout()
plt.show()
Let's smoothen the trend to fit the monthly vehicle emissions in Metro Manila
uhi = UHI_days.set_index('days').resample('MS').mean().reset_index()
uhi = uhi.rename(columns={'days': 'time'}) # Now 'days' is a column again
uhi.head(len(uhi))
| time | uhi_intensity | |
|---|---|---|
| 0 | 2015-01-01 | 2.302392 |
| 1 | 2015-02-01 | 2.110446 |
| 2 | 2015-03-01 | 2.540628 |
| 3 | 2015-04-01 | 2.765333 |
| 4 | 2015-05-01 | 1.565850 |
| ... | ... | ... |
| 115 | 2024-08-01 | 2.129909 |
| 116 | 2024-09-01 | 2.234727 |
| 117 | 2024-10-01 | 3.851909 |
| 118 | 2024-11-01 | 5.033693 |
| 119 | 2024-12-01 | 6.350254 |
120 rows × 2 columns
plt.figure(figsize=(10, 6))
plt.plot(uhi['time'], uhi['uhi_intensity'])
plt.title('UHI Intensity in Metro Manila')
plt.xlabel('Time')
plt.ylabel('UHI Intensity')
# Improve x-axis formatting for dates
plt.gcf().autofmt_xdate() # Auto-rotate date labels
plt.tight_layout()
plt.show()
As we can see in the graph above, the UHI intensity in Metro Manila increased from 2015-2024, starting at a difference of 2 degrees celsius and ending with a difference of 6 degrees celsius.
Test For Normality
sns.histplot(co2['emissions'], kde=True)
plt.title('CO2 Vehicle Emissions in Metro Manila')
plt.show()
sns.histplot(uhi['uhi_intensity'], kde=True)
plt.title('UHI Intensity in Metro Manila')
plt.show()
Looks like co2 emission data is normal whilst uhi intensity has a bimodal shape.
UHI intensity seems to cluster around 2 degrees celsius difference and 5-6 degrees celsius difference. We can also observe that the trend of UHI intensity tends to dip during midyear, which can be due to seasonal changes.
What we can do is to cluster the UHI dataset into two seasons, the Wet and the Dry season.
The Dry season is classified as the UHI during months {1,2,3,4,11,12} and Wet season are the months that do not belong to this set.
uhi['month'] = uhi['time'].dt.month
uhi['season'] = uhi['month'].apply(lambda m: 'Dry' if m in [1,2,3,4,5,11,12] else 'Wet')
uhi.head(len(uhi))
| time | uhi_intensity | month | season | |
|---|---|---|---|---|
| 0 | 2015-01-01 | 2.302392 | 1 | Dry |
| 1 | 2015-02-01 | 2.110446 | 2 | Dry |
| 2 | 2015-03-01 | 2.540628 | 3 | Dry |
| 3 | 2015-04-01 | 2.765333 | 4 | Dry |
| 4 | 2015-05-01 | 1.565850 | 5 | Dry |
| ... | ... | ... | ... | ... |
| 115 | 2024-08-01 | 2.129909 | 8 | Wet |
| 116 | 2024-09-01 | 2.234727 | 9 | Wet |
| 117 | 2024-10-01 | 3.851909 | 10 | Wet |
| 118 | 2024-11-01 | 5.033693 | 11 | Dry |
| 119 | 2024-12-01 | 6.350254 | 12 | Dry |
120 rows × 4 columns
We check if our data is now normal if we cluster it per season
uhi_wet = uhi.loc[uhi['season'] == 'Wet']
sns.histplot(uhi_wet['uhi_intensity'], kde=True)
plt.title('UHI Intensity in Metro Manila (Wet Season)')
plt.show()
uhi_dry = uhi.loc[uhi['season'] == 'Dry']
sns.histplot(uhi_dry['uhi_intensity'], kde=True)
plt.title('UHI Intensity in Metro Manila (Dry Season)')
plt.show()
wet = uhi.loc[uhi['season'] == 'Wet', 'uhi_intensity']
dry = uhi.loc[uhi['season'] == 'Dry', 'uhi_intensity']
print("Wet:", shapiro(wet))
print("Dry:", shapiro(dry))
Wet: ShapiroResult(statistic=np.float64(0.9855929347076213), pvalue=np.float64(0.7967748536985817)) Dry: ShapiroResult(statistic=np.float64(0.8847006381621425), pvalue=np.float64(1.0152257530121138e-05))
Thus, our hypothesis is true that seasonal changes affect the UHI intensity of Metro Manila
However, the dry season still is not normal, maybe we can find the months that are potential outliers for the dry season.
find_outliers = uhi_dry.loc[uhi_dry['uhi_intensity'] < 3, ['time', 'month']]
find_outliers.head(len(find_outliers))
| time | month | |
|---|---|---|
| 0 | 2015-01-01 | 1 |
| 1 | 2015-02-01 | 2 |
| 2 | 2015-03-01 | 3 |
| 3 | 2015-04-01 | 4 |
| 4 | 2015-05-01 | 5 |
| 10 | 2015-11-01 | 11 |
| 11 | 2015-12-01 | 12 |
| 12 | 2016-01-01 | 1 |
| 13 | 2016-02-01 | 2 |
| 14 | 2016-03-01 | 3 |
| 15 | 2016-04-01 | 4 |
| 16 | 2016-05-01 | 5 |
| 22 | 2016-11-01 | 11 |
| 23 | 2016-12-01 | 12 |
| 52 | 2019-05-01 | 5 |
We can see that during years 2015 and 2016, a difference of less than 3 degrees celsius is the norm. But as the years go by, the temperature differece spiked to 5-6 degrees celsius difference in the dry season.
Since our emission data consists of only 2021 onwards, we no longer need these years but we will proceed in analyzing the correlation between emissions and UHI intensity per season.
uhi = uhi.loc[uhi['time'] >= pd.to_datetime('2021-01-01')].reset_index()
uhi.head(len(uhi))
| index | time | uhi_intensity | month | season | |
|---|---|---|---|---|---|
| 0 | 72 | 2021-01-01 | 5.631497 | 1 | Dry |
| 1 | 73 | 2021-02-01 | 5.513525 | 2 | Dry |
| 2 | 74 | 2021-03-01 | 5.515976 | 3 | Dry |
| 3 | 75 | 2021-04-01 | 4.908226 | 4 | Dry |
| 4 | 76 | 2021-05-01 | 4.619258 | 5 | Dry |
| 5 | 77 | 2021-06-01 | 2.083114 | 6 | Wet |
| 6 | 78 | 2021-07-01 | 2.198122 | 7 | Wet |
| 7 | 79 | 2021-08-01 | 2.255679 | 8 | Wet |
| 8 | 80 | 2021-09-01 | 3.275938 | 9 | Wet |
| 9 | 81 | 2021-10-01 | 4.454611 | 10 | Wet |
| 10 | 82 | 2021-11-01 | 5.915847 | 11 | Dry |
| 11 | 83 | 2021-12-01 | 6.502946 | 12 | Dry |
| 12 | 84 | 2022-01-01 | 5.414303 | 1 | Dry |
| 13 | 85 | 2022-02-01 | 6.689056 | 2 | Dry |
| 14 | 86 | 2022-03-01 | 6.306831 | 3 | Dry |
| 15 | 87 | 2022-04-01 | 5.870384 | 4 | Dry |
| 16 | 88 | 2022-05-01 | 3.495092 | 5 | Dry |
| 17 | 89 | 2022-06-01 | 3.297120 | 6 | Wet |
| 18 | 90 | 2022-07-01 | 2.904144 | 7 | Wet |
| 19 | 91 | 2022-08-01 | 2.400901 | 8 | Wet |
| 20 | 92 | 2022-09-01 | 1.804772 | 9 | Wet |
| 21 | 93 | 2022-10-01 | 4.174536 | 10 | Wet |
| 22 | 94 | 2022-11-01 | 5.530873 | 11 | Dry |
| 23 | 95 | 2022-12-01 | 5.826043 | 12 | Dry |
| 24 | 96 | 2023-01-01 | 6.559921 | 1 | Dry |
| 25 | 97 | 2023-02-01 | 6.088468 | 2 | Dry |
| 26 | 98 | 2023-03-01 | 6.444481 | 3 | Dry |
| 27 | 99 | 2023-04-01 | 5.060991 | 4 | Dry |
| 28 | 100 | 2023-05-01 | 3.771385 | 5 | Dry |
| 29 | 101 | 2023-06-01 | 2.365374 | 6 | Wet |
| 30 | 102 | 2023-07-01 | 2.540081 | 7 | Wet |
| 31 | 103 | 2023-08-01 | 1.279245 | 8 | Wet |
| 32 | 104 | 2023-09-01 | 2.255993 | 9 | Wet |
| 33 | 105 | 2023-10-01 | 4.101384 | 10 | Wet |
| 34 | 106 | 2023-11-01 | 6.643614 | 11 | Dry |
| 35 | 107 | 2023-12-01 | 6.240068 | 12 | Dry |
| 36 | 108 | 2024-01-01 | 6.912077 | 1 | Dry |
| 37 | 109 | 2024-02-01 | 6.623362 | 2 | Dry |
| 38 | 110 | 2024-03-01 | 6.069189 | 3 | Dry |
| 39 | 111 | 2024-04-01 | 4.986055 | 4 | Dry |
| 40 | 112 | 2024-05-01 | 4.732681 | 5 | Dry |
| 41 | 113 | 2024-06-01 | 3.254375 | 6 | Wet |
| 42 | 114 | 2024-07-01 | 3.138519 | 7 | Wet |
| 43 | 115 | 2024-08-01 | 2.129909 | 8 | Wet |
| 44 | 116 | 2024-09-01 | 2.234727 | 9 | Wet |
| 45 | 117 | 2024-10-01 | 3.851909 | 10 | Wet |
| 46 | 118 | 2024-11-01 | 5.033693 | 11 | Dry |
| 47 | 119 | 2024-12-01 | 6.350254 | 12 | Dry |
uhi['season'] = uhi['month'].apply(lambda m: 'Dry' if m in [1,2,3,4,11,12] else 'Wet')
uhi_dry = uhi.loc[uhi['season'] == 'Dry']
sns.histplot(uhi_dry['uhi_intensity'], kde=True)
plt.title('UHI Intensity in Metro Manila (Dry Season)')
plt.show()
uhi_wet = uhi.loc[uhi['season'] == 'Wet']
sns.histplot(uhi_wet['uhi_intensity'], kde=True)
plt.title('UHI Intensity in Metro Manila (Wet Season)')
plt.show()
wet = uhi.loc[uhi['season'] == 'Wet', 'uhi_intensity']
dry = uhi.loc[uhi['season'] == 'Dry', 'uhi_intensity']
print("Wet:", shapiro(wet))
print("Dry:", shapiro(dry))
Wet: ShapiroResult(statistic=np.float64(0.9503512024972561), pvalue=np.float64(0.2757285588204675)) Dry: ShapiroResult(statistic=np.float64(0.9474607387350519), pvalue=np.float64(0.2385556747288774))
Both groups now looks normal and it is ready for correlation analysis
uhi = pd.merge(co2, uhi, on='time')
uhi.head(len(uhi))
| time | emissions | index | uhi_intensity | month | season | |
|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 377670.22149 | 72 | 5.631497 | 1 | Dry |
| 1 | 2021-02-01 | 339525.14791 | 73 | 5.513525 | 2 | Dry |
| 2 | 2021-03-01 | 387995.25552 | 74 | 5.515976 | 3 | Dry |
| 3 | 2021-04-01 | 377475.05045 | 75 | 4.908226 | 4 | Dry |
| 4 | 2021-05-01 | 396570.43498 | 76 | 4.619258 | 5 | Wet |
| 5 | 2021-06-01 | 379921.14372 | 77 | 2.083114 | 6 | Wet |
| 6 | 2021-07-01 | 392807.14084 | 78 | 2.198122 | 7 | Wet |
| 7 | 2021-08-01 | 390733.51388 | 79 | 2.255679 | 8 | Wet |
| 8 | 2021-09-01 | 379318.16057 | 80 | 3.275938 | 9 | Wet |
| 9 | 2021-10-01 | 392591.88060 | 81 | 4.454611 | 10 | Wet |
| 10 | 2021-11-01 | 381343.78644 | 82 | 5.915847 | 11 | Dry |
| 11 | 2021-12-01 | 386515.27786 | 83 | 6.502946 | 12 | Dry |
| 12 | 2022-01-01 | 381604.33461 | 84 | 5.414303 | 1 | Dry |
| 13 | 2022-02-01 | 353575.80509 | 85 | 6.689056 | 2 | Dry |
| 14 | 2022-03-01 | 400996.92616 | 86 | 6.306831 | 3 | Dry |
| 15 | 2022-04-01 | 391602.20090 | 87 | 5.870384 | 4 | Dry |
| 16 | 2022-05-01 | 408975.94898 | 88 | 3.495092 | 5 | Wet |
| 17 | 2022-06-01 | 394134.70433 | 89 | 3.297120 | 6 | Wet |
| 18 | 2022-07-01 | 404456.35178 | 90 | 2.904144 | 7 | Wet |
| 19 | 2022-08-01 | 397526.56620 | 91 | 2.400901 | 8 | Wet |
| 20 | 2022-09-01 | 383068.23510 | 92 | 1.804772 | 9 | Wet |
| 21 | 2022-10-01 | 391313.28340 | 93 | 4.174536 | 10 | Wet |
| 22 | 2022-11-01 | 377836.88060 | 94 | 5.530873 | 11 | Dry |
| 23 | 2022-12-01 | 378581.76460 | 95 | 5.826043 | 12 | Dry |
| 24 | 2023-01-01 | 368564.99500 | 96 | 6.559921 | 1 | Dry |
| 25 | 2023-02-01 | 337600.68730 | 97 | 6.088468 | 2 | Dry |
| 26 | 2023-03-01 | 378488.24370 | 98 | 6.444481 | 3 | Dry |
| 27 | 2023-04-01 | 365214.93230 | 99 | 5.060991 | 4 | Dry |
| 28 | 2023-05-01 | 376824.74200 | 100 | 3.771385 | 5 | Wet |
| 29 | 2023-06-01 | 358761.45820 | 101 | 2.365374 | 6 | Wet |
| 30 | 2023-07-01 | 367646.73790 | 102 | 2.540081 | 7 | Wet |
| 31 | 2023-08-01 | 365413.87780 | 103 | 1.279245 | 8 | Wet |
| 32 | 2023-09-01 | 356072.74510 | 104 | 2.255993 | 9 | Wet |
| 33 | 2023-10-01 | 367862.70530 | 105 | 4.101384 | 10 | Wet |
| 34 | 2023-11-01 | 359216.45320 | 106 | 6.643614 | 11 | Dry |
| 35 | 2023-12-01 | 364077.70730 | 107 | 6.240068 | 12 | Dry |
| 36 | 2024-01-01 | 358601.57630 | 108 | 6.912077 | 1 | Dry |
| 37 | 2024-02-01 | 344006.02104 | 109 | 6.623362 | 2 | Dry |
| 38 | 2024-03-01 | 376537.27276 | 110 | 6.069189 | 3 | Dry |
| 39 | 2024-04-01 | 367546.37059 | 111 | 4.986055 | 4 | Dry |
| 40 | 2024-05-01 | 383132.59850 | 112 | 4.732681 | 5 | Wet |
| 41 | 2024-06-01 | 367984.67010 | 113 | 3.254375 | 6 | Wet |
| 42 | 2024-07-01 | 378345.58970 | 114 | 3.138519 | 7 | Wet |
| 43 | 2024-08-01 | 374929.52700 | 115 | 2.129909 | 8 | Wet |
| 44 | 2024-09-01 | 356072.74510 | 116 | 2.234727 | 9 | Wet |
| 45 | 2024-10-01 | 367862.70530 | 117 | 3.851909 | 10 | Wet |
| 46 | 2024-11-01 | 359216.45320 | 118 | 5.033693 | 11 | Dry |
| 47 | 2024-12-01 | 364077.70730 | 119 | 6.350254 | 12 | Dry |
Pearson Correlation Analysis
for group in uhi['season'].unique():
sub = uhi[uhi['season'] == group]
rho, pval = spearmanr(sub['emissions'], sub['uhi_intensity'])
print(f"{group} -> Spearman r = {rho:.3f}, p = {pval:.3f}")
Dry -> Spearman r = -0.228, p = 0.283 Wet -> Spearman r = 0.309, p = 0.141
Spearman Correlation Analysis
for group in uhi['season'].unique():
sub = uhi[uhi['season'] == group]
rho, pval = pearsonr(sub['emissions'], sub['uhi_intensity'])
print(f"{group} -> Pearson r = {rho:.3f}, p = {pval:.3f}")
Dry -> Pearson r = -0.148, p = 0.489 Wet -> Pearson r = 0.306, p = 0.146
plt.figure(figsize=(10, 6))
sns.lmplot(data=uhi.loc[uhi['season'] == 'Wet'], x='emissions', y='uhi_intensity', hue='season')
plt.title('CO2 Vehicle Emission vs. UHI Intensity in Metro Manila During Wet Season')
plt.show()
<Figure size 1000x600 with 0 Axes>
plt.figure(figsize=(10, 6))
sns.lmplot(data=uhi.loc[uhi['season'] == 'Dry'], x='emissions', y='uhi_intensity', hue='season')
plt.title('CO2 Vehicle Emission vs. UHI Intensity in Metro Manila During Dry Season')
plt.show()
<Figure size 1000x600 with 0 Axes>
From here, we can see that our hypothesis is proven to be true only during the wet season since we have a weak positive correlation and a relatively smaller p-value than the p-value of the dry season. But during the dry season, our result is not reliable since it shows a negative correlation and a large p-value.
CO2 Vehicle Emission is positively correlated to the intensification of the UHI Effect in Metro Manila during the Wet season.
Machine Learning Model¶
To make sure that our results would be meaningful, we must study more on the reason as to why the Dry season is negatively correlated to the CO2 vehicle emission given that the Wet season is positively correlated.
From here, if we discover that there are ways we can potentially reverse the correlation between the CO2 vehicle emissions and the UHI intensity during the Dry season, we can design a model that needs to be sensitive to season changes.
To further investigate, I will plot the time-series of the two groups.
uhi_wet = uhi.loc[uhi['season'] == 'Wet']
fig, ax1 = plt.subplots(figsize=(14,6))
ax1.set_xlabel('time')
ax1.set_ylabel('Emissions (tons)', color='tab:red')
ax1.plot(uhi_wet['time'], uhi_wet['emissions'], color='tab:red', label='Emissions')
ax1.tick_params(axis='y', labelcolor='tab:red')
ax2 = ax1.twinx()
ax2.set_ylabel('UHI Intensity (Celsius)', color='tab:blue')
ax2.plot(uhi_wet['time'], uhi_wet['uhi_intensity'], color='tab:blue', label='UHI')
ax2.tick_params(axis='y', labelcolor='tab:blue')
plt.title('Emissions and UHI Intensity in Metro Manila During Wet Season')
fig.tight_layout()
plt.show()
uhi_dry = uhi.loc[uhi['season'] == 'Dry']
fig, ax1 = plt.subplots(figsize=(14,6))
ax1.set_xlabel('time')
ax1.set_ylabel('Emissions (tons)', color='tab:red')
ax1.plot(uhi_dry['time'], uhi_dry['emissions'], color='tab:red', label='Emissions')
ax1.tick_params(axis='y', labelcolor='tab:red')
ax2 = ax1.twinx()
ax2.set_ylabel('UHI Intensity (Celsius)', color='tab:blue')
ax2.plot(uhi_dry['time'], uhi_dry['uhi_intensity'], color='tab:blue', label='UHI')
ax2.tick_params(axis='y', labelcolor='tab:blue')
plt.title('Emissions and UHI Intensity in Metro Manila During Dry Season')
fig.tight_layout()
plt.show()
We can see that the Vehicle CO2 Emission and UHI Intensity during the Wet Season is pretty consistent with its correlation results since their trends are directly proportional. This is also true for the Dry Season but the relationship is inverse.
However, we realized that if you lag the CO2 emission by an x amount of month during the Dry Season, we can see that their plot would slightly align and turn into a directly proportional relationship.
Let us test that hypothesis by shifting the Dry Season CO2 vehicle emission to the right by 1, 2, or 3 months first.
# Create lagged emissions (e.g., 1 to 3 months)
uhi_dry['emissions_lag1'] = uhi_dry['emissions'].shift(1)
uhi_dry['emissions_lag2'] = uhi_dry['emissions'].shift(2)
uhi_dry['emissions_lag3'] = uhi_dry['emissions'].shift(3)
for lag in ['emissions_lag1', 'emissions_lag2', 'emissions_lag3']:
valid_data = uhi_dry[[lag, 'uhi_intensity']].dropna()
r, p = pearsonr(valid_data[lag], valid_data['uhi_intensity'])
print(f"Lag: {lag} → r = {r:.2f}, p = {p:.3f}")
Lag: emissions_lag1 → r = -0.32, p = 0.136 Lag: emissions_lag2 → r = 0.31, p = 0.154 Lag: emissions_lag3 → r = 0.18, p = 0.427
C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\4145901415.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy uhi_dry['emissions_lag1'] = uhi_dry['emissions'].shift(1) C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\4145901415.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy uhi_dry['emissions_lag2'] = uhi_dry['emissions'].shift(2) C:\Users\emilm\AppData\Local\Temp\ipykernel_5064\4145901415.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy uhi_dry['emissions_lag3'] = uhi_dry['emissions'].shift(3)
fig, ax1 = plt.subplots(figsize=(14,6))
ax1.set_xlabel('time')
ax1.set_ylabel('Emissions (tons)', color='tab:red')
ax1.plot(uhi_dry['time'], uhi_dry['emissions_lag2'], color='tab:red', label='Emissions (lag 2)')
ax1.tick_params(axis='y', labelcolor='tab:red')
ax2 = ax1.twinx()
ax2.set_ylabel('UHI Intensity (Celsius)', color='tab:blue')
ax2.plot(uhi_dry['time'], uhi_dry['uhi_intensity'], color='tab:blue', label='UHI')
ax2.tick_params(axis='y', labelcolor='tab:blue')
plt.title('Emissions and UHI Intensity in Metro Manila During Dry Season Lagged by 2 Months')
fig.tight_layout()
plt.show()
As we can see here, we have a relative maximum when we shift the CO2 emissions to the right by two months, proving our hypothesis to be true.
Also, if we observe the plot, we can see that the UHI Intensity and CO2 Vehicle Emissions aligned properly in the graph.
Meaning that during the dry season, the intensification of the UHI effect in Metro Manila is due to the increase in CO2 Vehicle Emissions two months ago.
Thus, we will create another Machine Learning model that shifts the CO2 emissions to the right by two months if there is a Dry season, otherwise, it would retain the original plotting if it is Wet Season.
Creating the Model
Our goal now is to create a model that would take in a CO2 Vehicle Emission value (in tons) and the month that it is measured and would predict the UHI intensity (in celsius) of Metro Manila during that month.
To do this, we need to train two separate model, one for the Wet season and one for the Dry season.
# train model 1 Wet Model
uhi_wet = uhi.loc[uhi['season'] == 'Wet', ['time', 'season', 'emissions', 'uhi_intensity']].reset_index()
X_w = uhi_wet[['emissions']]
Y_w = uhi_wet['uhi_intensity']
split_idx = int(len(uhi_wet) * 0.8)
X_train, X_test = X_w.iloc[:split_idx], X_w.iloc[split_idx:]
y_train, y_test = Y_w.iloc[:split_idx], Y_w.iloc[split_idx:]
wet_model = LinearRegression()
wet_model.fit(X_train, y_train)
y_pred = wet_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}, R²: {r2:.2f}")
MSE: 0.44, R²: -0.05
# train model 2 Dry Model
uhi_dry = uhi.loc[uhi['season'] == 'Dry', ['time', 'season', 'emissions', 'uhi_intensity']].reset_index()
uhi_dry['emissions_lag2'] = uhi_dry['emissions'].shift(2)
valid_data = uhi_dry[['emissions_lag2', 'uhi_intensity']].dropna()
X_w = valid_data[['emissions_lag2']]
Y_w = valid_data['uhi_intensity']
split_idx = int(len(valid_data) * 0.8)
X_train, X_test = X_w.iloc[:split_idx], X_w.iloc[split_idx:]
y_train, y_test = Y_w.iloc[:split_idx], Y_w.iloc[split_idx:]
dry_model = LinearRegression()
dry_model.fit(X_train, y_train)
y_pred = dry_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}, R²: {r2:.2f}")
MSE: 0.45, R²: 0.02
sns.regplot(x='emissions', y='uhi_intensity', data=uhi_wet)
plt.title("Check Linearity")
plt.show()
sns.regplot(x='emissions_lag2', y='uhi_intensity', data=valid_data)
plt.title("Check Linearity")
plt.show()
We now create the main seasonal model to predict the whole dataset
def seasonal_model(df):
y_pred = np.full(len(df), np.nan)
# Wet season
wet_mask = df['season'] == 'Wet'
if wet_mask.any():
y_pred[wet_mask] = wet_model.predict(df.loc[wet_mask, ['emissions']])
dry_mask = df['season'] == 'Dry'
valid_dry = dry_mask & df['emissions_lag2'].notnull()
if valid_dry.any():
y_pred[valid_dry] = dry_model.predict(df.loc[valid_dry, ['emissions_lag2']])
return y_pred
uhi['emissions_lag2'] = uhi['emissions'].shift(2)
uhi['uhi_predicted'] = seasonal_model(uhi)
uhi.head()
| time | emissions | index | uhi_intensity | month | season | emissions_lag2 | uhi_predicted | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 377670.22149 | 72 | 5.631497 | 1 | Dry | NaN | NaN |
| 1 | 2021-02-01 | 339525.14791 | 73 | 5.513525 | 2 | Dry | NaN | NaN |
| 2 | 2021-03-01 | 387995.25552 | 74 | 5.515976 | 3 | Dry | 377670.22149 | 6.073800 |
| 3 | 2021-04-01 | 377475.05045 | 75 | 4.908226 | 4 | Dry | 339525.14791 | 5.663141 |
| 4 | 2021-05-01 | 396570.43498 | 76 | 4.619258 | 5 | Wet | 387995.25552 | 3.338978 |
plt.figure(figsize=(14, 6))
plt.plot(uhi['time'], uhi['uhi_intensity'], label='Actual UHI Intensity', color='blue', linewidth=2)
plt.plot(uhi['time'], uhi['uhi_predicted'], label='Predicted UHI Intensity', color='red', linestyle='--', linewidth=2)
plt.title('Actual vs Predicted UHI Intensity Over Time')
plt.xlabel('Time')
plt.ylabel('UHI Intensity (°C)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
filtered = uhi[uhi['uhi_predicted'].notna()]
mse = mean_squared_error(filtered['uhi_intensity'],filtered['uhi_predicted'])
r2 = r2_score(filtered['uhi_intensity'], filtered['uhi_predicted'])
print(f"MSE: {mse:.2f}, R²: {r2:.2f}")
MSE: 0.57, R²: 0.80
Conclusion¶
Even though the model formulated only has an 80% accuracy on the dataset given, there are still a lot of insights covered during the EDA and modelling.
- There is an increase in the temperature difference between Metro Manila and rural regions around it (UHI intensification) since 2015 (starting with 2 degrees celsius) to 2024 (ending with 6 degrees celsius).
- The magnitude of the temperature difference between Metro Manila and rural regions around it are significantly greater in the Dry season (ranging from 5 to 6 degrees celsius on average) than in the wet season (ranging from 2 to 3 degrees celsius on average).
- There is a positive correlation between the CO2 vehicle emissions and UHI intensification in Metro Manila during the Wet season (with a coefficient of 0.3). This shows that the increase in CO2 vehicle emissions will likely result to an increase in the UHI intensification in Metro Manila.
- There is a delayed (by 2 months) positive correlation between the CO2 vehicle emissions and UHI intensification in Metro Manila during the Dry season (with a coefficient of 0.3). This shows that the increase in CO2 vehicle emissions will likely result to an increase in the UHI intensification in Metro Manila 2 months later.
Do note that the model has a high variance and low bias, making it an overfitting model. This means that if we use a completely different dataset than the ones used in this study, it might not have a consistent accuracy of 80%.
Call To Action¶
As citizens of the Philippines, and especially as citizens of Metro Manila, this project shows that the temperature rise in our region compared to other regions is not purely coincidental, the result of this project exemplifies that. The scope of this study is limited only to CO2 emissions from vehicle transportation and majority of our transportation emits CO2, which causes temperature rise in Metro Manila. In reality, there are a lot more underlying factors that cause UHI intensification in Metro Manila. In the end, it is the job of students and researchers to reveal these underlying factors but it is the job of the general public to take action and contribute to the betterment of our climate.
What can you do?
Based on this project, we citizens of Metro Manila should do the following when travelling:
Short Trips? Walk or Bike!
- For nearby destinations, choose walking or cycling instead of driving.
- It’s healthier for you and produces zero CO₂ emissions.
Medium Distances? Try an E-Bike!
- If you regularly travel mid-range distances, consider investing in an electric bike.
- E-bikes are efficient, eco-friendly, and keep you active without the sweat.
Buying a Vehicle? Go Electric!
- If you’re in the market for a new car, opt for an electric vehicle (EV).
- EVs reduce CO₂ emissions and lower your carbon footprint significantly.
Other things we can do outside the scope of this project:
Reduce Energy Use
– Turn off lights, unplug devices, and switch to energy-efficient appliances to lower CO₂ emissions.
Go Green
– Plant trees and recycle waste to absorb carbon and reduce environmental harm.