This is the third part of the tutorial, combining classification and regression
import numpy as np
import os
import sys
from sklearn.metrics import (mean_absolute_error,
mean_squared_error,
mean_squared_log_error,
f1_score)
import pandas as pd
sys.path.insert(0, '..') # We add the parent dir to the path
from src.mercs.core import MERCS
from src.mercs.utils import *
import src.datasets as datasets
First, we import the fertility dataset.
train, test = datasets.load_fertility()
train.head()
We observe that some attributes are nominal, whereas others appear numerical. MERCS can handle both, but of course has to somehow figure out which kind each attribute is.
This is, in general, very hard to do correctly and an genuine A.I. problem in its own right (type inference).
So we won't get into that. MERCS obeys a very, very simple policy, i.e.: sklearn assumptions. Let us demonstrate.
model = MERCS()
model.fit(train)
Let us see what happened here. MERCS remembers some useful things about the datasets it encounters. Amongst which, the classlabels. Since a numeric type has no classlabels in the real sense of the word, MERCS uses a placeholder there.
Nevertheless, the classlabels-datastructure present in MERCS tells us all we need to know.
model.s['metadata']['clf_labels']
So, what about the assumptions? Well, 2 things:
1) They work
Training went well, without any issues. This tells us that MERCS at least makes assumptions that ensure the component models can handle the data that they are fed and the outputs that they are expected to provide
2) They're too simple
The assumptions do not really correspond to reality, as we will see.
To see for ourselves how these assumptions compare to reality, let us simply look at reality. How many unique values are present in the DataFrame we provided?
train.nunique()
It seems like MERCS made a mistake in the first attribute, season
. MERCS thinks it is numeric, but it really appears more of a nominal attribute. All the rest seems to correspond.
How did this happen?
Well, MERCS knowns about numeric and nominal, and makes it decisions in utils.py, in the method get_metadata_df
.
An attribute is numeric
UNLESS:
1) Its type is `int` (necessary for sklearn)
2) It has 10 or less distinct values (this is a choice made by MERCS)
We can solve this by simple preprocessing and converting season to ints.
train['season'] = pd.factorize(train['season'])[0]
train.head()
train.dtypes
Let us take this again from the top.
train, test = datasets.load_fertility()
train['season'] = pd.factorize(train['season'])[0]
test['season'] = pd.factorize(test['season'])[0]
train.head(13)
model = MERCS()
ind_parameters = {'ind_type': 'RF',
'ind_n_estimators': 10,
'ind_max_depth': 4}
sel_parameters = {'sel_type': 'Base',
'sel_its': 8,
'sel_param': 2}
model.fit(train, **ind_parameters, **sel_parameters)
MERCS makes some decisions regarding the attribute types automatically.
model.s['metadata']['clf_labels']
model.s['metadata']['nb_values']
code = [0]*model.s['metadata']['nb_atts']
code[-2] = 1
code[-1] = 1
print(code)
target_boolean = np.array(code) == 1
y_true = test[test.columns.values[target_boolean]].values
pred_parameters = {'pred_type': 'IT',
'pred_param': 0.1,
'pred_its': 4}
y_pred = model.predict(test,
**pred_parameters,
qry_code=code)
y_pred
clf_labels_targets = [model.s['metadata']['clf_labels'][t]
for t, check in enumerate(target_boolean)
if check]
clf_labels_targets
def verify_numeric_prediction(y_true, y_pred):
obs_1 = mean_absolute_error(y_true, y_pred)
obs_2 = mean_squared_error(y_true, y_pred)
obs_3 = mean_squared_log_error(y_true, y_pred)
obs = [obs_1, obs_2, obs_3]
for o in obs:
assert isinstance(o, (int, float))
assert 0 <= o
return
def verify_nominal_prediction(y_true, y_pred):
obs = f1_score(y_true, y_pred, average='macro')
assert isinstance(obs, (int, float))
assert 0 <= obs <= 1
return
# Ensure every target is nominal
for t_idx, clf_labels_targ in enumerate(clf_labels_targets):
print(t_idx)
single_y_true = y_true[:,t_idx]
single_y_pred = y_pred[:,t_idx].astype(int)
print(single_y_pred)
if isinstance(clf_labels_targ, np.ndarray):
# Nominal target
print('NOMINAL')
verify_nominal_prediction(single_y_true, single_y_pred)
elif isinstance(clf_labels_targ, list):
# Numeric target
print('NUMERIC')
assert clf_labels_targ == ['numeric']
verify_numeric_prediction(single_y_pred, single_y_pred)
else:
msg = """clf_labels of MERCS are either:\n
np.ndarray, shape (classlabels,)\n
\t for nominal attributes\n
list, shape (1,)\n
\t ['numeric] for numeric attributes \n"""
raise TypeError(msg)