MERCS 101 - Lecture 03: Mix Classification & Regression¶

This is the third part of the tutorial, combining classification and regression

Preliminaries¶

External Imports¶

import numpy as np
import os
import sys

from sklearn.metrics import (mean_absolute_error,
                             mean_squared_error,
                             mean_squared_log_error,
                             f1_score)
import pandas as pd

MERCS imports¶

sys.path.insert(0, '..') # We add the parent dir to the path
from src.mercs.core import MERCS
from src.mercs.utils import *

import src.datasets as datasets

/home/elia/Software/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Induction¶

Importing Data¶

First, we import the fertility dataset.

train, test = datasets.load_fertility()

train.head()

We observe that some attributes are nominal, whereas others appear numerical. MERCS can handle both, but of course has to somehow figure out which kind each attribute is.

This is, in general, very hard to do correctly and an genuine A.I. problem in its own right (type inference).

So we won't get into that. MERCS obeys a very, very simple policy, i.e.: sklearn assumptions. Let us demonstrate.

model = MERCS()
model.fit(train)

Let us see what happened here. MERCS remembers some useful things about the datasets it encounters. Amongst which, the classlabels. Since a numeric type has no classlabels in the real sense of the word, MERCS uses a placeholder there.

Nevertheless, the classlabels-datastructure present in MERCS tells us all we need to know.

model.s['metadata']['clf_labels']

[['numeric'],
 ['numeric'],
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([0., 1.])]

So, what about the assumptions? Well, 2 things:

1) They work
    Training went well, without any issues. This tells us that MERCS at least makes assumptions that ensure the component models can handle the data that they are fed and the outputs that they are expected to provide
2) They're too simple
    The assumptions do not really correspond to reality, as we will see.

To see for ourselves how these assumptions compare to reality, let us simply look at reality. How many unique values are present in the DataFrame we provided?

train.nunique()

season                    3
age                      14
child_diseases            2
accident                  2
surgical_intervention     2
high_fever                3
alco                      5
smoking                   3
h_seating                13
diagnosis                 2
dtype: int64

It seems like MERCS made a mistake in the first attribute, season. MERCS thinks it is numeric, but it really appears more of a nominal attribute. All the rest seems to correspond.

How did this happen?

Well, MERCS knowns about numeric and nominal, and makes it decisions in utils.py, in the method get_metadata_df.

An attribute is numeric

UNLESS:

1) Its type is `int` (necessary for sklearn)
2) It has 10 or less distinct values (this is a choice made by MERCS)

We can solve this by simple preprocessing and converting season to ints.

train['season'] = pd.factorize(train['season'])[0]
train.head()

train.dtypes

season                     int64
age                      float64
child_diseases             int64
accident                   int64
surgical_intervention      int64
high_fever                 int64
alco                     float64
smoking                    int64
h_seating                float64
diagnosis                  int64
dtype: object

Preprocessing¶

Let us take this again from the top.

train, test = datasets.load_fertility()

train['season'] = pd.factorize(train['season'])[0]
test['season'] = pd.factorize(test['season'])[0]

train.head(13)

Training¶

model = MERCS()

ind_parameters = {'ind_type':           'RF',
                  'ind_n_estimators':   10,
                  'ind_max_depth':      4}

sel_parameters = {'sel_type':           'Base',
                  'sel_its':            8,
                  'sel_param':          2}

model.fit(train, **ind_parameters, **sel_parameters)

Introspection¶

Identification of types¶

MERCS makes some decisions regarding the attribute types automatically.

model.s['metadata']['clf_labels']

[array([0., 1., 2.]),
 ['numeric'],
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([0., 1.])]

model.s['metadata']['nb_values']

array([ 3, 14,  2,  2,  2,  3,  5,  3, 13,  2])

Inference¶

Prediction¶

code = [0]*model.s['metadata']['nb_atts']
code[-2] = 1
code[-1] = 1
print(code)

target_boolean = np.array(code) == 1
y_true = test[test.columns.values[target_boolean]].values

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

pred_parameters = {'pred_type':     'IT',
                   'pred_param':    0.1,
                   'pred_its':      4}

y_pred = model.predict(test,
                       **pred_parameters,
                       qry_code=code)

y_pred

array([[0.72198958, 1.        ],
       [0.34198542, 1.        ],
       [0.46525   , 1.        ],
       [0.394875  , 1.        ],
       [0.390375  , 1.        ],
       [0.312875  , 1.        ],
       [0.30854167, 1.        ],
       [0.36529167, 1.        ],
       [0.45025   , 1.        ],
       [0.32229167, 1.        ],
       [0.437125  , 1.        ],
       [0.39266667, 1.        ],
       [0.32816458, 1.        ],
       [0.441625  , 1.        ],
       [0.3755625 , 1.        ],
       [0.33141667, 1.        ],
       [0.23495833, 1.        ],
       [0.29609375, 1.        ],
       [0.43279167, 1.        ],
       [0.39188542, 1.        ],
       [0.365625  , 1.        ],
       [0.32228542, 1.        ],
       [0.26225   , 1.        ],
       [0.280625  , 1.        ],
       [0.38225   , 1.        ],
       [0.31059167, 1.        ],
       [0.3845625 , 1.        ],
       [0.37720833, 1.        ],
       [0.43351042, 1.        ],
       [0.42177083, 1.        ]])

Evaluation¶

clf_labels_targets = [model.s['metadata']['clf_labels'][t]
                      for t, check in enumerate(target_boolean)
                      if check]

clf_labels_targets

[['numeric'], array([0., 1.])]

def verify_numeric_prediction(y_true, y_pred):
    obs_1 = mean_absolute_error(y_true, y_pred)
    obs_2 = mean_squared_error(y_true, y_pred)
    obs_3 = mean_squared_log_error(y_true, y_pred)

    obs = [obs_1, obs_2, obs_3]

    for o in obs:
        assert isinstance(o, (int, float))
        assert 0 <= o 
    return

def verify_nominal_prediction(y_true, y_pred):
    obs = f1_score(y_true, y_pred, average='macro')

    assert isinstance(obs, (int, float))
    assert 0 <= obs <= 1
    return

# Ensure every target is nominal
for t_idx, clf_labels_targ in enumerate(clf_labels_targets):
    print(t_idx)
    single_y_true = y_true[:,t_idx]
    single_y_pred = y_pred[:,t_idx].astype(int)
    print(single_y_pred)
    
    if isinstance(clf_labels_targ, np.ndarray):
        # Nominal target
        print('NOMINAL')
        verify_nominal_prediction(single_y_true, single_y_pred)
    elif isinstance(clf_labels_targ, list):
        # Numeric target
        print('NUMERIC')
        assert clf_labels_targ == ['numeric']
        
        verify_numeric_prediction(single_y_pred, single_y_pred)
    else:
        msg = """clf_labels of MERCS are either:\n
        np.ndarray, shape (classlabels,)\n
        \t for nominal attributes\n
        list, shape (1,)\n
        \t ['numeric] for numeric attributes \n"""
        raise TypeError(msg)

0
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
NUMERIC
1
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
NOMINAL

/home/elia/Software/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

	season	age	child_diseases	accident	surgical_intervention	alco	smoking	h_seating	diagnosis
0	-0.33	0.69	0	1	1	0.8	0	0.88	1
1	-0.33	0.94	1	0	1	0.8	1	0.31	0
2	-0.33	0.50	1	0	0	1.0	-1	0.50	1
3	-0.33	0.75	0	1	1	1.0	-1	0.38	1
4	-0.33	0.67	1	1	0	0.8	-1	0.50	0

	season	age	child_diseases	accident	surgical_intervention	high_fever	alco	smoking	h_seating	diagnosis
0	0	0.69	0	1	1	0	0.8	0	0.88	1
1	0	0.94	1	0	1	0	0.8	1	0.31	0
2	0	0.50	1	0	0	0	1.0	-1	0.50	1
3	0	0.75	0	1	1	0	1.0	-1	0.38	1
4	0	0.67	1	1	0	0	0.8	-1	0.50	0
5	0	0.67	1	0	1	0	0.8	0	0.50	1
6	0	0.67	0	0	0	-1	0.8	-1	0.44	1
7	0	1.00	1	1	1	0	0.6	-1	0.38	1
8	1	0.64	0	0	1	0	0.8	-1	0.25	1
9	1	0.61	1	0	0	0	1.0	-1	0.25	1
10	1	0.67	1	1	0	-1	0.8	0	0.31	1
11	1	0.78	1	1	1	0	0.6	0	0.13	1
12	1	0.75	1	1	1	0	0.8	1	0.25	1