MERCS 101 - Lecture 03: Mix Classification & Regression

This is the third part of the tutorial, combining classification and regression

Preliminaries

External Imports

In [1]:
import numpy as np
import os
import sys

from sklearn.metrics import (mean_absolute_error,
                             mean_squared_error,
                             mean_squared_log_error,
                             f1_score)
import pandas as pd

MERCS imports

In [2]:
sys.path.insert(0, '..') # We add the parent dir to the path
from src.mercs.core import MERCS
from src.mercs.utils import *

import src.datasets as datasets
/home/elia/Software/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Induction

Importing Data

First, we import the fertility dataset.

In [3]:
train, test = datasets.load_fertility()
In [4]:
train.head()
Out[4]:
season age child_diseases accident surgical_intervention high_fever alco smoking h_seating diagnosis
0 -0.33 0.69 0 1 1 0 0.8 0 0.88 1
1 -0.33 0.94 1 0 1 0 0.8 1 0.31 0
2 -0.33 0.50 1 0 0 0 1.0 -1 0.50 1
3 -0.33 0.75 0 1 1 0 1.0 -1 0.38 1
4 -0.33 0.67 1 1 0 0 0.8 -1 0.50 0

We observe that some attributes are nominal, whereas others appear numerical. MERCS can handle both, but of course has to somehow figure out which kind each attribute is.

This is, in general, very hard to do correctly and an genuine A.I. problem in its own right (type inference).

So we won't get into that. MERCS obeys a very, very simple policy, i.e.: sklearn assumptions. Let us demonstrate.

In [5]:
model = MERCS()
model.fit(train)

Let us see what happened here. MERCS remembers some useful things about the datasets it encounters. Amongst which, the classlabels. Since a numeric type has no classlabels in the real sense of the word, MERCS uses a placeholder there.

Nevertheless, the classlabels-datastructure present in MERCS tells us all we need to know.

In [6]:
model.s['metadata']['clf_labels']
Out[6]:
[['numeric'],
 ['numeric'],
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([0., 1.])]

So, what about the assumptions? Well, 2 things:

1) They work
    Training went well, without any issues. This tells us that MERCS at least makes assumptions that ensure the component models can handle the data that they are fed and the outputs that they are expected to provide
2) They're too simple
    The assumptions do not really correspond to reality, as we will see.


To see for ourselves how these assumptions compare to reality, let us simply look at reality. How many unique values are present in the DataFrame we provided?

In [7]:
train.nunique()
Out[7]:
season                    3
age                      14
child_diseases            2
accident                  2
surgical_intervention     2
high_fever                3
alco                      5
smoking                   3
h_seating                13
diagnosis                 2
dtype: int64

It seems like MERCS made a mistake in the first attribute, season. MERCS thinks it is numeric, but it really appears more of a nominal attribute. All the rest seems to correspond.

How did this happen?

Well, MERCS knowns about numeric and nominal, and makes it decisions in utils.py, in the method get_metadata_df.

An attribute is numeric

UNLESS:

1) Its type is `int` (necessary for sklearn)
2) It has 10 or less distinct values (this is a choice made by MERCS)

We can solve this by simple preprocessing and converting season to ints.

In [8]:
train['season'] = pd.factorize(train['season'])[0]
train.head()
Out[8]:
season age child_diseases accident surgical_intervention high_fever alco smoking h_seating diagnosis
0 0 0.69 0 1 1 0 0.8 0 0.88 1
1 0 0.94 1 0 1 0 0.8 1 0.31 0
2 0 0.50 1 0 0 0 1.0 -1 0.50 1
3 0 0.75 0 1 1 0 1.0 -1 0.38 1
4 0 0.67 1 1 0 0 0.8 -1 0.50 0
In [9]:
train.dtypes
Out[9]:
season                     int64
age                      float64
child_diseases             int64
accident                   int64
surgical_intervention      int64
high_fever                 int64
alco                     float64
smoking                    int64
h_seating                float64
diagnosis                  int64
dtype: object
model.fit(train) model.s['metadata']['clf_labels']

Preprocessing

Let us take this again from the top.

In [10]:
train, test = datasets.load_fertility()

train['season'] = pd.factorize(train['season'])[0]
test['season'] = pd.factorize(test['season'])[0]
In [11]:
train.head(13)
Out[11]:
season age child_diseases accident surgical_intervention high_fever alco smoking h_seating diagnosis
0 0 0.69 0 1 1 0 0.8 0 0.88 1
1 0 0.94 1 0 1 0 0.8 1 0.31 0
2 0 0.50 1 0 0 0 1.0 -1 0.50 1
3 0 0.75 0 1 1 0 1.0 -1 0.38 1
4 0 0.67 1 1 0 0 0.8 -1 0.50 0
5 0 0.67 1 0 1 0 0.8 0 0.50 1
6 0 0.67 0 0 0 -1 0.8 -1 0.44 1
7 0 1.00 1 1 1 0 0.6 -1 0.38 1
8 1 0.64 0 0 1 0 0.8 -1 0.25 1
9 1 0.61 1 0 0 0 1.0 -1 0.25 1
10 1 0.67 1 1 0 -1 0.8 0 0.31 1
11 1 0.78 1 1 1 0 0.6 0 0.13 1
12 1 0.75 1 1 1 0 0.8 1 0.25 1

Training

In [12]:
model = MERCS()
In [13]:
ind_parameters = {'ind_type':           'RF',
                  'ind_n_estimators':   10,
                  'ind_max_depth':      4}

sel_parameters = {'sel_type':           'Base',
                  'sel_its':            8,
                  'sel_param':          2}
In [14]:
model.fit(train, **ind_parameters, **sel_parameters)

Introspection

Identification of types

MERCS makes some decisions regarding the attribute types automatically.

In [15]:
model.s['metadata']['clf_labels']
Out[15]:
[array([0., 1., 2.]),
 ['numeric'],
 array([0., 1.]),
 array([0., 1.]),
 array([0., 1.]),
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([-1.,  0.,  1.]),
 ['numeric'],
 array([0., 1.])]
In [16]:
model.s['metadata']['nb_values']
Out[16]:
array([ 3, 14,  2,  2,  2,  3,  5,  3, 13,  2])

Inference

Prediction

In [57]:
code = [0]*model.s['metadata']['nb_atts']
code[-2] = 1
code[-1] = 1
print(code)

target_boolean = np.array(code) == 1
y_true = test[test.columns.values[target_boolean]].values
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
In [40]:
pred_parameters = {'pred_type':     'IT',
                   'pred_param':    0.1,
                   'pred_its':      4}
In [41]:
y_pred = model.predict(test,
                       **pred_parameters,
                       qry_code=code)
In [42]:
y_pred
Out[42]:
array([[0.72198958, 1.        ],
       [0.34198542, 1.        ],
       [0.46525   , 1.        ],
       [0.394875  , 1.        ],
       [0.390375  , 1.        ],
       [0.312875  , 1.        ],
       [0.30854167, 1.        ],
       [0.36529167, 1.        ],
       [0.45025   , 1.        ],
       [0.32229167, 1.        ],
       [0.437125  , 1.        ],
       [0.39266667, 1.        ],
       [0.32816458, 1.        ],
       [0.441625  , 1.        ],
       [0.3755625 , 1.        ],
       [0.33141667, 1.        ],
       [0.23495833, 1.        ],
       [0.29609375, 1.        ],
       [0.43279167, 1.        ],
       [0.39188542, 1.        ],
       [0.365625  , 1.        ],
       [0.32228542, 1.        ],
       [0.26225   , 1.        ],
       [0.280625  , 1.        ],
       [0.38225   , 1.        ],
       [0.31059167, 1.        ],
       [0.3845625 , 1.        ],
       [0.37720833, 1.        ],
       [0.43351042, 1.        ],
       [0.42177083, 1.        ]])

Evaluation

In [59]:
clf_labels_targets = [model.s['metadata']['clf_labels'][t]
                      for t, check in enumerate(target_boolean)
                      if check]

clf_labels_targets 
Out[59]:
[['numeric'], array([0., 1.])]
In [44]:
def verify_numeric_prediction(y_true, y_pred):
    obs_1 = mean_absolute_error(y_true, y_pred)
    obs_2 = mean_squared_error(y_true, y_pred)
    obs_3 = mean_squared_log_error(y_true, y_pred)

    obs = [obs_1, obs_2, obs_3]

    for o in obs:
        assert isinstance(o, (int, float))
        assert 0 <= o 
    return
In [51]:
def verify_nominal_prediction(y_true, y_pred):
    obs = f1_score(y_true, y_pred, average='macro')

    assert isinstance(obs, (int, float))
    assert 0 <= obs <= 1
    return
In [65]:
# Ensure every target is nominal
for t_idx, clf_labels_targ in enumerate(clf_labels_targets):
    print(t_idx)
    single_y_true = y_true[:,t_idx]
    single_y_pred = y_pred[:,t_idx].astype(int)
    print(single_y_pred)
    
    if isinstance(clf_labels_targ, np.ndarray):
        # Nominal target
        print('NOMINAL')
        verify_nominal_prediction(single_y_true, single_y_pred)
    elif isinstance(clf_labels_targ, list):
        # Numeric target
        print('NUMERIC')
        assert clf_labels_targ == ['numeric']
        
        verify_numeric_prediction(single_y_pred, single_y_pred)
    else:
        msg = """clf_labels of MERCS are either:\n
        np.ndarray, shape (classlabels,)\n
        \t for nominal attributes\n
        list, shape (1,)\n
        \t ['numeric] for numeric attributes \n"""
        raise TypeError(msg)
0
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
NUMERIC
1
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
NOMINAL
/home/elia/Software/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)