机器学习中如何处理分类变量的不均衡

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class

  • Up-sample the minority class

    • resample module with replace = True
  • Down-sample the majority class

    • resample module with replace = False
  • Change your performance metric

    • Area Under ROC Curve (AUROC)
    • from sklearn.metrics import roc_auc_score
  • Penalize algorithms (cost-sensitive training)

1
2
3
SVC(kernel='linear', 
class_weight='balanced', # penalize
probability=True)
  • Use tree-based algorithms

    • from sklearn.ensemble import RandomForestClassifier
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils import resample


input = pd.read_csv('trime_skinput.csv') # row.names equls numbers
input.trimethoprim_sulfamethoxazole.value_counts()
input_major = input[input.trimethoprim_sulfamethoxazole == "R"]
input_minor = input[input.trimethoprim_sulfamethoxazole == "S"]

input_minor_upsampled = resample(input_minor,
replace = True,
n_samples = 67,
random_state=123)

input_upsampled = pd.concat([input_major,input_minor_upsampled])
input_upsampled.trimethoprim_sulfamethoxazole.value_counts()


X = input_upsampled.iloc[:,0:19277]
y = input_upsampled.iloc[:,19277:19278]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

### train model
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')

svclassifier.fit(X_train, y_train)

## predict
y_pred = svclassifier.predict(X_test)

# evaluation
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
# upsampling
precision recall f1-score support

R 1.00 0.88 0.93 16
S 0.85 1.00 0.92 11

accuracy 0.93 27
macro avg 0.92 0.94 0.93 27
weighted avg 0.94 0.93 0.93 27

## don't do anything
precision recall f1-score support

R 0.88 1.00 0.93 14
S 0.00 0.00 0.00 2

accuracy 0.88 16
macro avg 0.44 0.50 0.47 16
weighted avg 0.77 0.88 0.82 16

参考资料