数据挖掘与机器学习-第四章 分类算法


第四章 分类算法

分类算法概述

分类的定义

image-20220423205910301

分类的应用

image-20220423210010708

什么样的数据适合分类?

image-20220423205438985

image-20220423205536080

image-20220423205553795

image-20220423210255075

分类器的构建标准

image-20220423222634239

image-20220423222816691 image-20220423222843876

image-20220423210406604

朴素贝叶斯算法(NB)

简介

image-20220423210906433

频率&概率

image-20220423211214265

image-20220423211200738

先验概率 & 后验概率 & 条件概率

image-20220423211255437

image-20220423211406624

贝叶斯算法的核心

image-20220423211754348

多项式朴素贝叶斯算法(Naive Bayes)

朴素贝叶斯案例

image-20220423233036616

image-20220423233153735

image-20220423233216892

朴素贝叶斯分类算法

image-20220423232843432

高斯分布朴素贝叶斯算法

image-20220423233437441

image-20220423233645001

image-20220423233934867

image-20220423234236933

image-20220423234447246

应用场景

image-20220423235523805

实现

image-20220423234701473

image-20220423234842350

K近邻算法(KNN)

空间

image-20220424113315121

维度

image-20220424113409264

向量

image-20220424112817814

距离

image-20220424113525899

image-20220424112851183

欧氏距离

image-20220424113633832

曼哈顿距离

image-20220424113839988

切比雪夫距离

image-20220424114447305

闵可夫斯基距离

image-20220424114615988

杰卡德距离

image-20220424114753868

image-20220424114902226

余弦距离

image-20220424114954666

image-20220424115224915

相关距离

image-20220424115436723

汉明距离

image-20220424115458359

image-20220424115509099

距离总结

image-20220424115710684

image-20220424115730672

image-20220424115922126

image-20220424115929505

练习

image-20220424113002694

image-20220424115944637

image-20220424115953912

最近邻算法

image-20220424121809756

K近邻算法

image-20220424121907864

image-20220424122045644

image-20220424122248692

K值的影响

image-20220424122328955

常用的距离

image-20220424122349211

优点

image-20220424122521591

模型

image-20220424122844497

image-20220424122858657

image-20220424122912176

实现

image-20220424123157511

image-20220424123304029

多类问题的分类

image-20220424221609739

image-20220424221731390

image-20220424221810933

实验四 分类算法

image-20220425113052503

image-20220425111240273

image-20220425111253702

image-20220425113112690

image-20220425113558681

image-20220425125001738

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

input_file=r'D:\重庆第二师范学院\2020秋大三上\数据挖掘与机器学习 程雪峰\实验四\实验四 分类算法\数据源\adult.data.txt'
X=[]
Y=[]
num_lessthan50k=0
num_morethan50k=0
num_threshold=30000
with open(input_file,'r') as f:
for line in f.readlines():
if '?' in line:
continue
data=line[:-1].split(', ')
if (data[-1]=='<=50K') and (num_lessthan50k<num_threshold):
X.append(data)
num_lessthan50k=num_lessthan50k+1
elif (data[-1]=='>50K') and (num_morethan50k<num_threshold):
Y.append(data)
num_morethan50k=num_morethan50k+1
if num_lessthan50k>=num_threshold and num_morethan50k>=num_threshold:
break
a=np.array(X)
print(a)

image-20220425123730771

image-20220425124639268

1
2
3
4
5
6
7
8
9
10
11
12
13
label_encoder=[]
X_encoded=np.empty(X.shape)
for i,item in enumerate(X[0]):
if item.isdigit():
X_encoded[:,i] = X[:,i]
else:
le=preprocessing.LabelEncoder()
label_encoder.append(le)
X_encoded[:,i]=label_encoder[-1].fit_transform(X[:,i])
X=X_encoded[:,:-1].astype(int)
y=X_encoded[:,-1].astype(int)
print(X)
print(y)

image-20220426180354403

[preprocessing.LabelEncoder的使用]

image-20220425130640763

1
2
3
4
5
6
7
8
# ###STEP3###
from sklearn.model_selection import cross_val_score, train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=5)
classifier_gaussiannb=GaussianNB()
classifier_gaussiannb.fit(x_train,y_train)
y_test_pred=classifier_gaussiannb.predict(x_test)
f1=cross_val_score(classifier_gaussiannb,x,y,scoring='f1_weighted',cv=5)
print('F1 score:'+str(round(100*f1.mean(),2))+"%")

image-20220426180410228

交叉验证

交叉验证

image-20220425130705085

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
###STEP4###
# 创建个例,将其进行同样编码处理
input_data = ['39', 'State-gov', '77516', 'Bachelors', '13', 'Never-married', 'Adm-clerical', 'Not-in-family', 'White', 'Male', '2174', '0', '40', 'United-States']
count = 0
input_data_encoded = [-1] * len(input_data)
for i,item in enumerate(input_data):
if item.isdigit():
input_data_encoded[i] = int(input_data[i])
else:
input_data_encoded[i] = int(label_encoder[count].transform([input_data[i]]))
count = count + 1
input_data_encoded = np.array(input_data_encoded)
#将个体进行预测分类,并输出结果
output_class = classifier_gaussiannb.predict(input_data_encoded.reshape(1,-1))
print (label_encoder[-1].inverse_transform(output_class)[0])

image-20220426181449680


本文标题:数据挖掘与机器学习-第四章 分类算法

文章作者:TTYONG

发布时间:2022年04月23日 - 20:04

最后更新:2022年05月05日 - 10:05

原始链接:http://tianyong.fun/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98%E6%8A%80%E6%9C%AF%E4%B8%8E%E5%BA%94%E7%94%A8-%E7%AC%AC%E5%9B%9B%E7%AB%A0-%E5%88%86%E7%B1%BB%E7%AE%97%E6%B3%95.html

许可协议: 转载请保留原文链接及作者。

多少都是爱
0%