威斯康星乳腺癌良性预测

时间：2018-11-01 18:27:10 阅读：203 评论：0 收藏：0 [点我收藏+]

标签：learning nal tran 梯度 log cat nsf 时间 img

一、获取数据

wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

　原始数据以逗号分隔：

技术分享图片

　各个列的属性：

　　1.Sample Code Number　　　　id number

　　2.Clump Thickness　　　　　　1 - 10　　　　肿块厚度

　　3.Uniformity Of Cell Size　　　　1 - 10　　　　细胞大小均一性

　　4.Uniformity Of Cell Shape　　　1 - 10　　　　细胞形状的均一性

　　5.Marginal Adhesion　　　　　　1 - 10　　　边缘附着性

　　6.Single Epithelial Cell Size　　 1 - 10　　　　单上皮细胞大小

　　7.Bare Nuclei　　　　　　　　　 1 - 10　　　　裸核

　　8.Bland Chromatin　　　　　　　1 - 10　　　　布兰染色质

　　9.Normal Nucleoli　　　　　　　 1 - 10　　　　正常核仁

　　10.Mitoses　　　　　　　　　　 1 - 10　　　　有丝分裂

　　11.Class　　　　　　　　　　　　　　　　　　 2是良性，4是恶性

二、使用LR和SGD

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

#数据没有标题，因此加上参数header
data = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data‘, header=None)

column_names = [‘Sample code number‘,‘Clump Thickness‘,‘Uniformity of Cell Size‘,‘Uniformity of Cell Shape‘,                ‘Marginal Adhesion‘,‘Single Epithelial Cell Size‘,‘Bare Nuclei‘,                ‘Bland Chromatin‘,‘Normal Nucleoli‘,‘Mitoses‘,‘Class‘]

data.columns = column_names
#发现数据中存在?符号
data = data.replace(to_replace=‘?‘,value = np.nan)
data = data.dropna(how=‘any‘)

#一般1代表恶性，0代表良性（本数据集4恶性，所以将4变成1，将2变成0）
#data[‘Class‘][data[‘Class‘] == 4] = 1
#data[‘Class‘][data[‘Class‘] == 2] = 0
data.loc[data[‘Class‘] == 4, ‘Class‘] = 1
data.loc[data[‘Class‘] == 2, ‘Class‘] = 0

#Sample code number特征对分类没有作用，将数据集75%作为训练集，25%作为测试集
X_train, X_test, y_train, y_test = train_test_split(data[ column_names[1:10] ], data[ column_names[10] ], test_size = 0.25, random_state = 33)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
print( ‘The LR Predict Result‘, metrics.accuracy_score(lr_y_predict, y_test) )
#LR也自带了score
print( "The LR Predict Result Show By lr.score", lr.score(X_test, y_test) )


sgdc = SGDClassifier(max_iter = 1000)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)
print( "The SGDC Predict Result", metrics.accuracy_score(sgdc_y_predict, y_test) )
#SGDC也自带了score
print( "The SGDC Predict Result Show By SGDC.score", sgdc.score(X_test, y_test) )
print("\n")
print("性能分析:\n")
#性能分析
from sklearn.metrics import classification_report
#使用classification_report模块获得LR三个指标的结果（召回率，精确率，调和平均数）
print( classification_report( y_test,lr_y_predict,target_names=[‘Benign‘,‘Malignant‘] ) )

##使用classification_report模块获得SGDC三个指标的结果
print( classification_report( y_test,sgdc_y_predict,target_names=[‘Benign‘,‘Malignant‘] ) )

‘‘‘
特点分析：
LR对参数的计算采用精确解析的方法，计算时间长但是模型性能高
SGDC采用随机梯度上升算法估计模型参数，计算时间短但产出的模型性能略低，
一般而言，对于训练数据规模在10万量级以上的数据，考虑到时间的耗用，推荐使用SGDC
‘‘‘

　技术分享图片

威斯康星乳腺癌良性预测

标签：learning nal tran 梯度 log cat nsf 时间 img

原文地址：https://www.cnblogs.com/always-fight/p/9888353.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行