标签:
今天这个比赛结束了,结果可以看:https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard
public结果:
private结果:
首先对比private和public的结果,可以发现:
1)几乎所有的人都overfitting了;或者说private的另一半测试数据比public的那一半测试数据更不规律。
2)private的前十名有5个是在public中排不进前几百,有四个甚至排在1000名到2000名之间;说明使用一个正确的方法比一味地追求public上的排名更重要!!!
3)我自己从public的第2323名调到private的1063名,提高了1260个名次;作为第一次参加这种比赛的人,作为一个被各种作业困扰的人,能在有5236个队伍中、5831个选手中取得这样的成绩,个人还比较满意,毕竟经验不足,做了很多冤枉工作。
4)说回最关键的,什么叫做“一个正确的方法”???这也是我想探讨的失败之处:
1、选择正确的模型:因为对数据不了解,所以直接尝试了以下模型:
models=[
RandomForestClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
RandomForestClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
ExtraTreesClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
ExtraTreesClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
GradientBoostingClassifier(learning_rate=0.1, n_estimators=101, subsample=0.6, max_depth=8, random_state=SEED)
]实际上,我这里想说的是,这些模型的速度都非常慢!最开始,我觉得方便就一直没有配置xgBoost,这种选择实际上浪费了非常多的时间;后来使用了xgBoost,才得到了最终的这个结果。所以说,不了解数据时,选择一个速度快的、泛化能力强的模型很重要,xgBoost是首选。
2、上来不经过任何思考就开始使用各种复杂的模型,甚至连一个baseline都没有:对,我就是这样,因为第一次,确实缺乏经验;因为复杂的模型容易过拟合,所以你越比陷得越深;而且复杂模型一般花费时间比较多,真是浪费青春;这一点我是在快要没时间的时候才意识到的;另外,我的最终结果确实是通过一个非常简单的模型得到的。所以说,开始时先鲁一个简单的模型,以此为参照构建之后的模型。什么是简单的模型:原始数据集(或者稍微做了一点处理的数据集,比如去常数列、补缺失值、归一化等)、logistic regression或者简单的svm、xgBoost。
3、相信交叉验证的结果:不要只将数据集划分成两份,因为交叉验证时你会发现有些fold效果非常好,AUC可以到0.85左右,而有些fold则非常差,0.82都不到。
4、关于noise的问题:一直没找到好的处理办法,所以最终效果不是很好也正常。
5、关于一堆零的处理办法:归一化特征,这个非常有必要!否则你之后的特征工程都会发现效果很差,因为0+k=k、0*k=0、0^2=0;具体怎么归一化,我就不多说了,点到为止。
6、另外还有一些小细节,比如筛选特征时,因为你的最终模型是GBDT,那你筛选特征时就使用GBDT,否则你使用LR筛选的有效特征可能对GBDT模型来说并不是有效的;还有很多很多,真的是在实践中才能意识到,比如特征处理是在train+test上还是单独在train上这些问题,理论上只应该在train上,因为我们认为test数据集是不知道的,但是对于这种比赛,你知道了test,那还是用上的好。。。。不多说了,大家还是多实践好;科研再忙,一学期玩一个比赛还是有时间的。。。。。。。。
7、说了这么多没用的,给大家上一点代码,主要包括贪心筛选特征、交叉验证、blending三部分关键点,但是整个代码是完整的:
#!usr/bin/env python
#-*- coding:utf-8 -*-
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
SEED=1126
nFold=5
def SaveFile(submitID, testSubmit, fileName="submit.csv"):
content="ID,TARGET";
for i in range(submitID.shape[0]):
content+="\n"+str(submitID[i])+","+str(testSubmit[i])
file=open(fileName,"w")
file.write(content)
file.close()
def CrossValidationScore(data, label, clf, nFold=5, scoreType="accuracy"):
if scoreType=="accuracy":
scores=cross_validation.cross_val_score(clf,data,label,cv=nFold)
#print("mean accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
return scores.mean()
elif scoreType=="auc":
meanAUC=0.0
kfcv=StratifiedKFold(y=label, n_folds=nFold, shuffle=True, random_state=SEED)
for j, (trainI, cvI) in enumerate(kfcv):
print "Fold ", j, "^"*20
Xtrain=data[trainI]
Xcv=data[cvI]
Ytrain=label[trainI]
Ycv=label[cvI]
clf.fit(Xtrain,Ytrain)
probas=clf.predict_proba(Xcv)
aucScore=metrics.roc_auc_score(Ycv, probas[:,1])
#print "auc (fold %d/%d): %0.4f" % (i+1,nFold, aucScore)
meanAUC+=aucScore
#print "mean auc: %0.4f" % (meanAUC/nFold)
return meanAUC/nFold
def GreedyFeatureAdd(clf, data, label, scoreType="accuracy", goodFeatures=[], maxFeaNum=100, eps=0.00005):
scoreHistorys=[]
while len(scoreHistorys)<=2 or scoreHistorys[-1]>scoreHistorys[-2]+eps:
if len(goodFeatures)==maxFeaNum:
break
scores=[]
for testFeaInd in range(data.shape[1]):
if testFeaInd not in goodFeatures:
#tempFeaInds=goodFeatures.append(testFeaInd);
tempFeaInds=goodFeatures+[testFeaInd]
tempData=data[:,tempFeaInds]
score=CrossValidationScore(tempData, label, clf, nFold, scoreType)
scores.append((score,testFeaInd))
print "feature: "+str(testFeaInd)+"==>mean "+scoreType+": %0.4f" % score
goodFeatures.append(sorted(scores)[-1][1]) #only add the feature which get "the biggest gain score"
scoreHistorys.append(sorted(scores)[-1][0]) #only add the biggest gain score
#print scoreHistorys
print "current features: %s" % sorted(goodFeatures)
if len(goodFeatures)<maxFeaNum:
goodFeatures.pop(-1) #remove last added feature from goodFeatures
#goodFeatures=sorted(goodFeatures) don't sort at this point, we may use the first 100 "bigger gain score" feature
print "selected %d features: %s" % (len(goodFeatures), goodFeatures)
return goodFeatures #a feature list
trainD=pd.read_csv("train.csv")
trainY=np.array(trainD.iloc[:,-1])
trainX=np.array(trainD.iloc[:,1:-1]) #drop ID and TARGET
testD=pd.read_csv("test.csv")
submitID=np.array(testD.iloc[:,0]) #ID column
testX=np.array(testD.iloc[:,1:])#drop ID
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! better use a RFC or GBC as the clf
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! because the final predict model are those two
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! we should select better feature for RFC or GBC, not for LR
clf = LogisticRegression(class_weight='balanced', penalty='l2', n_jobs=-1)
selectedFeaInds=GreedyFeatureAdd(clf, trainX, trainY, scoreType="auc", goodFeatures=[], maxFeaNum=150)
joblib.dump(selectedFeaInds, 'modelPersistence/selectedFeaInds.pkl')
#selectedFeaInds=joblib.load('modelPersistence/selectedFeaInds.pkl')
trainX=trainX[:,selectedFeaInds]
testX=testX[:,selectedFeaInds]
print trainX.shape
trainN=len(trainY)
print "Creating train and test sets for blending..."
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! always use a seed for randomized procedures
models=[
RandomForestClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
RandomForestClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
ExtraTreesClassifier(n_estimators=1999, criterion='gini', n_jobs=-1, random_state=SEED),
ExtraTreesClassifier(n_estimators=1999, criterion='entropy', n_jobs=-1, random_state=SEED),
GradientBoostingClassifier(learning_rate=0.1, n_estimators=101, subsample=0.6, max_depth=8, random_state=SEED)
]
#StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
#kfcv=KFold(n=trainN, n_folds=nFold, shuffle=True, random_state=SEED)
kfcv=StratifiedKFold(y=trainY, n_folds=nFold, shuffle=True, random_state=SEED)
dataset_trainBlend=np.zeros( ( trainN, len(models) ) )
dataset_testBlend=np.zeros( ( len(testX), len(models) ) )
meanAUC=0.0
for i, model in enumerate(models):
print "model ", i, "=="*20
dataset_testBlend_j=np.zeros( ( len(testX), nFold ) )
for j, (trainI, testI) in enumerate(kfcv):
print "Fold ", j, "^"*20
Xtrain=trainX[trainI]
Xcv=trainX[testI]
Ytrain=trainY[trainI]
Ycv=trainY[testI]
model.fit(Xtrain,Ytrain)
Ypred=model.predict_proba(Xcv)[:,1]
dataset_trainBlend[testI, i]=Ypred
dataset_testBlend_j[:,j]=model.predict_proba(testX)[:,1]
dataset_testBlend[:,i]=dataset_testBlend_j.mean(1)
aucScore=metrics.roc_auc_score(trainY, dataset_trainBlend[:, i])
print "model %d, cv mean auc: %0.9f" % (i, aucScore)
meanAUC+=aucScore
print "ALL models, cv mean auc: %0.9f" % (meanAUC/len(models))
'''
0.7786
0.7814
0.7230
0.7239
0.8199
mean auc:0.7654
'''
print "Blending models..."
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! if we want to predict some real values, use RidgeCV
model=LogisticRegression(n_jobs=-1)
C=np.linspace(0.001,1.0,1000)
trainAucList=[]
for c in C:
model.C=c
model.fit(dataset_trainBlend,trainY)
trainProba=model.predict_proba(dataset_trainBlend)[:,1]
trainAuc=metrics.roc_auc_score(trainY, trainProba)
trainAucList.append((trainAuc, c))
sortedtrainAucList=sorted(trainAucList)
for trainAuc, c in sortedtrainAucList:
print "c=%f => trainAuc=%f" % (c, trainAuc)
'''
C => trainProba
0.0001 => 0.126..
0.001 => 0.807188
0.01 => 0.815833
0.03 => 0.820674
0.04 => 0.821295
0.05 => 0.821439 ***
0.06 => 0.821129
0.07 => 0.820521
0.08 => 0.820067
0.1 => 0.819036
0.3 => 0.813210
1.0 => 0.809002
10.0 => 807334
'''
model.C=sortedtrainAucList[-1][1] #0.05
model.fit(dataset_trainBlend,trainY)
trainProba=model.predict_proba(dataset_trainBlend)[:,1]
print "train auc: %f" % metrics.roc_auc_score(trainY, trainProba) #0.821439
print "model.coef_: ", model.coef_
print "Predict and saving results..."
submitProba=model.predict_proba(dataset_testBlend)[:,1]
df=pd.DataFrame(submitProba)
print df.describe()
SaveFile(submitID, submitProba, fileName="1submit.csv") #0.815536 [blending makes result < GBC 0.8199]
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Blending models ISN'T a good idea when one model OBVIOUSLY better than others...
'''
count 75818.000000
mean 0.039187
std 0.033691
min 0.024876
25% 0.028400
50% 0.029650
75% 0.034284
max 0.806586
'''
print "MinMaxScaler predictions to [0,1]..."
mms=preprocessing.MinMaxScaler(feature_range=(0, 1))
submitProba=mms.fit_transform(submitProba)
df=pd.DataFrame(submitProba)
print df.describe()
SaveFile(submitID, submitProba, fileName="1submitScale.csv") #0.815536
'''
count 75818.000000
mean 0.018307
std 0.043099
min 0.000000
25% 0.004509
50% 0.006107
75% 0.012035
max 1.000000
'''其实还有很多话想说,不过这个文章就到这边吧,毕竟一个1000+的人的说教会让人觉得烦;以后再参加其他比赛了一起说吧。
http://blog.kaggle.com/2016/02/22/profiling-top-kagglers-leustagos-current-7-highest-1/
和大牛不谋而合:
记一次失败的kaggle比赛(3):失败在什么地方,贪心筛选特征、交叉验证、blending
标签:
原文地址:http://blog.csdn.net/mmc2015/article/details/51301865