标签:
This summer, the ICDM 2015 conference sponsored a competitionfocused on making individual user connections across multiple digital devices. Top teams were invited to submit a paper for presentation at an ICDM workshop.
Roberto Diaz, competing as team "CookieMonster", took 3rd place. In this blog, he shares how he became a Kaggle addict, what he values in a competition, and most importantly, details on his approach to this unique dataset. Congrats to Roberto for achieving his goal of becoming a top 100 Kaggle user!
407 players on 340 teams competed in ICDM 2015: Drawbridge Cross-Device Connections
In addition to being a Kaggle addict, I am a researcher at Treelogicworking in the machine learning area. In parallel I work on my PhD thesis at the University Carlos III de Madrid focused on the parallelization of Kernel Methods.
Roberto‘s Kaggle profile
I didn‘t have any knowledge about this domain. The topic is quite new and I couldn‘t find any papers related to this problem, most probably because there are not public datasets.
I started on the first Facebook competition a long time ago. A friend of mine was taking part in the challenge and he encouraged me to compete. That caught my initial curiosity so I accessed the challenge‘s forum and I read a post with a solution that scored quite well on the leaderboard and I thought "I think I can do better than that". At the end I scored 9th on the leaderboard.
For my second challenge (EMC Israel Data science challenge) I was on a team with my PhD mates. We finished 3rd receiving a prize.
After that it was too late for me, I had become an addict.
The things I value most in a challenge are:
DÌaz-Morales, R., & Navia-V·zquez, A. (2015, September). Optimization of AMS using Weighted AUC optimized models. In *JMLR: Workshop and Conference Proceedings*, Vol. 42, pp. 109-127.
This challenge looked very interesting to me because all the conditions were met.
In this challenge we had a list of devices and a list of cookies and we had to tell what cookies belonged to the person using the device.
The most important part was the feature extraction procedure, they had to contain information about the relation between devices and cookies (for example, the number of IP addresses visited by each one and by both of them).
Once I had the features I tried simple supervised machine learning algorithms and complex ones (my winning methodology was Semi-Supervised learning procedure using Gradient Boosting + Bagging) and the score just grew up from 0.865 to 0.88.
A key part of the solution was the initial selection of candidates and the post processing:
The initial selection of candidates reduces the complexity of the problem and the post processing step find out most of the device/cookie pairs lost by that initial selection strategy.
Yes. When I sorted the scores obtained by the classifier for every candidate I saw that if the first score is high and the second is very low, is extremely likely that the first cookie belongs to the device. I made use of this information to create semi-supervised learning procedure updating some features in the training set and retraining the algorithm again with this new information to improve the results.
This picture shows the F05 score and the percentage of devices that fulfill the condition when we match devices and the first cookies candidate when the second candidate scores less than a threshold:
This solution has been implemented in python and uses the external software XGBoost.
The libraries of python used were:
I spent about 20% of the time in feature engineering, 10% in the supervised learning part and 70% eagerly awaiting for the results.
Too much, the training procedure takes around 9 hours using 12 cores.
The prediction procedure takes around 30 minutes, it is necessary to extract some features from the relational database.
I was trying to reach a place in top 100 of the users global ranking and I finally got it.
Regarding the challenge:
"All hope abandon, ye who enter here".
No, seriously, at the beginning you may feel frustrated because it is difficult area but you are in the correct place if:
Roberto Diaz is a researcher in the R&D department of Treelogic, a SME Spanish company focused on Machine Learning, Computer Vision and Big Data that takes part in many EU Research and Innovarions programmes. In parallel he works on his PhD thesis in the University Carlos III de Madrid focused on the parallelization of Kernel Methods.
ICDM Winner's Interview: 3rd place, Roberto Diaz
标签:
原文地址:http://www.cnblogs.com/yymn/p/4817370.html