码迷,mamicode.com
首页 > Web开发 > 详细

webmagic爬虫程序

时间:2014-07-05 18:52:01      阅读:266      评论:0      收藏:0      [点我收藏+]

标签:http   java   os   html   for   htm   

package com.letv.cloud.spider;
import java.util.HashSet;
import java.util.List;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class MoviePaperPageProcessor implements PageProcessor {
private Site page = Site.me().setRetryTimes(3).setSleepTime(1000);
public Site getSite() {
return page;http://www.huiyi8.com/moban/
public void process(Page page) {网页模板
List<String> links = page.getHtml().links().regex(
"http://posters.imdb.cn/poster/\\d+").all();
links = removeDuplicate(links);
page.addTargetRequests(links);
page.putField("title", page.getHtml().xpath(
"//div[@id=‘imdbleftsecc‘]/center/h1/text()").toString());
page.putField("imgurl", page.getHtml().xpath(
"//div[@id=‘imdbleftsecc‘]/center/img/@src").toString());
public static void main(String[] args) { for (int i = 1; i <= 3; i++) {
Spider.create(new MoviePaperPageProcessor()).addUrl(
"http://posters.imdb.cn/poster_page/" + i).thread(5).run();
public static List removeDuplicate(List list) {
HashSet hs = new HashSet(list);
list.clear();
list.addAll(hs);
return list;

webmagic爬虫程序,布布扣,bubuko.com

webmagic爬虫程序

标签:http   java   os   html   for   htm   

原文地址:http://www.cnblogs.com/cjings/p/3822887.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!