码迷,mamicode.com
首页 > Web开发 > 详细

抓取网页内容

时间:2014-11-26 18:17:05      阅读:476      评论:0      收藏:0      [点我收藏+]

标签:style   blog   http   io   ar   color   os   sp   java   

上一篇博客已经介绍了如何得到网页的编码,得到编码之后根据编码得到相应的流,我们将网页的内容获取存在一个string类型的变量中即可


package Spider;


import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.URL;


import org.junit.Test;


public class Capture {
@Test
public String k() throws Exception{
URL url=new URL("http://www.baidu.com");
//得到百度首页的编码
String charset=i(url);
StringBuffer codeBuffer=null;
BufferedReader in=null;
try{
//设置代理
Proxy proxy=new Proxy(Proxy.Type.HTTP, new InetSocketAddress("proxy3.bj.petrochina", 8080));
//打开连接
HttpURLConnection urlcon=(HttpURLConnection) url.openConnection(proxy);
urlcon.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)");
InputStream is=urlcon.getInputStream();
in=new BufferedReader(new InputStreamReader(is,charset));
codeBuffer=new StringBuffer();
String tmpCode=null;
//把buffer内的值读出来,保存到code中
while((tmpCode=in.readLine())!=null){
codeBuffer.append(tmpCode).append("\n");
}
in.close();
}catch(Exception e){
e.printStackTrace();
}
String temp = codeBuffer.toString();
System.out.println(temp);
return temp;
}
public String i(URL url) throws Exception{
//设置代理
Proxy proxy=new Proxy(Proxy.Type.HTTP, new InetSocketAddress("proxy3.bj.petrochina", 8080));

//URL url=new URL("http://www.baidu.com");
//打开连接
HttpURLConnection urlcon=(HttpURLConnection) url.openConnection(proxy);
urlcon.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)");
String charset=null;
String contentType=urlcon.getHeaderField("Content-Type");
for(String param:contentType.replace(" ","").split(";")){
if(param.startsWith("charset=")){
charset=param.split("=", 2)[1];
break;
}
}
return charset;
}
}

 

 

抓取网页内容

标签:style   blog   http   io   ar   color   os   sp   java   

原文地址:http://www.cnblogs.com/jiejiecool/p/4123464.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!