Java爬虫搜索原理实现

时间：2016-12-18 14:55:38 阅读：300 评论：0 收藏：0 [点我收藏+]

没事做，又研究了一下爬虫搜索，两三天时间总算是把原理闹的差不多了，基本实现了爬虫搜索的原理，本次实现还是俩程序，分别是按广度优先和深度优先完成的，广度优先没啥问题，深度优先请慎用，有极大的概率会造成死循环情况，下面深度优先的测试网站就造成了死循环。。。。好吧，我承认是我人品不太好。。。下面有请代码君出场~~~~~~~~~~~~~~~

1.广度优先

[java]view plaincopy 
/** 
 * 完成广度优先搜索 
 */ package  import import import import import import import import import import import  /** 
 * @author 魏诗尧 
 * @version 1.8 
 * @emali inwsy@hotmail.com 
 */ publicclass    
privatevoid   
null   
null   
null  try   
new   
     
   
new  bytenewbyte1024  int;  
  
while, )) != -) {  
, len);  
   
new);  
 catch );  
finally try   
ifnull  ifnull   catch         
privatevoid   
null null null  try   
newtrue  new);  
 newnew   
whilenull     
     
   
while );  
  
if) {  
continue   if) == ) {  
continue   if)) {  
continue   if) != -) {  
continue  if) != -) {  
continue   if)) {  
continue    
   
);  
     catch );  
finally   
try ifnull  ifnull  ifnull  catch         
privatevoid   
null  null  try   
 new);  
newnew   
new   
whilenull    
);  
  catch    finally try   
ifnull  ifnull  catch        publicstaticvoidthrows   
new, );  
  
new  }  

上面广度优先没啥问题，本人昨天凌晨3点多做的测试，15分钟左右的时间，这只小爬虫爬到了30W+的链接，能力还是蛮强大的么，顺便提一下，白天测试的时候会非常非常的慢，推荐各位测试君在晚上12点以后做测试。。。。。虽然不太人道。。。

下面是深度优先的代码，测试的时候每次都能造成死循环。。。好吧，我承认我没有人品。。。其实基本方法和广度优先没啥区别，我每个页面爬出来的链接只拿第一个去爬下一个页面，总共爬多少层我懒的木有定义，就是想看看最多能爬到哪。。。然后每次都能悲剧的死循环了。。。我明明也设置了跳出的方法了啊，我有判断有效链接的方式，但是我的判断并不完善么，跳出方法我写到了catch中，只要有一个无效链接，就可以跳出来了么。。。今天凌晨全都是死循环。。。。无奈了。。。。下面请代码君上场~~~~~~~~~~

[java]view
 plaincopy 

/** 

 * 完成深度优先搜索 

 * 爬虫进行深度优先很有可能会出现死循环的情况 

 */
package

import
import
import
import
import
import
import
import
import
import
import
import
import

/** 

 * @author 魏诗尧 

 * @version 1.8 

 * @emali inwsy@hotmail.com 

 */
publicclass
  
privatestaticnew

privatevoid
  
null
  
null
  
null

try
  
new
  
new

bytenewbyte1024

int;  

while, )) != -) {  

, len);  

new);  

catch
);  

finally
try
  
ifnull

ifnull

catch

privatevoid
  
null

null

try
  
new);  

newnew
  
whilenull

while
);  

if) {  

continue

if) == ) {  

continue

if)) {  

continue

if) != -) {  

continue

if) != -) {  

continue

if)) {  

continue

whilenull

new);  

break

catch
);  

new
finally
  
try

ifnull

ifnull

catch

publicvoid

null

try
  
new, true
  
while

);  

catch
);  

finally

try
ifnull

catch

publicstaticvoid
new, );  

new

}

上面这两篇代码本身是十分不完善的，时间原因，我基本只实现了最基本的原理，能改动增加的地方还有很多，主要是增加，很多地方都可增加代码来增强程序的健壮性。。。比如有效链接判断的地方，我们从href标签中取出来的内容除了我写的几条判断意外还有好多东西都没有处理掉，这个地方还是能增加很多东西的。。。

Java爬虫搜索原理实现

标签：bsp 深度概率 keyword ber sdn 取出 stat lib

原文地址：http://www.cnblogs.com/arxive/p/6194372.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行