记一次爬虫经历

时间：2015-11-04 19:33:24 阅读：371 评论：0 收藏：0 [点我收藏+]

标签：

一个同学想要提取某网站的数据，然后导出为excel。

刚开始，我在网上找了一个php生成excel的demo，名字叫PHPExcel。

刚开始我是直接从那个网站上读数据，然后利用那个demo导出为excel，但是遇到了一个特别坑的问题——

那个网站服务器实在太差了，我要从那里读1400多个网页的内容，经常出现读取失败的问题，然后程序就进行不下去了。

于是，想到办法1：

每次读取5页数据，最后可以生成大概1400/5个表。

这样每次读取失败，再重新运行程序读取。

但是有可以改进的地方——

那个网站一次读取失败，可以重复读取几次，99%是能读取成功的。

之后想到，数据库有导出excel的方法，于是乎，我先把数据存入数据库，最后一下子导出为excel。

<?php
//避免程序运行30秒后停止
ini_set(‘max_execution_time‘, ‘0‘);		
header("content-Type: text/html; charset=utf8");

//连接数据库
$conn=mysql_connect("localhost","root","")or die("连接错误");
mysql_select_db("newdb",$conn);
mysql_query("set names ‘utf8‘");

//读取网页超时时间
$opts = array(   
  ‘http‘=>array(   
    ‘method‘=>"GET",   
    ‘timeout‘=>15,//单位秒  
   )   
);

for($i = 1; $i<143; $i++)
{
	//读取超时后重新读网页内容，最多3次，否则数据出错
	$cnt1=0; 
	while($cnt1<3 && ($strJson=file_get_contents("http://stockdata.stock.hexun.com/zrbg/data/zrbList.aspx?date=2014-12-31&count=20&page=".$i, false, stream_context_create($opts)))===FALSE) 
		$cnt1++;
	$strJson = iconv("gb2312", "utf-8//IGNORE",$strJson);
	preg_match_all(‘/(?<=industry:\‘).+?(?=\‘,stockNumber)/‘, $strJson, $arrLala);		//获得所有企业名称
	preg_match_all(‘/(?<=StockNameLink:\‘).+?(?=\‘,industry)/‘, $strJson, $arrLala2);	//获得所有链接
	for($j = 0; $j <20; $j++)
	{
		$cnt=0; 
		while($cnt<3 && ($str=file_get_contents("http://stockdata.stock.hexun.com/zrbg/".$arrLala2[0][$j], false, stream_context_create($opts)))===FALSE) 
			$cnt++;
		$str = iconv("gb2312", "utf-8//IGNORE",$str);
		preg_match_all(‘/(?<=\().+?分/‘, $str, $arr);		
		preg_match_all(‘/(?<=：).+?(?=<)/‘, $str, $arr2);
		
		$sql="insert into report(
		a1,b1,c1,d1,e1,f1,g1,h1,i1,j1,k1,l1,m1,n1,o1,p1,q1,r1,s1,t1,u1,v1,w1,x1,y1,z1,
		a2,b2,c2,d2,e2,f2,g2,h2,i2,j2,k2,l2,m2,n2,o2,p2,q2,r2,s2,t2,u2,v2,w2,x2,y2,z2,
		a3,b3,c3,d3) values(‘".$arrLala[0][$j]."‘,‘".$arr[0][0]."‘,‘".$arr[0][1]."‘,‘".$arr2[0][5]."‘,‘".$arr2[0][6]."‘,‘".$arr2[0][7]."‘,‘".$arr2[0][8]."‘,‘".$arr2[0][9]."‘,‘".$arr2[0][10]."‘,‘".
			$arr[0][8]."‘,‘".$arr2[0][11]."‘,‘".$arr2[0][12]."‘,‘".$arr2[0][13]."‘,‘".$arr2[0][14]."‘,‘".$arr2[0][15]."‘,‘".
			$arr[0][14]."‘,‘".$arr2[0][16]."‘,‘".$arr2[0][17]."‘,‘".$arr2[0][18]."‘,‘".
			$arr[0][18]."‘,‘".$arr2[0][19]."‘,‘".
			$arr[0][20]."‘,‘".$arr2[0][20]."‘,‘".$arr2[0][21]."‘,‘".$arr2[0][22]."‘,‘".
			$arr[0][24]."‘,‘".$arr[0][25]."‘,‘".$arr2[0][23]."‘,‘".$arr2[0][24]."‘,‘".
			$arr[0][28]."‘,‘".$arr2[0][25]."‘,‘".$arr2[0][26]."‘,‘".
			$arr[0][31]."‘,‘".$arr2[0][27]."‘,‘".$arr2[0][28]."‘,‘".$arr2[0][29]."‘,‘".
			$arr[0][35]."‘,‘".$arr[0][36]."‘,‘".$arr2[0][30]."‘,‘".$arr2[0][31]."‘,‘".
			$arr[0][39]."‘,‘".$arr2[0][32]."‘,‘".
			$arr[0][41]."‘,‘".$arr2[0][33]."‘,‘".$arr2[0][34]."‘,‘".
			$arr[0][44]."‘,‘".$arr[0][45]."‘,‘".$arr2[0][35]."‘,‘".$arr2[0][36]."‘,‘".$arr2[0][37]."‘,‘".$arr2[0][38]."‘,‘".$arr2[0][39]."‘,‘".
			$arr[0][51]."‘,‘".$arr[0][52]."‘,‘".$arr2[0][40]."‘,‘".$arr2[0][41]."‘)";
		mysql_query($sql,$conn);	
	}
}
echo "success!";
?>

如果有读取失败，或者读取错误的数据，那么重新读取更新数据库。

<?php
//避免程序运行30秒后停止
ini_set(‘max_execution_time‘, ‘0‘);		
header("content-Type: text/html; charset=utf8");

//连接数据库
$conn=mysql_connect("localhost","root","")or die("连接错误");
mysql_select_db("newdb",$conn);
mysql_query("set names ‘utf8‘");

//读取网页超时时间
$opts = array(   
  ‘http‘=>array(   
    ‘method‘=>"GET",   
    ‘timeout‘=>2,//单位秒  
   )   
);


$cnt=0; 
while($cnt<3 && ($str=file_get_contents("http://stockdata.stock.hexun.com/zrbg/stock_bg.aspx?code=300256", false, stream_context_create($opts)))===FALSE) 
{	
	$cnt++;
	if($cnt<3)
		echo "Try to get contents again...<br>";
	else
		echo "Read failed!<br>";
}
$str = iconv("gb2312", "utf-8//IGNORE",$str);
preg_match_all(‘/(?<=\().+?分/‘, $str, $arr);		
preg_match_all(‘/(?<=：).+?(?=<)/‘, $str, $arr2);

$sql = "update report set 	b1=‘".$arr[0][0]."‘,
							c1=‘".$arr[0][1]."‘,
							d1=‘".$arr2[0][5]."‘,
							e1=‘".$arr2[0][6]."‘,
							f1=‘".$arr2[0][7]."‘,
							g1=‘".$arr2[0][8]."‘,
							h1=‘".$arr2[0][9]."‘,
							i1=‘".$arr2[0][10]."‘,
							j1=‘".$arr[0][8]."‘,
							k1=‘".$arr2[0][11]."‘,
							l1=‘".$arr2[0][12]."‘,
							m1=‘".$arr2[0][13]."‘,
							n1=‘".$arr2[0][14]."‘,
							o1=‘".$arr2[0][15]."‘,
							p1=‘".$arr[0][14]."‘,
							q1=‘".$arr2[0][16]."‘,
							r1=‘".$arr2[0][17]."‘,
							s1=‘".$arr2[0][18]."‘,
							t1=‘".$arr[0][18]."‘,
							u1=‘".$arr2[0][19]."‘,
							v1=‘".$arr[0][20]."‘,
							w1=‘".$arr2[0][20]."‘,
							x1=‘".$arr2[0][21]."‘,
							y1=‘".$arr2[0][22]."‘,
							z1=‘".$arr[0][24]."‘,
							a2=‘".$arr[0][25]."‘,
							b2=‘".$arr2[0][23]."‘,
							c2=‘".$arr2[0][24]."‘,
							d2=‘".$arr[0][28]."‘,
							e2=‘".$arr2[0][25]."‘,
							f2=‘".$arr2[0][26]."‘,
							g2=‘".$arr[0][31]."‘,
							h2=‘".$arr2[0][27]."‘,
							i2=‘".$arr2[0][28]."‘,
							j2=‘".$arr2[0][29]."‘,
							k2=‘".$arr[0][35]."‘,
							l2=‘".$arr[0][36]."‘,
							m2=‘".$arr2[0][30]."‘,
							n2=‘".$arr2[0][31]."‘,
							o2=‘".$arr[0][39]."‘,
							p2=‘".$arr2[0][32]."‘,
							q2=‘".$arr[0][41]."‘,
							r2=‘".$arr2[0][33]."‘,
							s2=‘".$arr2[0][34]."‘,
							t2=‘".$arr[0][44]."‘,
							u2=‘".$arr[0][45]."‘,
							v2=‘".$arr2[0][35]."‘,
							w2=‘".$arr2[0][36]."‘,
							x2=‘".$arr2[0][37]."‘,
							y2=‘".$arr2[0][38]."‘,
							z2=‘".$arr2[0][39]."‘,
							a3=‘".$arr[0][51]."‘,
							b3=‘".$arr[0][51]."‘,
							c3=‘".$arr2[0][40]."‘,
							d3=‘".$arr[0][41]."‘
					where 	a1=‘星星科技(300256)‘";
mysql_query($sql,$conn);	

?>

记一次爬虫经历

标签：

原文地址：http://www.cnblogs.com/hellovenus/p/4936903.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行