爬去图片

时间：2018-01-20 20:34:45 阅读：125 评论：0 收藏：0 [点我收藏+]

标签：utf-8 img main date eve file try lib tle

#coding=utf-8

import urllib.request
from bs4 import BeautifulSoup
from urllib import error
import re

def validateTitle(title):
    rstr = r"[\/\\\:\*\?\"\<\>\|]"  # ‘/ \ : * ? " < > |‘
    new_title = re.sub(rstr, "_", title)  # 替换为下划线
    return new_title

for j in range(1,151637):
	url_origin = "http://www.7160.com/meinv/"+str(j)
	for i in range(1,30):
		if i == 1 :
			url = url_origin+"/index.html"
		else:
			url = url_origin+"/index_"+str(i)+".html"
		request = urllib.request.Request(url)
		try:
			res = urllib.request.urlopen(request)

			soup = BeautifulSoup(res,‘lxml‘)
			title_obj = soup.find(attrs={"class":"picmainer"})

			if title_obj is not None:
				print(url)
				title = title_obj.h1.string
				content = soup.find(‘img‘)
				src = content.get("src")

				file_name = validateTitle(title)+".jpg"
				urllib.request.urlretrieve(src, file_name)
				print(file_name+"保存成功")
		except error.URLError as e:
			print(e.reason)

爬去图片

标签：utf-8 img main date eve file try lib tle

原文地址：https://www.cnblogs.com/php-linux/p/8321709.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行