标签:win rank for requests gecko 请求 links 页面 路径
思路如下:
首先找到图片的节点
<div class="thumb"> <a href="/article/121672165" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12167/121672165/medium/NTDNQY3EJKUSRZ2X.jpg" alt="糗事#121672165" class="illustration" width="100%" height="auto"> </a> </div>
找到爬取页面的url
https://www.qiushibaike.com/imgrank/
发起请求拿到响应,略
使用正则表达式来获取图片的src
re.compile(‘<div class="thumb">.*?<img src="(.*?)>".*?</div>‘, re.S)
最后持久化写入文件。
具体代码如下:
import requests
import re
import os
url = "https://www.qiushibaike.com/imgrank/page/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
pattern = re.compile(‘<div class="thumb">.*?<img src="(.*?)".*?>.*?</div>‘, re.S)
if not os.path.exists("./imgs"):
os.mkdir("imgs")
# 观察糗事百科的分页url,不需要传入参数
for page in range(1, 6):
# 直接更新url切换页面
new_url = url + "%s/" % page
response = requests.get(url=new_url, headers=headers)
page_text = response.text
# 拿到所有图片的链接列表
list_img = pattern.findall(page_text)
# 持久化存储
page_path = "pages%s/" % page
os.mkdir("imgs/%s" % page_path)
for my_url in list_img:
# 将图片url补充完整
url_img = "https:" + my_url
print(url_img)
# 拿到图片的二进制文件
data_img = requests.get(url=url_img, headers=headers).content
# 图片的名称
name_img = my_url.split("/")[-1]
print(name_img)
# 写入到本地的文件的路径
path_img = "imgs/" + page_path + name_img
print(path_img)
with open(path_img, "wb") as fp:
fp.write(data_img)
标签:win rank for requests gecko 请求 links 页面 路径
原文地址:https://www.cnblogs.com/haoqirui/p/10658270.html