xpath案例爬取58出租房源信息&解析下载图片数据&乱码问题

时间：2019-09-30 14:46:53 阅读：112 评论：0 收藏：0 [点我收藏+]

标签：text 名称 http mobile imp jpg use amp 二手房

58二手房解析房源名称

from lxml import etree
import requests
url = ‘https://haikou.58.com/chuzu/j2/‘
headers = {
‘User-Agent‘: ‘Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36‘
}
parser = etree.HTMLParser(encoding=‘utf-8‘)
page_text = requests.get(url=url).text
tree = etree.HTML(page_text,parser=parser)
lis = tree.xpath(‘//ul[@class="house-list"]/li‘)
for li_item in lis:
    res=li_item.xpath(‘.//h2/a/text()‘) #注意 ./  
    print(res[0].strip())

爬取彼岸图网图片

from lxml import etree
import requests
url = ‘http://pic.netbian.com/4kfengjing‘
headers = {
‘User-Agent‘: ‘Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36‘
}
parser = etree.HTMLParser(encoding=‘utf-8‘)
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text,parser=parser)
res = tree.xpath(‘//div[@class="slist"]//li/a/img/@src‘)
count=0
for url_item in res:
    full_url = "%s%s"%(‘http://pic.netbian.com/‘,url_item)
    res = requests.get(url=full_url).content
    with open(‘图片%s.jpg‘%count,‘wb‘)as f:
        f.write(res)
    count+=1

乱码问题:

　　1.整体

　　　　- response = requests.get(url=xxx,headers=xxx)

　　　　-response.encoding = ‘utf-8‘

　　2. 单独

　　　 - xxx.encode(‘iso-8859-1‘).decode(‘gbk‘) （通用处理中文乱码方案)

xpath案例爬取58出租房源信息&解析下载图片数据&乱码问题

标签：text 名称 http mobile imp jpg use amp 二手房

原文地址：https://www.cnblogs.com/Jnhnsnow/p/11612292.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

xpath案例 爬取58出租房源信息&解析下载图片数据&乱码问题

xpath案例爬取58出租房源信息&解析下载图片数据&乱码问题