初阶爬虫（三）

时间：2019-03-10 19:09:04 阅读：145 评论：0 收藏：0 [点我收藏+]

标签：@class python get wiki 一段 elf com .com sel

来一段：

import requests
url="https://en.wikipedia.org/wiki/Steve_Jobs"
res=requests.get(url)
print(res.status_code)
with open(‘a.html‘, ‘w‘, encoding=‘utf-8‘) as f:
    f.write(res.text)

保存一个网页，由于windows和python编码的原因，所以在open的时候要指定encoding=‘utf-8‘

再来一段：

import requests
import re
from lxml import etree
with open("a.html","r",encoding="utf-8") as f:
    c=f.read()
tree=etree.HTML(c)
table_element=tree.xpath("//table[@class=‘infobox biography vcard‘]")
table_row=tree.xpath("//table[@class=‘infobox biography vcard‘][1]/tbody/tr")
pattern_attrib=re.compile("<.*?>")
# print(table_element)
# infobox biography vcard
for row in table_row:
    try:
        thead=row.xpath("th")[0]
        title=etree.tostring(thead).decode("utf-8")
        title=pattern_attrib.sub(" ",title)
        desc=row.xpath("td")[0]
        desc=etree.tostring(desc).decode("utf-8")
        desc=pattern_attrib.sub(" ",desc)
        print(title+":"+desc)
        print("=========")
    except Exception as err:
        print(err)
        # pass
content=tree.xpath("//div[@id=‘mw-content-text‘][1]//*[self::h2 or self::p]")
for line in content:
    print(line.text)

初阶爬虫（三）

标签：@class python get wiki 一段 elf com .com sel

原文地址：https://blog.51cto.com/14156081/2360824

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行