python3: 爬虫---- urllib, beautifulsoup

时间：2018-07-16 22:16:50 阅读：177 评论：0 收藏：0 [点我收藏+]

标签：语言 lib index 学习 body python3 取数 err mozilla

最近晚上学习爬虫，首先从基本的开始；

python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载， beautifulsoup 可以从杂乱的html代码中

分离出我们需要的部分；

注： beautifulsoup 是一种可以从html 或XML文件中提取数据的python库；

实例1：

from urllib import request
from bs4 import BeautifulSoup as bs
import re

header = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36‘
}

def download():
    """
     模拟浏览器进行访问；
    :param url:
    :return:
    """
    for pageIdx in range(1, 3, 1):
        #print(pageIdx)
        url = "https://www.cnblogs.com/#p%s" % str(pageIdx)
        print(url)
        req = request.Request(url, headers=header)
        rep = request.urlopen(req).read()
        data = rep.decode(‘utf-8‘)
        print(data)
        content = bs(data)
        for link in content.find_all(‘h3‘):
            content1 = bs(str(link), ‘html.parser‘)
            print(content1.a[‘href‘],content1.a.string)
            curhtmlcontent = request.urlopen(request.Request(content1.a[‘href‘], headers=header)).read()
            #print(curhtmlcontent.decode(‘utf-8‘))
            open(‘%s.html‘ % content1.a.string, ‘w‘,encoding=‘utf-8‘).write(curhtmlcontent.decode(‘utf-8‘))

if __name__ == "__main__":
    download()

实例2：

# -- coding: utf-8 --
import unittest
import  lxml
import requests
from bs4 import BeautifulSoup as bs

def  school():
    for index in range(2, 34, 1):
        try:
            url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index)
            r = requests.get(url=url)
            soup = bs(r.content, ‘lxml‘)
            city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string
            fp = open("%s.txt" %(city), "w", encoding="utf-8")
            content1 = soup.find_all(name="tr", attrs={"height": "29"})
            for content2 in content1:
                try:
                    contentTemp = bs(str(content2), "lxml")
                    soup_content = contentTemp.find_all(name="td")[1].string
                    fp.write(soup_content + "\n")
                    print(soup_content)
                except IndexError:
                    pass
            fp.close()
        except IndexError:
            pass


class MyTestCase(unittest.TestCase):
    def test_something(self):
        school()


if __name__ == ‘__main__‘:
    unittest.main()

BeatifulSoup支持很多HTML解析器（下面是一些主要的）：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	(1)Python的内置标准库(2)执行速度适中(3)文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, “lxml”)	(1)速度快(2)文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”)	(1)速度快(2)唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	(1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档	(1)速度慢(2)不依赖外部扩展

python3: 爬虫---- urllib, beautifulsoup

标签：语言 lib index 学习 body python3 取数 err mozilla

原文地址：https://www.cnblogs.com/yinwei-space/p/9320640.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行