1 了解

时间：2017-11-11 19:52:41 阅读：211 评论：0 收藏：0 [点我收藏+]

标签：one python get 链接 tee opened erro lap 信息

urllib 标准库（py2中是urllib2）

子模块：request、parse、error

request：

urlopen函数：打开并读取一个从网络获取的远程对象

from urllib.request import urlopen
# html = urlopen(‘http://pythonscraping.com/pages/page1.html‘)
html2 = urlopen(‘http://baidu.com/robots.txt‘)  　　#获取该网页全部HTML代码
print(html2.read())                                 #显示的字节码

beautifulsoup4（bs4）非标准库

sudo apt-get install python-bs4

In [5]: from bs4 import BeautifulSoup

#改进
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(‘http://news.ifeng.com/a/20171111/53165257_0.shtml‘)    
#网页可能不存在，返回HTTPError，此时用try-except语句；服务器不存在返回None即链接打不开或URL写错，加一个if语句；还有可能我们所需标签不存在，返回None
bsobj = BeautifulSoup(html.read(),‘html.parser‘)　　　　　　# ‘html.parser‘避免出现警告信息
print(bsobj.h1)         　　　　　　　　　　　　　　　　　　　　#获取h1标签                 


<h1 id="artical_topic" itemprop="headline">又一名女空乘从波音客机上掉下</h1>

Heading标签也叫做H标签，HTML语言里共六种大小的heading 标签，是网页html 中对文本标题所进行的着重强调的一种标签，以标签<h1>、<h2>、<h3>到<h6>定义标题头

的六个不同文字大小的tags,本质是为了呈现内容结构。共有六对，文字从大到小，依此显示重要性的递减，也就是权重依次降低。

含异常处理的改进（获取网页标题）：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsobj = BeautifulSoup(html.read(),‘html.parser‘)
        title = bsobj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle(‘http://news.ifeng.com/a/20171111/53165257_0.shtml‘)
if title ==None:
    print(‘Title cound not be found‘)
else:
    print(title)

View Code

1 了解

标签：one python get 链接 tee opened erro lap 信息

原文地址：http://www.cnblogs.com/lybpy/p/7819643.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行