BeautifulSoup常用的函数【转】

时间：2015-04-27 21:33:35 阅读：119 评论：0 收藏：0 [点我收藏+]

标签：

Example：

html文件：

html_doc = """ <html><head><title>The Dormouse‘s story</title></head> <p class="title">The Dormouse‘s story

<p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie, <a href="http://example.com/lacie" class="sister" id="link2">Lacie and <a href="http://example.com/tillie" class="sister" id="link3">Tillie; and they lived at the bottom of a well.

<p class="story">...

"""

代码：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

# <title>The Dormouse‘s story</title>

soup.p

# <p class="title">The Dormouse‘s story

soup.a （注：仅仅返回第一个结果）

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie

soup.find_all(‘a‘) （find_all 可以返回所有）

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie]

find还可以按属性查找

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie

要取某个标签的某个属性，可用函数有 find_all,get

for link in soup.find_all(‘a‘):

print(link.get(‘href‘))

# http://example.com/elsie

# http://example.com/lacie

# http://example.com/tillie

要取html文件中的所有文本，可使用get_text()

print(soup.get_text())

# The Dormouse‘s story

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

# ...

如果是打开html文件，语句可用：

soup = BeautifulSoup(open("index.html"))

BeautifulSoup中的Object

tag （对应html中的标签）

tag.attrs (以字典形式返回tag的所有属性）

可以直接对tag的属性进行增、删、改，跟操作字典一样

tag[‘class‘] = ‘verybold‘

tag[‘id‘] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag[‘class‘]

del tag[‘id‘]

tag

# <blockquote>Extremely bold</blockquote>

tag[‘class‘]

# KeyError: ‘class‘

print(tag.get(‘class‘))

# None

X.contents (X为标签，可返回标签的内容）

eg.

head_tag = soup.head

head_tag

# <head><title>The Dormouse‘s story</title></head>

head_tag.contents

[<title>The Dormouse‘s story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse‘s story</title>

title_tag.contents

# [u‘The Dormouse‘s story‘]

BeautifulSoup常用的函数【转】

标签：

原文地址：http://www.cnblogs.com/Blaxon/p/4461279.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行