码迷,mamicode.com
首页 > 其他好文 > 详细

BeautifulSoup常用的函数【转】

时间:2015-04-27 21:33:35      阅读:119      评论:0      收藏:0      [点我收藏+]

标签:

Example:

html文件:

html_doc = """ <html><head><title>The Dormouse‘s story</title></head> <p class="title">The Dormouse‘s story

<p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie, <a href="http://example.com/lacie" class="sister" id="link2">Lacie and <a href="http://example.com/tillie" class="sister" id="link3">Tillie; and they lived at the bottom of a well.

 

<p class="story">...

 

"""

 

代码:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签,返回整个标签,包括标签的属性,内容等)

如:soup.title 

    # <title>The Dormouse‘s story</title>

    soup.p 

    # <p class="title">The Dormouse‘s story

 

    soup.a  (注:仅仅返回第一个结果)

    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie

    soup.find_all(‘a‘) (find_all 可以返回所有)

    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie, 

    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie, 

    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie]

 

    find还可以按属性查找

    soup.find(id="link3")

    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie

 

    要取某个标签的某个属性,可用函数有 find_all,get

    for link in soup.find_all(‘a‘): 

      print(link.get(‘href‘)) 

    # http://example.com/elsie 

    # http://example.com/lacie 

    # http://example.com/tillie

 

    要取html文件中的所有文本,可使用get_text()

    print(soup.get_text()) 

    # The Dormouse‘s story 

    #

    # The Dormouse‘s story 

    #

    # Once upon a time there were three little sisters; and their names were 

    # Elsie, 

    # Lacie and 

    # Tillie; 

    # and they lived at the bottom of a well.

    #

    # ...

 

    如果是打开html文件,语句可用:

    soup = BeautifulSoup(open("index.html"))

 

 

BeautifulSoup中的Object

tag (对应html中的标签)

tag.attrs (以字典形式返回tag的所有属性)

可以直接对tag的属性进行增、删、改,跟操作字典一样

tag[‘class‘] = ‘verybold‘ 

tag[‘id‘] = 1 

tag 

# <blockquote class="verybold" id="1">Extremely bold</blockquote> 

 

del tag[‘class‘] 

del tag[‘id‘] 

tag 

# <blockquote>Extremely bold</blockquote> 

 

tag[‘class‘] 

# KeyError: ‘class‘ 

print(tag.get(‘class‘)) 

# None

 

X.contents (X为标签,可返回标签的内容)

eg.

head_tag = soup.head 

head_tag 

# <head><title>The Dormouse‘s story</title></head> 

head_tag.contents 

[<title>The Dormouse‘s story</title>] 

title_tag = head_tag.contents[0] 

title_tag 

# <title>The Dormouse‘s story</title> 

title_tag.contents 

# [u‘The Dormouse‘s story‘]

BeautifulSoup常用的函数【转】

标签:

原文地址:http://www.cnblogs.com/Blaxon/p/4461279.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!