python解析xml

时间：2014-09-22 21:43:53 阅读：347 评论：0 收藏：0 [点我收藏+]

了解xml:

解析xml 之前首先对xml做个了解.

来自维基百科的解释：

XML设计用来传送及携带数据信息，不用来表现或展示数据，HTML语言则用来表现数据，所以XML用途的焦点是它说明数据是什么，以及携带数据信息。

如果你已经了解xml，可以跳过这一部分。

xml是一种描述层次结构化数据的通用方法。xml文档包含由起始和结束标签(tag)分隔的一个或多个元素(element)。以下也是一个完整的(虽然空洞)xml文件：

<foo>   ①
</foo>  ②

①	*这是`foo`元素的起始标签。*
②	*这是`foo`元素对应的结束标签。就如写作、数学或者代码中需要平衡括号一样，每一个起始标签必须有对应的结束标签来闭合（匹配）。*

元素可以嵌套到任意层次。位于foo中的元素bar可以被称作其子元素。

<foo>
  <bar></bar>
</foo>

xml文档中的第一个元素叫做根元素(root element)。并且每份xml文档只能有一个根元素。以下不是一个xml文档，因为它存在两个“根元素”。

<foo></foo>
<bar></bar>

元素可以有其属性(attribute)，它们是一些名字-值(name-value)对。属性由空格分隔列举在元素的起始标签中。一个元素中属性名不能重复。属性值必须用引号包围起来。单引号、双引号都是可以。

<foo lang=‘en‘>                          ①
  <bar id=‘papayawhip‘ lang="fr"></bar>  ②
</foo>

①	`foo`元素有一个叫做`lang`的属性。`lang`的值为`en`
②	`bar`元素则有两个属性，分别为`id`和`lang`。其中`lang`属性的值为`fr`。它不会与`foo`的那个属性产生冲突。每个元素都其独立的属性集。

如果元素有多个属性，书写的顺序并不重要。元素的属性是一个无序的键-值对集，跟Python中的列表对象一样。另外，元素中属性的个数是没有限制的。

元素可以有其文本内容(text content)

<foo lang=‘en‘>
  <bar lang=‘fr‘>PapayaWhip</bar>
</foo>

如果某一元素既没有文本内容，也没有子元素，它也叫做空元素。

<foo></foo>

表达空元素有一种简洁的方法。通过在起始标签的尾部添加/字符，我们可以省略结束标签。上一个例子中的xml文档可以写成这样：

<foo/>

就像Python函数可以在不同的模块(modules)中声明一样，也可以在不同的名字空间(namespace)中声明xml元素。xml文档的名字空间通常看起来像URL。我们可以通过声明xmlns来定义默认名字空间。名字空间声明跟元素属性看起来很相似，但是它们的作用是不一样的。

<feed xmlns=‘http://www.w3.org/2005/Atom‘>  ①
  <title>dive into mark</title>             ②
</feed>

①	`feed`元素处在名字空间`http://www.w3.org/2005/Atom`中。
②	`title`元素也是。名字空间声明不仅会作用于当前声明它的元素，还会影响到该元素的所有子元素。

也可以通过xmlns:prefix声明来定义一个名字空间并取其名为prefix。然后该名字空间中的每个元素都必须显式地使用这个前缀(prefix)来声明。

<atom:feed xmlns:atom=‘http://www.w3.org/2005/Atom‘>  ①
  <atom:title>dive into mark</atom:title>             ②
</atom:feed>

① feed元素属于名字空间http://www.w3.org/2005/Atom。

② title元素也在那个名字空间。

对于xml解析器而言，以上两个xml文档是一样的。名字空间 + 元素名 = xml标识。前缀只是用来引用名字空间的，所以对于解析器来说，这些前缀名(atom:)其实无关紧要的。名字空间相同，元素名相同，属性（或者没有属性）相同，每个元素的文本内容相同，则xml文档相同。

来源： <http://woodpecker.org.cn/diveintopython3/xml.html>

python解析xml:

解析XML主要用到pytohn自带的XML库，其次还是lxml库

XML结构，先以一个相对简单但功能比较全的XML文档为例：

<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <category scheme='http://diveintomark.org' term='docbook'/>
    <category scheme='http://diveintomark.org' term='html'/>
    <summary type='html'>Putting an entire chapter on one page sounds
      bloated, but consider this &mdash; my longest chapter so far
      would be 75 printed pages, and it loads in under 5 seconds&hellip;
      On dialup.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Accessibility is a harsh mistress</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>
    <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
    <updated>2009-03-22T01:05:37Z</updated>
    <published>2009-03-21T20:09:28Z</published>
    <category scheme='http://diveintomark.org' term='accessibility'/>
    <summary type='html'>The accessibility orthodoxy does not permit people to
      question the value of features that are rarely useful and rarely used.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
    </author>
    <title>A gentle introduction to video encoding, part 1: container formats</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>
    <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
    <updated>2009-01-11T19:39:22Z</updated>
    <published>2008-12-18T15:54:22Z</published>
    <category scheme='http://diveintomark.org' term='asf'/>
    <category scheme='http://diveintomark.org' term='avi'/>
    <category scheme='http://diveintomark.org' term='encoding'/>
    <category scheme='http://diveintomark.org' term='flv'/>
    <category scheme='http://diveintomark.org' term='GIVE'/>
    <category scheme='http://diveintomark.org' term='mp4'/>
    <category scheme='http://diveintomark.org' term='ogg'/>
    <category scheme='http://diveintomark.org' term='video'/>
    <summary type='html'>These notes will eventually become part of a
      tech talk on video encoding.</summary>
  </entry>
</feed>

先简单的看一下这个XML的结构：

<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'> #这里定义了命名空间(namespace) http://www.w3.org/2005/Atom
  <title></title>
  <subtitle></subtitle>
  <id></id>
  <updated></updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> #这里的<link>没有text，但是里面有相应的属性
  <entry>
    <author>
      <name></name>
      <uri></uri>
    </author>
    <title></title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id></id>
    <updated></updated>
    <published></published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <summary type='html'></summary>
  </entry>
</feed>

首先有一个全局的根元素<feed></feed>

在根元素下面有title,subtitle,id,update,link,entry子元素

在entry元素下面还有author,title,link,id,updated,published,category,summary子元素

在author元素下面还有name,uri子元素

结构还是挺清晰的

下面我们用python的方法来一步步的取出在元素<></>之间的content以及元素内的属性

使用的方法主要有

tree = etree.parse() 解析XML

root = tree.getroot() 得到根元素

root.tag 根元素名称

root.attrib 显示元素的属性

root.findall() 查找元素

下面请看代码，都已经将注释与结果写在里面：

import xml.etree.ElementTree as etree #将xml.etree.ElementTree引入
tree = etree.parse('feed.xml') #解析XML
root = tree.getroot()
print root
# <Element {http://www.w3.org/2005/Atom}feed at cd1eb0>
 
#元素即列表
print root.tag
#{http://www.w3.org/2005/Atom}feed
# ElementTree使用{namespace}localname来表达xml元素
 
for child in root:
    print child
 
    # <Element {http://www.w3.org/2005/Atom}title at e2b5d0>
    # <Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
    # <Element {http://www.w3.org/2005/Atom}id at e2b6c0>
    # <Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
    # <Element {http://www.w3.org/2005/Atom}link at e2b4b0>
    # <Element {http://www.w3.org/2005/Atom}entry at e2b720>
    # <Element {http://www.w3.org/2005/Atom}entry at e2b510>
    # <Element {http://www.w3.org/2005/Atom}entry at e2b750>
    # 这里只显示一级子元素，而子元素的子元素将不会被遍历
 
#属性即字典
print root.attrib
#{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
#我们注意到feed下面的link这个元素有属性
print root[4].attrib
#{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
print root[3].attrib
#{} 将会得到一个空字典，因为updated元素内没有属性值
 
#查找元素
entrylist = root.findall('{http://www.w3.org/2005/Atom}entry')
print entrylist
# [<Element {http://www.w3.org/2005/Atom}entry at 18423a0>, <Element {http://www.w
# 3.org/2005/Atom}entry at 18425d0>, <Element {http://www.w3.org/2005/Atom}entry a
# t 1842968>]
print root.findall('{http://www.w3.org/2005/Atom}author') 
# 这里将得到一个空列表，因为author不是feed的直接子元素
 
#查找子元素
entries = tree.findall('{http://www.w3.org/2005/Atom}entry') #先找到entry元素·
title = entries[0].find('{http://www.w3.org/2005/Atom}title')#接着再找title元素
print title.text
#'Dive into history, 2009 edition'
 
all_links = tree.findall('//{http://www.w3.org/2005/Atom}link') #在元素前面加'//' 则可以在所有元素里查找包括子元素和孙元素
# [<Element {http://www.w3.org/2005/Atom}link at e181b0>,
 # <Element {http://www.w3.org/2005/Atom}link at e2b570>,
 # <Element {http://www.w3.org/2005/Atom}link at e2b480>,
 # <Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
 
print all_links[0].attrib #将会得到这个Link的属性字典 
 # {'href': 'http://diveintomark.org/',
 # 'type': 'text/html',
 # 'rel': 'alternate'}

来源： <http://my.oschina.net/yangyanxing/blog/159212>

python解析实例：

1.微信的XML解析，微信将用户的信息POST到你的服务器上，基本形式如：

<xml>
<ToUserName><![CDATA[toUser]]></ToUserName>
<FromUserName><![CDATA[fromUser]]></FromUserName> 
<CreateTime>1348831860</CreateTime>
<MsgType><![CDATA[text]]></MsgType>
<Content><![CDATA[this is a test]]></Content>
<MsgId>1234567890123456</MsgId>
</xml>

pyhon解析xml文件获取元素名字及其content中内容：

import xml.etree.ElementTree as etree
 
weixinxml = etree.parse('weixinpost.xml')
wroot = weixinxml.getroot()
print wroot.tag
 
for child in wroot:
    print child.tag
 
if wroot.find('Content') is not None:
    print wroot.find('Content').text
else:
    print 'Nothing found'

效果如下：

2.练习一下解析Yahoo的XML格式的天气预报，获取当天和最近几天的天气：

http://weather.yahooapis.com/forecastrss?u=c&w=2151330

参数w是城市代码，要查询某个城市代码，可以在weather.yahoo.com搜索城市，浏览器地址栏的URL就包含城市代码。

由此引入“网络爬虫” 实现代码后续奉上...

python解析xml

标签：python 网络爬虫 xml python解析xml

原文地址：http://blog.csdn.net/zwto1/article/details/39479925

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行