python解析xml之lxml

时间：2016-04-13 14:45:03 阅读：727 评论：0 收藏：0 [点我收藏+]

标签：

虽然python解析xml的库很多，但是，由于lxml在底层是用C语言实现的，所以lxml在速度上有明显优势。除了速度上的优势，lxml在使用方面，易用性也非常好。这里将以下面的xml数据为例，介绍lxml的简单使用。

[html]?view plain?copy

? 技术分享

例子：dblp.xml(dblp数据的片段)??
<?xml?version=‘1.0‘?encoding=‘utf-8‘?>????
<dblp>??
???????<article?mdate="2012-11-28"?key="journals/entropy/BellucciFMY08">????
????????<author>Stefano?Bellucci</author>????
????????<author>Sergio?Ferrara</author>????
????????<author>Alessio?Marrani</author>????
????????<author>Armen?Yeranyan</author>????
????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
????????<pages>507-555</pages>????
????????<year>2008</year>????
????????<volume>10</volume>????
????????<journal>Entropy</journal>????
????????<number>4</number>????
????????<ee>http://dx.doi.org/10.3390/e10040507</ee>????
????????<url>db/journals/entropy/entropy10.html#BellucciFMY08</url>????
????</article>????
????<article?mdate="2013-03-04"?key="journals/entropy/Knuth13">????
????????<author>Kevin?H.?Knuth</author>????
????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
????????<pages>698-699</pages>????
????????<year>2013</year>????
????????<volume>15</volume>????
????????<journal>Entropy</journal>????
????????<number>2</number>????
????????<ee>http://dx.doi.org/10.3390/e15020698</ee>????
????????<url>db/journals/entropy/entropy15.html#Knuth13</url>????
????</article>????
</dblp>??

1、将xml解析为树结构，并得到该树的根。

为了将xml解析为树结构，并得到该树的根，要进行如下的操作：

[python]?view plain?copy

? 技术分享

#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??

另外，如果xml数据中出现了关于dtd的声明(如下面的例子)，那样的话，必须在使用lxml解析xml的时候，进行相应的声明。

[html]?view plain?copy

? 技术分享

xml文件中含有dtd声明的例子：??
<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
<dblp>??
<article?mdate="2002-01-03"?key="persons/Codd71a">??
<author>E.?F.?Codd</author>??
<title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
<journal>IBM?Research?Report,?San?Jose,?California</journal>??
<volume>RJ909</volume>??
<month>August</month>??
<year>1971</year>??
<a?href="http://lib.csdn.net/base/20"?class="replace_word"?title="Hadoop知识库"?target="_blank"?style="color:#df3434;?font-weight:bold;">hadoop</a>@hadoop:~/20130722dblpxml$?head?-15?dblp.xml???
<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
<dblp>??
<article?mdate="2002-01-03"?key="persons/Codd71a">??
<author>E.?F.?Codd</author>??
<title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
<journal>IBM?Research?Report,?San?Jose,?California</journal>??
<volume>RJ909</volume>??
<month>August</month>??
<year>1971</year>??
<cdrom>ibmTR/rj909.pdf</cdrom>??
<ee>db/labs/ibm/RJ909.html</ee>??
</article>??
</dblp>??

这时候，要想将xml数据解析为树结构并得到该树的树根，必须进行如下的操作：

[python]?view plain?copy

? 技术分享

#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
parser=etree.XMLParser(load_dtd=?True)#首先根据dtd得到一个parser(注意dtd文件要放在和xml文件相同的目录)??
tree?=?etree.parse("dblp.xml",parser)#用上面得到的parser将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??

2、遍历树结构，获得各元素的属性及其子元素。

[python]?view plain?copy

? 技术分享

for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称：",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author，title，volume，year等)??
????????print?field.tag,":",field.text#同样地，用.tag可以得到元素的名称，而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key",key??
????print?""#隔行分开不同的article元素??

到这里，便可以进行简单的xml数据的解析了。

3、解析xml数据的例子

用下面的代码解析文章开头的名为dblp.xml数据。

[python]?view plain?copy

? 技术分享

#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
?? ?
for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称：",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author，title，volume，year等)??
????????print?field.tag,":",field.text#同样地，用.tag可以得到元素的名称，而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key",key??
????print?""#隔行分开不同的article元素??

便可以得到输出如下：

[python]?view plain?copy

? 技术分享

元素名称：?article??
author?:?Stefano?Bellucci??
author?:?Sergio?Ferrara??
author?:?Alessio?Marrani??
author?:?Armen?Yeranyan??
title?:?ES??
pages?:?507-555??
year?:?2008??
volume?:?10??
journal?:?Entropy??
number?:?4??
ee?:?http://dx.doi.org/10.3390/e10040507??
url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
mdate:?2012-11-28??
key:?journals/entropy/BellucciFMY08??
?? ?
?? ?
元素名称：?article??
author?:?Kevin?H.?Knuth??
title?:?None??
pages?:?698-699??
year?:?2013??
volume?:?15??
journal?:?Entropy??
number?:?2??
ee?:?http://dx.doi.org/10.3390/e15020698??
url?:?db/journals/entropy/entropy15.html#Knuth13??
mdate:?2013-03-04??
key:?journals/entropy/Knuth13??

4、元素既有sub-element，又有text的处理

可以看到在上面的例子中，title元素的内容是不正确的。由于title元素及包含sub-element，又有text内容(如下)，这时简单的用.text，并不能正确的得到title元素的内容。上面的例子中，第一个article元素的title只取到了ES，而第二个article元素的title则什么都没取到，None。

[python]?view plain?copy

? 技术分享

<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
<title><i>Entropy</i>?Best?Paper?Award?2013.</title>???

由于在这个例子中，子元素比较简单，这里就简单的采取将子元素和text一起打印的方法来解决这一问题。代码如下：

[python]?view plain?copy

? 技术分享

#!/usr/bin/python??
#-*-coding:utf-8-*-??
from?lxml?import?etree#导入lxml库??
tree?=?etree.parse("dblp.xml")#将xml解析为树结构??
root?=?tree.getroot()#获得该树的树根??
?? ?
for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
????print?"元素名称：",article.tag#用.tag得到该子元素的名称??
????for?field?in?article:#遍历article元素的所有子元素(这里是指article的author，title，volume，year等)??
????????if?field.tag=="title":??
????????????print?field.tag,":",etree.tostring(field,encoding=‘utf-8‘,pretty_print=False)#将元素text连同sub_element一起打印??
????????else:??
????????????print?field.tag,":",field.text#同样地，用.tag可以得到元素的名称，而.text可以得到元素的内容??
????mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值??
????key=article.get("key")??
????print?"mdate:",mdate??
????print?"key:",key??
????print?""#隔行分开不同的article元素??

输出如下：

[python]?view plain?copy

? 技术分享

元素名称：?article??
author?:?Stefano?Bellucci??
author?:?Sergio?Ferrara??
author?:?Alessio?Marrani??
author?:?Armen?Yeranyan??
title?:?<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
?????????? ?
pages?:?507-555??
year?:?2008??
volume?:?10??
journal?:?Entropy??
number?:?4??
ee?:?http://dx.doi.org/10.3390/e10040507??
url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
mdate:?2012-11-28??
key:?journals/entropy/BellucciFMY08??
?? ?
元素名称：?article??
author?:?Kevin?H.?Knuth??
title?:?<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
?????????? ?
pages?:?698-699??
year?:?2013??
volume?:?15??
journal?:?Entropy??
number?:?2??
ee?:?http://dx.doi.org/10.3390/e15020698??
url?:?db/journals/entropy/entropy15.html#Knuth13??
mdate:?2013-03-04??
key:?journals/entropy/Knuth13??

当然，不难看出这个问题用这种方法解决比较傻，后面还得将title内容中的tag等不需要部分通过各种字符串的处理将其去掉。最好的方法是能有比较简单的方法，分别获取到一个元素的text和sub_element。下面就讲一下如何实现这个需求：

5、sub_element和text优雅实现版

假设xml文件paper.xml内容如下：

[plain]?view plain?copy

? 技术分享

<?xml?version="1.0"?encoding="ISO-8859-1"?>??
<dblp>??
????<article?mdate="2002-01-03"?key="persons/Codd71a">??
????????<author>E.?F.?Codd</author>??
????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
????????<volume>RJ909</volume>??
????????<month>August</month>??
????????<year>1971</year>??
????</article>??
????<article?mdate="2002-01-03"?key="persons/Codd71a">??
????????<author>E.?F.?Codd</author>??
????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>??
????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
????????<volume>RJ909</volume>??
????????<month>August</month>??
????????<year>1971</year>??
????????<cdrom>ibmTR/rj909.pdf</cdrom>??
????????<ee>db/labs/ibm/RJ909.html</ee>??
????</article>??
</dblp>??

可以看到，上面的文件中title字段中，既有子元素，也有嵌套。所以，为了同时取到text和子元素中的text，要单独地为取该字段的text写一个函数，下面是两个具体的实现。

5.1 v1.0

首先考虑的是递归读取各个元素的text，然后将它们拼起来，代码如下：

[python]?view plain?copy

? 技术分享

from?lxml?import?etree#paper2.py??
?? ?
def?node_text(node):??
????result?=?node.text.strip()?if?node.text?else?‘‘??
????for?child?in?node:??
????????child_text?=?node_text(child)??
????????if?child_text:??
????????????result?=?result?+?‘?%s‘?%?child_text?if?result?else?child_text??
????return?result??
?? ?
if?__name__?==?‘__main__‘:??
????parser?=?etree.XMLParser()??
????root?=?etree.parse(‘paper.xml‘,?parser).getroot()??
????for?element?in?root:??
????????category?=?element.tag??
????????for?attribute?in?element:??
????????????if?attribute.tag?==?"title":??
????????????????print?"title:",?node_text(attribute)??
????????????else:??
????????????????print?attribute.tag+":",attribute.text.strip()??
????????print?""??