码迷,mamicode.com
首页 > 编程语言 > 详细

python解析xml之lxml

时间:2016-04-13 14:45:03      阅读:727      评论:0      收藏:0      [点我收藏+]

标签:

虽然python解析xml的库很多,但是,由于lxml在底层是用C语言实现的,所以lxml在速度上有明显优势。除了速度上的优势,lxml在使用方面,易用性也非常好。这里将以下面的xml数据为例,介绍lxml的简单使用。

[html]?view plain?copy

?技术分享技术分享技术分享

  1. 例子:dblp.xml(dblp数据的片段)??
  2. <?xml?version=‘1.0‘?encoding=‘utf-8‘?>????
  3. <dblp>??
  4. ???????<article?mdate="2012-11-28"?key="journals/entropy/BellucciFMY08">????
  5. ????????<author>Stefano?Bellucci</author>????
  6. ????????<author>Sergio?Ferrara</author>????
  7. ????????<author>Alessio?Marrani</author>????
  8. ????????<author>Armen?Yeranyan</author>????
  9. ????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
  10. ????????<pages>507-555</pages>????
  11. ????????<year>2008</year>????
  12. ????????<volume>10</volume>????
  13. ????????<journal>Entropy</journal>????
  14. ????????<number>4</number>????
  15. ????????<ee>http://dx.doi.org/10.3390/e10040507</ee>????
  16. ????????<url>db/journals/entropy/entropy10.html#BellucciFMY08</url>????
  17. ????</article>????
  18. ????<article?mdate="2013-03-04"?key="journals/entropy/Knuth13">????
  19. ????????<author>Kevin?H.?Knuth</author>????
  20. ????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
  21. ????????<pages>698-699</pages>????
  22. ????????<year>2013</year>????
  23. ????????<volume>15</volume>????
  24. ????????<journal>Entropy</journal>????
  25. ????????<number>2</number>????
  26. ????????<ee>http://dx.doi.org/10.3390/e15020698</ee>????
  27. ????????<url>db/journals/entropy/entropy15.html#Knuth13</url>????
  28. ????</article>????
  29. </dblp>??

1、将xml解析为树结构,并得到该树的根。

为了将xml解析为树结构,并得到该树的根,要进行如下的操作:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. #!/usr/bin/python??
  2. #-*-coding:utf-8-*-??
  3. from?lxml?import?etree#导入lxml??
  4. tree?=?etree.parse("dblp.xml")#xml解析为树结构??
  5. root?=?tree.getroot()#获得该树的树根??

另外,如果xml数据中出现了关于dtd的声明(如下面的例子),那样的话,必须在使用lxml解析xml的时候,进行相应的声明。

[html]?view plain?copy

?技术分享技术分享技术分享

  1. xml文件中含有dtd声明的例子:??
  2. <?xml?version="1.0"?encoding="ISO-8859-1"?>??
  3. <!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
  4. <dblp>??
  5. <article?mdate="2002-01-03"?key="persons/Codd71a">??
  6. <author>E.?F.?Codd</author>??
  7. <title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
  8. <journal>IBM?Research?Report,?San?Jose,?California</journal>??
  9. <volume>RJ909</volume>??
  10. <month>August</month>??
  11. <year>1971</year>??
  12. <a?href="http://lib.csdn.net/base/20"?class="replace_word"?title="Hadoop知识库"?target="_blank"?style="color:#df3434;?font-weight:bold;">hadoop</a>@hadoop:~/20130722dblpxml$?head?-15?dblp.xml???
  13. <?xml?version="1.0"?encoding="ISO-8859-1"?>??
  14. <!DOCTYPE?dblp?SYSTEM?"dblp.dtd">??
  15. <dblp>??
  16. <article?mdate="2002-01-03"?key="persons/Codd71a">??
  17. <author>E.?F.?Codd</author>??
  18. <title>Further?Normalization?of?the?Data?Base?Relational?Model.</title>??
  19. <journal>IBM?Research?Report,?San?Jose,?California</journal>??
  20. <volume>RJ909</volume>??
  21. <month>August</month>??
  22. <year>1971</year>??
  23. <cdrom>ibmTR/rj909.pdf</cdrom>??
  24. <ee>db/labs/ibm/RJ909.html</ee>??
  25. </article>??
  26. </dblp>??

这时候,要想将xml数据解析为树结构并得到该树的树根,必须进行如下的操作:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. #!/usr/bin/python??
  2. #-*-coding:utf-8-*-??
  3. from?lxml?import?etree#导入lxml??
  4. parser=etree.XMLParser(load_dtd=?True)#首先根据dtd得到一个parser(注意dtd文件要放在和xml文件相同的目录)??
  5. tree?=?etree.parse("dblp.xml",parser)#用上面得到的parserxml解析为树结构??
  6. root?=?tree.getroot()#获得该树的树根??

2、遍历树结构,获得各元素的属性及其子元素。

[python]?view plain?copy

?技术分享技术分享技术分享

  1. for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
  2. ????print?"元素名称:",article.tag#.tag得到该子元素的名称??
  3. ????for?field?in?article:#遍历article元素的所有子元素(这里是指articleauthortitlevolumeyear)??
  4. ????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
  5. ????mdate=article.get("mdate")#.get("属性名")可以得到article元素相应属性的值??
  6. ????key=article.get("key")??
  7. ????print?"mdate:",mdate??
  8. ????print?"key",key??
  9. ????print?""#隔行分开不同的article元素??

到这里,便可以进行简单的xml数据的解析了。

3、解析xml数据的例子

用下面的代码解析文章开头的名为dblp.xml数据。

[python]?view plain?copy

?技术分享技术分享技术分享

  1. #!/usr/bin/python??
  2. #-*-coding:utf-8-*-??
  3. from?lxml?import?etree#导入lxml??
  4. tree?=?etree.parse("dblp.xml")#xml解析为树结构??
  5. root?=?tree.getroot()#获得该树的树根??
  6. ?? ?
  7. for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
  8. ????print?"元素名称:",article.tag#.tag得到该子元素的名称??
  9. ????for?field?in?article:#遍历article元素的所有子元素(这里是指articleauthortitlevolumeyear)??
  10. ????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
  11. ????mdate=article.get("mdate")#.get("属性名")可以得到article元素相应属性的值??
  12. ????key=article.get("key")??
  13. ????print?"mdate:",mdate??
  14. ????print?"key",key??
  15. ????print?""#隔行分开不同的article元素??

便可以得到输出如下:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. 元素名称:?article??
  2. author?:?Stefano?Bellucci??
  3. author?:?Sergio?Ferrara??
  4. author?:?Alessio?Marrani??
  5. author?:?Armen?Yeranyan??
  6. title?:?ES??
  7. pages?:?507-555??
  8. year?:?2008??
  9. volume?:?10??
  10. journal?:?Entropy??
  11. number?:?4??
  12. ee?:?http://dx.doi.org/10.3390/e10040507??
  13. url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
  14. mdate:?2012-11-28??
  15. key:?journals/entropy/BellucciFMY08??
  16. ?? ?
  17. ?? ?
  18. 元素名称:?article??
  19. author?:?Kevin?H.?Knuth??
  20. title?:?None??
  21. pages?:?698-699??
  22. year?:?2013??
  23. volume?:?15??
  24. journal?:?Entropy??
  25. number?:?2??
  26. ee?:?http://dx.doi.org/10.3390/e15020698??
  27. url?:?db/journals/entropy/entropy15.html#Knuth13??
  28. mdate:?2013-03-04??
  29. key:?journals/entropy/Knuth13??

4、元素既有sub-element,又有text的处理

可以看到在上面的例子中,title元素的内容是不正确的。由于title元素及包含sub-element,又有text内容(如下),这时简单的用.text,并不能正确的得到title元素的内容。上面的例子中,第一个article元素的title只取到了ES,而第二个article元素的title则什么都没取到,None

[python]?view plain?copy

?技术分享技术分享技术分享

  1. <title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
  2. <title><i>Entropy</i>?Best?Paper?Award?2013.</title>???

由于在这个例子中,子元素比较简单,这里就简单的采取将子元素和text一起打印的方法来解决这一问题。代码如下:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. #!/usr/bin/python??
  2. #-*-coding:utf-8-*-??
  3. from?lxml?import?etree#导入lxml??
  4. tree?=?etree.parse("dblp.xml")#xml解析为树结构??
  5. root?=?tree.getroot()#获得该树的树根??
  6. ?? ?
  7. for?article?in?root:#这样便可以遍历根元素的所有子元素(这里是article元素)??
  8. ????print?"元素名称:",article.tag#.tag得到该子元素的名称??
  9. ????for?field?in?article:#遍历article元素的所有子元素(这里是指articleauthortitlevolumeyear)??
  10. ????????if?field.tag=="title":??
  11. ????????????print?field.tag,":",etree.tostring(field,encoding=‘utf-8‘,pretty_print=False)#将元素text连同sub_element一起打印??
  12. ????????else:??
  13. ????????????print?field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容??
  14. ????mdate=article.get("mdate")#.get("属性名")可以得到article元素相应属性的值??
  15. ????key=article.get("key")??
  16. ????print?"mdate:",mdate??
  17. ????print?"key:",key??
  18. ????print?""#隔行分开不同的article元素??

输出如下:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. 元素名称:?article??
  2. author?:?Stefano?Bellucci??
  3. author?:?Sergio?Ferrara??
  4. author?:?Alessio?Marrani??
  5. author?:?Armen?Yeranyan??
  6. title?:?<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
  7. ?????????? ?
  8. pages?:?507-555??
  9. year?:?2008??
  10. volume?:?10??
  11. journal?:?Entropy??
  12. number?:?4??
  13. ee?:?http://dx.doi.org/10.3390/e10040507??
  14. url?:?db/journals/entropy/entropy10.html#BellucciFMY08??
  15. mdate:?2012-11-28??
  16. key:?journals/entropy/BellucciFMY08??
  17. ?? ?
  18. 元素名称:?article??
  19. author?:?Kevin?H.?Knuth??
  20. title?:?<title><i>Entropy</i>?Best?Paper?Award?2013.</title>????
  21. ?????????? ?
  22. pages?:?698-699??
  23. year?:?2013??
  24. volume?:?15??
  25. journal?:?Entropy??
  26. number?:?2??
  27. ee?:?http://dx.doi.org/10.3390/e15020698??
  28. url?:?db/journals/entropy/entropy15.html#Knuth13??
  29. mdate:?2013-03-04??
  30. key:?journals/entropy/Knuth13??

当然,不难看出这个问题用这种方法解决比较傻,后面还得将title内容中的tag等不需要部分通过各种字符串的处理将其去掉。最好的方法是能有比较简单的方法,分别获取到一个元素的textsub_element。下面就讲一下如何实现这个需求:

5sub_elementtext优雅实现版

假设xml文件paper.xml内容如下:

[plain]?view plain?copy

?技术分享技术分享技术分享

  1. <?xml?version="1.0"?encoding="ISO-8859-1"?>??
  2. <dblp>??
  3. ????<article?mdate="2002-01-03"?key="persons/Codd71a">??
  4. ????????<author>E.?F.?Codd</author>??
  5. ????????<title>ES<sup>2</sup>:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.</title>??
  6. ????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
  7. ????????<volume>RJ909</volume>??
  8. ????????<month>August</month>??
  9. ????????<year>1971</year>??
  10. ????</article>??
  11. ????<article?mdate="2002-01-03"?key="persons/Codd71a">??
  12. ????????<author>E.?F.?Codd</author>??
  13. ????????<title><i>Entropy</i>?Best?Paper?Award?2013.</title>??
  14. ????????<journal>IBM?Research?Report,?San?Jose,?California</journal>??
  15. ????????<volume>RJ909</volume>??
  16. ????????<month>August</month>??
  17. ????????<year>1971</year>??
  18. ????????<cdrom>ibmTR/rj909.pdf</cdrom>??
  19. ????????<ee>db/labs/ibm/RJ909.html</ee>??
  20. ????</article>??
  21. </dblp>??

可以看到,上面的文件中title字段中,既有子元素,也有嵌套。所以,为了同时取到text和子元素中的text,要单独地为取该字段的text写一个函数,下面是两个具体的实现。

5.1 v1.0

首先考虑的是递归读取各个元素的text,然后将它们拼起来,代码如下:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. from?lxml?import?etree#paper2.py??
  2. ?? ?
  3. def?node_text(node):??
  4. ????result?=?node.text.strip()?if?node.text?else?‘‘??
  5. ????for?child?in?node:??
  6. ????????child_text?=?node_text(child)??
  7. ????????if?child_text:??
  8. ????????????result?=?result?+?‘?%s‘?%?child_text?if?result?else?child_text??
  9. ????return?result??
  10. ?? ?
  11. if?__name__?==?‘__main__‘:??
  12. ????parser?=?etree.XMLParser()??
  13. ????root?=?etree.parse(‘paper.xml‘,?parser).getroot()??
  14. ????for?element?in?root:??
  15. ????????category?=?element.tag??
  16. ????????for?attribute?in?element:??
  17. ????????????if?attribute.tag?==?"title":??
  18. ????????????????print?"title:",?node_text(attribute)??
  19. ????????????else:??
  20. ????????????????print?attribute.tag+":",attribute.text.strip()??
  21. ????????print?""??

运行结果如下:

[plain]?view plain?copy

?技术分享技术分享技术分享

  1. $?python?paper2.py???
  2. author:?E.?F.?Codd??
  3. title:?ES?2??
  4. journal:?IBM?Research?Report,?San?Jose,?California??
  5. volume:?RJ909??
  6. month:?August??
  7. year:?1971??
  8. ?? ?
  9. author:?E.?F.?Codd??
  10. title:?Entropy??
  11. journal:?IBM?Research?Report,?San?Jose,?California??
  12. volume:?RJ909??
  13. month:?August??
  14. year:?1971??
  15. cdrom:?ibmTR/rj909.pdf??
  16. ee:?db/labs/ibm/RJ909.html??

显然,这个方法只能够取到各个子元素的text,然后将它们拼起来,因此,这并不是我们想要的。不知道当时怎么想的,我居然就直接这样用了。现在看来too young, too simple, always naive

5.2 v2.0

数据都上线快一年了,发现了这个问题。简直不更sb了,这样,我们就要重新写上面去取得xml一个节点中所有text的函数(现在看来,当初将这一个功能写成一个函数还算是比较科学的),下面是现在的方案:

[python]?view plain?copy

?技术分享技术分享技术分享

  1. from?lxml?import?etree#paper.py??
  2. ?? ?
  3. def?node_text(node):??
  4. ????result?=?""??
  5. ????for?text?in?node.itertext():??
  6. ????????result?=?result?+?text??
  7. ????return?result??
  8. ?? ?
  9. if?__name__?==?‘__main__‘:??
  10. ????parser?=?etree.XMLParser()??
  11. ????root?=?etree.parse(‘paper.xml‘,?parser).getroot()??
  12. ????for?element?in?root:??
  13. ????????category?=?element.tag??
  14. ????????for?attribute?in?element:??
  15. ????????????if?attribute.tag?==?"title":??
  16. ????????????????print?"title:",?node_text(attribute)??
  17. ????????????else:??
  18. ????????????????print?attribute.tag+":",attribute.text.strip()??
  19. ????????print?""??

运行之后得到下面的结果:

[plain]?view plain?copy

?技术分享技术分享技术分享

  1. $?python?paper.py???
  2. author:?E.?F.?Codd??
  3. title:?ES2:?A?cloud?data?storage?system?for?supporting?both?OLTP?and?OLAP.??
  4. journal:?IBM?Research?Report,?San?Jose,?California??
  5. volume:?RJ909??
  6. month:?August??
  7. year:?1971??
  8. ?? ?
  9. author:?E.?F.?Codd??
  10. title:?Entropy?Best?Paper?Award?2013.??
  11. journal:?IBM?Research?Report,?San?Jose,?California??
  12. volume:?RJ909??
  13. month:?August??
  14. year:?1971??
  15. cdrom:?ibmTR/rj909.pdf??
  16. ee:?db/labs/ibm/RJ909.html??

这样,这个问题总算是解决了。下面的问题就是如何将线上的数据更改过来,当然,这又是另外的一个问题了。

0

python解析xml之lxml

标签:

原文地址:http://www.cnblogs.com/Yiutto/p/5387021.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!