码迷,mamicode.com
首页 > 编程语言 > 详细

Python爬虫之XPath

时间:2018-02-28 20:14:47      阅读:228      评论:0      收藏:0      [点我收藏+]

标签:XML   extern   self   loaded   import   sde   strong   完全   ring   

基本语法

 表达式  描述
 node 选取此结点的所有子节点,tag或*选择任意的tag
从根节点选取,选择直接子节点,不包含更小的后代(例如孙、重孙)
 // 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置,包含所有后代 
 .  选取当前节点
 ..  选取当前节点的父节点
 @  选取属性

@属性

在DOM树,以路劲的方式查询节点

通过@符号来选取属性

<a rel="nofollow" class="external text" href="http://google.ac">goole<wbr/>.ac</a>

rel class href 都是属性,可以通过 "//*[@class=‘external text‘]" 来选取对应元素

= 符号要求属性完全匹配,可以用contains方法来部分匹配,例如

"//*[contains(@class,‘external‘)]" 可以匹配,而

"//*[@class=‘external‘]"则不能

运算符

and和or运算符:

  选择p或者span或者h1标签的元素

soup = tree.xpath(‘//td[@class="editor bbsDetailContainer"]//*[self::p or self::span or self::h1]‘) 

  选择class为editor或者tag的元素

soup = tree.xpath(‘//td[@class="editor" or @class="tag"]‘)

 Demo

import lxml
from lxml import html
from lxml import etree

from bs4 import BeautifulSoup

f = open(‘jd.com_2131674.html‘, ‘r‘)
content = f.read()

tree = etree.HTML(content.decode(‘utf-8‘))

print ‘--------------------------------------------‘
print ‘# different quote //*[@class="p-price J-p-2131674"‘
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘p-price J-p-2131674‘]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# partial match ‘ + "//*[@class=‘J-p-2131674‘]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[@class=‘J-p-2131674‘]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# exactly match class string ‘ + ‘//*[@class="p-price J-p-2131674"]‘
print ‘--------------------------------------------‘
print tree.xpath(u‘//*[@class="p-price J-p-2131674"]‘)
print ‘‘

print ‘--------------------------------------------‘
print ‘# use contain ‘ + "//*[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//*[contains(@class, ‘J-p-2131674‘)]")
print ‘‘


print ‘--------------------------------------------‘
print ‘# specify tag name ‘ + "//strong[contains(@class, ‘J-p-2131674‘)]"
print ‘--------------------------------------------‘
print tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]")
print ‘‘

print ‘--------------------------------------------‘
print ‘# css selector with tag‘ + "cssselect(‘strong.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
print htree.cssselect(‘strong.J-p-2131674‘)
print ‘‘

print ‘--------------------------------------------‘
print ‘# css selector without tag, partial match‘ + "cssselect(‘.J-p-2131674‘)"
print ‘--------------------------------------------‘
htree = lxml.html.fromstring(content)
elements = htree.cssselect(‘.J-p-2131674‘)
print elements
print ‘‘

print ‘--------------------------------------------‘
print ‘# attrib and text‘
print ‘--------------------------------------------‘
for element in tree.xpath(u"//strong[contains(@class, ‘J-p-2131674‘)]"):
    print element.text
    print element.attrib
print ‘‘

print ‘--------------------------------------------‘
print ‘########## use BeautifulSoup ##############‘
print ‘--------------------------------------------‘
print ‘# loading content to BeautifulSoup‘
soup = BeautifulSoup(content, ‘html.parser‘)
print ‘# loaded, show result‘
print soup.find(attrs={‘class‘:‘J-p-2131674‘}).text

f.close()

 

Python爬虫之XPath

标签:XML   extern   self   loaded   import   sde   strong   完全   ring   

原文地址:https://www.cnblogs.com/wuwen19940508/p/8485409.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!