XML

时间：2019-08-14 18:43:44 阅读：104 评论：0 收藏：0 [点我收藏+]

标签：project 作者写入 file tac field 提取 str 设置

进入新浪博客点击订阅后会提示订阅地址为 XML 格式的地址，如图

技术图片

博客地址

http://blog.sina.com.cn/u/3980770831

XML 地址内容，如图

技术图片

XML 地址：

http://blog.sina.com.cn/rss/3980770831.xml

创建一个项目：

$ scrapy startproject blogxml

填写 item 需要定义的存储数据：

import scrapy
 
class BlogxmlItem(scrapy.Item):
    # 存储标题
    title = scrapy.Field()
    # 存储对应连接
    link = scrapy.Field()
    # 存储对应文章作者
    author = scrapy.Field()

技术图片

创建一个 xml 模板：

$ cd blogxml
$ scrapy genspider -l
$ scrapy genspider -t xmlfeed blogxmlspider sina.com.cn

技术图片

打开创建的 blogxmlspider.py 文件更改代码:

# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider
from blogxml.items import BlogxmlItem
 
class BlogxmlspiderSpider(XMLFeedSpider):
    name = ‘blogxmlspider‘
    allowed_domains = [‘sina.com.cn‘]
    # 设置 xml 地址
    start_urls = [‘http://blog.sina.com.cn/rss/3980770831.xml‘]
    iterator = ‘iternodes‘ # you can change this; see the docs
    # 此时将开始迭代的节点设置为第一个节点 rss
    itertag = ‘rss‘ # change it accordingly
 
    def parse_node(self, response, node):
        i = BlogxmlItem()
        # 利用 XPath 表达式将对应信息提取出来，并存储到对应的 Item 中
        i[‘title‘] = node.xpath("//rss/channel/item/title/text()").extract()
        i[‘link‘] = node.xpath("//rss/channel/item/link/text()").extract()
        i[‘author‘] = node.xpath("//rss/channel/item/author/text()").extract()
        for j in range(len(i[‘title‘])):
            print(" 第 "+str(j+1)+" 篇文章 ")
            print(" 标题是: ")
            print(i[‘title‘][j])
            print(" 对应链接是: ")
            print(i[‘link‘][j])
            print(" 对应作者是: ")
            print(i[‘author‘][j])
            print("------------------------")
        return i

修改前：

技术图片

修改后：

技术图片

写入一个 main.py 内容为：

from scrapy import cmdline
cmdline.execute("scrapy crawl blogxmlspider".split())

直接执行即可。

XML

标签：project 作者写入 file tac field 提取 str 设置

原文地址：https://www.cnblogs.com/dalton/p/11353878.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行