爬虫实战1

时间：2019-08-28 19:54:56 阅读：71 评论：0 收藏：0 [点我收藏+]

re正则：

re.findall("规则"，解析的数据，匹配的模式(re.S))
.*? ---->过滤
(.*?)=-->提取内容

json的使用：

json - --->第三方的数据格式
json.dump(）
json.loads() -->json数据格式 - -转化为python数据

爬取多网页：

import requests
import re
# 0 获取所有电影的url
num = 0
for line in range(10):
    url = f'https://movie.douban.com/top250?start={num}&filter='  # 0 25 50 75
    num += 25

    # 1.发送请求
    response = requests.get(
        url=url
    )
    # 获取电影的名称与详情页地址
    # movie_name = re.findall('<div class="item">.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>', response.text, re.S)

    movie_list = re.findall(
        '<div class="item">.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)人评价</span>',
        response.text, re.S)

    # 循环
    num = 1
    with open('douban.txt', 'a', encoding='utf-8') as f:
        for line in movie_list:
            movie_url = line[0]
            movie_name = line[1]
            movie_point = line[2]
            movie_count = line[3]
            f.write(movie_url + '---' + movie_name + '---' + movie_point + '---' + movie_count + '\n')

        print('写入数据成功,爬虫程序结束...')

爬虫实战1

标签：dump http poi lis rop 写入 div 爬虫 find

原文地址：https://www.cnblogs.com/shaozheng/p/11426076.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行