爬虫-selenium模块

时间：2019-12-28 21:20:06 阅读：91 评论：0 收藏：0 [点我收藏+]

标签：utf8 read ack 图片 index ice return xxx ide

问题：selenium模块和爬虫之间具有的关联

便捷的获取网站中动态加载的数据
便捷实现模拟登陆

什么是selenium模块

selenium是基于浏览器自动化的一个模块

selenium使用流程

安装环境：pip install selenium
下载一个浏览器的驱动程序
- 下载路径(谷歌)： http://chromedriver.storage.googleapis.com/index.html
- 驱动程序和浏览器的映射关系：http://blog.csdn.net/huilan_same/article/details/51896672
实例化一个浏览器对象 bro = webdriver.Chrome(executable_path=‘./chromedriver.exe‘)
编写基于浏览器自动化的操作代码
- 发起请求：get(url) bro.get(‘https://www.taobao.com/‘)
- 标签定位：find系列的方法 search_input = bro.find_element_by_id(‘q‘)
- 标签交互：send_keys(‘xxx‘) search_input.send_keys(‘Iphone‘)
- 点击搜索按钮：
```
btn = bro.find_element_by_css_selector('.btn-search')
btn.click()
```
- 执行js程序：excute_script(‘jsCode‘) bro.execute_script(‘window.scrollTo(0,document.body.scrollHeight)‘)
- html_source = bro.page_source 该属性可以获取当前浏览器的当前页的源码（html）
- 前进，后退：back(),forward() bro.back()
- 关闭浏览器：quit() bro.back()

selenium处理iframe

如果定位的标签存在于iframe标签之中，则必须使用switch_to.frame(id)
导包：from selenium.webdriver import ActionChains

实例化一个动作链对象：action = ActionChains(bro)
click_and_hold（div）：长按且点击操作
move_by_offset(x,y)
perform()让动作链立即执行
action.release()释放动作链对象

示例：iframe+动作链

from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains     #导入动作链对应的类

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

#如果定位的标签是存在于iframe标签之中的则必须通过如下操作在进行标签定位
bro.switch_to.frame('iframeResult')#切换浏览器标签定位的作用域
div = bro.find_element_by_id('draggable')

#动作链
action = ActionChains(bro)
#点击长按指定的标签
action.click_and_hold(div)

for i in range(5):
    #perform()立即执行动作链操作
    #move_by_offset(x,y):x水平方向 y竖直方向
    action.move_by_offset(17,0).perform()
    sleep(0.5)

#释放动作链
action.release()
bro.quit()

无头浏览器(无可视化界面)

基本使用1：无可视化界面

# 老版本的方式
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
#实现无可视化界面的操作
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
bro = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)


# 新版本，弃用chrome_options 参数
from selenium import webdriver
from time import sleep
from selenium.webdriver import ChromeOptions

option = ChromeOptions()
#实现无可视化界面的操作
option.add_argument('--headless')
option.add_argument('--disable-gpu')
bro = webdriver.Chrome(executable_path='chromedriver.exe', options=option)

基本使用2：反反爬策略

from selenium import webdriver
from time import sleep
from selenium.webdriver import ChromeOptions

option = ChromeOptions()
#实现规避检测
option.add_experimental_option('excludeSwitches', ['enable-automation'])
bro = webdriver.Chrome(executable_path='chromedriver.exe', options=option)

超级鹰基本使用

线上打码软件超级鹰

# 下述代码为超级鹰提供的示例代码
import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
im = open('12306.jpg', 'rb').read()
print(chaojiying.PostPic(im, 9004)['pic_str'])

12306模拟登陆

使用超级鹰破解：12306登陆时验证码

超级鹰：http://www.chaojiying.com/about.html

超级鹰使用流程

注册：普通用户
登录：普通用户
题分查询：充值
创建一个软件（id）
下载示例代码

12306模拟登录编码流程

使用selenium打开登录页面
对当前selenium打开的这张页面进行截图
对当前图片局部区域（验证码图片）进行裁剪
- 好处：将验证码图片和模拟登录进行一一对应。
使用超级鹰识别验证码图片（坐标）
使用动作链根据坐标实现点击操作
录入用户名密码，点击登录按钮实现登录

爬虫-selenium模块

标签：utf8 read ack 图片 index ice return xxx ide

原文地址：https://www.cnblogs.com/liuxu2019/p/12112698.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行