python爬取银行名称和官网地址

时间：2018-10-09 21:47:07 阅读：212 评论：0 收藏：0 [点我收藏+]

标签：运行官网 5.0 firefox 结果 http import toc port

爬取所有银行的银行名称和官网地址(如果没有官网就忽略)，并写入数据库。
目标网址：http://www.cbrc.gov.cn/chinese/jrjg/index.html
（因为此网站做了反爬虫机制，所以这里需要我们将爬虫伪装浏览器进行访问。）
关于爬虫伪装成浏览器访问可以参考这篇文章：
https://blog.csdn.net/a877415861/article/details/79468878

话不多说直接上代码：

import re
from urllib import request
from urllib.request import urlopen
import pymysql as mysql

u = ‘root‘
p = ‘root‘
d = ‘python‘
sql = ‘insert into bank_info values(%s,%s)‘

url = ‘http://www.cbrc.gov.cn/chinese/jrjg/index.html‘

# 爬虫伪装浏览器步骤：

# 1. 定义一个真实浏览器的代理名称
myAgent = "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0"    #这个是我当前火狐浏览器的信息

# 2.将代理写到请求页面的header里面去
myrequest = request.Request(url,headers={‘User-Agent‘: myAgent} )

# 3. 打开网页， 获取内容
content = urlopen(myrequest).read().decode(‘utf-8‘)

# 获取对象：<a href="http://www.icbc.com.cn/" target="_blank" style="color:#08619D">中国工商银行</a>
pattern = r‘<li style="margin.*inline;">\s*<a href="(http://.+?)" target="_blank" style="color:#08619D">\s*?([\S]*?)\s*?</a>|<li style="margin.*inline;">\s*?([\S]*?)\s*?</li>‘

def main():
    res = re.findall(pattern, content)
    # [(‘http://www.hsbc.com.cn‘, ‘汇丰中国‘, ‘‘), ...(‘‘, ‘‘, ‘蒙特利尔银行（中国）有限公司‘)...]
    conn = mysql.connect(user=u, passwd=p, db=d, charset=‘utf8‘, autocommit=True)
    cur = conn.cursor()
    for info in res:
        if info[0]:
            info = info[1::-1]    # 有官网
        else:
            info = info[:-3:-1]    # 无官网
        cur.execute(sql, (info[0],info[1]))
        conn.commit()

if __name__ == "__main__":
    main()

运行结果：
技术分享图片

python爬取银行名称和官网地址

标签：运行官网 5.0 firefox 结果 http import toc port

原文地址：http://blog.51cto.com/13885935/2296512

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行