Python--34 爬虫

时间：2017-09-14 18:48:50 阅读：194 评论：0 收藏：0 [点我收藏+]

标签：www. either alt 组成 www span logs 格式 ret

Python如何访问互联网

　　URL + lib --> urllib

URL的一般格式为

　　protocol://hostname[:port]/path/[;parameters][?query]#fragment

URL由三部分组成

　　第一部分是协议:http,https,ftp,file,ed2k......

　　第二部分是存放资源服务器的域名系统或IP地址(有时候要包含端口号，各种传输协议都有默认的端口号，如http的默认端口为80)

　　第三部分是资源的具体地址，如目录或文件名等

urllib包含四个模块

　　urllib.request for opening and reading URLs

　　urllib.error containing the exceptions raised by urllib.request

　　urllib.parse for parsing URLS

　　urllib.robotparser for parsing robots.txt files

　　　　urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath=None,cadefault = False)

　　　　Open the URL url,which can be either a string or a Request object.

>>> import urllib.request
>>> response = urllib.request.urlopen(‘http://www.weparts.net‘)
>>> html = response,read()
>>> print(html.decode(‘utf-8‘))

实战

import urllib.request
response = urllib.request.urlopen(‘http://placekitten.com/g/500/600‘)
cat_img = response.read()
with open(‘cat_500_600.jpg‘,‘wb‘) as f:
    f.write(cat_img)

import urllib.request
req = urllib.request.Request(‘http://placekitten.com/g/500/600‘)
response = urllib.request.urlopen(req)
cat_img = response.read()
with open(‘cat_500_600.jpg‘,‘wb‘) as f:
    f.write(cat_img)

>>>response.geturl()
‘http://placeketten.com/g/500/600‘

>>>response.info()
<bound method HTTPResponse.geturl of <http.client.HTTPResponse object at 
>>>print(response.info())
0x7fe88d136f60>>
Date: Thu, 14 Sep 2017 08:10:46 GMT
Content-Type: image/jpeg
Content-Length: 26590
Connection: close
Set-Cookie: __cfduid=dc52691cf479658e05d15824990dabeb11505376646; expires=Fri, 14-Sep-18 08:10:46 GMT; path=/; domain=.placekitten.com; HttpOnly
Accept-Ranges: bytes
X-Powered-By: PleskLin
Access-Control-Allow-Origin: *
Cache-Control: public
Expires: Thu, 31 Dec 2020 20:00:00 GMT
Server: cloudflare-nginx
CF-RAY: 39e1df2a94ee77a2-LAX

>>>response.getcode()
200

data urllib .parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format

import urllib.request
url = ‘http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule&sessionFrom=‘
data= {}
data[‘i‘] = ‘I love Junjie‘
data[‘from‘] = ‘AUTO‘
data[‘to‘] = ‘AUTO‘
data[‘smartresult‘] = ‘dict‘
data[‘client‘] = ‘fanyideskweb‘
data[‘salt‘] = ‘1505376958945‘
data[‘sign‘] = ‘86bb3d2294c81c8d6718e800f939bf45‘
data[‘doctype‘] = ‘json‘
data[‘version‘] = ‘2.1‘
data[‘keyfrom‘] = ‘fanyi.web‘
data[‘action‘] = ‘FY_BY_CLICKBUTTION‘
data[‘typoResult‘] = ‘true‘
data = urllib.parse.urlencode(data).encode(‘utf-8‘)
response = urllib.request.urlopen(url,data)
html = response.read().decode(‘utf-8‘)
print(html)

import json
json,loads(html) #得到的就是一个字典

隐藏

urllib.request.Request(url,data = None, headers = {},origin_req_host = None,unverifiable = False, method = None)

headers should be a dictionary

add_header()

Python--34 爬虫

标签：www. either alt 组成 www span logs 格式 ret

原文地址：http://www.cnblogs.com/fengjunjie-w/p/7521480.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行