码迷,mamicode.com
首页 > 编程语言 > 详细

Python--34 爬虫

时间:2017-09-14 18:48:50      阅读:194      评论:0      收藏:0      [点我收藏+]

标签:www.   either   alt   组成   www   span   logs   格式   ret   

Python如何访问互联网

  URL + lib -->  urllib 

URL的一般格式为

  protocol://hostname[:port]/path/[;parameters][?query]#fragment

URL由三部分组成

  第一部分是协议:http,https,ftp,file,ed2k......

  第二部分是存放资源服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)

  第三部分是资源的具体地址,如目录或文件名等

urllib包含四个模块

  urllib.request for opening and reading URLs

  urllib.error containing the exceptions raised by urllib.request

  urllib.parse for parsing URLS

  urllib.robotparser for parsing robots.txt files

    urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath=None,cadefault = False)

    Open the URL url,which can be either a string or a Request object.

>>> import urllib.request
>>> response = urllib.request.urlopen(http://www.weparts.net)
>>> html = response,read()
>>> print(html.decode(utf-8))

实战

import urllib.request
response = urllib.request.urlopen(http://placekitten.com/g/500/600)
cat_img = response.read()
with open(cat_500_600.jpg,wb) as f:
    f.write(cat_img)
import urllib.request
req = urllib.request.Request(http://placekitten.com/g/500/600)
response = urllib.request.urlopen(req)
cat_img = response.read()
with open(cat_500_600.jpg,wb) as f:
    f.write(cat_img)
>>>response.geturl()
http://placeketten.com/g/500/600
>>>response.info()
<bound method HTTPResponse.geturl of <http.client.HTTPResponse object at 
>>>print(response.info())
0x7fe88d136f60>>
Date: Thu, 14 Sep 2017 08:10:46 GMT
Content-Type: image/jpeg
Content-Length: 26590
Connection: close
Set-Cookie: __cfduid=dc52691cf479658e05d15824990dabeb11505376646; expires=Fri, 14-Sep-18 08:10:46 GMT; path=/; domain=.placekitten.com; HttpOnly
Accept-Ranges: bytes
X-Powered-By: PleskLin
Access-Control-Allow-Origin: *
Cache-Control: public
Expires: Thu, 31 Dec 2020 20:00:00 GMT
Server: cloudflare-nginx
CF-RAY: 39e1df2a94ee77a2-LAX
>>>response.getcode()
200

data urllib .parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format

 

import urllib.request
url = http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule&sessionFrom=
data= {}
data[i] = I love Junjie
data[from] = AUTO
data[to] = AUTO
data[smartresult] = dict
data[client] = fanyideskweb
data[salt] = 1505376958945
data[sign] = 86bb3d2294c81c8d6718e800f939bf45
data[doctype] = json
data[version] = 2.1
data[keyfrom] = fanyi.web
data[action] = FY_BY_CLICKBUTTION
data[typoResult] = true
data = urllib.parse.urlencode(data).encode(utf-8)
response = urllib.request.urlopen(url,data)
html = response.read().decode(utf-8)
print(html)
import json
json,loads(html) #得到的就是一个字典

 隐藏

 urllib.request.Request(url,data = None, headers = {},origin_req_host = None,unverifiable = False, method = None)

headers should be a dictionary 

add_header()

Python--34 爬虫

标签:www.   either   alt   组成   www   span   logs   格式   ret   

原文地址:http://www.cnblogs.com/fengjunjie-w/p/7521480.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!