标签:des style blog http color strong
最近看的关于网络爬虫和模拟登陆的资料,发现有这样一个包
mechanize [‘mek?.na?z]又称为机械化的意思,确实文如其意,确实有自动化的意思。
mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:
any URL can be opened, not just http:
mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of robots.txt.
Automatic handling of HTTP-Equiv and Refresh.
意思就是说 mechanize.Browser和mechanize.UserAgentBase只是urllib2.OpenerDirector的接口实现,因此,包括HTTP协议,所有的协议都可以打开
另外,提供了更简单的配置方式而不用每次都创建一个新的OpenerDirector
对表单的操作,对链接的操作、浏览历史和重载操作、刷新、对robots.txt的监视操作等等
import re import mechanize
(1)实例化一个浏览器对象 br = mechanize.Browser() (2)打开一个网址
br.open("http://www.example.com/") (3)该网页下的满足text_regex的第2个链接
# follow second link with element text matching regular expression response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1) assert br.viewing_html() (4)网页的名称
print br.title() (5)将网页的网址打印出来
print response1.geturl() (6)网页的头部
print response1.info() # headers (7)网页的body
print response1.read() # body
(8)选择body中的name =" order"的FORM br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm.
(9)为name = cheeses的form赋值 br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) # Submit current form. Browser calls .close() on the current response on # navigation, so this closes response1 (10)提交
response2 = br.submit() # print currently selected form (don‘t call .submit() on this, use br.submit()) print br.form (11)返回 response3 = br.back() # back to cheese shop (same data as response1) # the history mechanism returns cached response objects # we can still use the response, even though it was .close()d
response3.get_data() # like .seek(0) followed by .read() (12)刷新网页
response4 = br.reload() # fetches from server (13)这可以列出该网页下所有的Form
for form in br.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
这是文档中给出的一个例子,基本的解释已经在代码中给出
You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:
通过mechanize.UserAgent这个模块,我们可以实现对browser’s policy的控制,代码给出如下,也是来自与文档的例子:
br = mechanize.Browser()
# Explicitly configure proxies (Browser will attempt to set good defaults).
# Note the userinfo ("joe:password@") and port number (":3128") are optional.
br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
"ftp": "proxy.example.com",
})
# Add HTTP Basic/Digest auth username and password for HTTP proxy access.
# (equivalent to using "joe:password@..." form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.
br.add_password("http://example.com/protected/", "joe", "password")
# Don‘t handle HTTP-EQUIV headers (HTTP headers embedded in HTML).
br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
# Don‘t add Referer (sic) header
br.set_handle_referer(False)
# Don‘t handle Refresh redirections
br.set_handle_refresh(False)
# Don‘t handle cookies
br.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by
# default: no need to do this unless you have some reason to use a
# particular cookiejar)
br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True)
# To make sure you‘re seeing all debug output:
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)
# Sometimes it‘s useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
另外,还有一些类似于mechanize的网页交互模块,
There are several wrappers around mechanize designed for functional testing of web applications:
归根到底,都是对urllib2的封装,因此,选择一个比较好用的模块就好了!
标签:des style blog http color strong
原文地址:http://www.cnblogs.com/CBDoctor/p/3855738.html