码迷,mamicode.com
首页 > 其他好文 > 详细

我的爬虫笔记(1)

时间:2017-11-03 23:47:17      阅读:188      评论:0      收藏:0      [点我收藏+]

标签:selected   htm   style   message   complete   table   mod   http   copy   

最简单的 先把网页的HTML代码爬取下来

from urllib.request import urlopen
from urllib.request import Request
#遇到反爬取可以添加模拟浏览器协议头
headers = {User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6}
#想要爬取的网站地址
url = "https://www.zhihu.com/"
req_timeout=5  #设置req_timeout防止url不可访问,或者响应速度太慢而造成的时间浪费。
req=Request(url=url,headers=headers)
f=urlopen(req,None,req_timeout)
s=f.read()
s=s.decode(utf-8)# 防止爬取的页面中文出现乱码
ss=str(s)
print(ss)

遇到的问题:

1.大部分网站会有发爬取措施 所以我们需要添加一段代码:

headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}

这个是添加模拟浏览器协议头,可以解决这个问题。自己亲测百度知乎都可以用这个方法爬取下来HTML代码

2.爬取的代码中有乱码

s=s.decode(‘utf-8‘)

使用这个方法可以解决

3.输出结果需要str类型

将其转换成str类型

上面代码结果(爬取知乎首页代码):

<!DOCTYPE html>
<html lang="zh-CN" class="">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta http-equiv="X-ZA-Response-Id" content="1b244bb1a32b4315">
<meta http-equiv="X-ZA-Experiment" content="default:None,ge3:ge3_9,ge2:ge2_1,nweb_sticky_sidebar:sticky,live_review_buy_bar:live_review_buy_bar_2,is_office:false,home_ui2:default,is_show_unicom_free_entry:unicom_free_entry_off,app_store_rate_dialog:close,qa_sticky_sidebar:sticky_sidebar,android_profile_panel:panel_b,live_store:ls_a2_b2_c1_f2,search_hybrid_tabs:without-tabs,answer_related_readings:qa_recommend_with_ads_and_article,asdfadsf:asdfad,new_mobile_column_appheader:new_header,fav_act:default,remix_one_key_play_button:headerButton,mobile_qa_page_proxy_heifetz:m_qa_page_nweb,nweb_write_answer:default,android_pass_through_push:getui,new_more:new,new_buy_bar:livenewbuy3,zcm-lighting:zcm,iOS_newest_version:4.2.0,qrcode_login:qrcode,wechat_share_modal:wechat_share_modal_show">
<meta name="renderer" content="webkit" />
<meta name="description" content="中文互联网最大的知识平台,帮助人们便捷地分享彼此的知识、经验和见解。"/>
<meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
<title>知乎 - 发现更大的世界</title>



<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png" sizes="60x60">

<link rel="shortcut icon" href="https://static.zhihu.com/static/favicon.ico" type="image/x-icon" />
<link rel="dns-prefetch" href="p1.zhimg.com"/>
<link rel="dns-prefetch" href="p2.zhimg.com"/>
<link rel="dns-prefetch" href="p3.zhimg.com"/>
<link rel="dns-prefetch" href="p4.zhimg.com"/>
<link rel="dns-prefetch" href="comet.zhihu.com"/>
<link rel="dns-prefetch" href="static.zhihu.com"/>
<link rel="dns-prefetch" href="upload.zhihu.com"/>
<link rel="stylesheet" href="https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.f214513a.css">
<meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg" />
<meta name="baidu-site-verification" content="KPFppAFoYF4Kkdv9" />
<meta property="qc:admins" content="00544670776201056375" />
<link rel="canonical" href="http://www.zhihu.com" />
<meta id="znonce" name="znonce" content="d5e581328572473aad8501685dae174f">
<!--[if lt IE 9]>
<script src="https://static.zhihu.com/static/components/respond/dest/respond.min.js"></script>
<link href="https://static.zhihu.com/static/components/respond/cross-domain/respond-proxy.html" id="respond-proxy" rel="respond-proxy" />
<link href="/static/components/respond/cross-domain/respond.proxy.gif" id="respond-redirect" rel="respond-redirect" />
<script src="/static/components/respond/cross-domain/respond.proxy.js"></script>
<![endif]-->
<script src="https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js"></script>
</head>
<body class="zhi ">




<div class="index-main">
<div class="index-main-body">
<div class="index-header">
<h1 class="logo hide-text">知乎</h1>

<h2 class="subtitle">与世界分享你的知识、经验和见解</h2>

</div>

<div class="desk-front sign-flow sign-flow clearfix sign-flow-simple">


<div class="index-tab-navs">
<div class="navs-slider">
<a href="#signup" class="active">注册</a>
<a href="#signin">登录</a>
<span class="navs-slider-bar"></span>
</div>
</div>



<div class="view view-signin" data-za-module="SignInForm">
<form method="POST">
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
<div class="group-inputs">

<div class="account input-wrapper">

<input type="text" name="account" aria-label="手机号或邮箱" placeholder="手机号或邮箱" required>
</div>
<div class="verification input-wrapper">
<input type="password" name="password" aria-label="密码" placeholder="密码" required /><button type="button" class="send-code-button">获取验证码</button>
</div>

<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="请点击图中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">请点击图中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="验证码" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">登录</button>
</div>
<div class="signin-misc-wrapper clearfix">

<button type="button" class="signin-switch-button">手机验证码登录</button>

<a class="unable-login" href="#">无法登录?</a>
</div>

<div class="other-signup-wrapper" data-za-module="SNSSignIn">

<span class="name signin-switch-qrcode-buttons">二维码登录</span>
<span class="signup-footer-separate signup-footer-se"> · </span>

<span class="name signup-social-buttons js-toggle-sns-buttons">社交帐号登录</span>

<div class="sns-buttons">
<a title="微信登录" class="js-bindwechat" href="#"><i class="sprite-index-icon-wechat"></i></a>
<a title="微博登录" class="js-bindweibo" href="#"><i class="sprite-index-icon-weibo"></i></a>
<a title="QQ 登录" class="js-bindqq" href="#"><i class="sprite-index-icon-qq"></i></a>
</div>


</div>

</form>

<div class="qrcode-signin-container">
<div class="qrcode-signin-step1">
<div class="qrcode-signin-img-wrapper">
<img src="/static/img/spinner/grey-loading.gif" class="qrcode-signin-loading"/>
</div>
<p>打开最新 <a href="https://www.zhihu.com/app/" target="_blank">知乎 App</a></p>
<p>在「更多」页面右上角打开扫一扫</p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密码登录</span>
</div>
</div>
<div class="qrcode-signin-step2">
<div class="qrcode-signin-scan-status"></div>
<p class="qrcode-signin-scan-tips">扫描成功</p>
<p>请在手机上「确认登录」</p>
<div class="qrcode-signin-cut-button">
<span class="qrcode-goto-scan">返回二维码</span>
</div>
</div>
<div class="qrcode-signin-failure">
<div class="qrcode-signin-failure-icon"></div>
<p class="qrcode-signin-failure-message"></p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密码登录</span>
</div>
</div>
<div class="qrcode-signin-guide"></div>
</div>



<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下载知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>


</div>
<div class="view  view-signup selected" data-za-module="SignUpForm">

<form class="zu-side-login-box" action="/register/email" id="sign-form-1" autocomplete="off" method="POST">
<input type="password" hidden> 
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>

<div class="group-inputs">


<div class="name input-wrapper">
<input required type="text" name="fullname" aria-label="姓名" placeholder="姓名">
</div>
<div class="email input-wrapper">

<input required type="text" class="account" name="phone_num" aria-label="手机号" placeholder="手机号">

</div>
<div class="input-wrapper">
<input required type="password" name="password" aria-label="密码" placeholder="密码(不少于 6 位)" autocomplete="off">
</div>


<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="请点击图中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">请点击图中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="验证码" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">注册知乎</button>
</div>

</form>

<p class="agreement-tip">点击「注册」按钮,即代表你同意<a href="/terms" target="_blank">《知乎协议》</a></p>
<a class="signup-entry--org" href="/org/signup">注册机构号</a>

<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下载知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>



</div>
</div>
</div>


</div>

<div class="footer">
<a target="_blank" href="https://zhuanlan.zhihu.com">知乎专栏</a>
<span class="dot">·</span>
<a target="_blank" href="/roundtable">知乎圆桌</a>
<span class="dot">·</span>
<a target="_blank" href="/explore" data-za-c="explore" data-za-a="visit_explore" data-za-l="home_bottom_explore">发现</a>
<span class="dot">·</span>
<a target="_blank" href="/app">移动应用</a>
<span class="dot">·</span>
<a href="/contact" class="footer-mobile-show">联系我们</a>
<span class="dot">·</span>
<a target="_blank" href="/careers">来知乎工作</a>
<br />
<span>&copy; 2017 知乎</span>
<span class="dot">·</span>
<a href="http://www.miibeian.gov.cn/" target="_blank">京 ICP 证 110745 号</a>
<span class="dot">·</span>
<span>京公网安备 11010802010035 号</span>
<span class="dot">·</span>
<a href="http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg" target="_blank">出版物经营许可证</a>
<br />
<a target="_blank" href="https://zhuanlan.zhihu.com/p/28852607">侵权举报</a>
<span class="dot">·</span>
<a target="_blank" href="http://www.12377.cn">网上有害信息举报专区</a>
<span class="dot">·</span>
<a target="_blank" href="/jubao">儿童色情信息举报专区</a>
<span class="dot">·</span>
<span>违法和不良信息举报:010-82716601</span>
<div class="chengxing">
<a id=‘___szfw_logo___‘ href=‘https://credit.szfw.org/CX20170607038331320388.html‘ target=‘_blank‘>
<img src="https://static.zhihu.com/static/revved/img/index/chengxing_logo@2x.65dc76e8.png" border=‘0‘ />
</a>
<script type=‘text/javascript‘>(function(){document.getElementById(___szfw_logo___).oncontextmenu = function(){return false;}})();</script>
</div>
</div>




<script type="text/json" class="json-inline" data-name="disabled_components">["back_to_top"]</script>
<script type="text/json" class="json-inline" data-name="current_user">["","","","-1","",0,0]</script>
<script type="text/json" class="json-inline" data-name="env">["zhihu.com","comet.zhihu.com",false,null,false,false]</script>

<script type="text/json" class="json-inline" data-name="ga_vars">{"user_created":0,"now":1509713487000,"abtest_mask":"------------------------------","user_attr":[0,0,0,"-","-"],"user_hash":0}</script>

<script src="https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/base.41bb3b24.js"></script>

<script src="https://static.zhihu.com/static/revved/-/js/closure/common.ef6c9c27.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/page-index.f17f3a40.js"></script>
<meta name="entry" content="ZH.entrySignPage" data-module-id="page-index">


<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
</body>
</html>

 

我的爬虫笔记(1)

标签:selected   htm   style   message   complete   table   mod   http   copy   

原文地址:http://www.cnblogs.com/wssx/p/7780462.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!