码迷,mamicode.com
首页 > 其他好文 > 详细

查询被收录页面中的死链接 By SEO

时间:2016-06-14 11:29:29      阅读:247      评论:0      收藏:0      [点我收藏+]

标签:

朋友说他的站挂了,想知道被收录的页面有多少是死链,于是我就想了一下流程,从Site获得收录数量当然是不精准的,不过也没有更好的地了,真实的收录只有搜索引擎数据库里面才有。。。

查询被收录页面的状态码,流程:获取收录网址 > 解析真实URL > 获取状态码

不过执行起来比较慢,不知道是BeautifulSoup还是 Location 获取真实URL地址这步慢了

#coding:utf-8

import urllib2,re,requests
from bs4 import BeautifulSoup as bs

domain = www.123.com    #要查询的域名
page_num = 10 * 10     #第一个数字为要抓取的页数

def gethtml(url):
    headers = {
        Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8,
        # ‘Accept-Encoding‘:‘gzip, deflate, sdch‘,
        Accept-Language:zh-CN,zh;q=0.8,
        Cache-Control:max-age=0,
        Connection:keep-alive,
        Cookie:BDUSS=ng4UFVyUUpWU2hUR2R3b3hKamtpaE9ocW40LTFZcGdWeDBjbXkzdE83eDJQSE5YQVFBQUFBJCQAAAAAAAAAAAEAAADD3IYSamFjazE1NDUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHavS1d2r0tXa; ispeed_lsm=2; PSTM=1465195705; BIDUPSID=2274339847BBF9B1E97DA3ECE6469761; H_WISE_SIDS=102907_106764_106364_101556_100121_102478_102628_106368_103569_106502_106349_106665_106589_104341_106323_104000_104613_104638_106071_106599_106795; BAIDUID=D94A8DE66CF701AB5C3332B1BF883DDC:FG=1; BDSFRCVID=UEusJeC62m80hjJRoxzDhboaBeKaL6vTH6aIa6lTlb9Zx-72yRF7EG0PfOlQpYD-d1GyogKK3gOTH4jP; H_BDCLCKID_SF=fR-foIPbtKvSq5rvKbOEhPCX-fvQh4JXHD7yWCvG3455OR5Jj65Ve58JM46N2bvE3IbaWbjP5lvH8KQC3MA--fF_jxvn2PD8yj-L_KoXLqLbsq0x0-jchh_QWt8LKToxMCOMahkb5h7xOKbF056jK4JKjH0qt5cP; SIGNIN_UC=70a2711cf1d3d9b1a82d2f87d633bd8a02157232777; BD_HOME=1; BD_UPN=12314353; sug=3; sugstore=1; ORIGIN=0; bdime=0; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; H_PS_645EC=a5cfUippkbo0uQPU%2F4QbUFVCqXu4W9g5gr5yrxTnJT10%2FElVEvJBbeyjWJq8QUHgepjd; BD_CK_SAM=1; BDSVRTM=323; H_PS_PSSID=1434_20317_12896_20076_19860_17001_15506_11866; __bsi=16130066511508055252_00_0_I_R_326_0303_C02F_N_I_I_0,
        # ‘Host‘:‘www.baidu.com‘,
        Upgrade-Insecure-Requests:1,
        User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36,
    }

    req = urllib2.Request(url=url,headers=headers)
    html = urllib2.urlopen(req,timeout = 30).read()
    return html

def status(url):    #返回状态码
    status = requests.get(url).status_code
    return status

status_file = open(url_status.txt,a+)
for i in range(10,page_num,10):
    url = https://www.baidu.com/s?wd=site%3A + domain + &pn= + str(i)
    html = gethtml(url)

    soup = bs(html,"lxml")
    for i in soup.select(.c-showurl):
        # print i.get(‘href‘)
        urls = i.get(href)
        # url_list.append(urls)    
        header = requests.head(urls).headers
        header_url = header[location]    #获取真实URL
        if int(status(header_url)) == 404:
            print status(header_url),header_url    #打印状态码和真实URL
            status_file.write(str(status(header_url)) +   + header_url + \n)    #获取的状态码和链接写入文件



status_file.close()
#获取状态码函数

 

借鉴的代码段

#coding: utf-8
import sys
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
 
question_word = "吃货 程序员"
url = "http://www.baidu.com/s?wd=" + urllib.quote(question_word.decode(sys.stdin.encoding).encode(gbk))
htmlpage = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlpage)
print len(soup.findAll("table", {"class": "result"}))
for result_table in soup.findAll("table", {"class": "result"}):
    a_click = result_table.find("a")
    print "-----标题----\n" + a_click.renderContents()#标题
    print "----链接----\n" + str(a_click.get("href"))#链接
    print "----描述----\n" + result_table.find("div", {"class": "c-abstract"}).renderContents()#描述
    print

 

查询被收录页面中的死链接 By SEO

标签:

原文地址:http://www.cnblogs.com/wuzhi-seo/p/5583139.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!