码迷,mamicode.com
首页 > Web开发 > 详细

爬虫链家网站获取信息

时间:2018-10-14 14:09:30      阅读:149      评论:0      收藏:0      [点我收藏+]

标签:获取   pen   from   return   div   数字   html   ensure   class   

import re
import json
from urllib.request import urlopen
import ssl
# ?掉数字签名证书
ssl._create_default_https_context = ssl._create_unverified_context

ershoufang_url=https://bj.lianjia.com/ershoufang/rs/

def get_html_content(url):
    html=urlopen(url)
    content=html.read().decode(utf-8)
    # print(content)
    return content
def chuli(content):
    obj=re.compile(r<span.*?>关注</span></div><div.*?><span></span></div><div.*?><span></span></div><div class="price"><span>(?P<price>.*?)</span>万</div></a><a.*?>(?P<title>.*?)</a><div class="info">.*?<span>/</span>.*?<span>/</span>(?P<pingmi>.*?)<span>/</span>(?P<fangxiang>.*?)<span>/</span>(?P<zhuangxiu>.*?)</div><div .*?>(?:<span .*?>.*?</span>)?<span.*?>(?P<fangben>.*?)</span>,re.S)
    it=obj.finditer(content)
    for el in it:
        yield {
            价格::el.group(price)+,
            房屋信息::el.group(title),
            平米数::el.group(pingmi),
            朝向:el.group(fangxiang),
            装修::el.group(zhuangxiu).replace(<span>/</span>,,),
            房本信息::el.group(fangben).replace(随时看房,无信息).replace(关注,无信息),
        }
def xieru(jieguo):
    txt=json.dumps(jieguo,ensure_ascii=False)
    with open(houseInfo,mode=a,encoding=utf-8)as f:
        f.write(txt+\n)

def main():
    for i in range(1,101):
        if i ==1:
            new_content = get_html_content(ershoufang_url)
        else:
            dong_url=https://bj.lianjia.com/ershoufang/pg%d/%i
            new_content = get_html_content(dong_url)
        ret = chuli(new_content)
        for el in ret:
            xieru(el)
            print(el)

if __name__==__main__:
    main()

 

爬虫链家网站获取信息

标签:获取   pen   from   return   div   数字   html   ensure   class   

原文地址:https://www.cnblogs.com/PythonMrChu/p/9785661.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!