码迷,mamicode.com
首页 > 编程语言 > 详细

使用python爬取淘宝商品名称和价格

时间:2020-02-13 20:53:21      阅读:333      评论:0      收藏:0      [点我收藏+]

标签:format   val   time   ike   淘宝网   veh   swa   font   int   

最近在家闲得无聊,由于家里开网点,妈妈对于起商品标题感到很头痛,所以我就想在淘宝爬取一些信息。

小破站找了个学习视频,跟一遍发现视频是2018年的,而淘宝在2019年可能加入了反爬取机制,使用正常的方法爬不到结果。

但是有一种方式可以爬取,要先登陆淘宝网页版,然后去搜索,获取cookie和user-agent。

代码如下:

import requests
import re


def getHTMLText(url):
    kv = {‘cookie‘:‘cna=54y8Fm+TyioCATzcP+BwvvDA; thw=cn; lgc=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; tracknick=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; tg=0; enc=p6YWWbSWACqr5t1PcdDiNADVd7zKpnQG9X%2FZ666%2Fl7CM9%2FsOLpiM1WX5QQNnS%2B5ydtOFYKtHlmwg9AgeUX0Rjg%3D%3D; mt=ci=25_1; hng=CN%7Czh-CN%7CCNY%7C156; _m_h5_tk=9564049e168909dda591afc00632fed0_1581060181038; _m_h5_tk_enc=69c0046fffbc1f3750258e3f8fb06eb6; v=0; t=a8c0eb2a0d2265808242379d7f81bf64; cookie2=1f0913be01a97ec8038fb3c5354c793a; _tb_token_=ed66e3be155ef; alitrackid=www.taobao.com; _samesite_flag_=true; sgcookie=Q0iMW4gaGTiQdKuBbG0Q; unb=2514996592; uc3=vt3=F8dBxdzxpjLCa1%2BSc2Y%3D&id2=UU2zVEbkyyO2WQ%3D%3D&nk2=p2NDIgVDkDZm7A%3D%3D&lg2=VT5L2FSpMGV7TQ%3D%3D; csg=5b7ad9c4; cookie17=UU2zVEbkyyO2WQ%3D%3D; dnk=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; skt=c0aa8f1a6e59edcd; existShop=MTU4MTU5NTkzOQ%3D%3D; uc4=nk4=0%40pVWU4YQkw8jOJbWHe0nFlK5IE6%2Bq&id4=0%40U2%2F0ltjMwUTM8KkFFqREWIu1Zr5o; _cc_=VFC%2FuZ9ajQ%3D%3D; _l_g_=Ug%3D%3D; sg=%E5%92%8C22; _nk_=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; cookie1=BxAV4i9dmFVSXeCQYeZRwnecoaXNB46utcHDDLh6ZgY%3D; lastalitrackid=i.taobao.com; uc1=cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&cookie21=VT5L2FSpccLuJBreK%2BBd&cookie15=W5iHLLyFOGW7aA%3D%3D&existShop=false&pas=0&cookie14=UoTUO8VjZOZt2g%3D%3D&tag=8&lng=zh_CN; JSESSIONID=214606DAA803E833B2D19064DCFFE666; isg=BFZW9gNpFGQi5iC5GO6s0NHppwxY95oxuW-OAcC_eTnUg_cdPocWQaoxGxdvK5JJ; l=dBLXnIncQINxmF32BOCgC40XkGbTvIRfgukohvEHi_5Kv_8sGz_Oo7aMMEJ6cfWAMjxM4cULng2tieLYJiuKHdGJ4AadZxDDB‘,
          ‘user-agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36‘}
    try:
        r = requests.get(url, headers=kv,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def parsePage(ilt, html):
    try:
        plt = re.findall(r‘\"view_price\"\:\"[\d\.]*\"‘, html)
        tlt = re.findall(r‘\"raw_title\"\:\".*?\"‘, html)
        for i in range(len(plt)):
            price = eval(plt[i].split(‘:‘)[1])
            title = eval(tlt[i].split(‘:‘)[1])
            ilt.append([price, title])
    except:
        print("")


def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))


def main():
    goods = ‘中老年服饰‘
    depth = 3
    start_url = ‘https://s.taobao.com/search?q=‘ + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + ‘&s=‘ + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)


main()

 最终爬取的结果如下图所示:

技术图片

 

 问题:

需要间歇的更换cookie,否则会爬几次后爬不到数据

使用python爬取淘宝商品名称和价格

标签:format   val   time   ike   淘宝网   veh   swa   font   int   

原文地址:https://www.cnblogs.com/lijiahaoAA/p/12305034.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!