码迷,mamicode.com
首页 > 编程语言 > 详细

python爬虫进阶

时间:2021-01-11 11:30:43      阅读:0      评论:0      收藏:0      [点我收藏+]

标签:sel   urllib   用户   path   mpi   web   trident   start   xpath   

获取豆瓣https://movie.douban.com/top250的,第一页前25个电影名字

我的答案:

import requests

from bs4 import BeautifulSoup

head={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}

res=requests.get("https://movie.douban.com/top250",headers=head)



soup=BeautifulSoup(res.content,"html.parser")

for i in range(1,26):

    get=soup.select("#content > div > div.article > ol > li:nth-child(%s) > div > div.info > div.hd > a > span:nth-child(1)"%i)

    print(get[0].string)





爬取https://movie.douban.com/top250,250部电影的名字。

我的答案:

import requests

from bs4 import BeautifulSoup

head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

for j in range(0,10):

      res = requests.get("https://movie.douban.com/top250?start="+"%s"%(25*j), headers=head)

      soup=BeautifulSoup(res.content,"html.parser")

      for i in range(1,26):

            get=soup.select("#content > div > div.article > ol > li:nth-child(%s) > div > div.info > div.hd > a > span:nth-child(1)"%i)

            print(get[0].string)




将http://www.netbian.com/s/huyan/index.htm

中的所有图片爬取到本地文件中,并以 第1张.jpg  第2张.jpg……保存。

我的答案:


import requests
from bs4 import BeautifulSoup
head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}
res = requests.get("http://www.netbian.com/s/huyan/index.htm", headers=head)
soup = BeautifulSoup(res.text, "html.parser")
for i in range(4,20):
      get=soup.select("#main > div.list > ul > li:nth-child(%s) > a > img "%i)
      datel=str(get)
      datelist=datel.split(‘‘‘\"‘‘‘)
      print(datelist[3])
      URL=datelist[3]
      res1 = requests.get(URL, headers=head)
      date=open("D:\\Project\\%s.jpg"%i,"wb+")
      date.write(res1.content)
      date.close()




网址:https://www.kugou.com/yy/rank/home/1-8888.html

爬取第一页的歌曲名和歌手信息

我的答案:

import requests

from lxml import etree

head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"}

res=requests.get("https://www.kugou.com/yy/rank/home/1-8888.html",headers=head)

sele=etree.HTML(res.text)

for i in range(1,23):

    temp=sele.xpath(//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()%i)

    print(temp)






使用正则表达式,匹配多个"zz任意数字",并输出显示

string="asdasdzz234234adas,asdasdzz2348weqesad,zz657878asd"



我的答案:

import re

line="asdasdzz234234adas,asdasdzz2348weqesad,zz657878asd"

pat=re.compile(rzz\d+)

date=pat.findall(line)

print(date)




使用用户代理池来

网址:https://www.kugou.com/yy/rank/home/1-8888.html

爬取TOP500的500首歌曲名和歌手信息



我的答案:

import requests

from lxml import etree

import random

uapools=[

    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",

    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"

]



def ua():

    thisua=random.choice(uapools)

    head={"User-Agent":thisua}

    return head

head=ua()

for j in range(1,23):

    res=requests.get("https://www.kugou.com/yy/rank/home/%s-8888.html"%j,headers=head)

    sele=etree.HTML(res.text)

    for i in range(1,23):

        temp=sele.xpath(//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()%i)

        print(temp)

for n in range(1,17):

    res1=requests.get("https://www.kugou.com/yy/rank/home/23-8888.html",headers=head)

    sele = etree.HTML(res1.text)

    temp1=sele.xpath(//*[@id="rankWrap"]/div[2]/ul/li[%s]/a/text()%n)

    print(temp1)






聊天格式如下,并把{br}换成回车

我:小猫

小K:崔燚

我:hello

小K:{face:14}Hi~

我:讲个笑话

小K:★ 关于国际理论{br}一个企业人士登机后发现他很幸运的坐在一个美女旁边。彼此交换简短的寒喧之后,他注意到她正在看一份性学统计的手册,于是他问她那本书,她答道:{br}

我:



我的答案:

import requests

import urllib.request

import json

shuru=input("我:")

while(shuru!=0):

    key=urllib.request.quote(shuru)

    res=requests.get("http://api.qingyunke.com/api.php?key=free&appid=0&msg="+key)

    ddd=json.loads(res.text)

    s = ddd["content"]

    s_replace = s.replace({br}, "\n")

    print("小K:",s_replace)

    shuru=input("我:")

 

 

 


 

 

 

 

 

python爬虫进阶

标签:sel   urllib   用户   path   mpi   web   trident   start   xpath   

原文地址:https://www.cnblogs.com/toooof/p/14254086.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!