码迷,mamicode.com
首页 > 编程语言 > 详细

python 读写excel

时间:2017-12-05 17:37:37      阅读:346      评论:0      收藏:0      [点我收藏+]

标签:dem   connect   mod   内容   获取数据   ted   client   fan   手动   

最近老大让从网站上获取数据,手动太慢,网上找了点python,用脚本操作。

 1 import os
 2 import re
 3 
 4 import xlrd
 5 import requests
 6 import xlwt
 7 from bs4 import BeautifulSoup
 8 from xlutils.copy import copy
 9 from xlwt import *
10 
11 
12 def read_excel(path):
13     # 打开文件
14     workbook = xlrd.open_workbook(path)
15     # 获取所有sheet
16 
17     # 根据sheet索引或者名称获取sheet内容
18     sheet1 = workbook.sheet_by_index(0)  # sheet索引从0开始
19 
20     # sheet的名称,行数,列数
21     i = 0
22     for sheet1_values in sheet1._cell_values:
23 
24         str = sheet1_values[0]
25         str.replace(\‘,‘‘)
26         print (str,i)
27         response = get_responseHtml(str)
28         soup = get_beautifulSoup(response)
29         pattern1 = ^https://ews-aln-core.cisco.com/applmgmt/view-appl/+[0-9]*$
30         pattern2 = ^https://ews-aln-core.cisco.com/applmgmt/view-endpoint/+[0-9]*$
31         pattern3 = ^https://ews-aln-core.cisco.com/applmgmt/view-appl/by-name/
32         if pattern_match(str,pattern1) or pattern_match(str,pattern3):
33             priority = soup.find("table", class_="main_table_layout").find("tr", class_="centered sub_section_header").find_next("tr",
34                                                                                                                   align="center").find_all(
35             "td")
36         elif pattern_match(str,pattern2):
37             priority = soup.find("table", class_="main_table_layout").find("tr",
38                                                                            class_="centered").find_next(
39                 "tr",
40                 align="center").find_all(
41                 "td")
42         else:
43             print("no pattern")
44         try:
45             priorityNumble =P + get_last_td(priority)
46 
47         except Exception:
48             print("没有找到"+str)
49             priorityNumble = P + get_last_td(priority)
50         write_excel(path,i,1,priorityNumble)
51         i = i + 1
52 def write_excel(path,row,col,value):
53     oldwb = xlrd.open_workbook(path)
54     wb =copy(oldwb)
55     ws = wb.get_sheet(0)
56     ws.write(row,col,value)
57     wb.save(path)
58 def get_last_td(result):
59     for idx  in range(len(result)):
60         returnResult = result[idx].contents[0]
61     return returnResult
62 def get_beautifulSoup(request):
63     soup = BeautifulSoup(request, html.parser, from_encoding=utf-8, exclude_encodings=utf-8)
64     return soup
65 def get_responseHtml(url):
66     headers = {
67         User-Agent: User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36}
68     response = requests.get(url, auth=(userName, passWord),headers=headers).content
69     return response
70 def pattern_match(str,pattern,flags = 0):
71     pattern = re.compile(pattern)
72     return re.match(pattern,str,flags)
73 
74 if __name__ == __main__:
75     userName = *;
76     passWord = *
77     path = r*
78     read_excel(path)

这里面坑可是不少

  1.刚开始xlsx格式文件,save后不能打开,把excel格式改为xls才正确。

  2.header网上找的,这样不会被认为是网络爬虫而报错:http.client.RemoteDisconnected: Remote end closed connection without response.

  3.copy的参数要为workbook而不是xls的fileName,否则报错:AttributeError: ‘str’ object has no attribute ‘datemode’.

  4.找到一篇很好的博客:Python中,添加写入数据到已经存在的Excel的xls文件,即打开excel文件,写入新数据

  5.刚开始想往新的文件里save,用了新的路径,发现不可行,因为在for循环中每次都是从源excel中copy,所以实际结果只插入了一行。

  6.正则表达式的语法:正则表达式 - 语法  和 Python正则表达式

  6.python中beautiful soup的用法,很全的文档:Beautiful Soup 4.2.0 文档

  7.一个爬小说的demo:Python3网络爬虫(七):使用Beautiful Soup爬取小说

  8.从没写过python,第一次写,花了半天时间,还有很多可以改进的地方。

 

 

 

  

 

  

  

  

python 读写excel

标签:dem   connect   mod   内容   获取数据   ted   client   fan   手动   

原文地址:http://www.cnblogs.com/lizhang4/p/7988065.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!