码迷,mamicode.com
首页 > Web开发 > 详细

数据处理(html to pdf)

时间:2018-01-13 20:44:31      阅读:579      评论:0      收藏:0      [点我收藏+]

标签:findall   call   技术分享   https   list   packages   图片   traceback   并保存   

爬取网站内容并保存为PDF格式

1、安装pdf依赖包 pip install  pdfkit

 但是使用pdfkit时,还是会报错

Traceback (most recent call last):
  File "C:\Users\zhan\AppData\Roaming\Python\Python36\site-packages\pdfkit\configuration.py", line 21, in __init__
    with open(self.wkhtmltopdf) as f:
FileNotFoundError: [Errno 2] No such file or directory: b‘‘

During handling of the above exception, another exception occurred:

OSError: No wkhtmltopdf executable found: "b‘‘"
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

根据提示官网下载 wkhtmltopdf ,并安装记录安装路径。

通过如下代码使用pdfkit

# path_wk = rD:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe #安装位置
# config = pdfkit.configuration(wkhtmltopdf = path_wk)
# pdfkit.from_string("hello world","1.pdf",configuration=config)

准备工作完成后开始代码实现:

#!/usr/bin/env python 
#coding:utf8
import sys
import  requests
import  pdfkit
import  re
import  os

class HtmlToPdf():
    def __init__(self):
        self.path_wk = rD:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe
        self.config = pdfkit.configuration(wkhtmltopdf=self.path_wk)
        self.url = "http://www.apelearn.com/study_v2/"
        # self.reg = re.compile(r<li class="toctree-l1"><a.*?href="(.*?)">.*?</a></li>)
        self.reg = re.compile(r<li class="toctree-l1"><a.*?href="(.*?)">(.*?)</a></li>)
        self.dirName = "aminglinuxbook"
        self.result = ""
        self.chapter = ""
        self.chapter_content = ""

    def get_html(self):
        s = requests.session()
        response = s.get(self.url)
        response.encoding = utf-8
        text = self.reg.findall(response.text)
        self.result = list(set(text))

    def get_pdfdir(self):
        if not os.path.exists(self.dirName):
            os.makedirs(self.dirName)

    def get_chapter(self):
        self.get_pdfdir()
        for chapter in self.result:
            pdfFileName =  "{0}-{1}.pdf".format(chapter[0].split(.)[0],chapter[1])
            # pdfFileName = chapter[0].replace("html", "pdf")
            pdfUrl = "{0}{1}".format(self.url, chapter[0])
            filePath = os.path.join(self.dirName, pdfFileName).strip()
            print(pdfUrl)
            print(filePath)
            try:
                pdfkit.from_url(pdfUrl, filePath, configuration=self.config)
            except Exception as e:
                print(e)

def main():
    html2pdf = HtmlToPdf()
    html2pdf.get_html()
    html2pdf.get_chapter()

if __name__ == "__main__":
    main()

运行结果:

技术分享图片

在目录中查看下载到的PDF文件

技术分享图片

 

数据处理(html to pdf)

标签:findall   call   技术分享   https   list   packages   图片   traceback   并保存   

原文地址:https://www.cnblogs.com/pythonlx/p/8280155.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!