码迷,mamicode.com
首页 > 编程语言 > 详细

python读取ppt内容

时间:2021-01-13 11:11:20      阅读:0      评论:0      收藏:0      [点我收藏+]

标签:asa   sha   attr   strip()   rom   turn   only   orm   内容   

import json

from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

def ppt_catch_format_text(filename):
    """
    抓取PPT的内容,按段落返回
    其中 filename 是PPT文件的路径
    """
    prs = Presentation(filename)
    txt_oa = {}
    for x in range(len(prs.slides)):
        txt_oa[x] = []
        # ---Only on text-boxes outside group elements---
        for shape in prs.slides[x].shapes:
            if hasattr(shape, "text"):
                row_text = shape.text.encode(‘utf-8‘).strip().decode()
                txt_oa[x].append(row_text)
        # ---Only operate on group shapes---
        group_shapes = [shp for shp in prs.slides[x].shapes 
                        if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    row_text = shape.text.encode(‘utf-8‘).strip().decode()
                    txt_oa[x].append(row_text)
    return txt_oa

text_list = ppt_catch_format_text(‘report.pptx‘)
text_list = json.dumps(text_list, ensure_ascii=False, indent=4).replace("\\n","")
print(text_list)

‘‘‘
Presentation pri?zen?te??n 演示
slides sla?dz 幻灯片
shape ?e?p 形状

‘‘‘

python读取ppt内容

标签:asa   sha   attr   strip()   rom   turn   only   orm   内容   

原文地址:https://www.cnblogs.com/primula/p/14264645.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!