背景:工作中需要将
文件夹下的若干word文�».docx转æ¢ä¸ºå¯¹åºtxtææ¬æ ¼å¼ å æ¤éè¦å°docxä¸ææ¬è¯»ååºæ¥ï¼ç¶åä¿å为txtæ ¼å¼å³å¯ éè¦çpython模å为 python-docx ï¼ https://python-docx.readthedocs.io/en/latest/index.html, 导å
¥æ¨¡åæ¶å导å
¥docxï¼åªè½è¯»å.docxæ件ï¼ä¸è½è¯»å.docæ件ï¼æ³¨æï¼å¨PyPiéè¿æä¸ä¸ªå«docxçåºï¼å·²ç»åæ¢æ´æ°ï¼ä¸å»ºè®®ä½¿ç¨ï¼ http://www.cnblogs.com/geek-arking/p/9300617.html ä¸é¢çæ¹æ³åªè½è¯»ådocxæ件ï¼å¦æ读ådocä¼æ¥é ç»ææ¥éï¼docx.opc.exceptions.PackageNotFoundError: Package not foundãè¿æ¯æ æ³è¯å«doc âæ¹åæå±å并没ææ¹åå
¶ç¼ç æ¹å¼ï¼å æ¤æ æ³è¯»åææ¬å
容ï¼éå°docæ件ç¨wordå¦å为docxååç¨python-docx读åå
¶å
容â 对äºè¦è½¬æ¢çdocæ件ï¼ç½ä¸çèµæé½æ¯ä½¿ç¨win32ï¼éè¦å®è£
pypiwin32 https://www.cnblogs.com/AlgorithmDot/p/3386918.htmléè¿ä¸é¢çæ¹æ³ï¼ææ¶å¯ä»¥ç´æ¥å°doc转æ¢ä¸ºtxtæ件ï¼ææ¶åä¼æ¥éã è¿éæ们å¯ä»¥èèå°docæ件ç´æ¥è½¬æ¢ä¸ºdocxç¶ååéè¿ä¸é¢çæ¹æ³è¯»å为txtï¼å¦ææå¨å°docä¿®æ¹ä¸ºtxtæè
docxï¼æå¼æ件ä¼æ¾ç¤ºä¹±ç ï¼ä½æ¯å¯ä»¥ç¨å
¶æä¾çSaveAsæ¹æ³å°.docææ¡£å©ç¨æå¨çæ¹å¼âå¦å为â.docxææ¡£ï¼å°±è½å¤æåæå¼è½¬ååç.docxææ¡£ï¼ doc.SaveAs(tmp +'.docx', 16) å
¶ä¸16çå«ä¹å¦ä¸ï¼ å©ç¨win32comæ¥å£ç´æ¥è°ç¨office APIï¼å¥½å¤æ¯ç®åãå
¼å®¹æ§å¥½ï¼åªè¦officeè½å¤ççï¼pythoné½å¯ä»¥å¤çï¼å¤çåºæ¥çç»æåoffice wordéé¢âå¦å为âä¸è´ã ä¸é¢æ¯office 2007æ¯æçå
¨é¨æä»¶æ ¼å¼å¯¹åºè¡¨ï¼ wdFormatDocument = 0 wdFormatDocument97 = 0 wdFormatDocumentDefault = 16 wdFormatDOSText = 4 wdFormatDOSTextLineBreaks = 5 wdFormatEncodedText = 7 wdFormatFilteredHTML = 10 wdFormatFlatXML = 19 wdFormatFlatXMLMacroEnabled = 20 wdFormatFlatXMLTemplate = 21 wdFormatFlatXMLTemplateMacroEnabled = 22 wdFormatHTML = 8 wdFormatPDF = 17 wdFormatRTF = 6 wdFormatTemplate = 1 wdFormatTemplate97 = 1 wdFormatText = 2 wdFormatTextLineBreaks = 3 wdFormatUnicodeText = 7 wdFormatWebArchive = 9 wdFormatXML = 11 wdFormatXMLDocument = 12 wdFormatXMLDocumentMacroEnabled = 13 wdFormatXMLTemplate = 14 wdFormatXMLTemplateMacroEnabled = 15 wdFormatXPS = 18 ç
§çåé¢ææåºè¯¥è½å¯¹åºå°ç¸åºçæä»¶æ ¼å¼ã 1ãæ°å»ºææå¼æ件ãè¿ä¸ªæ¯è¾ç®åç¨docxçDocumentç±»ï¼è¥æå®è·¯å¾åæ¯æå¼ææ¡£ï¼è¥æ²¡ææå®è·¯å¾åæ¯æ°å»ºææ¡£ 2ãä¿åæ件ãææå¼ï¼å°±æä¿åãç¨Documentç±»çsaveæ¹æ³ï¼å
¶ä¸åæ°æ¯ä¿åçæ件路å¾ï¼æè
è¦ä¿åçæ件æµãä¸è¬æå®è·¯å¾å³å¯ã doc.save(path_or_stream) 3ã对象éåãpython-docxå
å«äºwordææ¡£çç¸å
³å¯¹è±¡éåã 4ãæå
¥æ®µè½ã段è½æ¯wordæåºæ¬ç对象ä¹ä¸ã 5ãæ°å¢æ ·å¼ãè¿ä¸ªå¸®å©ææ¡£éé¢è¯´å¾ä¸ä»ç»ï¼èä¸è¿æ¯è±æçãææ头ä¸ç项ç®ç¨å°è¿ä¸ªï¼å°±èªå·±ç¢ç£¨åºæä¹ä½¿ç¨ï¼å¦ä¸ã 6ãåºç¨åç¬¦æ ·å¼ãå符èªç¶æ¯å¨æ®µè½éé¢çï¼å¯ä»¥éç¨ä¸é¢æ¹æ³ç»æ®µè½è¿½å æåå设置åç¬¦æ ·å¼ã #æå
¥ä¸ä¸ªç©ºç½æ®µè½ p = doc.add_paragraph('') p.add_run('123', style="Heading 1 Char") p.add_run('456') p.add_run('789', style="Heading 2 Char") #è¿æ ·ä¸ä¸ªæ®µè½å°±åºç¨äºä¸¤ä¸ªåç¬¦æ ·å¼ï¼ä¸é´â456â就没åºç¨æ ·å¼ printp.text#è¾åºç»ææ¯u'123456789' ä¹è¿æ¯è¿ç»ç 7ã设置åä½ãå½ç¶å¯ä»¥ä¸ç¨éè¿è®¾ç½®æ ·å¼å¯¹æäºåè¿è¡è®¾ç½®ï¼ä¹å¯ä»¥ç´æ¥è®¾ç½®ã p = doc.add_paragraph('') r = p.add_run('123') r.font.bold =True#å ç² r.font.italic =True#å¾æ çç... 8ãè¡¨æ ¼æä½ãè¡¨æ ¼ä¹æ¯ç»å¸¸ç¨å°çä¸ç§å¯¹è±¡ç±»åãpdf格式的文件必须用相应的pdf阅读器才能打开,而且一般的pdf阅读器打开pdf文件后并不支持编辑修改PDF
文档的文字。如果可以把把pdf转化为txt文本文件,那么我们阅读编辑起来就容易的多。现在市场上已经有很多PDF转换程序,但是基本上需要付费。但是你只要学会了用Python来进行pdf文件转换为txt文件操作,仅仅只需要短短几行代码就可以搞定# -*- coding:utf-8 -*-
from win32com import client as wc
import os
key = '文档密码'
def Translate(input, output):
# 转换
wordapp = wc.Dispatch('Word.Application')
try:
doc = wordapp.Documents.Open(input, False, False, False,key)
doc.SaveAs(FileName=output, FileFormat=4, Encoding="gb2312")
doc.Close()
print(input, "完成")
os.remove(input)
# 为了让python可以在后续操作中r方式读取txt和不产生乱码,参数为4
except:
print(input,"密码错误")
if __name__ == '__main__':
#docx文档物理路径
path = r"C:Usersdocx"
key = '文档密码'
j=0
for file in os.listdir(path):
if '.doc' in file:
name = file.split(".docx")[0]
#输入文档物理路径
input_file = r"C:Usersdocx"+""+file
#输出文档物理路径
output_file=r"C:Users xt"+""+name+".txt"
Translate(input_file, output_file)
j=j+1
print(j)
else:continue