Detecting corrupt document files with python-docx

Detecting corrupt document files with python-docx - python

Can you please help me i figuring this out.
While reading a .docx file from python-docx (docx.Document(file_name)), how can I detect if the docx file is correct or corrupt.
I've got some cases where these input docx files are either empty or corrupt.
How can I flag these cases using this library

There is no such feature in python-docx. Part of the reason is that while a file could be determined to be valid or invalid according to the schema in the ISO specification, many small discrepancies are permitted by each client. What is permitted varies between clients; some things that LibreOffice will accept produce a repair error in Microsoft Word, for example.
The only reliable way to determine this is to attempt to open the file with the target client, perhaps using automation like VBA in the case of Microsoft Word.

Related

How to extract all files from a p7m file

I have a bunch of p7m files (used to digitally sign some files, usually pdf files) and I would like some help to find a way to extract the content. I know how to iterate a process over the files in a folder using Python, I need help just with the extraction part.
I tried with PyPDF2.PdfFileReader.decrypt() but I get a "EOF marker not found" error because apparently PyPDF2 cannot manage encrypted files.
I saw somebody used the mime library, but that is way above my level honestly.
Thank you

saving back to a docx xml file with python

I am trying to preform find and replace on docx files, whilst still maintaining formatting.
From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.
I can load in the xml file and find/replace on it, but unsure how to write it back.
docx:
Hello {text}!
python:
import zipfile
zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')

What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by #shahvishal.
But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).
If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:
general knowledge on pywin30 module
sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help

You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it:
https://pypi.python.org/pypi/docx/0.2.4
I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.
The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!

You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:
overwriting file in ziparchive

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac

You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x

The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.

doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.

Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.

I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

python adding gibberish when reading from a .rtf file?

I have a .rtf file that contains nothing but an integer, say 15. I wish to read this integer in through python and manipulate that integer in some way. However, it seems that python is reading in much of the metadata associated with .rtf files. Why is that? How can I avoid it? For example, trying to read in this file, I get..
{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf460
{\fonttbl\f0\fswiss\fcharset0
Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl720\margr720\margb720\margt720\vieww9000\viewh8400\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural

That's the nature of .RTF (i.e Rich Text files), they include extra data to define how the text is layed-out and formated.
It is not recommended to store data in such files lest you encounter the difficulties you noted. Would you go through the effort to parse this file and "recover" your one numeric value, you may expose your application to the risk of updated versions of the RTF format which may render the parsing logic partially incorrect and hence yield wrong numeric data for the application).
Why not store this info in a true text file. This could be a flat text file or preferably an XML, YAML, JSON file for example for added "forward" compatibility as your application and you may add extra parameters and such in the file.
If this file is a given, however, there probably exist Python libraries to read and write to it. Check the Python Package Index (PyPI) for the RTF keyword.

That's exactly what the RTF file contains, so Python (in the absence of further instruction) is giving you what the file contains.
You may be looking for a library to read the contents of RTF files, such as pyrtf-ng.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detecting corrupt document files with python-docx - python

Can you please help me i figuring this out. While reading a .docx file from python-docx (docx.Document(file_name)), how can I detect if the docx file is correct or corrupt. I've got some cases where these input docx files are either empty or corrupt. How can I flag these cases using this library

Related

How to extract all files from a p7m file

saving back to a docx xml file with python

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

How to parse just the text from a Word Doc using Python?

python adding gibberish when reading from a .rtf file?

Categories

Resources