How to retrieve the author of an office file in python? - python

Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.
os.stat returns only size and datetime, real-file related information.
open(filename, 'rb').read(200) returns many characters that I could not parse.
There is a module called xlrd for reading xlsx files. Yet, this still doesn't let me read doc or docx files. I am aware of new office files are not easily read on non-msoffice programs, so if that's impossible, gathering info from old office files would suffice.

Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.
Here's how you can open the docx file and retrieve the creator:
import zipfile, lxml.etree
# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".
After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")

How about using docx library. You could pull more information about the file not only author.
#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx
import docx
file_name = 'file_path_name.doxs'
document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)
find more information about the docx library here and the github account is here

For old office documents (.doc, .xls) you can use hachoir-metadata.
It does not work well with the new file formats: for example, it can parse .xlsx files, but will not provide you with an Author name.

The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.
The code listed is easily extendable for OpenOffice formats.
Pseudocode:
z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data
Files to search metadata in:
docProps/core.xml for MS Office files
meta.xml for OpenOffice files
A non-exhaustive list of tags you can search for:
From the Dublin core namespace rules: dc
Title: dc:title
Creator (of most recent modification): dc:creator
Description: dc:description
Subject: dc:subject
Date (last modified): dc:date
Language: ???
From the ODF specification: meta
Generator (creating software application): meta:generator
Keywords: meta:keyword
Initial Creator: ???
Creation Date and Time: meta:creation-date
Modification Date and Time: ???
Print Date and Time: ???
Document Template: meta:template (data in attributes)
Document Statistics (word count, page count, etc.): meta:document-statistic (data in attributes)
MS Office specific:
Creation Date and Time: dcterms:created
Date (last modified): dcterms:modified
Creator of most recent modification: cp:lastModifiedBy

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0
Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

Python Camelot - export files without the additional string appended to the filename

Python 3.7 with Camelot 0.7.3. Currently, Camelot exports the converted file with 'page--table-' appended to the file name - we have very specific file name requirements for our application, and I'm trying to export the file without that extra string appended to the file name. Is this possible? The documentation does not mention anything about how to get around this.
The documentation does not mention anything about how to get around this.
I'm not sure what you mean. https://camelot-py.readthedocs.io/en/master/ says:
Here’s how you can extract tables from PDF files. Check out the PDF
used in this example here.
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables <TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
Using tables.export exports all the tables in the PDF to separate files and needs to distinguish them by the filenames.
If you only need to export a specific table, use the example further down on the page:
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
This passes the filename unchanged to pandas.DataFrame.to_csv, as can be seen in https://github.com/camelot-dev/camelot/blob/master/camelot/core.py#L571.

How to parse doc files on a mac with python? [duplicate]

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))
If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.
Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
Just an option for reading 'doc' files without using COM: miette. Should work on any platform.
Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

updating metadata for feature classes programatically using arcpy

I would like to be able to take an excel file that contains a record for each feature class, and some metadata fields, like summary, description, etc., and convert that to the feature class metadata. From the research I've done it seems like I need to convert each record in the excel table to xml, and then from there I may be able to import the xml file as metadata. Looks like I could use ElementTree, but I'm a little unsure of how to execute. Has anyone done this before and if so could you provide some guidance?
Man, this can be quite a process! I had to update some metadata information for a project at work the other day, so here goes nothing. It would be helpful to stored all of the metadata information in the excel table as a dictionary list or other data structure of your choosing (I work with csvs and try to stay away from excel spreadsheets for experience reasons).
metaInfo = [{"featureClass":"fc1",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},
{"featureClass":"fc2",
"abstract":"text goes here",
"description":"text goes here",
"tags":["tag1","tag2","tag3"]},...]
From there, I would actually export the current metadata feature class using the Export Metadata function to convert your feature class metadata into an xml file using a FGDC schema. Here is a code example below:
#Directory containing ArcGIS Install files
installDir = arcpy.GetInstallInfo("desktop")["InstallDir"]
#Path to XML schema for FGDC
translator = os.path.join(installDir, "Metadata/Translator/ARCGIS2FGDC.xml")
#Export your metadata
arcpy.ExportMetadata_conversion(featureClassPath, translator, tempXmlExportPath)
From there, you can use the xml module to access the ElementTree class. However, I would recommend using the lxml module (http://lxml.de/index.html#download) because it allows you to incorporate html code into your metadata through the CDATA factory if you needed special elements like line breaks in your metadata. From there, assuming that you have imported lxml, parse your local xml document:
import lxml.etree as ET
tree = ET.parse(tempXmlExportPath)
root = tree.getroot()
If you want to update the tags use the code below:
idinfo = root[0]
#Create keyworks element
keywords = ET.SubElement(idinfo, "keywords")
tree.write(tempXmlExportPath)
#Create theme child
theme = ET.SubElement(keywords, "theme")
tree.write(tempXmlExportPath)
#Create themekt and themekey grandchildren/insert tag info
themekt = ET.SubElement(theme, "themekt")
tree.write(tempXmlExportPath)
for tag in tags: #tags list from your dictionary
themekey = ET.SubElement(theme, "themekey")
themekey.text = tag
tree.write(tempXmlExportPath)
To update the Summary tags, use this code:
#Create descript tag
descript = ET.SubElement(idinfo, "descript")
tree.write(tempXmlExportPath)
#Create purpose child from abstract
abstract = ET.SubElement(descript, "abstract")
text = #get abstract string from dictionary
abstract.text = text
tree.write(tempXmlExportPath)
If a tag in the xml already exists, store the tag as an object using the parent.find("child") method, and update the text similar to the code examples above. Once you have updated your local xml file, use the Import Metadata method to import the xml file back into the feature class and remove the local xml file.
arcpy.ImportMetadata_conversion(tempXmlExportPath, "FROM_FGDC", featureClassPath)
shutil.rmtree(tempXmlExportPath)
Keep in mind that these tools in Arc are only for 32 bit, so if you are scripting through the 64 bit background geoprocessor, this will not work. I am working off of ArcMap 10.1. If you have any questions, please let me know or consult the documentation below:
lxml module
http://lxml.de/index.html#documentation
Export Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000t000000
Import Metadata arcpy
http://resources.arcgis.com/en/help/main/10.1/index.html#//00120000000w000000

Looking for recommendation on how to convert PDF into structured format

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction.
I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties.
Here's an image of what each page represents:
And here's a page that lists some properties:
I'm comfortable with python and ruby so I don't have any issues scripting up a solution, but because the "columns" and the data in those said columns aren't necessary tied together, it seems like this would be a dubious proposition.
Any ideas would be greatly appreciated.
After mucking around with this for 3 hours, I was able to create a parseable XML document from the data. Unfortunately, I was unsuccessful with putting together a completely reusable set of steps that I can use for future auctions publications.
As an aside, I did attempt to call and ask Los Angeles County if they could provide an alternative format of the properties up for auction (excel, etc) and the answer was no. That's government for you.
Here's a high-level view of my approach:
Convert the PDF into a text file using Poppler
Use RegEx foo to clean up and create XML nodes from the data
Use an XML beautifier / validator to find errors and do cleanup
Use Python/ruby to add Google Maps Link node, and link to LA County Assessors Map (http://assessormap.co.la.ca.us/mapping/rolldata.asp?ain=APN-GOES_HERE) and
Convert XML to CSV with Ruby
I used http://xmlbeautifier.com/ as my XML beautifier / validator because it was fast and it gave accurate error reporting, including line numbers.
Use Homebrew to install Poppler for Mac:
brew install poppler
After Poppler is installed, you should have access to the pdftotext utility to convert the PDF:
pdftotext -layout -f 24 -l 687 AuctionBook2013.pdf auction_book.txt
Here's a preview of the XML (Click here for full XML):
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<item id="1">
<nsb>536</nsb>
<minbid>3,422</minbid>
<apn>2006 003 001</apn>
<delinquent_year>03</delinquent_year>
<apn_old>2006 003 001</apn_old>
<description>LICENSED SURVEYOR'S MAP
AS PER BK 25 PG 28 OF L S LOT 1
BLK 1 ASSESSED TO J AND S
LIMITED LLC C/O DUNA CSARDAS -
JULIUS JANCSO LOCATION COUNTY OF
LOS ANGELES</description>
<address>VACANT LOT</address>
</item>
Edit: Adding the Ruby I wrote to convert the XML to a CSV.
require 'rexml/document'
require 'CSV'
class Auction
def initialize
f = File.new('AuctionBook2013.xml', 'r')
doc = REXML::Document.new(f)
CSV.open("auction.csv", "w+b") do |csv|
csv << ['id', 'minbid', 'apn', 'delinquent_year', 'apn_old', 'description', 'address']
doc.elements.each('/listings/item') do |item|
csv << [item.attributes['id'],
item.elements['minbid'].text,
item.elements['apn'].text,
item.elements['delinquent_year'].text,
item.elements['apn_old'].text,
item.elements['description'].text,
item.elements['address'].text]
end
end
end
end
a = Auction.new()
Link to Final CSV
Convert to text with Xpdf using command pdftotext.
I converted your file with the following:
pdftottext.exe -layout -f 23 -l 510 AuctionBook2013.pdf AuctionBook2013.txt
This conversion leaves text exactly in its original layout (due to -layout option). Options -f and -l indicate the first and last page numbers of the range of pages to extract.
From there, parsing should be simple -- a number in column 8 indicates the first line of a record, a blank line ends the record. Follow the guide for the exact positioning of elements within a record.

Categories

Resources