How change table contents and style using python-docx? - python

I found python-docx,
it looks very smart, but I have to do some tasks that are not well documented.
I need to open a .docx template, with a table within, ad for all the istances present in a list previously created, I have to format them in the table inside the template.

Probably I've found a solution.
It depends of document.xpath, a way to take a map of it is decompress the .docx and read the ./word/document.xml file.
PATH_CELL = 'the path you individuate in document.xml'
docbody = document.xpath('/w:document/w:body'+PATH_CELL,
namespaces=nsprefixes)[0]
print 'Replacing ...',
docbody = replace(docbody,'Welcome','Hello')
I've found this way to run the game. Any else ?

Related

Is it possible to insert text in a existing .otp file through Python?

I'm looking for a way/module to insert a string of text formatted by position and text size, into an already existing .otp file(OpenDocument Presentation Template/Libre&OpenOffice version of a .ppt file). Preferably through Python. I've tried googling but the only thing I find is about macros and I'm not sure if that will work the way I want it to.
As an example I'm looking for an end product like this, where the text is inserted through running a Python script.
I imagine the outline of the script to look something like:
import modules
file = 'cat.otp'
open file
some function for inserting formatted text(right size, text size etc) in otp file
you wrote .opt in text and in pseudo-code .otp, but extension should be .odp for a presentation file. otp is for a master. If you just want to add content this should work for both:
While there is a convenient lib for pptx: python-pptx I don't know one for OpenOffice-Files.
If you don't have to create from scratch but just edit something (and it sounds like that) you may unpack the file via zipfile, edit the content.xml and zip back the whole file structure to odp-file. If you take your demo-file and read the xml (use notepad++ with XML-Tools extension and pretty print function to de-linearize) it is quite self-explaining, you'll get an idea pretty fast.

python-docx does not add picture

I'm trying to insert a picture into a Word document using python-docx but running into errors.
The code is simply:
document.add_picture("test.jpg", width = Cm(2.0))
From looking at the python-docx documentation I can see that the following XML should be generated:
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="1" name="python-powered.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId7"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="859536" cy="343814"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
This does in fact get generated in my document.xml file. (When unzipping the docx file). However looking into the OOXML format I can see that the image should also be saved under the media folder and the relationship should be mapped in word/_rels/document.xml:
<Relationship Id="rId20"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target="media/image20.png"/>
None of this is happens however, and when I open the Word document I'm met with a "The picture can't be displayed" placeholder.
Can anyone help me understand what is going on?
It looks like the image is not embedded the way it should be and I need to insert it in the media folder and add the mapping for it, however as a well documented feature this should be working as expected.
UPDATE:
Testing it out with an empty docx file that image does get added as expected which leads me to believe it might have something to do with the python-docx-template library. (https://github.com/elapouya/python-docx-template)
It uses python-docx and jinja to allow templating capabilities but runs and works the same way python-docx should. I added the image to a subdoc which then gets inserted into a full document at a given place.
A sample code can be seen below (from https://github.com/elapouya/python-docx-template/blob/master/tests/subdoc.py):
from docxtpl import DocxTemplate
from docx.shared import Inches
tpl=DocxTemplate('test_files/subdoc_tpl.docx')
sd = tpl.new_subdoc()
sd.add_paragraph('A picture :')
sd.add_picture('test_files/python_logo.png', width=Inches(1.25))
context = {
'mysubdoc' : sd,
}
tpl.render(context)
tpl.save('test_files/subdoc.docx')
I'll keep this up in case anyone else manages to make the same mistake as I did :) I managed to debug it in the end.
The problem was in how I used the python-docx-template library. I opened up a DocxTemplate like so:
report_output = DocxTemplate(template_path)
DoThings(value,template_path)
report_output.render(dictionary)
report_output.save(output_path)
But I accidentally opened it up twice. Instead of passing the template to a function, when working with it, I passed a path to it and opened it again when creating subdocs and building them.
def DoThings(data,template_path):
doc = DocxTemplate(template_path)
temp_finding = doc.new_subdoc()
#DO THINGS
Finally after I had the subdocs built, I rendered the first template which seemed to work fine for paragraphs and such but I'm guessing the images were added to the "second" opened template and not to the first one that I was actually rendering. After passing the template to the function it started working as expected!
I came acrossed with this problem and it was solved after the parameter width=(1.0) in method add_picture removed.
when parameter width=(1.0) was added, I could not see the pic in test.docx
so, it MIGHT BE resulted from an unappropriate size was set to the picture,
to add pictures, headings, paragraphs to existing document:
doc = Document(full_path) # open an existing document with existing styles
for row in tableData: # list from the json api ...
print ('row {}'.format(row))
level = row['level']
levelStyle = 'Heading ' + str(level)
title = row['title']
heading = doc.add_heading( title , level)
heading.style = doc.styles[levelStyle]
p = doc.add_paragraph(row['description'])
if row['img_http_path']:
ip = doc.add_paragraph()
r = ip.add_run()
r.add_text(row['img_name'])
r.add_text("\n")
r.add_picture(row['img_http_path'], width = Cm(15.0))
doc.save(full_path)

Problems with PyPDF ignoring some data

Hoping for some help, as I can't find a solution.
We currently have a lot of manual data inputs through people reading PDF files, and I have been asked to find a way to cut this time down. My solution would be to transform the PDF to a much easier readable format, then using grep to get rid of the standard fields (Just leaving the data behind). This would then be uploaded into a template, then into SAP.
However, then main problem has come at the first hurdle - transforming the PDF into a txt file. The code I use is as follows -
import sys
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
f = open('test.txt', 'w+')
f.write(getPDFContent("Adminform.pdf").encode("ascii", "ignore"))
f.close()
This works, however it ignores some data from the PDF files. To show you what I mean, this PDF page -
http://s23.postimg.org/6dqykomqj/error.png
From the first section (gender, title, name) produces the below -
*Title: *Legal First Name (s): *Your forename and second name (if applicable) as it appears on your passport or birth certificate. Address: *Legal Surname: *Your surname as it appears on your passport or birth certificate
Basically, the actual data that I want to capture is not being converted.
Anyone have a fix for this?
Thanks,
Generally speaking converting pdfs to text is a bad idea. It almost always is messy.
There are linux utilities to do what you have implemented, but I don't expect them to do any better.
I can suggest tabula you can find it at.
http://tabula.technology/
It is meant for extracting tables out of pdfs by manually delineating the boundaries of the table. But running on a pdf with no tables would output text with some formatting retained.
There is some automation, although, limited.
Refer
https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool
Also, may not entirely relevant here, you can use openrefine to manage messy data. Refer
http://openrefine.org/

storing full text from txt file into mongodb

I have created a python script that automates a workflow converting PDF to txt files. I want to be able to store and query these files in MongoDB. Do I need to turn the .txt file into JSON/BSON? Should I be using a program like PyMongo?
I am just not sure what the steps of such a project would be let alone the tools that would help with this.
I've looked at this post: How can one add text files in Mongodb?, which makes me think I need to convert the file to a JSON file, and possibly integrate GridFS?
You don't need to JSON/BSON encode it if you're using a driver. If you're using the MongoDB shell, you'd need to worry about it when you pasted the contents.
You'd likely want to use the Python MongoDB driver:
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('test_file_name.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "test_file_name.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
(Untested code)
If you made sure that the file names are unique, you could set the _id property of the document and retrieve it like:
text_file_doc = collection.find_one({"_id": "test_file_name.txt"})
Or, you could ensure the file_name property as shown above is indexed and do:
text_file_doc = collection.find_one({"file_name": "test_file_name.txt"})
Your other option is to use GridFS, although it's often not recommended for small files.
There's a starter here for Python and GridFS.
Yes, you must convert your file to JSON. There is a trivial way to do that: use something like {"text": "your text"}. It's easy to extend / update such records later.
Of course you'd need to escape the " occurences in your text. I suppose that you use a JSON library and/or MongoDB library of your favorite language to do all the formatting.

Lost formatting and image after search and replace using python-docx

Experts,
I have a template docx report, which has image and standard formatting inside it. What I did using docx, was just to search some tags, and replace it using the value from a config file.
Search & replace was working as expected, but the output file lost all the image, and the formatting. Do you know what went wrong? All I did was just modifying the example-makedocument.py, and replace it to use with my docx file.
I've searched the discussion on python.docx librelist, and their page on github, there were a lot of questions like this, but remained unanswered.
Thank you.
--- my script is simple one like this ---
from docx import *
from ConfigParser import SafeConfigParser
filename = "template.docx"
document = opendocx(filename)
relationships = relationshiplist()
body = document.xpath('/w:document/w:body',namespaces=nsprefixes)[0]
####### get config file
parser = SafeConfigParser()
parser.read('../TESTING1-config.txt')
######## Search and replace
print 'Searching for something in a paragraph ...',
if search(body, ''):
print 'found it!'
else:
print 'nope.'
print 'Replacing ...',
body = advReplace(body, '', parser.get('ASD', 'ASD'))
print 'done.'
####### #Create our properties, contenttypes, and other support files
title = 'Python docx demo'
subject = 'A practical example of making docx from Python'
creator = 'Mike MacCana'
keywords = ['python', 'Office Open XML', 'Word']
coreprops = coreproperties(title=title, subject=subject, creator=creator,keywords=keywords)
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
savedocx(document, coreprops, appprops, contenttypes, websettings, wordrelationships, 'Welcome to the Python docx module.docx')
Python-docx only copies over the document.xml file in the original Docx zip. Everything else is discarded and recreated either from a function or from a preexisting template file. This unfortunately includes the document.xml.rels file that is responsible for mapping images.
The oodocx module that I have developed copies over everything from the old Docx and, at least in my experience, plays nicely with images.
I have answered to a similar question about python-docx. Python docx is not meant to store the docx images and export them away.
Python Docx is not a templating engine for Docx.

Categories

Resources