I'm finding resources to do either importing xml from a file or escaping html in python (so that the stuff like & ampersands are imported/converted properly and don't cause errors) but what I'm finding is not recent and/or doesn't really use methods that put both tasks together at the same time.
The task is to take an xml file generated from the amazon api, which always has unescaped ampersands in the output (I have to create a file because amazon has to approve the code before they allow me to make api requests from the program itself) and then parse through it, eventually providing me with an object I can get data from.
How would you do this in 2017? Easy answers definitely appreciated, I'm not exactly a pro at python coding yet.
Related
I'm trying to download a .doc file using requests.get() request (though I've heard about other methods - they all require saving too)
Is there any method I could use to extract the text from it (or even convert it into a .txt for example) straight away without saving it into a file?
I've tried passing request.raw into various conventors (docx2txt.process() for example) but I assume they all work with files, not with streams.
While the script is running the memory allocation are handled by the python interpreter but if you save the content to a file the memory allocated is different. This article can be helpful to you.
Link: article
I have a Python source file which I would like to convert, according to a certain template written in Apache Velocity, to an XML file.
Is it possible to read the contents of a python code file and extract the necessary information in native Velocity language? Or do I need to write a python script (or some other language)to parse the python file to the xml template?
As you have identified, you need to have two distinct phases:
Extract information from the Python source file (aka parsing)
Publish this information back to an XML file
Apache Velocity can help you accomplish step 1, but knows nothing about parsing.
There are several ways to achieve the parsing step:
you could parse it with whatever tool you like, and publish it in a format easily understandable for Velocity, like a Properties file that you would put in the context. If you need to use hierarchical properties, have a look at the ValueParser tool, which can return submaps from sets of properties like foo.bar=woogie and foo.schmoo=wiggie.
you could give a try to Stillness, a parsing tool that uses a Velocity-like syntax (disclaimer: I wrote it).
I am trying to preform find and replace on docx files, whilst still maintaining formatting.
From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.
I can load in the xml file and find/replace on it, but unsure how to write it back.
docx:
Hello {text}!
python:
import zipfile
zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')
What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by #shahvishal.
But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).
If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:
general knowledge on pywin30 module
sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help
You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it:
https://pypi.python.org/pypi/docx/0.2.4
I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.
The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!
You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:
overwriting file in ziparchive
I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac
You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.
using SQL server 2005+
for over a year now, I've been running a query to return all users from the database for whom i'd like generate a report for. at the end of the query, I've added
for xml auto
The result of this is a link to a file where the xml lives (one row, one column). I then take this file and pass it through some xslt and am left with the final xml that a java program I created nearly two years ago can parse.
What i'd like to do is skip the whole middle part and simply automate this all with python.
Run the query, put the results into xml format and then either use the xslt or simply format the xml that's returned into the format that I need for the application.
I've searched far and wide on the internet but I'm afraid I'm not sure enough of the technologies available to find just what I'm looking for.
Any help in directing me to an answer would be greatly appreciated.
PyODBC to talk to MS SQL Server, libxml2 to handle the XSLT, and subprocess to run your Java program.