I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!
There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.
Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx
Related
I have tried many attempts and all fail to record the data I need in a reliable and complete manner. I understand the extreme basics of python and selenium for automating simple tasks but in this case the content is dynamically generated and I am unable to find the correct way to access and subsequently record all the data I need.
The URL I am looking to scrape content from is structured similar to the following:
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
In particular I am trying grab all info using something like -
browser.find_elements_by_xpath('//*[#id="products-container"]
Is this the right approach? How do I access specific sub elements of this element (and all elements of the same path)
I have read that I might need beautifulsoup4, but I am unsure the best way to approach this.
Would the best approach be to use xpaths? If so is there a way to iterate through all elements and record all the data within or do I have to specify each and every data point that I am after?
Any assistance to point me in the right direction would be extremely helpful as I am still learning and have hit a roadblock in my progress.
My end goal is a list of all product names, prices and any other data points that I deem relevant based on the specific exercise at hand. If I could find the correct way to access the data points I could then store them and compare/report on them as needed.
Thank you!
I think you are looking for something like
browser.find_elements_by_css_selector('[class*="product-information__Title"]')
This should find all elements with a class beginning with that string.
I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.
To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.
I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.
The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation
I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package
I am loading and scrolling on dynamically loading pages. An example is the Facebook "wall", which only loads the next items once you have scrolled to somewhere near the bottom.
I scroll until the page is veeeery long, then I copy the source code, save it as a text file and go on to parsing it.
I would like to extract certain parts of the webpage. I have been using the lxml module in python, but with limited success. On there website they only show examples with pretty short Xpaths.
Below is an example of the function and a path that gets me the user names included on the page.
usersID = elTree.xpath('//a[#class="account-group js-account-group js-action-profile js-user-profile-link js-nav"]')
this works fairly well, however I am getting some errors (another post of mine), such as:
TypeError: 'NoneType' object has no attribute 'getitem'
I have also been looking at the Xpaths that Firebug provides. These are of course much longer and very specific. Here is an example for a reoccuring element on the page:
/html/body/div[2]/div[2]/div/div[2]/div[2]/div/div[2]/div/div/div/div/div[2]/ol[1]/li[26]/ol/li/div/div[2]/p
The part towards the end li[26] shows it is the 26th item in a list of the same element, which are found at the same level of the HTML tree.
I would like to know how I might use such firebug-Xpaths with the lxml library, or of anybody knows of a better way to use Xpaths in general?
Using example HTML code and tools like this for test purposes, the Xpaths from Firebug don't work at all. Is that path just ridiculous in people's experience?
Is is very specific to the source code? Are there any other tools like Firebug that produce more reliable output for use with lxml?
FireBug actually generates really poor xpaths. They are long and fragile because they're incredibly non specific beyond hierarchy.
Pages today are incredibly dynamic.
The best way to work with xpath on dynamic pages is to locate common elements as the hook and perform xpath ops from those as your path root.
What I mean here by common elements is stable structural elements that are highly likely or guaranteed to be present. Pick the one closest to your target in terms of containment hierarchy. Shorter paths are faster and clearer.
From there you need to create paths that locate some specific unique attribute or attribute value on the target element.
Sometimes that's not possible so another strategy is to target the closest uniquely identifiable container element then get all elements similar to yours under that and iterate them looking for your goal.
Highly dynamic pages require sophisticated and dynamic approaches.
Facebook changes a lot and will require script maintenance frequently.
I found two things which, together, worked very well for me.
The first thing:
The lxml package allows the usage of some functions in combination with the Xpath. I used the starts-with function, as follows:
tweetID = elTree.xpath("//div[starts-with(#class, 'the-non-varying-part-of-a-very-long-class-name')]")
When exploring the HTML code (tree) using tools such as Firebug/Firepath, everything is shown nice and neatly - for example:
*
*
When I used the highlighted path, i.e. tweet original-tweet js-original-tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable has-cards has-native-media with-media-forward media-forward cards-forward - to search my elTree within the code above, nothing was found.
Having a look at the actual HTML file I was trying to parse, I saw it was really spread over many lines - like this:
this explains why the lxml package was not finding it according to my search.
The second thing:
I know is not generally recommended as a workaround, but the Python approach that it is "easier to ask for forgiveness than permission" applied in my case - The next things I did was to use the python try / except on a TypeError that I kept getting at seemingly arbitrary lines of my code
This may well be specific to my code, but after checking the output on many cases, it seems as though it worked well for me.
I just read through the documentation on python-docx.
They mention several times that added content is created at the end of the document, but I didn't notice any way to alter this functionality.
Does anyone know how to add a new page to a pre-existing document, but make it page 1?
Thanks!
The short answer is the library doesn't support that just yet, although those features are high on the backlog so will be among the next to be implemented.
To get it done in the meantime you'll need to go down to the XML level with a "workaround" function. If you want to add this use case on this issue on GitHub I'll put together some workaround code you can use.
https://github.com/python-openxml/python-docx/issues/27