How to convert docx file to .chm using Python - python

I want to convert contents(text, images, links) of docx file to .chm file using Python. Can anyone please suggest how to do.
I tried to read the docx file content using docx2txt
https://github.com/ankushshah89/python-docx2txt package. But I am not sure how to read the images and links in the file.
Can someone please suggest how to read each content separately and convert it to .chm file.

You maybe warned this has a learn curve.
You need to extract all sections from your Word document into clean HTML files including the graphic files.
Please try to Save Word as HTML. But I think this don't make clean HTML.
You need the Microsoft Htmlhelp compiler for creating Chm files. I recommend using a converter tool or a Help Authoring Tool (Hat) for your task.
Search by Google for such tool "DoctoChm" and give it a try for your needs.

I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats.
you can see the url 'http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/'

Related

Python + Linux - Excel to HTML (keeping format)

I'm looking for a way to convert excel to html while preserving formatting.
I know this is doable on windows due to the availability of some underlying win32 libraries, (eg via xlwings
Python - Excel to HTML (keeping format))
But I'm looking for a solution on Linux.
I've also come by Aspose Cells but this requires a paid license or else it will add a lot of extra junk to the output that needs to be scrubbed out.
And lastly I tried the python lib xlsx2html but it does a very poor job at preserving formatting.
Are there any suggestions for a Linux based solution? I'd also be interested in tools written in other languages that can be easily wrapped around via python.
Thanks in advance!
Update:
Here is an example of a random excel sheet I converted via excel itself that I would like to reproduce. It has some colors, some border variations, some merged cells and some font sizes to see if they all work.
You can use LibreOffice to convert an Excel file to a HTML file using the command line:
# --convert-to implies --headless so it's not mandatory to specify --headless
soffice --headless --convert-to html data.xlsx
You can refer to the documentation to know more about other CLI parameters.
I think you should search for Excel to HTML in the JS world not python (I am not saying it is not possible, but It's more usual in JS), I promise you will get better results.
In my opinion, finding a JS-based solution and make a python wrapper can be more helpful. Because in JS community, they struggled more than another communities to import and work with Excels.
Another idea is to change your approach, look for how you can import an Excel file in an embedded way or iframe inside an HTML page with JS and then export it.
But again, I highly recommend to check JS libraries or GitHub repositories, some of them care about formatting.

Is there a way to save an inline shape from docx as an image file?

I'm trying to parse a docx file using python-docx. The file contains images and text. Basically i need a way to take an image(an InlineShape object) from the file and save it as a separate image (like "smth.jpg"). Is there a way to do that? From reading the API docs it doesn't seem like it, but maybe i'm missing something.
docx2python will pull these images for you.
from docx2python import docx2python
content = docx2python('my_document.docx', 'output_image_directory')
The images will be in whatever directory you supply.
OK, i've figured put a way. Converting docx file to zip and extracting from there. It's not the best option, but still pretty good for me.

Can't get text out of PDF file with PyPDF2

I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.
Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...
The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.

Saving files openoffice Python

I'm working on a script to create an OpenOffice document. After this i want to save the file. Maybe later also as an PDF.. Google doesn't give me any information how to fix this..
My question here is: What method should be used to save an openoffice-writer document?
Thanks in advance!
You should look at this similar question which answer covers both MSWord and OOWriter (by the way, creating a Word file could be the easiest to be read with OpenOffice).
How can I create a Word document using Python?
Alexis
You can create a rtf file with pyrtf or it's variants, and for pdf you can use reportlab. These are libraries for use in python, not to control remotely oo. There are other libraries for other formats.

Read contents of a pdf file

Is there a commandline tool to read a pdf file on linux.Please indicate the appropriate urls for this.
Thanks..
Xpdf and Poppler contain the commandline-utility pdftotext wich converts PDF files to plain text.
There is PyODConverter. It uses OpenOffice working as a service and can convert between various document formats including PDF and simple text.
Not a command line tool but a pdf reading and generation framework
http://www.reportlab.com/software/opensource/
you should also be able to write a simple reader using
https://pypi.org/project/pypdf/
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
you can also look at:
http://www.unixuser.org/~euske/python/pdfminer/index.html

Categories

Resources