Read contents of a pdf file - python

Is there a commandline tool to read a pdf file on linux.Please indicate the appropriate urls for this.
Thanks..

Xpdf and Poppler contain the commandline-utility pdftotext wich converts PDF files to plain text.

There is PyODConverter. It uses OpenOffice working as a service and can convert between various document formats including PDF and simple text.

Not a command line tool but a pdf reading and generation framework
http://www.reportlab.com/software/opensource/
you should also be able to write a simple reader using
https://pypi.org/project/pypdf/
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
you can also look at:
http://www.unixuser.org/~euske/python/pdfminer/index.html

Related

Can't read .docx file which i got after converting pdf using soffice command

I am trying to convert pdf to docx using soffice. It converts it into .docx but it gives textboxes which I am unable to read using the docx api provided by python. Is there any better way to read the file or any better way to convert pdf to docx so that I do not get textboxes?
soffice --infilter="writer_pdf_import" --convert-to docx "convert_this.pdf"
You can try using Aspose.Words for Cloud to convert PDF to Word documents.
https://docs.aspose.cloud/display/wordscloud/Convert+PDF+Document+to+Word
It converts PDF from fixed form to flow form so it is editable in MS Word.
Disclosure: I work at Aspose.Words team.

Can't get text out of PDF file with PyPDF2

I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.
Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...
The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.

How to convert docx file to .chm using Python

I want to convert contents(text, images, links) of docx file to .chm file using Python. Can anyone please suggest how to do.
I tried to read the docx file content using docx2txt
https://github.com/ankushshah89/python-docx2txt package. But I am not sure how to read the images and links in the file.
Can someone please suggest how to read each content separately and convert it to .chm file.
You maybe warned this has a learn curve.
You need to extract all sections from your Word document into clean HTML files including the graphic files.
Please try to Save Word as HTML. But I think this don't make clean HTML.
You need the Microsoft Htmlhelp compiler for creating Chm files. I recommend using a converter tool or a Help Authoring Tool (Hat) for your task.
Search by Google for such tool "DoctoChm" and give it a try for your needs.
I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats.
you can see the url 'http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/'

Convert .pages to .doc or .pdf in Python

How does one convert a .pages file to a .doc or .pdf file using Python? My use case is basically:
User uploads a .pages file to my service
My service converts the .pages to a .pdf`
The .pdf is rendered in browser using a browser-based .pdf viewer
I've never done it, but it appears the .pages file already contains a pdf version if you unzip the file: http://blog.cleverly.com/
A complete native solution in python will be difficult.
Appropriate solution would be to look at how you can automate pages to export the file in pdf or ms word.
For that, there seems to be an available solution:
pyobjc
Three is an example that automates pages using pyobjc: http://www.mugginsoft.com/kosmictask/help/automation-python

Reading/Writing MS Word files in Python

Is it possible to read and write Word (2003 and 2007) files in Python without using a COM object?
I know that I can:
f = open('c:\file.doc', "w")
f.write(text)
f.close()
but Word will read it as an HTML file not a native .doc file.
See python-docx, its official documentation is available here.
This has worked very well for me.
Note that this does not work for .doc files, only .docx files.
If you only what to read, it is simplest to use the linux soffice command to convert it to text, and then load the text into python:
I'd look into IronPython which intrinsically has access to windows/office APIs because it runs on .NET runtime.
doc (Word 2003 in this case) and docx (Word 2007) are different formats, where the latter is usually just an archive of xml and image files. I would imagine that it is very possible to write to docx files by manipulating the contents of those xml files. However I don't see how you could read and write to a doc file without some type of COM component interface.

Categories

Resources