Parse arabic PDF to plain text - python

I've got a problem with parsing Arabic PDF to plain text. I have tried Apache Tika, PDFBox (both in Java and Python) and a few less popular tools like PyPDF2 every time getting mixed order of of signs. For PDFBox I have used the hint from the documentation for RTL languages link but it didn't work.
The example is presented below:
Original PDF:
Generated text:
The order is changed in every line that Latin sings occurs. Has anyone faced similar problem and solved it?
Thanks for help!

Related

Strange looking Swedish characters in generated PDF

I am building backend for a system in Django and I am generating PDF files using ReportLab. Please notice that the dots and circle above some letters are moved to the right for some reason. Why does this occur? Font is Times New Roman.
I found that this issue is coming from "paste". I had original text in a PDF document, that I copied and pasted to .py file (utf8 formatted). When I wrote same words for myself, problem disappeared.

Does python have font face for strings?

I recently used Google Vision API to extract text from a pdf. Now I searching for a keyword in the response text (from API). When I compare the given string and found string, they do not match even they have same characters.
The only reason I can see is font types of given and found string which looks different which lead to different ascii/utf-8 code of the characters in the string. (I never came across such a problem)
How to solve this? How can I bring these two string to same characters? I am using Jupyter notebook but I even pasted the comparison on terminal but still its evaluates it to False.
Here are the strings I am trying to match:
'КА Р5259' == 'KA P5259'
But they look the same on Stack Overflow so here's a screenshot:
Thanks everyone for the your comments.
I found the solution. I am posting it here, it might be helpful for someone. Actually it's correct that python does not support font faces. So if one copies a font faced character and paste it to python console or jupyter notebook (which renders the font faces due to the fact that it uses html to display information) it is considered a different unicode character.
So the idea is to first bring the text response in a plain text format which I achieved by storing the response in a .txt file (or .pkl file more precisely) which I had to do anyway to preserve the response objects for later data analysis purposes. Once the response in stored in plain text file you can read it without any font face problem unlike I faced above.

Can't get text out of PDF file with PyPDF2

I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.
Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...
The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.
doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.
Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.
I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

Categories

Resources