how to use an arabic stemmer on ubuntu - python

I'm trying to test out stemming Arabic text using the ISRIStemmer tool but the GNOME terminal doesn't properly render Arabic text which is RTL.
I assume this means I need to have the texts I need in external documents and reference them in the code.
Can any one show me an example of how I might go about doing this?

Related

Azure Language Studio not showing text content in python script regarding OCR

I am working on OCR of a word document to recognize the content mentioned in the document. I observed that OCR generated python code is not showing the content available in the document in auto-generated python script in language studio. I just want to get the python script structure where I can see the tags which are focusing on the identifying the sentences without table content.
Is the approach I am looking for is right? Any flow that explains requirement is much appreciated.
This problem will be defined in Form Recognizer. In this factor, we cannot see anything related to the general text from an image or a text file like DOCX and PDF in python script which will be generated. The form recognition, related to the specific structure that is having pre-modelled scenarios of syntax. But the content which was mentioned as the input without any tabular form is not recognized in python. The content will be visible in JSON which is other than the tabular form.
Check out the thread for reference.
Azure Cognitive Form Recognizer to Extract Page Numbers using Python

Multilingual Python script

I recently created a Python script and am now considering making it multilingual.
However, I already have all the text written in English and all in that particular script.
So I didn't write down the sentences in any external file and apply variables, but wrote the sentences right into the script.
Now the question: How can I translate the script for another language?
Do I have to replace all the existing sentences with variables? Or is there an API for this?
If I have to replace the sentences with variables, should I use JSON or xml as external language file?
There are different libraries to handle translations of software/scripts, which is called internationalization (i18n). Depending on the library common formats are JSON, gettext, YAML or XML. As an example there is https://pypi.org/project/python-i18n/.
Yes, you basically import the translation function of an i18n-library and provide your text as a parameter. If you also have variables, insert them properly in your text to place them somewhere different depending on the grammar of the language.

Parse arabic PDF to plain text

I've got a problem with parsing Arabic PDF to plain text. I have tried Apache Tika, PDFBox (both in Java and Python) and a few less popular tools like PyPDF2 every time getting mixed order of of signs. For PDFBox I have used the hint from the documentation for RTL languages link but it didn't work.
The example is presented below:
Original PDF:
Generated text:
The order is changed in every line that Latin sings occurs. Has anyone faced similar problem and solved it?
Thanks for help!

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

Processing Urdu Bidirectional text in text editors and Python

I wanted to process some bidirectional text (in Urdu and English) in a MS Word document with a Python script that transforms the text into table markup. I can't directly access the bidirectional text from the Word document as it is in binary format and even if I copy paste the text from the Word document to a text editor then all the bidirectional text renders incorrectly losing the directionality.
Example:
The following text is rendered in reverse direction from the original MSWord text from where I copied it (Urdu text involved):
images پر ہے۔
So how to process such bidi text so that it would be rendered correctly in a text editor like notepad++ and hence can be faithfully processed with Python script?
First, don't rely on bidi text appearing correctly in a Word file. It doesn't guarantee that the same text would appear correctly when in some other environment. Microsoft Word has its own way of handling bidirectional text in current and legacy versions which is not necessarily the way Unicode-compliant text-editors (like gedit) would handle that text. This might or might not be resolved eventually as Microsoft would implement a newer version of Unicode Bidirectional Algorithm in products.
Secondly, the reason which you don't see the copied text properly is that your text environment (including here) doesn't support bidi text properly and it's not even possible to have right-to-left text displayed. I copied your sample string in a Unicode-compliant text-editor and change the direction to right and this is the result which is correct.
Now to be able to process your text in that Word file using Python you need to improvise a bit. You can export the text content as Unicode text and then process it with Python. Or in case you want to process the text content in-place (inside Word), you might be able to get some satisfactory results out of OLE component scripting from your Python. See the related question here.

Categories

Resources