Working with Arabic in Python on MacOS

Working with Arabic in Python on MacOS - python

I'm working on a project that has some data in Arabic. One task requires me to create a database mapping for some dicts. I don't read Arabic, but with the help of Google Translate and original English versions of the data, I'm able to surmise which Arabic strings map to the database columns.
The problem I'm facing is that Python / MacOS / Something seems to be converting ligatures (?) in the Arabic when I use copy/paste on them, which leads to my code not recognizing some of the dicts.
I believe I have a way around the problem, but given the nature of the work I'm doing, I would like to understand what is happening.
The original Arabic key looks like this:
However, when I copy/paste it on MacOS, it converts to the following:
Google Translate, MacOS, Safari, etc... all seem to think these are equivalent text, but Python disagrees and throws a KeyError when it encounters the original (due to the system having converted it to the second version. Even if I paste it here, it converts: الفئة
Is there a way to work with this text at the system level that does not end up with it being converted to something that Python doesn't recognize?

In case anybody finds this and runs into a similar problem...
What I needed to do was to parse through 350k structured Arabic records (though not all with the same schema), extract the key values, map them to English database column names, and then insert the original records into a table. Thinking laziness would work, I created a set of the unique keys, printed it to screen, then copy/pasted it into a text editor, converted it to a dict, and used the Arabic words as dict keys and the English column names as the values. Except, I did not notice that when I pasted the set of Arabic field names that the system "fixed" the Arabic misspellings, resulting in key names that were no longer recognized when parsing the records.
To fix the problem, instead of printing the Arabic column names (there were 32 of them) to the screen, I created a SQLite database and inserted them into a table that also included a blank "standardized" column. I then went into SQLite and updated the records to map the English to the Arabic. I then read the table back into Python and created a lookup dict that I used when parsing the full data payload. Inserting the Arabic into SQLite did not "correct" the misspellings for me, and hence, the records extracted from there served as an accurate lookup.
The lookup table ended up looking like this:
In spite of trying, I never figured out how to get MacOS to stop correcting the misspelled Arabic.

Related

How to create a dynamic form with python using translated text as input?

I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.

To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.

Python: IDE support to output database query like pandas data frame

(This may be stupid question due to my ignorance.)
Is it possible in Visual Studio Code or PyCharm (perhaps with a plugin) to automatically output a database query, say from an Sqlite source, be nicely formatted like a Pandas DataFrame? (So when I run the code it will be displayed in a nicely formatted table.)

You can use .format(), there are a few different ways you could do this - I'd normally do something like this:
print('{:>len(longestResult)}'.format(i))
If you iterate through all your results to find the longest one and use the length of it as above and iterate through your results again it'll give you a nicely padded table.

Elasticsearch indexing with Python UTF-8 problems

I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance

If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx

Importing from Oracle using the correct encoding with Python

I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
Take Data from an Oracle DB using Python and cx_Oracle.
Write the data to a file using Python.
Ingest the file into Postgres using Python and psycopg2.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as ï¿½).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR field and not a NCHAR field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.

Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.