python write unicode to file and retrieve it as a symbol

python write unicode to file and retrieve it as a symbol - python

I am working on a Python program which contains an Arabic-English database and allows to update this database and also to study the vocabluary. I am almost done with implementing all the functions I need but the most important part is missing: The encoding of the Arabic strings. To append new vocabulary to the data base txt file, a dictionary is created and then its content is appended to the file. To study vocabulary, the content of the txt file is converted into a dictionary again, a random word is printed to the console and the user is asked for its translation. Now the idea is that the user has the possibility to write the Englisch word as well as the Arabic word in latin letters and the program will internally convert the pseudo-arabic string to Arabic letters. For example, if the user writes 'b' when asked for the Arabic word, I want to append 'ب‎'.
1. There are about 80 signs I have to consider in the implementation. Is there a way of creating some mapping between the latin-letter input string and the respective Arabic signs? For me, the most intuitive idea would be to write one if statement after the other but that's probably super slow.
2. I have trouble printing the Arabic string to the console. This input
print('bla{}!'.format(chr(0xfe9e)))
print('bla{}!'.format(chr(int('0x'+'0627',16))))
will result in printing the Arabic sign whereas this won't:
print('{}'.format(chr(0xfe9e)))
What can I do in order to avoid this problem, since I want a sequence which consists of unicode symbols only?

Did you try encode/decode function? for example you can write
u = ("سلام".encode('utf-8'))
print(u.decode('utf-8'))

This is not the final answer but can give you a start.
First of all check your encoding:
import sys
sys.getdefaultencoding()
Edit:
sys.setdefaultencoding('UTF8') was removed from sys module. But still, you can comment what sys.getdefaultencoding() returns in your box.
However, for Arabic characters, you can range them all at once:
According to this website, Arabic characters are from 0x620 to 0x64B and Basic Latin characters are from 0x0061 to 0x007B (for lower cases).
So:
arabic_chr = [chr(k) for k in range(0x620, 0x064B, 1)]
latin_chr = [chr(k) for k in range(0x0061, 0x007B, 1)]
Now, all what you have to do, is finding a relation between the two lists, orr maybe extend more the ranges (I speak arabic and i know that there is many forms of one char and a character can change from a word to another).

Related

Python character constants for chat bot

I'm currently writting a chat bot in python and I would like to be able to type special characters like emoji, etc. my first attempt was just to place the literal character in the code.
add_reaction('ðŸ‡¦')
Unfortunately not many editors support these characters, so they appear mostly as random gibberish. For readability this isn't very good either.
To solve the gibberish issue I used chr(charcode:{int}) which also made them more copy paste save.
Then I put all of them to a separate file special_chars.py so i could give the characters a name
thumbs_up = chr(...)
smiley_face = chr(...)
regional_a_z = [chr(127462+i) for i in range(0,25)]
...
However this file started to grow really long really quickly.
So is there a better way to do this?
Something to keep in mind:
if a long file isn't avoidable could the character codes be moved to a non-python file
potential list for consecutive characters or character groups ex: thumb-up and down, list of regional indicators

The unicodedata module of the standard library already contains names for the special characters:
>>> unicodedata.lookup('THUMBS UP SIGN')
'\U0001f44d'
>>> unicodedata.lookup("REGIONAL INDICATOR SYMBOL LETTER A")
'\U0001f1e6'
You can get the official name of a character by its code:
>>> unicodedata.name('\U0001F600')
'GRINNING FACE'

Encoding included in .append? Python

I am a complete beginner when it comes to Python, and currently getting closer towards the end of LPTHW. Now, in exercise 41, there is a line of code in a for-loop that I do not quite get.
I have searched online to the best of my abilities, but as I am still learning, I was not completely sure how to even search for this.
To clarify:
WORD_URL is just a series of words.
WORDS is an empty list.
This is the loop:
for word in urlopen(WORD_URL).readlines():
WORDS.append(str(word.strip(), encoding="utf-8"))
Now what I do not really understand is what this WORDS.append(str(word.strip(), encoding='utf-8') does. Why is encoding="utf-8" included, and what does it to in this context? I suspect the use of str here is connected to this some way, but not completely sure. Would it not be possible to simply just have it like this:
.append(word.strip())?
Thanks!

The code snippet
WORDS.append(str(word.strip(), encoding='utf-8'))
removes the whitespace from word, and coverts word to a utf-8 encoded string.
The str function takes care of the encoding.
The code you provided
WORDS.append(word.strip())
would place the variable word, with no whitespace at the end of the list WORDS.
The difference being that no encoding is specified, so the variable word will be placed at the end of WORDS in whatever encoding the variable word currently is. In the first example, all strings placed in WORDS will be utf-8 encoded.

remap scrambled PDF characters to readable text

I do have a problem due to cups-PDF creating PDF documents where characters are mapped to strange symbols [on Ubuntu Linux 14.04 and 16.04}. I think its some kind of unicode even if Python is telling me its string type. type(object) python returns "string"
No difference if I grab the text out of the PDF via Mouse copy paste from evince / Firefox or via Python PDFminer module. So its true, the PDF has broken text information which is rendered correct on PDF document itself. I did not know that, but text, and text-graphic on PDF document seem to be no bound very tight together.
When I do copy text from such created PDF document by example the name "Raphael" turns into "✡✍✑✒✍☛✓" so each single character maps to "✡=R ✍=a ✑=p ✒=h ✍=a ☛=e ✓=l"
Another example is: "Devel" turns into "✭☛✮☛✓"
How can I write a function in Python which shifts this "wrong" information to the correct one? On the PDF Document everything is perfectly readable.
This has something todo with cups-PDF using postscript to create the PDF but not adding the correct font/character information to the document.
If the letter 'l' is always the Symbol '✓' which is this checkmark unicode character
How can I do a remap of the characters in this strange representation to correct representation in Python? So how can I shift or remap symbol '✓' to letter 'l'? Any Idea?
Why I need this?
I need to search for a text value in this documents.

The PDF appears to be using a specialised font to prevent copying. The text is scrambled, but so are the letters in the font. So if a once was mapped to Unicode codepoint U+0061, the PDF has replaced all those a's with U+270D instead, and the special font replaced the normal "WRITING HAND" glyph with the letter a.
In other words, it's using a substitution cypher.
You'll have to unscramble this like any other substitution cypher: you need to create a reverse mapping from encrypted codepoint to un-encrypted codepoint. You can use the PDF as a guide; as a human you can easily read the actual text, and you can also see how it relates to the copied Unicode codepoints.
For example, we know that U+270D maps to U+0061:
>>> hex(ord('✍'))
'0x270d'
>>> hex(ord('a'))
'0x61'
because when you copy an a from the PDF, you got the 270d codepoint instead. Simply build up a table for the rest of the alphabet. That may sound like a lot of manual work, but you already have the plaintext. Imagine not knowing what the text contains (e.g. you only had the symbols that copying the text produces); then you'd have to do a full cryptanalysis first (for a substitution cypher, assume a specific language, and count symbols; each language has a typical frequency distribution for its letters and such a distribution can often be matched in an encrypted body of text to map back to the original letters).
Theoretically, you should be able to extract the specialised font, then analyse that to produce a translation table. This would require some form of computer vision however; the computer won't easily know that the raster of pixels or series of vector lines form a specific letter. For roughly 70 codepoints (uppercase, lowercase, digits, some punctuation) it'll probably easier to just create the table by hand.
Once you have a table, Python can do the translation for you; I've taken your clues and created a partial table for just those letters:
mapping = {
0x270d: 'a',
0x261b: 'e',
0x2712: 'h',
0x2713: 'l',
0x2711: 'p',
0x272e: 'v',
0x272d: 'D',
0x2721: 'R',
}
print(encrypted.translate(mapping))
All you need to do is fill in the remaining mappings; the str.translate() method will then take care of the rest.
Demo using the above partial table on your sample encrypted text samples:
>>> print("✡✍✑✒✍☛✓".translate(mapping))
Raphael
>>> print("✭☛✮☛✓".translate(mapping))
Devel

Python: Custom sort a list of lists

I know this has been asked before, but I have not been able to find a solution.
I'm trying to alphabetize a list of lists according to a custom alphabet.
The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:
u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf
Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:
[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]
This is what I have so far:
def alphabetize(word_list):
alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
return word_list
I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.
My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.

First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.
From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.
Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)
Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.
So, putting it all together, something like this:
with open('map.txt', 'rb') as f:
mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])
But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.
And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.
Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

How to read a non-English language text from a text file and print it in python?

I have a text file which contains some Persian text, I want to read the file and calculate the number of occurrence of each word then print the calculated values. this is my code:
f = open('C:/python programs/hafez.txt')
wordDict ={}
for line in f:
wordList = line.strip().split(' ')
for word in wordList:
if word not in wordDict:
wordDict[word] = 1
else: wordDict[word] = wordDict[word]+1
print((str(wordDict)))
It produces results which has wrong coding format, I tried various ways to fix this but no good result! Here is part of the text that this code produces:
{"\x00'\x063\x06(\x06": 3, "\x00,\x06'\x06E\x06G\x06": 16, "\x00'\x063\x06*\x06E\x06'\x069\x06": 1, '\x00-\x064\x061\x06': 1, .....}

There are several ways to deal with this, but perhaps the easiest is with codecs.open(). (I'm assuming you're using Python 2.7 for some of the other tricks here with Counter and with).
import codecs
from collections import Counter
wordDict = Counter()
with codecs.open('C:/python programs/hafez.txt','r',encoding='cp720') as f:
for line in f:
wordDict.update(line.strip().split())
for word, count in wordDict.most_common():
print word, count
In Python 3, you need the parentheses with print (it's a function in Python 3 but a statement in Python 2), and you don't need to import codecs because the builtin open() has support for different encodings.
If your encoding isn't Code Page 720, then you need to replace that option with the abbreviation for the appropriate encoding.
This is a good opportunity to learn some about encodings. While I agree with Joel, that no programmer should pretend that we live in a US English / ASCII world, the issue of encoding becomes especially pertinent when you're dealing with a non Latin alphabet on a regular basis. (Besides, ASCII isn't even enough for English -- many English words are borrowings that kept their accents, amongst other issues.) Good starting places are Joel's article (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)), the Pragmatic Unicode (including the Unicode sandwich), and for ease of producing said sandwich in Python 2, the codecs module. There's also a HOWTO in the Python docs, which is easier to understand after you've read the other articles.
If you've decided to go full Python 3, then you can simple select your exact version from the listbox at the top of the documentation pages. The BDFL's summary of the differences between Python 2 and 3 also includes a bit on issues with Unicode and how it's handled differently in Python 2 and 3.

I think in general you could encode your txt file with UTF-8, and read UTF-8 in python with # -- coding: UTF-8 -- in the start part of py file.

Consider using the python Counter subclass for counting occurrences of words.
As for the text, python2.7 is not unicode by default. Read: http://docs.python.org/2/howto/unicode.html
You can use
for i,j in wordDict.iteritems():
print unicode(i),j

I admit I don't know python, I'm learning it since a few days. And for the moment I'm a little confused between python 2 and 3, since it handles strings differently.
Just a tip to where to look : read your file into a string ( I don't know if you should open the file in binary mode ). Then convert it to unicode using
unicodestr = str.decode('cp720')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.