How to map Arabic letters to phonemes in Python? - python

I want to make a simple Python script that will map each Arabic letter to phoneme sound symbols. I have a file that has a bunch of words that the script will read to convert them to phonemes, and I have the following dictionary in my code:
Content in my .txt file:
السلام عليكم
السلام عليكم و رحمة الله
السلام عليكم و رحمة الله و بركاته
الحمد لله
كيف حالك
كيف الحال
The dictionary in my code:
ar_let_phon_maplist = {u'ﺍ':'A:', u'ﺏ':'B', u'ﺕ':'T', u'ﺙ':'TH', u'ﺝ':'J', u'ﺡ':'H', u'ﺥ':'KH', u'ﻩ':'H', u'ﻉ':'(ayn) ’', u'ﻍ':'GH', u'ﻑ':'F', u'ﻕ':'q', u'ﺹ':u'ṣ', u'ﺽ':u'ḍ', u'ﺩ':'D', u'ﺫ':'DH', u'ﻁ':u'ṭ', u'ﻙ':'K', u'ﻡ':'M', u'ﻥ':'N', u'ﻝ':'L', u'ﻱ':'Y', u'ﺱ':'S', u'ﺵ':'SH', u'ﻅ':u'ẓ', u'ﺯ':'Z', u'ﻭ':'W', u'ﺭ':'R'}
I have a nested loop where I'm reading each line, converting each character:
with codecs.open(sys.argv[1], 'r', encoding='utf-8') as file:
lines = file.readlines()
line_counter = 0
for line in lines:
print "Phonetics In Line " + str(line_counter)
print line + " ",
for word in line:
for character in word:
if character == '\n':
print ""
elif character == ' ':
print " "
else:
print ar_let_phon_maplist[character] + " ",
line_counter +=1
And this is the error I'm getting:
Phonetics In Line 0
السلام عليكم
Traceback (most recent call last):
File "grapheme2phoneme.py", line 25, in <module>
print ar_let_phon_maplist[character] + " ",
KeyError: u'\u0627'
And then I checked if the file type is UTF-8 using the Linux command:
file words.txt
The output I got:
words.txt: UTF-8 Unicode text
Any solution for this problem, why it's not mapping to an Unicode object that is in the dictionary since also the character I'm using as key in ar_let_phon_maplist[character] line is Unicode?
Is there something wrong with my code?

The first thing that catches the eye is KeyError. So your dictionary simply does not know about some symbols encountered in file. Looking ahead, it does not know about ANY of the submitted characters, not only about the first.
What we can to do with it? Okay, we can just add all of the symbols from Arabian segment of unicode table into our dictionary. Simple? Yes. Clear? No.
If you want to actually understand the reasons of this 'strange' behaviour, you should to know more about Unicode. In short, there are a lot of letters that looks similar but have different ordinal numbers. Moreover, the same letter sometimes can be presented in multiple forms. So comparing unicode characters is not a trivial task.
So, if I was allowed to use Python 3.3+ I would solve the task as follows. First I'll normalize keys in ar_let_phon_maplist dictionary:
ar_let_phon_maplist = {unicodedata.normalize('NFKD', k): v
for k, v in ar_let_phon_maplist.items()}
And then we will iterate over lines in file, words in line and characters in word like this:
for index, line in enumerate(lines):
print('Phonetics in line {0}, total {1} symbols'.format(index, len(line)))
unknown = [] # Here will be stored symbols that we haven't found in dict
words = line.split()
for word in words:
print(word, ': ', sep='', end='')
for character in word:
c = unicodedata.normalize('NFKD', character).casefold()
try:
print(ar_let_phon_maplist[c], sep='', end='')
except KeyError:
print('_', sep='', end='')
if c not in unknown:
unknown.append(c)
print()
if unknown:
print('Unrecognized symbols: {0}, total {1} symbols'.format(', '.join(unknown),
len(unknown)))
Script will produce something like that:
Phonetics in line 4, total 9 symbols
كيف: KYF
حالك: HA:LK

It looks like you forgot that character in the dictionary. You have ﺍ (u'\ufe8d', ARABIC LETTER ALEF ISOLATED FORM), which looks similar, but you don't have ا (u'\u0627', ARABIC LETTER ALEF).

Related

Python: how to avoid the space in dna calculated?

I am using python 2.7.
I want to find the DNA length. I have no idea where is the mistake.....The length of DNA supposed to be 283, but it comes up with 345.
The sequence in a single line is nothing wrong but just the length have some problem.....
I think the spaces are calculated too. May I know how to get the length of the DNA without including the spaces?
Thank you.
import re
singleSeq = ""
fh = open("seq.embl.txt")
lines = fh.readlines()
for line in lines:
lines = line.strip()
m = re.match(r"\s+(.[^\d]+)\s+\d+", line)
if m:
print(m.group(0))
seqline = m.group(1)
print(seqline)
singleSeq += seqline
print("\nSequence in a single line: ")
# print(line.strip(singleSeq))
print(singleSeq)
print("\nSequence length: ", len(singleSeq))
Output
Sequence in a single line:
cccatgtccc agcggcgtat tgctttgcat cgcgaacgca ctttcaatgt cccagcggcg tattgcttct attttataag taccagctaa attttttttt tttttttata agtaccagct aaaatttttt tttttttttt ttataagtac cagctaaaat tttttttttt tttttttata agtaccagct aaaatttttt ttttttttta taagttccag cggcgtattg ctttctgaaa tttaaaaaaa aaaaaaaatt tttttttaat aatatattat ata
Sequence length: 345
This should do the trick
# Python3 code to remove whitespace
def remove(string):
return string.replace(" ", "")
# Driver Program
string = ' t e s t '
print(remove(string))
it seems you are reinventing the wheel her. i strongly suggest you try BioPython for this
from Bio import SeqIO
record = SeqIO.read("seq.embl.txt", "embl")
print("\nSequence length: ", len(record))

String for text issue in python

I met some problems about try to use string in text.
here is a provided file sqroot2_10kdigits.txt.
the sqroot2_10kdigits.txt is below:
1.4142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727 3501384623 0912297024
9248360558 5073721264 4121497099 9358314132 2266592750 5592755799
9505011527 8206057147 0109559971 6059702745 3459686201 4728517418
6408891986 0955232923 0484308714 3214508397 6260362799 5251407989
6872533965 4633180882 9640620615 2583523950 5474575028 7759961729
8355752203 3753185701 1354374603 4084988471
My code is below:
myfile = open("sqroot2_10kdigits.txt")
txt = myfile.read()
print(txt)
myfile.close()
Q2: Make a new empty string called sqroot_2_string. Note that there's a space between every 10 digits.Instead of using the .rstrip() method, try using .replace(" ", "") to remove all the spaces in the file and save it in the empty string I just made. Check the length of the string as well, it should be 10002. Then print the first 10 digits followed by .... Here's an example:
The first 10 digit of square root of 2 is 1.4142135623... My codes are below:
def sqroot_2_string(string):
count = 0
list = []
for i in xrange(len(string)):
if string[i] != ' ':
list.append(string[i])
return toString(list)
# Utility Function
def toString(List):
return ''.join(List)
# Driver program
string = myfile
print sqroot_2_string(string)
Anyone can check my code in Q2? I don't know how to use .replace(" ", "") to remove all the spaces in the file and save it in the empty string
You can just do
def sqroot_2_string(string):
return string.replace(" ", "")
Also note that you should do
print(sqroot_2_string(txt))
so you are using the text from the file instead of the file handle

Taking input from a text file for implementing Caesar Cipher

I am trying to implement Caesar cipher in Python where my program would take input from a text file i.e. input_file.txt, and write the encrypted text as an output to another text file named output_file.txt. The input file contains:
Attack On Titans
4
where "Attack On Titan" is the string to be encrypted and 4 is the key to the encryption algorithm. The correct output for this string should be
Exxego Sr Xmxerw
but my program gives me
Exxego Sr Xmxerwv
i.e an extra character v. Here is my code for review:
data = open("input_file.txt", "r")
text = data.readline()
print(text)
key = int(data.readline())
def encrypt(text,key):
result = ""
for i in range(len(text)):
char = text[i]
if char == ' ':
result += ' '
elif char.isupper():
result += chr((ord(char) + key-65) % 26 + 65)
else:
result += chr((ord(char) + key - 97) % 26 + 97)
return result
ex= open("output_file.txt","w")
ex.write(encrypt(text,key))
print(encrypt(text , key))
I just wanted to know why am I getting this incorrect output although I know I can make it correct if I change the for statement by doing this:
for i in range(len(text)-1)
Please don't mind this amateurish coding since I am not good at it and want to improve it. Thanks.
data.readline() will give you the trailing newline character \n. You need to call text.strip() before passing to the encrypt function to get arid of it.
It look like you have a trailing newline character in the file you are reading in.
Testing it in the python interpreter:
>>> a = '\n'
>>> (ord(a)+4-97) % 26 + 97
118
>>> chr(118)
'v'
Remove trailing and beginning whitespace by calling test.strip() before passing it to your encrypt function.
As an aside, you should either explicitly close your files, e.g. ex.close() or wrap in in a block like this, to prevent file corruption.
with open('', 'r') as ex:
ex.write('bar')
data.readline() keeps the '\n' (newline) character at the end of the line. It's the reason why you have an extra character in your output.
To remove it you can replace
text = data.readline()
by
text = data.readline().rstrip('\n')
which will remove the '\n' at the end.
text.strip() (see other answers) will remove all whitespace characters from both end of the string. So if it's not the behaviour expected, use .rstrip('\n') which removes only '\n' at the end of the string.
You should also add
ex.close()
after
ex.write(encrypt(text,key))
to commit the change to the file.

Python Caesar Cipher project, incorrect output

I can't seem to get this program I'm supposed to do for a project to output the correct output, even though I have tried getting it to work multiple times. The project is:
Your program needs to decode an encrypted text file called "encrypted. txt". The person who wrote it used a cipher specified in "key. txt". This key file would look similar to the following:
A    B
B    C
C    D
D    E
E    F
F    G
G    H
H    I
I    J
J    K
K    L
L    M
M    N
N    O
O    P
P    Q
Q    R
R    S
S    T
T    U
U    V
V    W
W    X
X    Y
Y    Z
Z    A
The left column represents the plaintext letter, and the right column represents the corresponding ciphertext.
Your program should decode the "encrypted.txt" file using "key.txt" and write the plaintext to "decrypted.txt".
Your program should handle both upper and lower case letters in the encrypted without having two key files (or duplicating keys).  You may have the decrypted text in all caps.
You should be able to handle characters in the encrypted  text that are not in your key file.  In that case, just have the decryption repeat the character.  This will allow you to have spaces in your encrypted text that remain spaces when decrypted.
While you may write a program to create the key file - do NOT include that in the submission.  You may manually create the encrypted and key text files.  Use either the "new file" option in Python Shell (don't forget to save as txt) or an editor such as notepad.  Do not use word.
Here is my code:
keyFile = open("key.txt", "r")
keylist1= []
keylist2 = []
for line in keyFile:
keylist1.append(line.split()[0])
keylist2.append(line.split()[1])
keyFile.close()
encryptedfile = open("encrypted.txt", "r")
lines = encryptedfile.readlines()
currentline = ""
decrypt = ""
for line in lines:
currentline = line
letter = list(currentline)
for i in range(len(letter)):
currentletter = letter[i]
if not letter[i].isalpha():
decrypt += letter[i]
else:
for o in range(len(keylist1)):
if currentletter == keylist1[o]:
decrypt += keylist2[o]
print(decrypt)
The only output I get is:
, ?
which is incorrect.
You forgot to handle lowercase letters. Use upper() to convert everything to a common case.
It would also be better to use a dictionary instead of a pair of lists.
mapping = {}
with open("key.txt", "r") as keyFile:
for line in keyFile:
l1, l2 = line.split()
mapping[upper(l1)] = upper(l2)
decrypt = ""
with open("encrypted.txt", "r") as encryptedFile:
for line in encryptedFile:
for char in line:
char = upper(char)
if char in mapping:
decrypt += mapping[char]
else:
decrypt += char
print(decrypt)

Matching line endings ignores unicode characters

I've got a small Python script that compares a word list imported from document A with a set of line endings in document B in order to copy the ones that don't match those rules to document C. Example:
A (word list):
salir
entrar
leer
B (line endings list):
ir
ar
C (those from A that do not match B):
leer
In general it works fine but I realized that it doesn't work with line endings that contain a Unicode character as ó - there is no error message and everything seems smooth but the list C does still contain words ending with ó.
Here is an excerpt of my code:
inputobj = codecs.open(A, "r")
ruleobj = codecs.open(B, "r")
nomatch = codecs.open(C, "w")
inputtext = inputobj.readlines()
ruletext = ruleobj.readlines()
for line in inputtext:
x = 0
line = line.strip()
for rule in ruletext:
rule = rule.strip()
if line.endswith(rule):
print "rule", rule, " in line", line
x= x+1
if x == 0:
nomatchlist.append(line)
for i in notmatchlist:
print >> nomatch, i
I've tried some code locally. It works well for the 'ó'.
Could you check the A & B are in the same encoding?

Categories

Resources