I'm a noob student working on a computer vision project.
I'm using google trans library to translate characters extracted with teseract ocr. The extracted text is in Sinhala Language, and I need to transform them into English without translating the meaning. But when using googletrans it translates the meanings as well. I only need the letters to be translated. So I want to create a mapping file with characters in Sinhala language with corresponding values in English. I can't figure out where to start and what is needed to be done. I tried to find online resources but due to my lack of knowledge, I can not connect the dots. Please guide me through this.
Here is a sample of how it should be.
(sinhala letter) (english letters)
ට = ta
ක =ka
ර =ra
ම =ma
I think you should map all the characters using a dictionary like so :
charcters_map = {
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'
}
then you should loop through you text like so :
for letter in text:
try:
text = text.replace(letter,characters_map[letter])
except KeyError:
pass # if a letter is not recognized it will just let it as is
As pointed out by OneMadGypsy, this overwrites the initial text, this might be a better practice :
replaced_text = ''
for letter in text:
try:
replaced_text += characters_map[letter]
except KeyError:
replaced_text += letter
this replaces all occurences of the current letter with the corresponding value in you dictionary
i hope this helps, and good luck
More links :
replace()
loop through string
As #LouisAT stated, a dict is the probably the best way to go, but I disagree with the rest of their implementation.
You could create your own str type that fully inherits from str but adds your phonetics and transliteration properties.
class sin_str(str):
#property
def __phonetics(self) -> dict:
return {'ටු':'tu',
'කා':'kaa',
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'}
#property
def transliteration(self) -> str:
p, t = self.__phonetics, self[:]
for k,v in p.items(): t = t.replace(k, v)
return t
#use
text = sin_str('කා')
print(text.transliteration) #kaa
Related
I built a pretty basic program ,,, that will take input in English ,, and encrypt it using random alphabets of different languages ;; And also decrypt it :-
def encrypt_decrypt():
inut = input("Text to convert ::-- ")
# feel free to replace the symbols ,, with ur own carecters or numbers or something
# u can also add numbers , and other carecters for encryption or decryption
decideing_variable = input("U wanna encrypt or decrypt ?? ,, write EN or DE ::- ")
if decideing_variable == "EN":
deep = inut.replace("a", "ᛟ").replace("b", "ᛃ").replace("c", "Ῡ").replace("d", "ϰ").replace("e", "Г").replace("f", "ξ").replace("g", "ᾫ").replace("h", "ῆ").replace("i", "₪").replace("j", "א").replace("k", "ⴽ").replace("l", "ⵞ").replace("m", "ⵥ").replace("n", "ঙ").replace("o", "Œ").replace("p", "უ").replace("q", "ক").replace("r", "ჶ").replace("s", "Ø").replace("t", "ю").replace("u", "ʧ").replace("v", "ʢ").replace("w", "ұ").replace("x", "Џ").replace("y", "န").replace("z", "໒")
print(f"\n{deep}\n")
elif decideing_variable == "DE":
un_deep = inut.replace("ᛟ", "a").replace("ᛃ", "b").replace("Ῡ", "c").replace("ϰ", "d").replace("Г", "e").replace("ξ","f").replace("ᾫ", "g").replace("ῆ", "h").replace("₪", "i").replace("א", "j").replace("ⴽ", "k").replace("ⵞ", "l").replace("ⵥ", "m").replace("ঙ", "n").replace("Œ", "o").replace("უ", "p").replace("ক", "q").replace("ჶ", "r").replace("Ø", "s").replace("ю", "t").replace("ʧ", "u").replace("ʢ", "v").replace("ұ", "w").replace("Џ", "x").replace("န", "y").replace("໒", "z")
print(f"\n{un_deep}\n")
encrypt_decrypt()
while writing this I didn't know any better way then chaining .replace() function ,,,
But I have a feeling , that this isn't the proper way to do it ,,
The code works fine .
But ,, does any one know a better way of doing this ?
It looks like you are doing a character by character replacement. The function you are looking for is string.maketrans. You can give strings of equal length to convert each character to the desired character. Here is a working example online:
# first string
firstString = "abc"
secondString = "def"
string = "abc"
print(string.maketrans(firstString, secondString))
# example dictionary
firstString = "abc"
secondString = "defghi"
string = "abc"
print(string.maketrans(firstString, secondString))
You can also look at the official documentation for further details.
You can make a dictionary for corresponding words and use this,
text = "ababdba"
translation = {'a':'ᛟ', 'b':'ᛃ', 'c':'Ῡ','d': 'ϰ','e': 'Г','f': 'ξ','g': 'ᾫ','h':'ῆ','i': '₪','j': 'א','k': 'ⴽ','l': 'ⵞ','m' :'ⵥ','n': 'ঙ','o': 'Œ','p': 'უ','q': 'ক','r': 'ჶ','s': 'Ø','t': 'ю','u': 'ʧ', 'v':'ʢ','w': 'ұ','x': 'Џ','y': 'န','z': '໒'}
def translate(text,translation):
result = []
for char in text:
result.append( translation[char] )
return "".join(result)
print(translate(text,translation))
result is
ᛟᛃᛟᛃϰᛃᛟ
This might help you.
str.translate() and str.maketrans() are built to do all of the replacements in one go.
e.g.
>>> encrypt_table = str.maketrans("abc", "ᛟᛃῩ")
>>> "an abacus".translate(encrypt_table)
'ᛟn ᛟᛃᛟῩus'
NB. not string.maketrans() which is how it used to be in Python 2, and is now outdated; Python 3 turned that into two systems, str.maketrans() for text and bytes.maketrans() for bytes. see How come string.maketrans does not work in Python 3.1?
I need a python library that accepts some text, and replaces phone numbers, names, and so on with tokens. Example:
Input: Please call Robert on 0430013454 to discuss this further.
Output: Please call NAME on PHONE to discuss this further.
In other words I need to take a sentence, any sentence, then the program will be run on that sentence and remove anything that looks like a name, phone number or any other identifier, and replace it with a token I.E NAME, PHONE NUMBER So that token would just be text to replace the info so that it is no longer displayed.
Must be python 2.7 compatible. Would anybody know how this would be done?
Cheers!
As Harrison pointed out, nltk has named entity recognition, which is what you want for this task. Here is a good sample to get you started.
From the site:
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
Not really sure about name recognition. However, if you know the names that you would be looking for it would be easy. You could have a list of all of the names that you're looking for and check to see if each one is in the string and if so just use string.replace. If the names are random you could maybe look into NLTK I think they might have some name entity recognition. I really don't know anything about it though...
But as for phone numbers, that's easy. You can split the string into a list and check to see if any element consists of numbers. You could even check the length to make sure it's 10 digits (i'm assuming all numbers will be 10 based on your example).
Something like this...
example_input = 'Please call Robert on 0430013454 to discuss this further.'
new_list = example_input.split(' ')
for word in new_list:
if word.isdigit():
pos = new_list.index(word)
new_list[pos] = 'PHONE'
example_output = ' '.join(new_list)
print example_output
This would be the output: 'Please call Robert on PHONE to discuss this further'
The if statement would be something like if word.isdigit() and len(word) == 10: if you wanted to make sure the length of the digits is 10.
I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)
I am trying to making a word translator.
english_List = ["fire","apple","morning","river","wind"]
spanish_List = ["fuego","manzana","mañana","río","viento",]
Would I be able to make it so when enter an English word e.g. "fire" it will print out the corresponding translation "fuego"?
Use a dictionary. You can make a dictionary where an English word is mapped
to the corresponding Spanish word from these 2 lists using zip() to couple "fire" with "fuego", "apple" with "manzana" and so forth. Then build a dictionary using the dict().
english_list = ["fire","apple","morning","river","wind"]
spanish_list = ["fuego","manzana","mañana","río","viento"]
english_to_spanish = dict(zip(english_list, spanish_list))
You can get a translation for English word then as:
spanish = english_to_spanish['apple']
If a word is not found, KeyError exception is raised. A more complete example could use a function for translation, say:
def translate(english_word):
try:
print("{} in Spanish is {}".format(
english_word, english_to_spanish[english_word]))
except KeyError:
print("Looks like Spanish does not have the word for {}, sorry"
.format(english_word))
while True:
word = input() # raw_input in python 2
translate(word)
Use a dict to map the corresponding words:
trans_dict = {"fire":"fuego","apple":"manzana","morning":"mañana","river":"río","wind":"viento"}
inp = raw_input("Enter your english word to translate:").lower()
print("{} is {}".format(inp.capitalize(),trans_dict.get(inp," not in my translation dict").capitalize()))
You can use zip to make the dict from your lists:
english_List = ["fire","apple","morning","river","wind"]
spanish_List = ["fuego","manzana","mañana","río","viento"]
trans_dict = dict(zip(english_List,spanish_List))
Using trans_dict.get(inp,"not in my translation dict") with a default value of "not in my translation dict" will make sure if the user enters a word the does not exist in our trans_dict, it will print the the_word is not in my translation dict" and avoid a keyError
We use .lower() in case the user enter Fire or Apple etc.. with an uppercase letter and use str.capitalize() for the output the data capitalized.
dict.get
You can do it with this function:
def translate(word, english_list, spanish_list):
if word in english_list:
return spanish_list[english_list.index(word)]
else:
return None
However, the proper way would be to use a dictionary.
I think the best way to do this is to use a dictionary.
For example:
d = {"fire": "fuego", "apple": "manzana"}
And then retrieve the translation:
d.get("fire", "No translation")
BTW. on python.org you will find awesome documentation on how to learn Python:
https://wiki.python.org/moin/BeginnersGuide
I presume that you should start here:
https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
Examples of words:
ball
encyclopedia
tableau
Examples of random strings:
qxbogsac
jgaynj
rnnfdwpm
Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.
I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?
There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.
I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
Caveat I am not a Natural Language Expert
Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be
Have an English (I believe its language antagonistic) dictionary
Create a python dict of the words, with keys as the first and last character of the words in the dictionary
words = defaultdict()
with open("your_dict.txt") as fin:
for word in fin:
words[word[0]+word[-1]].append(word)
Now for any given word, search the dictionary (remember key is the first and last character of the word)
for matches in words[needle[0] + needle[-1]]:
Compare if the characters in the value of the dictionary and your needle matches
for match in words[needle[0] + needle[-1]]:
if sorted(match) == sorted(needle):
print "Human Readable Word"
A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])
If you really mean that your metric of randomness is pronounceability, you're getting into the realm of phonotactics: the allowed sequences of sounds in a language. As #ChrisPosser points out in his comment to your question, these allowed sequences of sounds are language-specific.
This question only makes sense within a specific language.
Whichever language you choose, you might have some luck with an n-gram model trained over the letters themselves (as opposed to the words, which is the usual approach). Then you can calculate a score for a particular string and set a threshold under which a string is random and over which a string is something like a word.
EDIT: Someone has done this already and actually implemented it: https://stackoverflow.com/a/6298193/583834
Works quite well for me:
VOWELS = "aeiou"
PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz']
def isWord(word):
if word:
consecutiveVowels = 0
consecutiveConsonents = 0
for idx, letter in enumerate(word.lower()):
vowel = True if letter in VOWELS else False
if idx:
prev = word[idx-1]
prevVowel = True if prev in VOWELS else False
if not vowel and letter == 'y' and not prevVowel:
vowel = True
if prevVowel != vowel:
consecutiveVowels = 0
consecutiveConsonents = 0
if vowel:
consecutiveVowels += 1
else:
consecutiveConsonents +=1
if consecutiveVowels >= 3 or consecutiveConsonents > 3:
return False
if consecutiveConsonents == 3:
subStr = word[idx-2:idx+1]
if any(phone in subStr for phone in PHONES):
consecutiveConsonents -= 1
continue
return False
return True
Use PyDictionary.
You can install PyDictionary using following command.
easy_install -U PyDictionary
Now in code:
from PyDictionary import PyDictionary
dictionary=PyDictionary()
a = ['ball', 'asdfg']
for item in a:
x = dictionary.meaning(item)
if x==None:
print item + ': Not a valid word'
else:
print item + ': Valid'
As far as I know, you can use PyDictionary for some other languages then english.
I wrote this logic to detect number of consecutive vowels and consonants in a string. You can choose the threshold based on the language.
def get_num_vowel_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[aeiou]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
def get_num_consonent_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[bcdfghjklmnpqrstvwxz]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol