I am trying to making a word translator.
english_List = ["fire","apple","morning","river","wind"]
spanish_List = ["fuego","manzana","mañana","río","viento",]
Would I be able to make it so when enter an English word e.g. "fire" it will print out the corresponding translation "fuego"?
Use a dictionary. You can make a dictionary where an English word is mapped
to the corresponding Spanish word from these 2 lists using zip() to couple "fire" with "fuego", "apple" with "manzana" and so forth. Then build a dictionary using the dict().
english_list = ["fire","apple","morning","river","wind"]
spanish_list = ["fuego","manzana","mañana","río","viento"]
english_to_spanish = dict(zip(english_list, spanish_list))
You can get a translation for English word then as:
spanish = english_to_spanish['apple']
If a word is not found, KeyError exception is raised. A more complete example could use a function for translation, say:
def translate(english_word):
try:
print("{} in Spanish is {}".format(
english_word, english_to_spanish[english_word]))
except KeyError:
print("Looks like Spanish does not have the word for {}, sorry"
.format(english_word))
while True:
word = input() # raw_input in python 2
translate(word)
Use a dict to map the corresponding words:
trans_dict = {"fire":"fuego","apple":"manzana","morning":"mañana","river":"río","wind":"viento"}
inp = raw_input("Enter your english word to translate:").lower()
print("{} is {}".format(inp.capitalize(),trans_dict.get(inp," not in my translation dict").capitalize()))
You can use zip to make the dict from your lists:
english_List = ["fire","apple","morning","river","wind"]
spanish_List = ["fuego","manzana","mañana","río","viento"]
trans_dict = dict(zip(english_List,spanish_List))
Using trans_dict.get(inp,"not in my translation dict") with a default value of "not in my translation dict" will make sure if the user enters a word the does not exist in our trans_dict, it will print the the_word is not in my translation dict" and avoid a keyError
We use .lower() in case the user enter Fire or Apple etc.. with an uppercase letter and use str.capitalize() for the output the data capitalized.
dict.get
You can do it with this function:
def translate(word, english_list, spanish_list):
if word in english_list:
return spanish_list[english_list.index(word)]
else:
return None
However, the proper way would be to use a dictionary.
I think the best way to do this is to use a dictionary.
For example:
d = {"fire": "fuego", "apple": "manzana"}
And then retrieve the translation:
d.get("fire", "No translation")
BTW. on python.org you will find awesome documentation on how to learn Python:
https://wiki.python.org/moin/BeginnersGuide
I presume that you should start here:
https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
Related
I'm a noob student working on a computer vision project.
I'm using google trans library to translate characters extracted with teseract ocr. The extracted text is in Sinhala Language, and I need to transform them into English without translating the meaning. But when using googletrans it translates the meanings as well. I only need the letters to be translated. So I want to create a mapping file with characters in Sinhala language with corresponding values in English. I can't figure out where to start and what is needed to be done. I tried to find online resources but due to my lack of knowledge, I can not connect the dots. Please guide me through this.
Here is a sample of how it should be.
(sinhala letter) (english letters)
ට = ta
ක =ka
ර =ra
ම =ma
I think you should map all the characters using a dictionary like so :
charcters_map = {
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'
}
then you should loop through you text like so :
for letter in text:
try:
text = text.replace(letter,characters_map[letter])
except KeyError:
pass # if a letter is not recognized it will just let it as is
As pointed out by OneMadGypsy, this overwrites the initial text, this might be a better practice :
replaced_text = ''
for letter in text:
try:
replaced_text += characters_map[letter]
except KeyError:
replaced_text += letter
this replaces all occurences of the current letter with the corresponding value in you dictionary
i hope this helps, and good luck
More links :
replace()
loop through string
As #LouisAT stated, a dict is the probably the best way to go, but I disagree with the rest of their implementation.
You could create your own str type that fully inherits from str but adds your phonetics and transliteration properties.
class sin_str(str):
#property
def __phonetics(self) -> dict:
return {'ටු':'tu',
'කා':'kaa',
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'}
#property
def transliteration(self) -> str:
p, t = self.__phonetics, self[:]
for k,v in p.items(): t = t.replace(k, v)
return t
#use
text = sin_str('කා')
print(text.transliteration) #kaa
I am trying to locate words that contains certain string inside a list of lists in python, for example: If I have a list of tuples like:
the_list = [
('Had denoting properly #T-jointure you occasion directly raillery'),
('. In said to of poor full be post face snug. Introduced imprudence'),
('see say #T-unpleasing devonshire acceptance son.'),
('Exeter longer #T-wisdom gay nor design age.', 'Am weather to entered norland'),
('no in showing service. Nor repeated speaking', ' shy appetite.'),
('Excited it hastily an pasture #T-it observe.', 'Snug #T-hand how dare here too.')
]
I want to find a specific string that I search for and extract a complete word that contains it, example
for sentence in the_list:
for word in sentence:
if '#T-' in word:
print(word)
import re
wordSearch = re.compile(r'word')
for x, y in the_list:
if wordSearch.match(x):
print(x)
elif wordSearch.match(y):
print(y)
You could use list of comprehension on a flattened array of yours:
from pandas.core.common import flatten
[[word for word in x.split(' ') if '#T-' in word] for x in list(flatten(the_list)) if '#T-' in x]
#[['#T-jointure'], ['#T-unpleasing'], ['#T-wisdom'], ['#T-it'], ['#T-hand']]
Relevant places: How to make a flat list out of list of lists? (specifically this answer), Double for loop list comprehension.
you would need to use re for this task
import re
a = re.search("#(.*?)[\s]",'Exeter longer #T-wisdom gay nor design age.')
a.group(0)
Note : you need to account for the Nonetype else it will throw and error
for name in the_list:
try:
if isinstance(name,(list,tuple)):
for name1 in name:
result = re.search("#(.*?)[\s]",name1)
print(result.group(0))
else:
result = re.search("#(.*?)[\s]",name)
print(result.group(0))
except:
pass
I need a python library that accepts some text, and replaces phone numbers, names, and so on with tokens. Example:
Input: Please call Robert on 0430013454 to discuss this further.
Output: Please call NAME on PHONE to discuss this further.
In other words I need to take a sentence, any sentence, then the program will be run on that sentence and remove anything that looks like a name, phone number or any other identifier, and replace it with a token I.E NAME, PHONE NUMBER So that token would just be text to replace the info so that it is no longer displayed.
Must be python 2.7 compatible. Would anybody know how this would be done?
Cheers!
As Harrison pointed out, nltk has named entity recognition, which is what you want for this task. Here is a good sample to get you started.
From the site:
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
Not really sure about name recognition. However, if you know the names that you would be looking for it would be easy. You could have a list of all of the names that you're looking for and check to see if each one is in the string and if so just use string.replace. If the names are random you could maybe look into NLTK I think they might have some name entity recognition. I really don't know anything about it though...
But as for phone numbers, that's easy. You can split the string into a list and check to see if any element consists of numbers. You could even check the length to make sure it's 10 digits (i'm assuming all numbers will be 10 based on your example).
Something like this...
example_input = 'Please call Robert on 0430013454 to discuss this further.'
new_list = example_input.split(' ')
for word in new_list:
if word.isdigit():
pos = new_list.index(word)
new_list[pos] = 'PHONE'
example_output = ' '.join(new_list)
print example_output
This would be the output: 'Please call Robert on PHONE to discuss this further'
The if statement would be something like if word.isdigit() and len(word) == 10: if you wanted to make sure the length of the digits is 10.
I need to be able to pick keywords from an Excel CSV file, I've appended the file to a list. The program is a Phone Troubleshoot, and I need the input ("The screen won't turn on") to have the same output as if I inputted ("The display is blank").
"Troubleshooting Program to give the user a solution to a trouble they've encountered based on inputted key words."
phoneprob=input("What problem are you having with your phone? ")
prob=open("phone.csv","r")
phone=prob.read()
prob.close()
eachProb=phone.split("\n")
print(eachProb)
problist=[eachProb]
print (problist)
Are you trying to build a keyword dictionary or retriving a sentence problem ?
In both case, you need to associate a problem to keywords.
A basic approch to get keywords is to split the sentence in words (using s.split()) and update keyword list with the most used of them...
difflib can help here.
Since we don't know the given file schema, i assume it's just a list of sentence and you provided keyword/problems elsewhere (situation Dict).
For example:
csv_ret = ["I can't turn on my phone", "The screen won't turn on", "phone The display is blank"]
situations = {
"screen": ["turn on", "blank", "display", "screen"],
"battery": ["turn on", "phone"]
}
def get_situation_from_sentence(sentence):
occurences = {}
for word in sentence.split():
for key, value in situations.items():
if word in value:
if occurences.get(key) is None:
occurences[key] = [word]
elif word not in occurences.get(key):
occurences[key].append(word)
averages = {k: ((len(v) * 100) / len(situations[k])) for k, v in occurences.items()}
return "{}, {}".format(averages, sentence)
for sentence in csv_ret:
print(get_situation_from_sentence(sentence))
results:
{'battery': 50.0}, I can't turn on my phone
{'screen': 25.0}, The screen won't turn on
{'screen': 50.0, 'battery': 50.0}, phone The display is blank
This code evaluate a sentence problems and related keyword match in percent.
Once again this is a very basic solution, and you probably need something more robust (lexer/parser, machine learning ...) but sometime simpler is better :)
Examples of words:
ball
encyclopedia
tableau
Examples of random strings:
qxbogsac
jgaynj
rnnfdwpm
Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.
I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?
There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.
I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
Caveat I am not a Natural Language Expert
Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be
Have an English (I believe its language antagonistic) dictionary
Create a python dict of the words, with keys as the first and last character of the words in the dictionary
words = defaultdict()
with open("your_dict.txt") as fin:
for word in fin:
words[word[0]+word[-1]].append(word)
Now for any given word, search the dictionary (remember key is the first and last character of the word)
for matches in words[needle[0] + needle[-1]]:
Compare if the characters in the value of the dictionary and your needle matches
for match in words[needle[0] + needle[-1]]:
if sorted(match) == sorted(needle):
print "Human Readable Word"
A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])
If you really mean that your metric of randomness is pronounceability, you're getting into the realm of phonotactics: the allowed sequences of sounds in a language. As #ChrisPosser points out in his comment to your question, these allowed sequences of sounds are language-specific.
This question only makes sense within a specific language.
Whichever language you choose, you might have some luck with an n-gram model trained over the letters themselves (as opposed to the words, which is the usual approach). Then you can calculate a score for a particular string and set a threshold under which a string is random and over which a string is something like a word.
EDIT: Someone has done this already and actually implemented it: https://stackoverflow.com/a/6298193/583834
Works quite well for me:
VOWELS = "aeiou"
PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz']
def isWord(word):
if word:
consecutiveVowels = 0
consecutiveConsonents = 0
for idx, letter in enumerate(word.lower()):
vowel = True if letter in VOWELS else False
if idx:
prev = word[idx-1]
prevVowel = True if prev in VOWELS else False
if not vowel and letter == 'y' and not prevVowel:
vowel = True
if prevVowel != vowel:
consecutiveVowels = 0
consecutiveConsonents = 0
if vowel:
consecutiveVowels += 1
else:
consecutiveConsonents +=1
if consecutiveVowels >= 3 or consecutiveConsonents > 3:
return False
if consecutiveConsonents == 3:
subStr = word[idx-2:idx+1]
if any(phone in subStr for phone in PHONES):
consecutiveConsonents -= 1
continue
return False
return True
Use PyDictionary.
You can install PyDictionary using following command.
easy_install -U PyDictionary
Now in code:
from PyDictionary import PyDictionary
dictionary=PyDictionary()
a = ['ball', 'asdfg']
for item in a:
x = dictionary.meaning(item)
if x==None:
print item + ': Not a valid word'
else:
print item + ': Valid'
As far as I know, you can use PyDictionary for some other languages then english.
I wrote this logic to detect number of consecutive vowels and consonants in a string. You can choose the threshold based on the language.
def get_num_vowel_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[aeiou]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
def get_num_consonent_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[bcdfghjklmnpqrstvwxz]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol