I found that I just asked the wrong question a few minutes ago, sorry about that. I ran a code that need to identify if the word in certain location matches my condition.
The original code is not in English, I just tried to use a simple way to show you the problem I had. There's actually no space between words in my language, so use split or re is not working.
I need to find the word before "car" to know whether someone loves the car or not. So I used location as conditions to identify it.
For example: (But it will be too long)
message="I do not like cars."
#print(message[14:18]) #cars starts from location 14
location = 14
if message[int(loca)-5:int(loca)-1]=="like":
print("like")
elif message[int(loca)-8:int(loca)-1]=="dislike":
print("dislike")
elif message[int(loca)-5:int(loca)-1]=="hate":
print("hate")
elif message[int(loca)-5:int(loca)-1]=="cool":
print("cool")
I actually used this one in my code, but found that I could not print the word:
if (
message[int(location) - 5:int(location) - 1] == "like" or
message[int(location) - 8:int(location) - 1] == "dislike" or
message[int(location) - 5:int(location) - 1] == "hate" or
message[int(location) - 5:int(location) - 1] == "cool"
):
#print "like"
#unable to do it
Is there anyway I can solve it by printing the matching word?
Looks like you need Regex:
import re
message="I do not dislike cars."
check_list = {"like", "dislike", "hate", "cool"}
pattern = re.compile(r"(\b{}\b)".format("|".join(check_list))) #or re.compile(r"({})".format("|".join(check_list)))
m = pattern.search(message)
if m:
print(m.group(1)) # --> dislike
Related
I have been searching for the solution to this problem. I am writing a custom function to count number of sentences. I tried nltk and textstat for this problem but both are giving me different counts.
An Example of a sentence is something like this.
Annie said, "Are you sure? How is it possible? you are joking, right?"
NLTK is giving me --> count=3.
['Annie said, "Are you sure?', 'How is it possible?', 'you are
joking, right?"']
another example:
Annie said, "It will work like this! you need to go and confront your
friend. Okay!"
NLTK is giving me --> count=3.
Please suggest. The expected count is 1 as it is a single direct sentence.
I have written a simple function that does what you want:
def sentences_counter(text: str):
end_of_sentence = ".?!…"
# complete with whatever end of a sentence punctuation mark I might have forgotten
# you might for instance want to add '\n'.
sentences_count = 0
sentences = []
inside_a_quote = False
start_of_sentence = 0
last_end_of_sentence = -2
for i, char in enumerate(text):
# quote management, to solve your issue
if char == '"':
inside_a_quote = not inside_a_quote
if not inside_a_quote and text[i-1] in end_of_sentence: # 🚩
last_end_of_sentence = i # 🚩
elif inside_a_quote:
continue
# basic management of sentences with the punctuation marks in `end_of_sentence`
if char in end_of_sentence:
last_end_of_sentence = i
elif last_end_of_sentence == i-1:
sentences.append(text[start_of_sentence:i].strip())
sentences_count += 1
start_of_sentence = i
# same as the last block in case there is no end punctuation mark in the text
last_sentence = text[start_of_sentence:]
if last_sentence:
sentences.append(last_sentence.strip())
sentences_count += 1
return sentences_count, sentences
Consider the following:
text = '''Annie said, "Are you sure? How is it possible? you are joking, right?" No, I'm not... I thought you were'''
To generalize your problem a bit, I added 2 more sentences, one with ellipsis and the last one without even any end punctuation mark. Now, if I execute this:
sentences_count, sentences = sentences_counter(text)
print(f'{sentences_count} sentences detected.')
print(f'The detected sentences are: {sentences}')
I obtain this:
3 sentences detected.
The detected sentences are: ['Annie said, "Are you sure? How is it possible? you are joking, right?"', "No, I'm not...", 'I thought you were']
I think it works fine.
Note: Please consider the quote management of my solution works for American style quotes, where the end punctuation mark of the sentence can be inside of the quote. Remove the lines where I have put flag emojis 🚩 to disable this.
I want to print a specific word a different color every time it appears in the text. In the existing code, I've printed the lines that contain the relevant word "one".
import json
from colorama import Fore
fh = open(r"fle.json")
corpus = json.loads(fh.read())
for m in corpus['smsCorpus']['message']:
identity = m['#id']
text = m['text']['$']
strtext = str(text)
utterances = strtext.split()
if 'one' in utterances:
print(identity,text, sep ='\t')
I imported Fore but I don't know where to use it. I want to use it to have the word "one" in a different color.
output (section of)
44814 Ohhh that's the one Johnson told us about...can you send it to me?
44870 Kinda... I went but no one else did, I so just went with Sarah to get lunch xP
44951 No, it was directed in one place loudly and stopped when I stoppedmore or less
44961 Because it raised awareness but no one acted on their new awareness, I guess
44984 We need to do a fob analysis like our mcs onec
Thank you
You could also just use the ANSI color codes in your strings:
# define aliases to the color-codes
red = "\033[31m"
green = "\033[32m"
blue = "\033[34m"
reset = "\033[39m"
t = "That was one hell of a show for a one man band!"
utterances = t.split()
if "one" in utterances:
# figure out the list-indices of occurences of "one"
idxs = [i for i, x in enumerate(utterances) if x == "one"]
# modify the occurences by wrapping them in ANSI sequences
for i in idxs:
utterances[i] = red + utterances[i] + reset
# join the list back into a string and print
utterances = " ".join(utterances)
print(utterances)
If you only have 1 coloured word you can use this I think, you can expand the logic for n coloured words:
our_str = "Ohhh that's the one Johnson told us about...can you send it to me?"
def colour_one(our_str):
if "one" in our_str:
str1, str2 = our_str.split("one")
new_str = str1 + Fore.RED + 'one' + Style.RESET_ALL + str2
else:
new_str = our_str
return new_str
I think this is an ugly solution, not even sure if it works. But it's a solution if you can't find anything else.
i use colour module from this link or colored module that link
Furthermore if you dont want to use a module for coloring you can address to this link or that link
i have multiple string variations: "gr_shoulder_r_tmp", "r_shoulder_tmp"
i need to substitute:
"r_" to l_, here:
"gr_shoulder_r_tmp" > "gr_shoulder_l_tmp"
"r_shoulder_tmp" > "l_shoulder_tmp"
in other words i need to subustitute 3rd coinsidence in frist example
and 1st in second example of stirngs
im started digging myself...
and came up into halfesolved result, which bore one more interesting question:
a) Find index of right hit
[i for i, x in enumerate(re.findall("(.?)(r_)", "gr_shoulder_r_tmp")) if filter(None, x).__len__() == 1]
which gives me indx = 2
?) how to use that hit index :[
while wrote this i found straight simple solution..
b) split by underscore, replace standalone letter, and join back
findtag = "r"
newtag = "l"
itemA = "gr_shoulder_r_tmp"
itemB = "r_shoulderr_tmp"
spl_str = itemA.split("_")
hit = spl_str.index(findtag)
spl_str[hit] = newtag
new_item = "_".join(spl_str)
both itemA,itemB gives me what i need.. but im not happy of it, too heavy and so rough
A simple regex will do this job.
re.sub(r'(?<![a-zA-Z])r_', 'l_', s)
(?<![a-zA-Z]) negative lookbehind which asserts that the match would be preceeded by any but not a letter.
Example:
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"gr_shoulder_r_tmp")
'gr_shoulder_l_tmp'
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"r_shoulder_tmp")
'l_shoulder_tmp'
Examples of words:
ball
encyclopedia
tableau
Examples of random strings:
qxbogsac
jgaynj
rnnfdwpm
Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.
I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?
There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.
I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
Caveat I am not a Natural Language Expert
Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be
Have an English (I believe its language antagonistic) dictionary
Create a python dict of the words, with keys as the first and last character of the words in the dictionary
words = defaultdict()
with open("your_dict.txt") as fin:
for word in fin:
words[word[0]+word[-1]].append(word)
Now for any given word, search the dictionary (remember key is the first and last character of the word)
for matches in words[needle[0] + needle[-1]]:
Compare if the characters in the value of the dictionary and your needle matches
for match in words[needle[0] + needle[-1]]:
if sorted(match) == sorted(needle):
print "Human Readable Word"
A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])
If you really mean that your metric of randomness is pronounceability, you're getting into the realm of phonotactics: the allowed sequences of sounds in a language. As #ChrisPosser points out in his comment to your question, these allowed sequences of sounds are language-specific.
This question only makes sense within a specific language.
Whichever language you choose, you might have some luck with an n-gram model trained over the letters themselves (as opposed to the words, which is the usual approach). Then you can calculate a score for a particular string and set a threshold under which a string is random and over which a string is something like a word.
EDIT: Someone has done this already and actually implemented it: https://stackoverflow.com/a/6298193/583834
Works quite well for me:
VOWELS = "aeiou"
PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz']
def isWord(word):
if word:
consecutiveVowels = 0
consecutiveConsonents = 0
for idx, letter in enumerate(word.lower()):
vowel = True if letter in VOWELS else False
if idx:
prev = word[idx-1]
prevVowel = True if prev in VOWELS else False
if not vowel and letter == 'y' and not prevVowel:
vowel = True
if prevVowel != vowel:
consecutiveVowels = 0
consecutiveConsonents = 0
if vowel:
consecutiveVowels += 1
else:
consecutiveConsonents +=1
if consecutiveVowels >= 3 or consecutiveConsonents > 3:
return False
if consecutiveConsonents == 3:
subStr = word[idx-2:idx+1]
if any(phone in subStr for phone in PHONES):
consecutiveConsonents -= 1
continue
return False
return True
Use PyDictionary.
You can install PyDictionary using following command.
easy_install -U PyDictionary
Now in code:
from PyDictionary import PyDictionary
dictionary=PyDictionary()
a = ['ball', 'asdfg']
for item in a:
x = dictionary.meaning(item)
if x==None:
print item + ': Not a valid word'
else:
print item + ': Valid'
As far as I know, you can use PyDictionary for some other languages then english.
I wrote this logic to detect number of consecutive vowels and consonants in a string. You can choose the threshold based on the language.
def get_num_vowel_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[aeiou]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
def get_num_consonent_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[bcdfghjklmnpqrstvwxz]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
I am a beginner in Python, I am teaching myself off of Google Code University online. One of the exercises in string manipulation is as follows:
# E. not_bad
# Given a string, find the first appearance of the
# substring 'not' and 'bad'. If the 'bad' follows
# the 'not', replace the whole 'not'...'bad' substring
# with 'good'.
# Return the resulting string.
# So 'This dinner is not that bad!' yields:
# This dinner is good!
def not_bad(s):
# +++your code here+++
return
I'm stuck. I know it could be put into a list using ls = s.split(' ') and then sorted with various elements removed, but I think that is probably just creating extra work for myself. The lesson hasn't covered RegEx yet so the solution doesn't involve re. Help?
Here's what I tried, but it doesn't quite give the output correctly in all cases:
def not_bad(s):
if s.find('not') != -1:
notindex = s.find('not')
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex > badindex:
removetext = s[notindex:badindex]
ns = s.replace(removetext, 'good')
else:
ns = s
else:
ns = s
else:
ns = s
return ns
Here is the output, it worked in 1/4 of the test cases:
not_bad
X got: 'This movie is not so bad' expected: 'This movie is good'
X got: 'This dinner is not that bad!' expected: 'This dinner is good!'
OK got: 'This tea is not hot' expected: 'This tea is not hot'
X got: "goodIgoodtgood'goodsgood goodbgoodagooddgood goodygoodegoodtgood
goodngoodogoodtgood" expected: "It's bad yet not"
Test Cases:
print 'not_bad'
test(not_bad('This movie is not so bad'), 'This movie is good')
test(not_bad('This dinner is not that bad!'), 'This dinner is good!')
test(not_bad('This tea is not hot'), 'This tea is not hot')
test(not_bad("It's bad yet not"), "It's bad yet not")
UPDATE: This code solved the problem:
def not_bad(s):
notindex = s.find('not')
if notindex != -1:
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex < badindex:
removetext = s[notindex:badindex]
return s.replace(removetext, 'good')
return s
Thanks everyone for helping me discover the solution (and not just giving me the answer)! I appreciate it!
Well, I think that it is time to make a small review ;-)
There is an error in your code: notindex > badindex should be changed into notindex < badindex. The changed code seems to work fine.
Also I have some remarks about your code:
It is usual practice to compute the value once, assign it to the variable and use that variable in the code below. And this rule seems to be acceptable for this particular case:
For example, the head of your function could be replaced by
notindex = s.find('not')
if notindex == -1:
You can use return inside of your function several times.
As a result tail of your code could be significantly reduced:
if (*all right*):
return s.replace(removetext, 'good')
return s
Finally i want to indicate that you can solve this problem using split. But it does not seem to be better solution.
def not_bad( s ):
q = s.split( "bad" )
w = q[0].split( "not" )
if len(q) > 1 < len(w):
return w[0] + "good" + "bad".join(q[1:])
return s
Break it down like this:
How would you figure out if the word "not" is in a string?
How would you figure out where the word "not" is in a string, if it is?
How would you combine #1 and #2 in a single operation?
Same as #1-3 except for the word "bad"?
Given that you know the words "not" and "bad" are both in a string, how would you determine whether the word "bad" came after the word "not"?
Given that you know "bad" comes after "not", how would you get every part of the string that comes before the word "not"?
And how would you get every part of the string that comes after the word "bad"?
How would you combine the answers to #6 and #7 to replace everything from the start of the word "not" and the end of the word "bad" with "good"?
Since you are trying to learn, I don't want to hand you the answer, but I would start by looking in the python documentation for some of the string functions including replace and index.
Also, if you have a good IDE it can help by showing you what methods are attached to an object and even automatically displaying the help string for those methods. I tend to use Eclipse for large projects and the lighter weight Spyder for small projects
http://docs.python.org/library/stdtypes.html#string-methods
I suspect that they're wanting you to use string.find to locate the various substrings:
>>> mystr = "abcd"
>>> mystr.find("bc")
1
>>> mystr.find("bce")
-1
Since you're trying to teach yourself (kudos, BTW :) I won't post a complete solution, but also note that you can use indexing to get substrings:
>>> mystr[0:mystr.find("bc")]
'a'
Hope that's enough to get you started! If not, just comment here and I can post more. :)
def not_bad(s):
snot = s.find("not")
sbad = s.find("bad")
if snot < sbad:
s = s.replace(s[snot:(sbad+3)], "good")
return s
else:
return s