Based on the given input:
I can do waaaaaaaaaaaaay better :DDDD!!!! I am sooooooooo exicted about it :))) Good !!
Desired: output
I can do way/LNG better :D/LNG !/LNG I am so/LNG exicted about it :)/LNG Good !/LNG
--- Challenges:
better vs. soooooooooo >> we need to keep the first one as is but shorten the second
for the second we need to add a tag (LNG) as it might have some importance for intensification for subjectivity and sentiment
---- Problem: error message "unbalanced parentheses"
Any ideas?
My code is:
import re
lengWords = {} # a dictionary of lengthened words
def removeDuplicates(corpus):
data = (open(corpus, 'r').read()).split()
myString = " ".join(data)
for word in data:
for chr in word:
countChr = word.count(chr)
if countChr >= 3:
lengWords[word] = word+"/LNG"
lengWords[word] = re.sub(r'([A-Za-z])\1+', r'\1', lengWords[word])
lengWords[word] = re.sub(r'([\'\!\~\.\?\,\.,\),\(])\1+', r'\1', lengWords[word])
for k, v in lengWords.items():
if k == word:
re.sub(word, v, myString)
return myString
It's not the perfect solution, but I don't have time to refine it now- just wanted to get you started with easy approach:
s = "I can do waaaaaaaaaaaaay better :DDDD!!!! I am sooooooooo exicted about it :))) Good !!"
re.sub(r'(.)(\1{2,})',r'\1/LNG',s)
>> 'I can do wa/LNGy better :D/LNG!/LNG I am so/LNG exicted about it :)/LNG Good !!'
Related
I want to do fuzzy matching on string with words.
The target string could be like.
"Hello, I am going to watch a film today."
where the words I want to search are.
"flim toda".
This hopefully should return "film today" as a search result.
I have used this method but it seems to be working only with one word.
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
matched_words = []
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
matched_words.append(match)
return matched_words
large_string = "Hello, I am going to watch a film today"
query_string = "film"
print(list(matches(large_string, query_string, 0.8)))
This only works with one word and it returns when there is little noise.
Is there any way to do such fuzzy matching with words?
The feature you are thinking of is called "query suggestion" and does rely on spell checking, but it relies on markov chains built out of search engine query log.
That being said, you use an approach similar to the one described in this answer: https://stackoverflow.com/a/58166648/140837
You can simply use Fuzzysearch, please see the example below;
from fuzzysearch import find_near_matches
text_string = "Hello, I am going to watch a film today."
matches = find_near_matches('flim toda', text_string, max_l_dist=2)
print([my_string[m.start:m.end] for m in matches])
This will give you the desired output.
['film toda']
Please note that you can give a value for max_l_dist parameter based on how much you are going to tolerate.
Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.
I have some text:
s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?
Greedy approach using trie
Try this using Biopython (pip install biopython):
from Bio import trie
import string
def get_trie(dictfile='/usr/share/dict/american-english'):
tr = trie.trie()
with open(dictfile) as f:
for line in f:
word = line.rstrip()
try:
word = word.encode(encoding='ascii', errors='ignore')
tr[word] = len(word)
assert tr.has_key(word), "Missing %s" % word
except UnicodeDecodeError:
pass
return tr
def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s
def main(s):
tr = get_trie()
while s:
word, s = get_trie_word(tr, s)
print word
if __name__ == '__main__':
s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
s = s.strip(string.punctuation)
s = s.replace(" ", '')
s = s.lower()
main(s)
Results
>>> if __name__ == '__main__':
... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
... s = s.strip(string.punctuation)
... s = s.replace(" ", '')
... s = s.lower()
... main(s)
...
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches
Caveats
There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.
Obligatory test
>>> main("expertsexchange")
experts
exchange
This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).
Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.
I am a beginner in Python, I am teaching myself off of Google Code University online. One of the exercises in string manipulation is as follows:
# E. not_bad
# Given a string, find the first appearance of the
# substring 'not' and 'bad'. If the 'bad' follows
# the 'not', replace the whole 'not'...'bad' substring
# with 'good'.
# Return the resulting string.
# So 'This dinner is not that bad!' yields:
# This dinner is good!
def not_bad(s):
# +++your code here+++
return
I'm stuck. I know it could be put into a list using ls = s.split(' ') and then sorted with various elements removed, but I think that is probably just creating extra work for myself. The lesson hasn't covered RegEx yet so the solution doesn't involve re. Help?
Here's what I tried, but it doesn't quite give the output correctly in all cases:
def not_bad(s):
if s.find('not') != -1:
notindex = s.find('not')
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex > badindex:
removetext = s[notindex:badindex]
ns = s.replace(removetext, 'good')
else:
ns = s
else:
ns = s
else:
ns = s
return ns
Here is the output, it worked in 1/4 of the test cases:
not_bad
X got: 'This movie is not so bad' expected: 'This movie is good'
X got: 'This dinner is not that bad!' expected: 'This dinner is good!'
OK got: 'This tea is not hot' expected: 'This tea is not hot'
X got: "goodIgoodtgood'goodsgood goodbgoodagooddgood goodygoodegoodtgood
goodngoodogoodtgood" expected: "It's bad yet not"
Test Cases:
print 'not_bad'
test(not_bad('This movie is not so bad'), 'This movie is good')
test(not_bad('This dinner is not that bad!'), 'This dinner is good!')
test(not_bad('This tea is not hot'), 'This tea is not hot')
test(not_bad("It's bad yet not"), "It's bad yet not")
UPDATE: This code solved the problem:
def not_bad(s):
notindex = s.find('not')
if notindex != -1:
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex < badindex:
removetext = s[notindex:badindex]
return s.replace(removetext, 'good')
return s
Thanks everyone for helping me discover the solution (and not just giving me the answer)! I appreciate it!
Well, I think that it is time to make a small review ;-)
There is an error in your code: notindex > badindex should be changed into notindex < badindex. The changed code seems to work fine.
Also I have some remarks about your code:
It is usual practice to compute the value once, assign it to the variable and use that variable in the code below. And this rule seems to be acceptable for this particular case:
For example, the head of your function could be replaced by
notindex = s.find('not')
if notindex == -1:
You can use return inside of your function several times.
As a result tail of your code could be significantly reduced:
if (*all right*):
return s.replace(removetext, 'good')
return s
Finally i want to indicate that you can solve this problem using split. But it does not seem to be better solution.
def not_bad( s ):
q = s.split( "bad" )
w = q[0].split( "not" )
if len(q) > 1 < len(w):
return w[0] + "good" + "bad".join(q[1:])
return s
Break it down like this:
How would you figure out if the word "not" is in a string?
How would you figure out where the word "not" is in a string, if it is?
How would you combine #1 and #2 in a single operation?
Same as #1-3 except for the word "bad"?
Given that you know the words "not" and "bad" are both in a string, how would you determine whether the word "bad" came after the word "not"?
Given that you know "bad" comes after "not", how would you get every part of the string that comes before the word "not"?
And how would you get every part of the string that comes after the word "bad"?
How would you combine the answers to #6 and #7 to replace everything from the start of the word "not" and the end of the word "bad" with "good"?
Since you are trying to learn, I don't want to hand you the answer, but I would start by looking in the python documentation for some of the string functions including replace and index.
Also, if you have a good IDE it can help by showing you what methods are attached to an object and even automatically displaying the help string for those methods. I tend to use Eclipse for large projects and the lighter weight Spyder for small projects
http://docs.python.org/library/stdtypes.html#string-methods
I suspect that they're wanting you to use string.find to locate the various substrings:
>>> mystr = "abcd"
>>> mystr.find("bc")
1
>>> mystr.find("bce")
-1
Since you're trying to teach yourself (kudos, BTW :) I won't post a complete solution, but also note that you can use indexing to get substrings:
>>> mystr[0:mystr.find("bc")]
'a'
Hope that's enough to get you started! If not, just comment here and I can post more. :)
def not_bad(s):
snot = s.find("not")
sbad = s.find("bad")
if snot < sbad:
s = s.replace(s[snot:(sbad+3)], "good")
return s
else:
return s
I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I have been trying thus far.
For example, "I love #stackoverflow because #people are very #helpful!"
This should pull the 3 hashtags into an array.
A simple regex should do the job:
>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']
Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL:
>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']
So another simple solution would be the following (removes duplicates as a bonus):
>>> def extract_hash_tags(s):
... return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])
>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']
The best Twitter hashtag regular expression:
import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)
>>> ['#promovolt', '#1st']
Suppose that you have to retrieve your #Hashtags from a sentence full of punctuation symbols. Let's say that #stackoverflow #people and #helpfulare terminated with different symbols, you want to retrieve them from text but you may want to avoid repetitions:
>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"
if you try with set([i for i in text.split() if i.startswith("#")]) alone, you will get:
>>> set(['#helpful???',
'#people',
'#stackoverflow,',
'#stackoverflow',
'#helpful!!!',
'#helpful!',
'#people...'])
which in my mind is redundant. Better solution using RE with module re:
>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])
Now it's ok for me.
EDIT: UNICODE #Hashtags
Add the re.UNICODE flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the #Hashtags may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-)
For example:
>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"
will be unicode-encoded as:
>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'
and you can retrieve your (correctly encoded) #Hashtags in this way:
>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
EDITx2: UNICODE #Hashtags and control for # repetitions
If you want to control for multiple repetitions of the # symbol, as in (forgive me if the text example has become almost unreadable):
>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'
then you should substitute these multiple occurrences with a unique #.
A possible solution is to introduce another nested implicit set() definition with the sub() function replacing occurrences of more-than-1 # with a single #:
>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
AndiDogs answer will screw up with links and other stuff, you may want to filter them out first. After that use this code:
UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)
It may seem overkill but this has been converted from here http://github.com/mzsanford/twitter-text-java.
It will handle like 99% of all hashtags in the same way that twitter handles them.
For more converted twitter regex check out this: http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py
EDIT:
Check out: http://github.com/BonsaiDen/AtarashiiFormat
simple gist (better than chosen answer)
https://gist.github.com/mahmoud/237eb20108b5805aed5f
also work with unicode hashtags
hashtags = [word for word in tweet.split() if word[0] == "#"]
i had a lot of issues with unicode languages.
i had seen many ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return list(set([word for word in ret if len(ret)>1 and len(ret)<20]))
I extracted hashtags in a silly but effective way.
def retrive(s):
indice_t = []
tags = []
tmp_str = ''
s = s.strip()
for i in range(len(s)):
if s[i] == "#":
indice_t.append(i)
for i in range(len(indice_t)):
index = indice_t[i]
if i == len(indice_t)-1:
boundary = len(s)
else:
boundary = indice_t[i+1]
index += 1
while index < boundary:
if s[index] in "`~!##$%^&*()-_=+[]{}|\\:;'"",.<>?/ \n\t":
tags.append(tmp_str)
tmp_str = ''
break
else:
tmp_str += s[index]
index += 1
if tmp_str != '':
tags.append(tmp_str)
return tags