I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']
Related
I have a text which looks like an email body as follows.
To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22
PM
Tony Stark wrote:
Then I have a list of key words as follows.
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
Objective:
I want to first search the text string as given above and if any of the key words as listed above is (or are) present then I want to extract the strings in between those key words. In this case, we have Not and don't keywords matched from no_wds. Also we have Proceed key word matched from yes_wds list. Thus the text I want to be extracted as list as follows
txt = ['seen any abnormalities and thus I don't think we should','think we should']
My approach:
I have tried
re.findall(r'{}(.*){}'.format(re.escape('|'.join(no_wds)),re.escape('|'.join(yes_wds))),text,re.I)
Or
text_f = []
for i in no_wds:
for j in yes_wds:
t = re.findall(r'{}(.*){}'.format(re.escape(i),re.escape(j)),text, re.I)
text_f.append(t)
Didn't get any suitable result. Then I tried str.find() method, there also no success.
I tried to get a clue from here.
Can anybody help in solving this? Any non-regex solution is somewhat I am keen to see, as regex at times are not a good fit. Having said the same, if any one can come up with regex based solution where I can iterate the lists it is welcome.
Loop through the list containing the keys, use the iterator as a splitter (whatever.split(yourIterator)).
EDIT:
I am not doing your homework, but this should get you on your way:
I decided to loop through the splitted at every space list of the message, search for the key words and add the index of the hits into a list, then I used those indexes to slice the message, probably worth trying to slice the message without splitting it, but I am not going to do your homework. And you must find a way to automate the process when there are more indexes, tip: check if the size is even or you are going to have a bad time slicing.
*Note that you should replace the \n characters and find a way to sort the key lists.
message = """To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22"""
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
splittedMessage = message.split( ' ' )
msg = []
for i in range( 0, len( splittedMessage ) ):
temp = splittedMessage[i]
for j, k in zip( no_wds, yes_wds ):
tempJ = j.lower()
tempK = k.lower()
if( tempJ == temp or tempK == temp ):
msg.append( i )
found = ' '.join( splittedMessage[msg[0]:msg[1]] )
print( found )
I want to auto-correct the words which are in my list.
Say I have a list
kw = ['tiger','lion','elephant','black cat','dog']
I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.
Now I have list of str
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]
Expected output:
['tiger','lion',None,'dog']
My Efforts:
import difflib
op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)
My Output:
[[], [], [], ['dog']]
The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).
If I lower the cutoff value it starts returning the words which is should not.
So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.
So is there way to implement this?
I have explored few more libraries like autocorrect, hunspell etc. but no success.
You could implement something based of levenshtein distance.
It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html
Clearly, bieber is a long way from beaver—they are too far apart to be
considered a simple misspelling. Damerau observed that 80% of human
misspellings have an edit distance of 1. In other words, 80% of
misspellings could be corrected with a single edit to the original
string.
Elasticsearch supports a maximum edit distance, specified with the
fuzziness parameter, of 2.
Of course, the impact that a single edit has on a string depends on
the length of the string. Two edits to the word hat can produce mad,
so allowing two edits on a string of length 3 is overkill. The
fuzziness parameter can be set to AUTO, which results in the following
maximum edit distances:
0 for strings of one or two characters
1 for strings of three, four, or five characters
2 for strings of more than five characters
I like to use pyxDamerauLevenshtein myself.
pip install pyxDamerauLevenshtein
So you could do a simple implementation like:
keywords = ['tiger','lion','elephant','black cat','dog']
from pyxdameraulevenshtein import damerau_levenshtein_distance
def correct_sentence(sentence):
new_sentence = []
for word in sentence.split():
budget = 2
n = len(word)
if n < 3:
budget = 0
elif 3 <= n < 6:
budget = 1
if budget:
for keyword in keywords:
if damerau_levenshtein_distance(word, keyword) <= budget:
new_sentence.append(keyword)
break
else:
new_sentence.append(word)
else:
new_sentence.append(word)
return " ".join(new_sentence)
Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.
Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:
def find_similar_word(s, kw, thr=0.5):
from difflib import SequenceMatcher
out = []
for i in s:
f = False
for j in i.split():
for k in kw:
if SequenceMatcher(a=j, b=k).ratio() > thr:
out.append(k)
f = True
if f:
break
if f:
break
else:
out.append(None)
return out
Output
find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']
Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.
import difflib
import itertools
kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]
op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]
print(op)
The output is generate is:
[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]
The trick is to split all the sentences along the whitespaces.
I have a list of strings called txtFreeForm:
['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.]
I need to check if only 'Add roth' exists in the sentence. To do that i used this
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if match is not None:
print(each_line)
But this obviously returns both the strings in my list as both contain 'add roth'. Is there a way to exclusively search for 'Add roth' in a sentence, because i have a bunch of these patterns to search in strings.
Thanks for your help!
Can you fix this problem by using the .Length property of strings? I'm not an experienced Python programmer, but here is how I think it should work:
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if (match is not None) and (len(txtFreeForm) == len("Add Roth")):
print(each_line)
Basically, if the text is in the string, AND the length of the string is exactly to the length of the string "Add Roth", then it must ONLY contain "Add Roth".
I hope this was helpful.
EDIT:
I misunderstood what you were asking. You want to print out sentences that contain "Add Roth", but not sentences that contain "Add Roth in plan". Is this correct?
How about this code?
for each_line in txtFreeForm:
match_AR = re.search('add roth',each_line.lower())
match_ARIP = re.search('add roth in plan',each_line.lower())
if (match_AR is True) and (match_ARIP is None):
print(each_line)
This seems like it should fix the problem. You can exclude any strings (like "in plan") by searching for them too and adding them to the comparison.
You're close :) Give this a shot:
for each_line in txtFreeForm:
match = re.search('add roth (?!in[-]plan)',each_line.lower())
if match is not None:
print(each_line[match.end():])
EDIT:
Ahhh I misread... you have a LOT of these. This calls for some more aggressive magic.
import re
from functools import partial
txtFreeForm = ['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.']
def roths(rows):
for row in rows:
match = re.search('add roth\s*', row.lower())
if match:
yield row, row[match.end():]
def filter_pattern(pattern):
return partial(lazy_filter_out, pattern)
def lazy_filter(pattern):
return partial(lazy_filter, pattern)
def lazy_filter_out(pattern, rows):
for row, rest in rows:
if not re.match(pattern, rest):
yield row, rest
def magical_transducer(bad_words, nice_rows):
magical_sentences = reduce(lambda x, y: y(x), [roths] + map(filter_pattern, bad_words), nice_rows)
for row, _ in magical_sentences:
yield row
def main():
magic = magical_transducer(['in[-]plan'], txtFreeForm)
print(list(magic))
if __name__ == '__main__':
main()
To explain a bit about what's happening hear, you mentioned you have a LOT of these words to process. The traditional way you might compare two groups of items is with nested for-loops. So,
results = []
for word in words:
for pattern in patterns:
data = do_something(word_pattern)
results.append(data)
for item in data:
for thing in item:
and so on...
and so fourth...
I'm using a few different techniques to attempt to achieve a "flatter" implementation and avoid the nested loops. I'll do my best to describe them.
**Function compositions**
# You will often see patterns that look like this:
x = foo(a)
y = bar(b)
z = baz(y)
# You may also see patterns that look like this:
z = baz(bar(foo(a)))
# an alternative way to do this is to use a functional composition
# the technique works like this:
z = reduce(lambda x, y: y(x), [foo, bar, baz], a)
i have multiple string variations: "gr_shoulder_r_tmp", "r_shoulder_tmp"
i need to substitute:
"r_" to l_, here:
"gr_shoulder_r_tmp" > "gr_shoulder_l_tmp"
"r_shoulder_tmp" > "l_shoulder_tmp"
in other words i need to subustitute 3rd coinsidence in frist example
and 1st in second example of stirngs
im started digging myself...
and came up into halfesolved result, which bore one more interesting question:
a) Find index of right hit
[i for i, x in enumerate(re.findall("(.?)(r_)", "gr_shoulder_r_tmp")) if filter(None, x).__len__() == 1]
which gives me indx = 2
?) how to use that hit index :[
while wrote this i found straight simple solution..
b) split by underscore, replace standalone letter, and join back
findtag = "r"
newtag = "l"
itemA = "gr_shoulder_r_tmp"
itemB = "r_shoulderr_tmp"
spl_str = itemA.split("_")
hit = spl_str.index(findtag)
spl_str[hit] = newtag
new_item = "_".join(spl_str)
both itemA,itemB gives me what i need.. but im not happy of it, too heavy and so rough
A simple regex will do this job.
re.sub(r'(?<![a-zA-Z])r_', 'l_', s)
(?<![a-zA-Z]) negative lookbehind which asserts that the match would be preceeded by any but not a letter.
Example:
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"gr_shoulder_r_tmp")
'gr_shoulder_l_tmp'
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"r_shoulder_tmp")
'l_shoulder_tmp'
I have a list of links in an array, such as
results = [link1/1254245,
'q%(random part)cache:link2/1254245& (random part) Dclnk',
'link3/1254245]
whereas link = http://www.whatever.com.
I want to replace the term q%3(random part)cache and &(random part)Dclnk with nothing so that the "clean" link2 is "cut" out and left over among the other "clean" links. The random part changes always in content and length. The q%3 : and & Dclnk stay the same.
How do I do that? I could not find a straight answer to that so far.
You could achieve this through re.sub and list comprehension.
>>> l = ['link1/1254245', 'q%(random part)cache:link2/1254245& (random part) Dclnk', 'link3/1254245']
>>> [re.sub(r'q%[^(]*\([^()]*\)cache:|&\s*\([^()]*\)\s*Dclnk', r'', i) for i in l]
['link1/1254245', 'link2/1254245', 'link3/1254245']
[^()]* matches any character but not of ( or ) zero or more times. Specify an | alteration operator to use multiple patterns.