regex sbustitute only specific hit sequence - python

i have multiple string variations: "gr_shoulder_r_tmp", "r_shoulder_tmp"
i need to substitute:
"r_" to l_, here:
"gr_shoulder_r_tmp" > "gr_shoulder_l_tmp"
"r_shoulder_tmp" > "l_shoulder_tmp"
in other words i need to subustitute 3rd coinsidence in frist example
and 1st in second example of stirngs
im started digging myself...
and came up into halfesolved result, which bore one more interesting question:
a) Find index of right hit
[i for i, x in enumerate(re.findall("(.?)(r_)", "gr_shoulder_r_tmp")) if filter(None, x).__len__() == 1]
which gives me indx = 2
?) how to use that hit index :[
while wrote this i found straight simple solution..
b) split by underscore, replace standalone letter, and join back
findtag = "r"
newtag = "l"
itemA = "gr_shoulder_r_tmp"
itemB = "r_shoulderr_tmp"
spl_str = itemA.split("_")
hit = spl_str.index(findtag)
spl_str[hit] = newtag
new_item = "_".join(spl_str)
both itemA,itemB gives me what i need.. but im not happy of it, too heavy and so rough

A simple regex will do this job.
re.sub(r'(?<![a-zA-Z])r_', 'l_', s)
(?<![a-zA-Z]) negative lookbehind which asserts that the match would be preceeded by any but not a letter.
Example:
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"gr_shoulder_r_tmp")
'gr_shoulder_l_tmp'
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"r_shoulder_tmp")
'l_shoulder_tmp'

Related

How to extract strings from a list in a column in a python pandas dataframe?

Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract() to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.
If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Option 1
Assuming that one wants to extract the exact string in the list lst one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

Python: print condition

I found that I just asked the wrong question a few minutes ago, sorry about that. I ran a code that need to identify if the word in certain location matches my condition.
The original code is not in English, I just tried to use a simple way to show you the problem I had. There's actually no space between words in my language, so use split or re is not working.
I need to find the word before "car" to know whether someone loves the car or not. So I used location as conditions to identify it.
For example: (But it will be too long)
message="I do not like cars."
#print(message[14:18]) #cars starts from location 14
location = 14
if message[int(loca)-5:int(loca)-1]=="like":
print("like")
elif message[int(loca)-8:int(loca)-1]=="dislike":
print("dislike")
elif message[int(loca)-5:int(loca)-1]=="hate":
print("hate")
elif message[int(loca)-5:int(loca)-1]=="cool":
print("cool")
I actually used this one in my code, but found that I could not print the word:
if (
message[int(location) - 5:int(location) - 1] == "like" or
message[int(location) - 8:int(location) - 1] == "dislike" or
message[int(location) - 5:int(location) - 1] == "hate" or
message[int(location) - 5:int(location) - 1] == "cool"
):
#print "like"
#unable to do it
Is there anyway I can solve it by printing the matching word?
Looks like you need Regex:
import re
message="I do not dislike cars."
check_list = {"like", "dislike", "hate", "cool"}
pattern = re.compile(r"(\b{}\b)".format("|".join(check_list))) #or re.compile(r"({})".format("|".join(check_list)))
m = pattern.search(message)
if m:
print(m.group(1)) # --> dislike

python if/else list comprehension

I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']

How to get a value for a key in a string, when followed by another specific key=value set

my code is like:
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
pattern = r'title=(.*?) color=red'
print re.compile(pattern).search(string).group(0)
and I got
"title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
But I want to find all the contents of "title"s immediately followed by "color=red"
You want what immediately precedes color=red? Then use
.*title=(.*?) color=red
Demo: https://regex101.com/r/sR4kN2/1
This greedily matches everything that comes before color=red, so that only the desired title appears.
Alternatively, if you know there is a character that doesn't appear in the title, you can simplify by just using a character class exclusion. For example, if you know = won't appear:
title=([^=]*?) color=red
Or, if you know whitespace won't appear:
title=([^\s]*?) color=red
A third option, using a bit of code to find all red titles (assuming that the input always alternates title, color):
for title, color in re.findall(r'title=(.*?) color=(.*?)\( |$\)'):
if color == 'red':
print title
If you want to get the last match of a sub-regexp before a certain regexp the solution is to use a greedy skipper. For example:
>>> pattern = '.*title="([^"]*)".*color="#123"'
>>> text = 'title="123" color="#456" title="789" color="#123"'
>>> print(re.match(pattern, s).groups(1))
the first .* is greedy and it will skip as much as possible (thus skipping first title) backing up to the one that allows matching the desired color.
As a simpler example consider that
a(.*)b(.*)c
processed on
a1111b2222b3333c
will match 1111b2222 in the first group and 3333 in the second.
Why don't you skip the regexes, and use some split functionality instead:
search_title = False
found = None
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht colo\
r=red title=xxxy red=anything title=xxxyyy color=red"
parts = string.split()
for part in parts:
key, value = part.split('=', 1)
if search_title:
if key == 'title':
found = value
search_title = False
if key == 'color' and value == 'red':
search_title = True
print(found)
results in
xxxy
Regexes are nice, but can cause headaches at times.
Try this using re module
>>>string = 'title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red'
>>>import re
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'whatIwaht'
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'xyxyx'

Identify substrings and return responses based on order of substrings in Python

I am a beginner in Python, I am teaching myself off of Google Code University online. One of the exercises in string manipulation is as follows:
# E. not_bad
# Given a string, find the first appearance of the
# substring 'not' and 'bad'. If the 'bad' follows
# the 'not', replace the whole 'not'...'bad' substring
# with 'good'.
# Return the resulting string.
# So 'This dinner is not that bad!' yields:
# This dinner is good!
def not_bad(s):
# +++your code here+++
return
I'm stuck. I know it could be put into a list using ls = s.split(' ') and then sorted with various elements removed, but I think that is probably just creating extra work for myself. The lesson hasn't covered RegEx yet so the solution doesn't involve re. Help?
Here's what I tried, but it doesn't quite give the output correctly in all cases:
def not_bad(s):
if s.find('not') != -1:
notindex = s.find('not')
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex > badindex:
removetext = s[notindex:badindex]
ns = s.replace(removetext, 'good')
else:
ns = s
else:
ns = s
else:
ns = s
return ns
Here is the output, it worked in 1/4 of the test cases:
not_bad
X got: 'This movie is not so bad' expected: 'This movie is good'
X got: 'This dinner is not that bad!' expected: 'This dinner is good!'
OK got: 'This tea is not hot' expected: 'This tea is not hot'
X got: "goodIgoodtgood'goodsgood goodbgoodagooddgood goodygoodegoodtgood
goodngoodogoodtgood" expected: "It's bad yet not"
Test Cases:
print 'not_bad'
test(not_bad('This movie is not so bad'), 'This movie is good')
test(not_bad('This dinner is not that bad!'), 'This dinner is good!')
test(not_bad('This tea is not hot'), 'This tea is not hot')
test(not_bad("It's bad yet not"), "It's bad yet not")
UPDATE: This code solved the problem:
def not_bad(s):
notindex = s.find('not')
if notindex != -1:
if s.find('bad') != -1:
badindex = s.find('bad') + 3
if notindex < badindex:
removetext = s[notindex:badindex]
return s.replace(removetext, 'good')
return s
Thanks everyone for helping me discover the solution (and not just giving me the answer)! I appreciate it!
Well, I think that it is time to make a small review ;-)
There is an error in your code: notindex > badindex should be changed into notindex < badindex. The changed code seems to work fine.
Also I have some remarks about your code:
It is usual practice to compute the value once, assign it to the variable and use that variable in the code below. And this rule seems to be acceptable for this particular case:
For example, the head of your function could be replaced by
notindex = s.find('not')
if notindex == -1:
You can use return inside of your function several times.
As a result tail of your code could be significantly reduced:
if (*all right*):
return s.replace(removetext, 'good')
return s
Finally i want to indicate that you can solve this problem using split. But it does not seem to be better solution.
def not_bad( s ):
q = s.split( "bad" )
w = q[0].split( "not" )
if len(q) > 1 < len(w):
return w[0] + "good" + "bad".join(q[1:])
return s
Break it down like this:
How would you figure out if the word "not" is in a string?
How would you figure out where the word "not" is in a string, if it is?
How would you combine #1 and #2 in a single operation?
Same as #1-3 except for the word "bad"?
Given that you know the words "not" and "bad" are both in a string, how would you determine whether the word "bad" came after the word "not"?
Given that you know "bad" comes after "not", how would you get every part of the string that comes before the word "not"?
And how would you get every part of the string that comes after the word "bad"?
How would you combine the answers to #6 and #7 to replace everything from the start of the word "not" and the end of the word "bad" with "good"?
Since you are trying to learn, I don't want to hand you the answer, but I would start by looking in the python documentation for some of the string functions including replace and index.
Also, if you have a good IDE it can help by showing you what methods are attached to an object and even automatically displaying the help string for those methods. I tend to use Eclipse for large projects and the lighter weight Spyder for small projects
http://docs.python.org/library/stdtypes.html#string-methods
I suspect that they're wanting you to use string.find to locate the various substrings:
>>> mystr = "abcd"
>>> mystr.find("bc")
1
>>> mystr.find("bce")
-1
Since you're trying to teach yourself (kudos, BTW :) I won't post a complete solution, but also note that you can use indexing to get substrings:
>>> mystr[0:mystr.find("bc")]
'a'
Hope that's enough to get you started! If not, just comment here and I can post more. :)
def not_bad(s):
snot = s.find("not")
sbad = s.find("bad")
if snot < sbad:
s = s.replace(s[snot:(sbad+3)], "good")
return s
else:
return s

Categories

Resources