Replace multiple substrings in a Pandas series with a value - python

All,
To replace one string in one particular column I have done this and it worked fine:
dataUS['sec_type'].str.strip().str.replace("LOCAL","CORP")
I would like now to replace multiple strings with one string say replace ["LOCAL", "FOREIGN", "HELLO"] with "CORP"
How can make it work? the code below didn't work
dataUS['sec_type'].str.strip().str.replace(["LOCAL", "FOREIGN", "HELLO"], "CORP")

You can perform this task by forming a |-separated string. This works because pd.Series.str.replace accepts regex:
Replace occurrences of pattern/regex in the Series/Index with some
other string. Equivalent to str.replace() or re.sub().
This avoids the need to create a dictionary.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
pattern = '|'.join(['LOCAL', 'FOREIGN', 'HELLO'])
df['A'] = df['A'].str.replace(pattern, 'CORP')
# A
# 0 CORP TEST
# 1 TEST CORP
# 2 ANOTHER CORP
# 3 NOTHING

replace can accept dict , os we just create a dict for those values need to be replaced
dataUS['sec_type'].str.strip().replace(dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3)),regex=True)
Info of the dict
dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3))
Out[585]: {'FOREIGN': 'CORP', 'HELLO': 'CORP', 'LOCAL': 'CORP'}
The reason why you receive the error ,
str.replace is different from replace

The answer of #Rakesh is very neat but does not allow for substrings. With a small change however, it does.
Use a replacement dictionary because it makes it much more generic
Add the keyword argument regex=True to Series.replace() (not Series.str.replace) This does two things actually: It changes your replacement to regex replacement, which is much more powerful but you will have to escape special characters. Beware for that. Secondly it will make the replace work on substrings instead of the entire string. Which is really cool!
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Full code example
dataUS = pd.DataFrame({'sec_type': ['LOCAL', 'Sample text LOCAL', 'Sample text LOCAL sample FOREIGN']})
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Output
0 CORP
1 CORP
2 Sample text CORP
3 Sample text CORP sample CORP
Name: sec_type, dtype: object

#JJP answer is a good one if you have a long list. But if you have just two or three then you can simply use the '|' within the pattern. Make sure to add regex=True parameter.
Clearly .str.strip() is not a requirement but is good practise.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
df['A'] = df['A'].str.strip().str.replace("LOCAL|FOREIGN|HELLO", "CORP", regex=True)
output
A
0 CORP TEST
1 TEST CORP
2 ANOTHER CORP
3 NOTHING

Function to replace multiple values in pandas Series:
def replace_values(series, to_replace, value):
for i in to_replace:
series = series.str.replace(i, value)
return series
Hope this helps someone

Try:
dataUS.replace({"sec_type": { 'LOCAL' : "CORP", 'FOREIGN' : "CORP"}})

Related

How to extract strings from a list in a column in a python pandas dataframe?

Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract() to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.
If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Option 1
Assuming that one wants to extract the exact string in the list lst one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

How to escape double inverted commas in json values from pandas dataframe for parsing?

so this is a long question but please hold on as I try my best to explain my problem:
I have a dataframe which has one column with rows as json, and I am able to parse them correctly
id | email | phone no | details
-------------------------------------------------
0 10 | abc#g.com | 123 | {"name" : "John "Smart" Wick", "address" : "123 c "dumb" road"}
1 12 | xyz#g.com | 789 | {"name" : "Peter Parker", "address" : "L "check" street"}
I want this json to be distributed to columns as:
id
email
phone no
name
address
10
abc#g.com
123
John "Smart" Wick
123 c "dumb" road
12
xyz#g.com
789
Peter Parker
L "check" street
To break the json keys into columns, I am able to do as:
# Check: 1
result = df.pop('details').apply(json.loads).apply(pd.Series).join(df)
This works always until when I come across a situation like that above where there are inverted commas as part of the json value in any field. This data is for representation purpose in reality I have millions of records and the column 'details' has 10+ key/value pairs.
For a hot fix, this is what I have done:
# check: 2
df['details'] = df['details'].str.replace('John "Smart" "Wick','John Smart Wick')
df['details'] = df['details'].str.replace('123 c "dumb" road','123 c dumb road')
df['details'] = df['details'].str.replace('L "check" street','L check street')
Then I run the code at #check: 1 and it works fine and replace it again this time the other way around. In a million records, there are just 2 such cases which leads to such a problem breaking the code so I found those 2 notorious records and for a hot-fix replaced the data to remove the inverted commas and later after processing re-introduced them.
What I want is to have a way such that no matter how many times such issues happen it doesn't create a problem and passes #check: 1 easily and return the original value without me having to catch such records manually and replacing it for the thing to run. I was wondering if regex can do this, and I tried few things but those were not good enough and kept throwing error.
I am able to solve the issue at my level but a universal way to handle all such exceptions in json key/value pair for a column in pandas dataframe will be a great thing to learn. I know the json is not clean here so basically a way to clean it for any such scenarios so that we can do the splitting of the key/value into individual columns.
Thanks for any help.
Edit: I have put this in comment also, if I add escape characters then it works fine like:
df['details'] = df['details'].str.replace('John "Smart" "Wick','John \\"Smart\\" Wick')
df['details'] = df['details'].str.replace('123 c "dumb" road','123 c \\"dumb\\" road')
df['details'] = df['details'].str.replace('L "check" street','L check \\"check\\" street')
This will work too but it still requires me to identify the records manually and add a replace command for those records with escape characters. Can this be done in a loop for the entire 'details' column to self-identify such cases and add escape characters wherever required?
Since there are only two fields in the stringified JSONs, you can use contextual matching with regex to make sure you match any text between the two names or end of the string.
Here is the regex you can use to match and capture the necessary bits:
(?s)("(?:name|address)"\s*:\s*")(.*?)(?="(?:\s*,\s*"(?:name|address)"|}$))
See the regex demo. The matches contain two adjacent groups where the first one needs to be kept as is, and all " chars in the second group should be prepended with a literal backslash.
Use Series.str.replace to perform this manipulation:
import pandas as pd
df = pd.DataFrame(
{'text':['{"name" : "John "Smart" Wick", "address" : "123 c "dumb" road"}']}
)
rx = r'(?s)("(?:name|address)"\s*:\s*")(.*?)(?="(?:\s*,\s*"(?:name|address)"|}$))'
df['text'] = df['text'].str.replace(rx, lambda x: x.group(1) + x.group(2).replace('"',r'\"'), regex=True)
# -> df
# text
# 0 {"name" : "John \"Smart\" Wick", "address" : "123 c \"dumb\" road"}

Repeat pattern using python regex

Well, I'm cleaning a dataset, using Pandas.
I have a column called "Country", where different rows could have numbers or other information into parenthesis and I have to remove them, for example:
Australia1,
PerĂº (country),
3Costa Rica, etc. To do this, I'm getting the column and I make a mapping over it.
pattern = "([a-zA-Z]+[\s]*[a-aZ-Z]+)(?:[(]*.*[)]*)"
df['Country'] = df['Country'].str.extract(pattern)
But I have a problem with this regex, I cannot match names as "United States of America", because it only takes "United ". How can I repeat unlimited the pattern of the fisrt group to match the whole name?
Thanks!
In this situation, I will clean the data step by step.
df_str = '''
Country
Australia1
PerĂº (country)
3Costa Rica
United States of America
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\n')
# handle the data
(df['Country']
.str.replace('\d+', '', regex=True) # remove number
.str.split('\(').str[0] # get items before `(`
.str.strip() # strip spaces
)
Thanks for you answer, it worked!
I found other solution, and it was doing a match of the things that I don't want on the df.
pattern = "([\s]*[(][\w ]*[)][\s]*)|([\d]*)" #I'm selecting info that I don't want
df['Country'] = df['Country'].replace(pattern, "", regex = True) #I replace that information to an empty string

Python and Pandas: Using a function to replace text

I have a pandas dataframe with 2 columns (Line, Sentence) and I need to count the number of times the word "RESULT" appears in each sentence. but I don't want to count if it appears as "AS A RESULT" or "WAS THE RESULT", etc (the actual list is quite long and with other words).
I had this problem before in a list and I used a little trick: I replaced the string, run the count and them replace it back the original. see function below (version 1, first pass; version 2, second pass).
def ConfusingStrings(text, version):
if version == 1:
text = re.sub(r"AS A RESULT", "XXXASAREXULT", text)
text = re.sub(r"WAS THE RESULT", "XXXWASTHEREXULT", text)
if version == 2:
text = re.sub(r"XXXASAREXULT", "AS A RESULT", text)
text = re.sub(r"XXXOFTHEREXULT", "OF THE RESULT", text)
return text
Now, with the pandas dataframe, I am trying to use the apply function, see below, but to be honest I cannot get this to work.
df['sentence'] = df.apply(ConfusingStrings(df['sentence'],1), axis=1)
Thanks for any input.
UPDATE:
import pandas as pd
c = pd.DataFrame({'A': [1,2,3,4], 'B':['ABC RESULTS FROM XYZ', 'AS A RESULT WE WILL NOT', 'THE RESULT IS THAT', 'THE BORDER WAS THE RESULT OF'], 'C':[1, 0,1,0]})
print (c)
The outcome I need is something like column C (which I did here manually), but bear in mind that this is a simplification, the list of confusing words/expression is in fact quite long, that's why I am looking to separate it in a function (easier to update and keeps main code cleaner). So basically I need to create column C via a function, I think.
Hope this helps: I just created a dummy data frame to include ab and exclude the list 'fc ab', 'ab ac'
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5,6], 'B':['ab', 'ab ac', 'fc ab', 'ab', 'ab ac', 'fc ab']})
list_to_include = ['ab']
list_to_exclude = ['fc ab', 'ab ac']
df['match'] = df['B'].str.count(r'|'.join(list_to_include)) - df['B'].str.count(r'|'.join(list_to_exclude))
match is the column containing count. You can also put abs to include safety for non negative values.

How to delete substrings with specific characters in a pandas dataframe?

I have a pandas dataframe that looks like this:
COL
hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
Im fine, what A/P_49 A/P_0.0309 about you?
The expected result should be:
COL
hi how are you?
...
Im fine, what about you?
How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?
I tried with this regular expression:
A/P_(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?
UPDATE
I tried:
df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.
one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:
df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')
In you input, it seems there is an typo error (or at least I think so), so with this input:
df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0 hi how are you ?
1 Im fine, what about you?
Name: col, dtype: object
you get the expected output
How about:
(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Full example:
import pandas as pd
df = pd.DataFrame({
'COL':
["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
"Im fine, what A/P_49 A/P_0.0309 about you?"]
})
df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Returns (oh, there is an extra space before ?):
COL
0 hi how you ?
1 Im fine, what about you?
Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:
df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().
I would suggest to use pandas.Series.apply function with precompiled regex pattern:
In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im fine, what A/P_49 A/P_0.0309 about you?']})
In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')
In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))
In [1173]: df4
Out[1173]:
COL
0 hi how are you ?
1 Im fine, what about you?

Categories

Resources