Pandas: Find pattern after specific text and replace the pattern - python

I have a sample dataframe below.
df = pd.DataFrame({'col1' : ['The IO operation at logical block address 0x0 for Disk1 (PDO name: \\Device00024','fddasfsa'],'col2': [1,2])
I like to replace the characters between 'Device' and ')' to 'xxxxxx'. Is it possible to do such replacement in pandas?
I thought I can do the following. The code ran with no issue but the replacement never happen.
df['col1'] = df['col1'].replace(r'\\Device(.*)', 'xxxxxx,regex=True)

You could use str.replace here:
df["col1"] = df["col1"].str.replace(r'\bDevice\d+', 'Devicexxxxxx')
The code sample you gave above won't even compile, but it actually looks on the right track. You made the same mistake I initially made here. You need to include Device in the replacement, not just xxxxxx, as your regex match will consume the device string along with the numbers.

Just replace the digits immediately to the left of Device. Code below
df['col1'].str.replace('(?<!Device)\d+','xxxxx')

Another solution, if you want to have the same number of x as digits:
df["col1"] = df["col1"].str.replace(
r"(?<=Device)(\d+)", lambda g: "x" * len(g.group(1)), regex=True
)
print(df)
Prints:
col1 col2
0 adbsdfklj (\Devicexxxxxx) 1
1 fddasfsa 2

Related

How to replace a character with \ in excel file [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

How to identify and remove all the special characters from the data frame

col1
Ntwk Lane 0 cannot on high operational\n
TX_PWR ALARM. TX_PWR also fluctuates over time (found Tx power dropped to -2dBm also raises TX_PWR_LO_ALRM
module report ASIC_PLL_REF_CLK_FREQ_ERR(20008=0x800000) and HOST_REF_PLL_2(20014=0x2)
I want to remove all the special characters from the column how to do that. I need only alphabets rest I need to remove
You can use regular expression:
import re
df['col2'] = df['col1'].apply(lambda x: re.compile('[^a-zA-Z]').sub('', x))
As suggested by #9769953
df['col2'] = df['col1'].str.replace('[^a-zA-Z]', '', regex=True)
is also a much cleaner approach. Same performance, but cleaner.

Removing list of strings from column in pandas

I would need to remove a list of strings:
list_strings=['describe','include','any']
from a column in pandas:
My_Column
include details about your goal
describe expected and actual results
show some code anywhere
I tried
df['My_Column']=df['My_Column'].str.replace('|'.join(list_strings), '')
but it removes parts of words.
For example:
My_Column
details about your goal
expected and actual results
show some code where # here it should be anywhere
My expected output:
My_Column
details about your goal
expected and actual results
show some code anywhere
Use the "word boundary" expression \b like.
In [46]: df.My_Column.str.replace(r'\b{}\b'.format('|'.join(list_strings)), '')
Out[46]:
0 details about your goal
1 expected and actual results
2 show some code anywhere
Name: My_Column, dtype: object
Your issue is that pandas doesn't see words, it simply sees a list of characters. So when you ask pandas to remove "any", it doesn't start by delineating words. So one option would be to do that yourself, maybe something like this:
# Your data
df = pd.DataFrame({'My_Column':
['Include details about your goal',
'Describe expected and actual results',
'Show some code anywhere']})
list_strings=['describe','include','any'] # make sure it's lower case
def remove_words(s):
if s is not None:
return ' '.join(x for x in s.split() if x.lower() not in list_strings)
# Apply the function to your column
df.My_Column = df.My_Column.map(remove_words)
The first parameter of .str.replace() method must be a string or compiled regex; not a list as you have.
You probably wanted
list_strings=['Describe','Include','any'] # Note capital D and capital I
for s in [f"\\b{s}\\b" for s in list_strings]: # surrounded word boundaries (\b)
df['My_Column'] = df['My_Column'].str.replace(s, '')
to obtain
My_Column
0 details about your goal
1 expected and actual results
2 Show some code anywhere

How to use str.replace by replacing from a specific character and on/forward

This is a extract from a table that i want to clean.
What i've tried to do:
df_sb['SB'] = df_sb['SB'].str.replace('-R*', '', df_sb['SB'].shape[0])
I expected this (Without -Rxx):
But i've got this (Only dash[-] and character "R" where replaced):
Could you please help me get the desired result from item 4?
str.replace works here, you just need to use a regular expression. So your original answer was very close!
df = pd.DataFrame({"EO": ["A33X-22EO-06690"] * 2, "SB": ["A330-22-3123-R01", "A330-22-3123-R02"]})
print(df)
EO SB
0 A33X-22EO-06690 A330-22-3123-R01
1 A33X-22EO-06690 A330-22-3123-R02
df["new_SB"] = df["SB"].str.replace(r"-R\d+$", "")
print(df)
EO SB new_SB
0 A33X-22EO-06690 A330-22-3123-R01 A330-22-3123
1 A33X-22EO-06690 A330-22-3123-R02 A330-22-3123
What the regular expression means:
r"-R\d+$" means find anywhere in the string we see that characters "-R" followed by 1 or more digits (\d+). Then we constrain this to ONLY work if that pattern occurs at the very end of the string. This way we don't accidentally replace an occurrence of -R(digits) that happens to be in the middle of the SB string (e.g. we don't remove "-R101" in the middle of: "A330-22-R101-R20". We would only remove "-R20"). If you would actually like to remove both "-R101" and "-R20", remove the "$" from the regular expression.
An example using str.partition():
s = ['A330-22-3123-R-01','A330-22-3123-R-02']
for e in s:
print(e.partition('-R')[0])
OUTPUT:
A330-22-3123
A330-22-3123
EDIT:
Not tested, but in your case:
df_sb['SB'] = df_sb['SB'].str.partition('-R')[0]

Why np.where, with a condition, does not work with only one row in a dataframe if condition not satisfied

Here is an example:
cars2 = {'Brand': ['Hon*da\nCivic', 'BM*AMT*B6*W'],'Price': [22000, 55000]}
df2 = pd.DataFrame(cars2, columns = ['Brand', 'Price'])
df2['Allowed_Amount'] = np.where(
df2['Brand'].apply(lambda x: x.count("AMT" + "*" + "B6") > 0),
df2['Brand'].str.split("AMT" + "*").str[1].str.split("B6").str[1].str[1:].str.split('\n').str[0], 0.00)
Output:
Brand Price Allowed_Amount
0 Hon*da\nCivic 22000 0
1 BM*AMT*B6*W 55000 W
Which is exactly what I need.
However, if the df contains only one row, which does not satisfy the condition, I get an error:
cars = {'Brand': ['Hon*da\nCivic'],'Price': [22000]}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Allowed_Amount'] = np.where(
df['Brand'].apply(lambda x: x.count("AMT" + "*" + "B6") > 0),
df['Brand'].str.split("AMT" + "*").str[1].str.split("B6").str[1].str[1:].str.split('\n').str[0], 0.00)
Output:
AttributeError: Can only use .str accessor with string values!
What I need:
Brand Price Allowed_Amount
0 Hon*da\nCivic 22000 0
Why doesn't it exit when the condition is not met? How to make this code work with one row as well?
The problem with your code is that df['Brand'].str.split("AMT" + "")* in the
"negative" case returns a list of size 1 (the whole source string in a
single element).
In this case .str[1] (following the previous code) returns None and
"following" methods in your code can not be called on it.
But in Pandas the actual exception is raised only if the above case
occurs for each source element, just like in the case of df.
I also think that such a long sequence of str.split, str and index
selections is difficult to read.
Try another approach based on extract with a regex:
df['Allowed_Amount'] = df['Brand'].str.extract(r'AMT\*.*?B6.(.*)').fillna(0)
Details of the regex:
AMT\* - Match AMT and an asterisk.
.*? - Match any number of characters, as little as possible (chars
between "AMT*" and "B6", if any). Maybe you can drop this fragment
from the regex.
B6 - Represent themselves.
. - Match any single char (a counterpart of [1:] in your code).
(.*) - Match text up to a newline (excluding, as the dot does not match
the newline) or to the end of string, as a capturing group, so this
is just the extracted content.
If the above regex doesn't match, NaN is returned for this row.
These NaN values are then replaced with 0, due to call to fillna(0)
afterwards.
Try the same on df2.
So this way you will achieve your desired result with shorter and more readable code.
Of course, it requires some knowledge of regular expressions but it is
definitely worth to take some time to learn them.
Edit following the question
To replace the literal star it the regex with a given delimiter,
you can define the following function, generating the content
for the new column:
def myExtract(df, delimiter='*'):
pat = rf'AMT\{delimiter}B6.(.*)'
return df['Brand'].str.extract(pat).fillna(0)
As you can see:
the delimiter is incorporated into the regex using f-string
feature (can co-exist with r-string),
it must be preceded with a backslash, to treat it literally
(not as a special regex char).
And to generate the new colum, just call this function, passing at
least the source DataFrame (and optionally the right delimiter):
df['Allowed_Amount'] = myExtract(df); df
The same for df2.

Categories

Resources