I have a small dataframe with entries regarding motorsport balance of performance.
I try to get rid of the string after "#"
This is working fine with the code:
for col in df_engine.columns[1:]:
df_engine[col] = df_engine[col].str.rstrip(r"[\ \# \d.[0-9]+]")
but is leaving last column unchanged, and I do not understand why.
The Ferrari column also has a NaN entry as last position, just as additional info.
Can anyone provide some help?
Thank you in advance!
rstrip does not work with regex. As per the documentation,
to_strip str or None, default None
Specifying the set of characters to
be removed. All combinations of this set of characters will be
stripped. If None then whitespaces are removed.
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-9]+]")
'1.76 # 0.88'
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-8]+]") # It's not treated as regex, instead All combinations of characters(`[\ \# \d.[0-8]+]`) stripped
'1.76'
You could use the replace method instead.
for col in df.columns[1:]:
df[col] = df[col].str.replace(r"\s#\s[\d\.]+$", "", regex=True)
What about str.split() ?
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
The function splits a serie in dataframe columns (when expand=True) using the separator provided.
The following example split the serie df_engine[col] and produce a dataframe. The first column of the new dataframe contains values preceding the first separator char '#' found in the value
df_engine[col].str.split('#', expand=True)[0]
I have a dataframe that contains a collection of strings. These strings look something like this:
"oop9-hg78-op67_457y"
I need to cut everything from the underscore to the end in order to match this data with another set. My attempt looked something like this:
df['column'] = df['column'].str[0:'_']
I've tried toying around with .find() in this statement but nothing seems to work. Anybody have any ideas? Any and all help would be greatly appreciated!
You can try .str.split then access the list with .str or with .str.extract
df['column'] = df['column'].str.split('_').str[0]
# or
df['column'] = df['column'].str.extract('^([^_]*)_')
print(df)
column
0 oop9-hg78-op67
df['column'] = df['column'].str.extract('_', expand=False)
could also be used if another option is needed.
Adding to the solution provided above by #Ynjxsjmh
You can use str.extract:
df['column'] = df['column'df].str.extract(r'(^[^_]+)')
Output (as separate column for clarity):
column column2
0 oop9-hg78-op67_457y oop9-hg78-op67
Regex:
( # start capturing group
^ # match start of string
[^_]+ # one or more non-underscore
) # end capturing group
In my Dataframe I'm using the following to replace 'stack' in the Brand column with 'stackoverflow'
df['Brand'] = df['Brand'].replace('stack', 'stackoverflow', regex=True)
Problem is if stackoverflow exists in the column, I end up with stackoverflowoverflow.
Is there a way to replace stack when the field in the column is only equal to stack and not effect other rows in the column that may contain the keyword stack?
This should do n would be useful if you have multiple replacements to do:
replace_dict = {'stack' : 'stackoverflow'}
replacement = {rf'\b{k}\b': v for k, v in replace_dict.items()}
df['Brand'] = df['Brand'].replace(replacement, regex=True)
Discovered the solution:
df['Brand'] = df['Brand'].str.replace(r'(?i)stack\b', r'stackoverflow')
I have the following dataframe that I am trying to remove the spaces between the numbers in the value column and then use pd.to_numeric to change the dtype. THe current dtype of value is an object.
periodFrom value
1 17.11.2020 28 621 240
2 18.11.2020 30 211 234
3 19.11.2020 33 065 243
4 20.11.2020 34 811 330
I have tried multiple variations of this but can't work it out:
df['value'] = df['value'].str.strip()
df['value'] = df['value'].str.replace(',', '').astype(int)
df['value'] = df['value'].astype(str).astype(int)
One option is to apply .str.split() first in order to split by whitespaces(even if the anyone of them has more than one character length), then concatenate (''.join()) while removing those whitespaces along with converting to integers(int()) such as
j=0
for i in df['value'].str.split():
df['value'][j]=int(''.join(i))
j+=1
You can do:
df['value'].replace({' ':''}, regex=True)
Or
df['value'].apply(lambda x: re.sub(' ', '', str(x)))
And add to both .astype(int).
I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]