Remove a bullet character in a Pandas column - python

Consider:
users_df['LASTNAME_TEST'] = users_df['LASTNAME'].replace(u'•','')
for item in users_df['LASTNAME_TEST']:
if u'•' in item:
print('yay')
I'm trying to remove the special bullet character in this column and using replace. This is still returning yay in the result. What am I missing?

Per SeaBean's comment, add regex=True to the call to .replace():
users_df['LASTNAME_TEST'] = users_df['LASTNAME'].replace(u'•', '', regex=True)
^^^^^^^^^^

Related

How can I replace substring from string by a list in a column dataframe?

I need to replace substrings in a column value in dataframe
Example: I have this column 'code' in a dataframe (in really, the dataframe is very large)
3816R(motor) #I need '3816R'
97224(Eletro)
502812(Defletor)
97252(Defletor)
97525(Eletro)
5725 ( 56)
And I have this list to replace the values:
list = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
I've tried a lot of methods, like:
df['code'] = df['code'].str.replace(list, '')
And regex= True, but anyone method worked to remove the substrings.
How can I do that?
You can try regex replace and regex or condition: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
l = ['(motor)', '(Eletro)', '(Defletor)', '( 56)']
l = [s.replace('(', '\(').replace(')', '\)') for s in l]
regex_str = f"({'|'.join(l)})"
df['code'] = df['code'].str.replace(regex_str, '', regex=True)
The regex_str will end up with something like
"(\(motor\)|\(Eletro\)|\(Defletor\)|\( 56\))"
If you are certain any and all rows follow the format provided, you could attempt the following by using a lambda function:
df['code_clean'] = df['code'].apply(lambda x: x.split('(')[0])
You can try the regular expression match method:
https://docs.python.org/3/library/re.html#re.Pattern.match
df['code'] = df['code'].apply(lambda x: re.match(r'^(\w+)\(\w+\)',x).group(1))
The first part of the regular expression ^(\w+), creates a capturing group of any letters or numbers before encountering a parenthesis. The group(1) then extracts the text.
str.replace will work with one string not a list of strings.. you could probably loop through it
rmlist = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
for repl in rmlist:
df['code'] = df['code'].str.replace(repl, '')
alternatively if your bracketed substring is at the end.. split it at "("
and discard additional column generated..will be faster for sure
df["code"]=df["code"].str.split(pat="(",n=1,expand=True)[0]
str.split is reasonably fast

How to replace a character with \ in excel file [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Regex lambda doesn't iterate through df - prints first row's result for all

def split_it(email):
return re.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)", s)
df['email_list'] = df['email'].apply(lambda x: split_it(x))
This code seems to work for the first row of the df, but then will print the result of the first row on all other rows.
Is it not iterating through all rows? Or does it print the result of row 1 on all rows?
You do not need to use apply here, use Series.str.findall directly:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)")
If there are several emails per row, you can join the results:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)").str.join(", ")
Note that the email pattern can be enhanced in many ways, but I would add \s into the negated character classes to exclude whitespace matching, and move \. outside the group to avoid repetition:
r"[^\s#]+#[^\s#]+\.(?:com|se|br|org)"

How to replace symbols from object in dataframe

I'm trying to replace symbols from object in python, I used
df_summary.replace('\(|\)!,"-', '', regex=True)
but it didn't change anything.
The replace function is not in place. This means that your dataframe will be unchanged, and the result is returned as the return value of the replac function.
You can the the inplace parameter of replace:
df_summary.replace('\(|\)!,"-', '', regex=True, inplace=True)
Most of the pandas functions are note in place and require if needed either the inplace argument, or the assignement of the result to a new dataframe.
You can either do
df_summary.replace('\(|\)!,"-', '', regex=True, inplace=True)
or
df_summary = df_summary.replace('\(|\)!,"-', '', regex=True)
When you only do df_summary.replace..., this line returns you a pandas list. You forgot to save it. Please add comments to further assist you
Apart from addind regex=True or setting the result to df_summary again, your are using a pattern:
`\(|\)!,"-`
That matches matches either ( OR the string \)!,"-
As you are referring to symbols, and you want to replace all separate chars ( ) ! , " - you can use a repeated character class [()!,"-]+ to replace multiple consecutive matches at once.
df_summary.replace('[()!,"-]+', '', regex=True, inplace=True)

Python - regex, blank element at the end of the list?

I have a code
print(re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!"))
which results
['Holy', 'moly', 'feferoni', '']
How can I get rid of this last blank element, what caused it?
If this is a dirty way to get rid of punctuation and spaces from a string, how else can I write but in regex?
Expanding on what #HamZa said in his comment, you would use re.findall and a negative character set:
>>> from re import findall
>>> findall(r"[^\s?!,;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni']
>>>
You get the empty string as the last element of you list, because the RegEx splits after the last !. It ends up giving you what's before the ! and what's after it, but after it, there's simply nothing, i.e. an empty string! You might have the same problem in the middle of the string if you didn't wisely add the + to your RegEx.
Add a call to list if you can't work with an iterator. If you want to elegantly get rid of the optional empty string, do:
filter(None, re.split(r"[\s?!,;]+", "Holy moly, feferoni!"))
This will result in:
['Holy', 'moly', 'feferoni']
What this does is remove every element that is not a True value. The filter function generally only returns elements that satisfy a requirement given as a function, but if you pass None it will check if the value itself is True. Because an empty string is False and every other string is True it will remove every empty string from the list.
Also note I removed the escaping of special characters in the character class, as it is simply not neccessary and just makes the RegEx harder to read.
the first thing which comes to my mind is something like this:
>>> mystring = re.split(r"[\s\?\!\,\;]+", "Holy moly, feferoni!")
['Holy', 'moly', 'feferoni', '']
>>> mystring.pop(len(mystring)-1)
>>> print mystring
['Holy', 'moly', 'feferoni']
__import__('re').findall('[^\s?!,;]+', 'Holy moly, feferoni!')

Categories

Resources