How to remove all occurrences of a character in a dataframe? - python

I have a dataframe with multiple columns and most have special characters like $, % or ^ and so on... How can I delete these characters throughout the entire data frame? I only know how to delete by column, for example:
df['Column'] = df['Column'].str.replace('^\d+','')

Just noticed that pandas.DataFrame.replace does not work with special characters like $, %, ^ etc. So, you can use following snippet to get rid of these special characters from the whole dataframe. We need to make sure that a certain column is of type string before applying str.replace
import pandas as pd
from pandas.api.types import is_string_dtype
f = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['f;','d:','sda$sd'],
'D':['s%','d;','d^p'],
'E':[5,3,6],
'F':[7,4,3]})
f looks like follows.
A B C D E F
0 1 4 f; s% 5 7
1 2 5 d: d; 3 4
2 3 6 sda$sd d^p 6 3
Now use a loop to replace the strings.
for col in f.columns:
if is_string_dtype(f[col]):
f[col] = f[col].str.replace('[^A-Za-z0-9-\s]+', '')
Output:
A B C D E F
0 1 4 f s 5 7
1 2 5 d d 3 4
2 3 6 sdasd dp 6 3
UPDATE:
The pandas version 0.24.1 does not replace some special characters, but versions 0.23.4 and 0.25.1 do work. So if you have any of these working versions, you can easily use pandas.DataFrame.replace to delete the special characters as follows. Make sure to escape these characters with \.
f = f.replace({'\$':'', '\^':'','%':''}, regex=True)
This results in the same output as above.

I think you want:
pandas.DataFrame.replace(to_replace, value)
The parameters accept regex and it should cover the whole df. Hope this helps.
Here's the documentation on this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace

Related

Recreate pandas dataframe from question in stackoverflow

This is a question from someone who tries to answer questions about pandas dataframes. Consider a question with a given dataset which is just the visualization (not the actual code), for example:
numbers letters dates all
0 1 a 20-10-2020 NaN
1 2 b 21-10-2020 b
2 3 c 20-11-2020 4
3 4 d 20-10-2021 20-10-2020
4 5 e 10-10-2020 3.14
Is it possible to quickly import this in python as a dataframe or as a dictionary? So far I copied the given text and transformed it to a dataframe by making strings (adding '') and so on.
I think there are two 'solutions' for this:
Make a function that given the text as input, it somehow transforms it to a dataframe.
Use some function in the text-editor (I use spyder) which can do this trick for us.
read_clipboard
You can use pd.read_clipboard() optionally with a separator (e.g. pd.read_clipboard('\s\s+') if you have datetime strings or spaces in column names and columns are separated by at least two spaces):
select text on the question and copy to clipboard (ctrl+c/command-c)
move to python shell or notebook and run pd.read_clipboard()
Note that this doesn't work well on all platforms.
read_csv + io.StringIO
For more complex formats, combine read_csv combined with io.StringIO:
data = '''
numbers letters dates all
0 1 a 20-10-2020 NaN
1 2 b 21-10-2020 b
2 3 c 20-11-2020 4
3 4 d 20-10-2021 20-10-2020
4 5 e 10-10-2020 3.14
'''
import io
df = pd.read_csv(io.StringIO(data), sep='\s+')
df

Regex Sub and Pandas

I am struggling replacing a string in a pandas cell with data from a dictionary. I have a pandas frame:
import pandas as pd
import numpy as np
import re
f = {'GAAP':['<1>','2','3','4'],'CP':['5','6','<7>','8']}
filter = pd.DataFrame(data=f)
filter
and a dictionary:
d = {'GAAP':['100','101'],'CP':['500','501','502']}
d
I am trying to get the following output:
op = {'GAAP':['100|101','2','3','4'],'CP':['5','6','500|501|502','8']}
op = pd.DataFrame(data=op)
op
I tried something like:
def rep1(fr,di):
op=re.sub('\<.*?\>',fr,di)
return(op)
a='|'.join(d['GAAP'])
op=rep1(filter['GAAP'],a)
op
but get an error saying series objects are mutable and cannot be hashed.Any suggestions on what I am doing wrong ?
Let us try use pd.to_numeric convert the <> to NaN, then fillna
filter=filter.apply(pd.to_numeric,errors='coerce').fillna(pd.Series(d).str.join('|'))
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8
one way about it using replace : get the regexes that match the <> and pair them with their replacements from the dictionary.
outcome = filter.replace({'GAAP':"\<\d\>", 'CP':"\<\d\>"},
{"GAAP":"|".join(d['GAAP']), "CP":"|".join(d["CP"])},
regex=True)
GAAP CP
0 100|101 5
1 2 6
2 3 500|501|502
3 4 8

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

How to drop parentheses within column or data frame

df =
A B
1 5
2 6)
(3 7
4 8
To remove parentheses I did:
df.A = df.A.str.replace(r"\(.*\)","")
But no result. I have checked a lot of replies here, but still same result.
Would appreciate to remove parentheses from the whole data set or at least in coulmn
to remove parentheses from the whole data set
With regex character class [...] :
In [15]: df.apply(lambda s: s.str.replace(r'[()]', ''))
Out[15]:
A B
0 1 5
1 2 6
2 3 7
3 4 8
Or the same with df.replace(r'[()]', '', regex=True) which is a more concise way.
If you want regex, you can use r"[()]" instead of alteration groups, as long as you need to replace only one character at a time.
df.A = df.A.str.replace(r"[()]", "")
I find it easier to read and alter if needed.

Pandas overwrite values in column selectively based on condition from another column

I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)

Categories

Resources