Remove optional characters from a column in pandas

Remove optional characters from a column in pandas - python

I have a column which may contain values like abc,def or abc,def,efg, or ab,12,34, etc. As you can see, some values end with a , and some don't. What I want to do is remove all such values that end with a comma ,.
Assuming the data is loaded and a data frame is created. So this is what I do
df[c] = df[c].astype('unicode').str.replace("/,*$/", '').str.strip()
But it doesn't do anything.
What am I doing wrong?

The way you were trying to do it, would be something like this:
df[c] = df[c].str.rstrip(',')
rstrip(',') will remove comma just from the end of the string.
strip(',') will remove it from start and end positions both.
The above will replace the text. It will not let you drop the rows from the dataframe. So you should do below:
Use str.endswith:
df[~df['col'].str.endswith(',')]
Consider below df:
In [1547]: df
Out[1547]:
date id value rolling_mean col
0 2016-08-28 A 1 nan a,
1 2016-08-28 B 1 nan b
2 2016-08-29 C 2 nan c,
3 2016-09-02 B 0 0.50 d
4 2016-09-03 A 3 2.00 ee,ff
5 2016-09-06 C 1 1.50 gg,
6 2017-01-15 B 2 1.00 i,
7 2017-01-18 C 3 2.00 j
8 2017-01-18 A 2 2.50 k,
In [1548]: df = df[~df['col'].str.endswith(',')]
In [1549]: df
Out[1549]:
date id value rolling_mean col
1 2016-08-28 B 1 nan b
3 2016-09-02 B 0 0.50 d
4 2016-09-03 A 3 2.00 ee,ff
7 2017-01-18 C 3 2.00 j

Your regex is wrong as it contains regex delimiter characters. Python regex uses plain strings, not regex literals.
Use
df[c] = df[c].astype('unicode').str.replace(",+$", '').str.strip()
The ,+$ will match one or more commas at the end of string.
See proof.
Also, see Regular expression works on regex101.com, but not on prod

Related

Drop rows in a dataframe based on type of the entry

Suppose I have a dataframe x that has a column terms. Terms are supposed to be of type string, but some contain numbers and for this reason I want to delete the rows in the dataframe where the corresponding terms values are integers/floats. I tried the following but received a Key Error:
x = x.drop(x[type(x['terms']) is int].index)
How should I change the code?

Use pd.to_numeric:
df = pd.DataFrame({'terms': [13, 0.23, 'hello', 'world', '12', '0.45']})
df = df[pd.to_numeric(df['terms'], errors='coerce').isna()]
print(df)
# Output:
terms
2 hello
3 world
Details:
>>> df
terms
0 13
1 0.23
2 hello
3 world
4 12
5 0.45
>>> pd.to_numeric(df['terms'], errors='coerce')
0 13.00
1 0.23
2 NaN
3 NaN
4 12.00
5 0.45
Name: terms, dtype: float64

Say you have a dataframe like this:
df = pd.DataFrame({'a':[1,'sd','sf',2,5,'13','s','143f','d234f','z24']})
# notice 13 is a string here ^^^^
a
0 1
1 sd
2 sf
3 2
4 5
5 13
6 s
7 143f
8 d234f
9 z24
If you want to get rid of items that look like numbers but are actually strings, use this:
df = df[~df['a'].astype(str).str.isdigit()]
Output:
>>> df
a
1 sd
2 sf
6 s
7 143f
8 d234f
9 z24
If you want to get rid of items that are actually not strings at all, use this:
df = df[df['a'].transform(type).eq(str)]
Output:
>>> df
a
1 sd
2 sf
5 13 <--- Notice how the string '13' is kept
6 s
7 143f
8 d234f
9 z24

Confused about the usage of .apply and lambda

After encountering this code:
I was confused about the usage of both .apply and lambda. Firstly does .apply apply the desired change to all elements in all the columns specified or each column one by one? Secondly, does x in lambda x: iterate through every element in specified columns or columns separately? Thirdly, does x.min or x.max give us the minimum or maximum of all the elements in specified columns or minimum and maximum elements of each column separately? Any answer explaining the whole process would make me more than grateful.
Thanks.

I think here is the best avoid apply - loops under the hood and working with subset of DataFrame by columns from list:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
c = ['B','C','D']
So first select minimal values of selected columns and similar maximal:
print (df[c].min())
B 4
C 2
D 0
dtype: int64
Then subtract and divide:
print ((df[c] - df[c].min()))
B C D
0 0 5 1
1 1 6 3
2 0 7 5
3 1 2 7
4 1 0 1
5 0 1 0
print (df[c].max() - df[c].min())
B 1
C 7
D 7
dtype: int64
df[c] = (df[c] - df[c].min()) / (df[c].max() - df[c].min())
print (df)
A B C D E F
0 a 0.0 0.714286 0.142857 5 a
1 b 1.0 0.857143 0.428571 3 a
2 c 0.0 1.000000 0.714286 6 a
3 d 1.0 0.285714 1.000000 9 b
4 e 1.0 0.000000 0.142857 2 b
5 f 0.0 0.142857 0.000000 4 b
EDIT:
For debug apply is best create custom function:
def f(x):
#for each loop return column
print (x)
#return scalar - min
print (x.min())
#return new Series - column
print ((x-x.min())/ (x.max() - x.min()))
return (x-x.min())/ (x.max() - x.min())
df[c] = df[c].apply(f)
print (df)

Check if the data are really being normalised. Because x.min and x.max may simply take the min and max of a single value, hence no normalisation would occur.

Unmelting a pandas dataframe with two columns

Suppose I have a dataframe
df = pd.DataFrame(np.random.normal(size = (10,3)), columns = list('abc'))
I melt the dataframe using pd.melt so that it looks like
variable value
a 0.2
a 0.03
a -0.99
a 0.86
a 1.74
Now, I would like to undo the action. Using pivot(columns = 'variable') almost works, but returns a lot of NULL values
a b c
0 0.2 NAN NAN
1 0.03 NAN NAN
2 -0.99 NAN NAN
3 0.86 NAN NAN
4 1.74 NAN NAN
How can I unmelt the dataframe so that it is as before?

A few ideas:
Assuming d1 is df.melt()
groupby + comprehension
pd.DataFrame({n: list(s) for n, s in d1.groupby('variable').value})
a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194
Option 2
pd.DataFrame.set_index
d1.set_index([d1.groupby('variable').cumcount(), 'variable']).value.unstack()
variable a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194

Use groupby, apply and unstack.
df.groupby('variable')['value']\
.apply(lambda x: pd.Series(x.values)).unstack().T
variable a b c
0 0.617037 -0.321493 0.747025
1 0.576410 -0.498173 0.185723
2 -1.563912 0.741198 1.439692
3 -1.305317 1.203608 -1.112820
4 1.287638 1.649580 0.404494
5 0.923544 0.988020 -1.918680
6 0.497406 -1.373345 0.074963
7 0.528444 -0.019914 -1.666261
8 0.260955 0.103575 0.190424
9 0.614411 -0.165363 -0.149514

Another method using the pivot and transform if you don't have nan value in the column i.e
df1 = df.melt()
df1.pivot(columns='variable',values='value')
.transform(lambda x: sorted(x,key=pd.isnull)).dropna()
Output:
variable a b c
0 1.596937 0.431029 0.345441
1 -0.493352 0.135649 -1.559669
2 0.548048 0.667752 0.258160
3 -0.251368 -0.265106 -2.339768
4 -0.397010 -0.381193 -0.359447
5 -0.945300 0.520029 0.362570
6 -0.883771 -0.612628 -0.478003
7 0.833100 -0.387262 -1.195496
8 -1.310178 -0.748359 0.073014
9 0.753457 1.105500 -0.895841

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!

You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html

You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6

df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Replacing space with a character

I need to replace all the spaces in a dataframe column with a period. ie:
Original df:
symbol
0 AEC
1 BRK A
2 BRK B
3 CTRX
4 FCE A
Desired result df:
symbol
0 AEC
1 BRK.A
2 BRK.B
3 CTRX
4 FCE.A
Is there a way to do this without needing to iterate through each row, replacing the space one at a time? I prefer not to iterate one row at a time if there is a vectorized way to do things.

Use vectorised str.replace:
In [95]:
df['symbol'] = df['symbol'].str.replace(' ','.')
df
Out[95]:
symbol
0 AEC
1 BRK.A
2 BRK.B
3 CTRX
4 FCE.A

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove optional characters from a column in pandas - python

Related

Drop rows in a dataframe based on type of the entry

Confused about the usage of .apply and lambda

Unmelting a pandas dataframe with two columns

How do I combine two columns within a dataframe in Pandas?

Replacing space with a character

Categories

Resources