Pandas replace with default value - python

I have a pandas dataframe I want to replace a certain column conditionally.
eg:
col
0 Mr
1 Miss
2 Mr
3 Mrs
4 Col.
I want to map them as
{'Mr': 0, 'Mrs': 1, 'Miss': 2}
If there are other titles now available in the dict then I want them to have a default value of 3
The above example becomes
col
0 0
1 2
2 0
3 1
4 3
Can I do this with pandas.replace() without using regex ?

You can use map rather as replace, because faster, then fillna by 3 and cast to int by astype:
df['col'] = df.col.map({'Mr': 0, 'Mrs': 1, 'Miss': 2}).fillna(3).astype(int)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Another solution with numpy.where and condition with isin:
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Solution with replace:
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Timings:
df = pd.concat([df]*10000).reset_index(drop=True)
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col0'] = df.col.map(d).fillna(3).astype(int)
df['col1'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
df['col2'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
print (df)
In [447]: %timeit df['col0'] = df.col.map(d).fillna(3).astype(int)
100 loops, best of 3: 4.93 ms per loop
In [448]: %timeit df['col1'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
100 loops, best of 3: 14.3 ms per loop
In [449]: %timeit df['col2'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
100 loops, best of 3: 7.68 ms per loop
In [450]: %timeit df['col3'] = df.col.map(lambda L: d.get(L, 3))
10 loops, best of 3: 36.2 ms per loop

To add on the answer by #jezrael: The most straight forward solution is to use a defaultdict instead of dict. This is especially useful when you want missing values not to be replaced with your default value.
from collections import defaultdict
df['col'] = df.col.map(defaultdict(lambda: 3,Mr= 0, Mrs= 1, Miss= 2),na_action='ignore')
The first argument of defaultdict is a function that return the default value.

Related

Adding column of weights to pandas DF by a condition on the DFs column

Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20

Using list comprehension to number dataframes in a list of dataframes

I have a list of 4 dataframes, called df.
I'd like to add a "number" column to each dataframe (df[i]['number']) that represent the dataframe number.
I tried to use list comprehension for that:
df=[df['number']=(x+1) for x in range(0,4)]
Which resulted in
File "<ipython-input-52-0b708f543fbb>", line 1
df=[df['number']=(x+1) for x in range(0,4)]
^
SyntaxError: invalid syntax
I also tried:
df=[x['number']=(y+1) for x,y in enumerate(df)]
With the same result, pointing at the '=' sign.
What am I doing wrong?
Use enumerate, starting from 1 and assign to each dataframe in your list.
for i, d in enumerate(df, 1):
d['number'] = i
In-place assignment is much cheaper than assignment in a list comprehension.
df[0]
id marks
0 1 100
1 2 200
2 3 300
df[1]
name score flag
0 'abc' 100 T
1 'zxc' 300 F
for i, d in enumerate(df, 1):
d['number'] = i
df[0]
id marks number
0 1 100 1
1 2 200 1
2 3 300 1
df[1]
name score flag number
0 'abc' 100 T 2
1 'zxc' 300 F 2
Performance
Small
1000 loops, best of 3: 278 µs per loop # mine
vs
1000 loops, best of 3: 567 µs per loop # John Galt
Large (df * 10000)
1000 loops, best of 3: 607 µs per loop # mine
vs
1000 loops, best of 3: 1.16 ms per loop # John Galt - assign
1 loop, best of 1: 1.42 ms per loop # John Galt - side effects
Note that the loop-based assignment is also space efficient.
Use
1)
In [454]: df = [x.assign(number=i) for i, x in enumerate(df, 1)]
In [455]: df[0]
Out[455]:
0 1 number
0 0.068330 0.708835 1
1 0.877747 0.586654 1
In [456]: df[1]
Out[456]:
0 1 number
0 0.430418 0.477923 2
1 0.049980 0.018981 2
Good part you can assign it to a new variable without altering old list like
dff = [x.assign(number=i) for i, x in enumerate(df, 1)]
2)
If you want inplace and list comprehension
In [474]: [x.insert(x.shape[1] ,'number', i) for i, x in enumerate(df, 1)]
Out[474]: [None, None, None, None]
In [475]: df[0]
Out[475]:
0 1 number
0 0.207806 0.315701 1
1 0.464864 0.976156 1

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

Renaming values in a column from lists within a dataframe

I have a data frame which looks like this,
df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
df
and I have two lists,
lis1=['A']
Lis2=['S','O']
I need to replace the value in col2 based on the lis1 and lis2.
So I used np.where to do so.
like this,
df['col2'] = np.where(df.col2.isin(lis1),'PC',df.col2.isin(lis2),'Ln','others')
But its throwing me follwoing error,
TypeError: function takes at most 3 arguments (5 given)
Any suggestion is very appreciated.!!
At the end I am aiming to have the values replace in col2 of my data frame as,
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Use double numpy.where:
lis1=['A']
lis2=['S','O']
df['col2'] = np.where(df.col2.isin(lis1),'PC',
np.where(df.col2.isin(lis2),'Ln','others'))
print (df)
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Timings:
#[60000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [257]: %timeitnp.where(df.col2.isin(lis1),'PC',np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 8.15 ms per loop
In [258]: %timeit in1d_based(df, lis1, lis2)
100 loops, best of 3: 4.98 ms per loop
Here's one approach -
a = df.col2.values
df.col2 = np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
Sample step-by-step run -
# Input dataframe
In [206]: df
Out[206]:
col1 col2
0 1 A
1 2 A
2 3 S
3 4 O
4 5 S
5 6 P
# Extract out col2 values
In [207]: a = df.col2.values
# Form an indexing array based on where we have matches in lis1 or lis2 or neither
In [208]: idx = np.in1d(a,lis1) + 2*np.in1d(a,lis2)
In [209]: idx
Out[209]: array([1, 1, 2, 2, 2, 0])
# Index into a list of new strings with those indices
In [210]: newvals = np.take(['others','PC','Ln'], idx)
In [211]: newvals
Out[211]:
array(['PC', 'PC', 'Ln', 'Ln', 'Ln', 'others'],
dtype='|S6')
# Finally assign those into col2
In [212]: df.col2 = newvals
In [213]: df
Out[213]:
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Runtime test -
In [251]: df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
In [252]: df = pd.concat([df]*10000).reset_index(drop=True)
In [253]: lis1
Out[253]: ['A']
In [254]: lis2
Out[254]: ['S', 'O']
In [255]: def in1d_based(df, lis1, lis2):
...: a = df.col2.values
...: return np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
...:
# #jezrael's soln
In [256]: %timeit np.where(df.col2.isin(lis1),'PC', np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 3.78 ms per loop
In [257]: %timeit in1d_based(df, lis1, lis2)
1000 loops, best of 3: 1.89 ms per loop

Square of each element of a column in pandas

How can I square each element of a column/series of a DataFrame in pandas (and create another column to hold the result)?
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b']**2
>>> df
a b c
0 1 2 4
1 3 4 16
Nothing wrong with the accepted answer, there is also:
df = pd.DataFrame({'a': range(0,100)})
np.square(df)
np.power(df, 2)
Which is ever so slightly faster:
In [11]: %timeit df ** 2
10000 loops, best of 3: 95.9 µs per loop
In [13]: %timeit np.square(df)
10000 loops, best of 3: 85 µs per loop
In [15]: %timeit np.power(df, 2)
10000 loops, best of 3: 85.6 µs per loop
You can also use pandas.DataFrame.pow() method.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2], [3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b'].pow(2)
>>> df
a b c
0 1 2 4
1 3 4 16

Categories

Resources