Pandas "diff()" with string - python

How can I flag a row in a dataframe every time a column change its string value?
Ex:
Input
ColumnA ColumnB
1 Blue
2 Blue
3 Red
4 Red
5 Yellow
# diff won't work here with strings.... only works in numerical values
dataframe['changed'] = dataframe['ColumnB'].diff()
ColumnA ColumnB changed
1 Blue 0
2 Blue 0
3 Red 1
4 Red 0
5 Yellow 1

I get better performance with ne instead of using the actual != comparison:
df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
Timings
Using the following setup to produce a larger dataframe:
df = pd.concat([df]*10**5, ignore_index=True)
I get the following timings:
%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop
%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop
%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop
%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop

Use .shift and compare:
dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB'])

For me works compare with shift, then NaN was replaced 0 because before no value:
df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int)
df.ix[0,'diff'] = 0
print (df)
ColumnA ColumnB diff
0 1 Blue 0
1 2 Blue 0
2 3 Red 1
3 4 Red 0
4 5 Yellow 1
Edit by timings of another answer - fastest is use ne:
df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
df.ix[0,'diff'] = 0

Related

Speed up integer encoding of strings in pandas dataframe

I have a pandas dataframe as follows, consisting of string values.
0 1 2
0 o jj ovg
1 j jj jjy
2 y yk yku
3 v vf vfs
4 i iw iwd
I have a function which encodes each column with integer values, and counts the number of unique elements in each column. I used cat.codes and nunique functions of pandas. See below the timing results and the code snippets.
As is evident, these operations take a lot of time. How can I speed them up?
Line # Hits Time Per Hit % Time Line Contents
=====================================================================================================================
25 1 7529434.0 7529434.0 79.9 df = df.apply(lambda x: x.astype('category').cat.codes)
26
27 # calculate the number of unique keys for each row
28 1 1825214.0 1825214.0 19.4 len_arr = df.nunique(axis=0).values
Edit Timing results from the answer
df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 6.24 ms per loop
%timeit df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 4.93 ms per loop
%timeit df1.nunique(axis=0).values
#100 loops, best of 3: 2.34 ms per loop
%timeit df1.apply(lambda x: len(pd.factorize(x)[1]))
#100 loops, best of 3: 2.64 ms per loop
Edit 2
More timing results for fun:
# results with 100 rows
%timeit original()
#100 loops, best of 3: 7 ms per loop
%timeit WeNYoBen()
#100 loops, best of 3: 2.4 ms per loop
%timeit jezrael()
#100 loops, best of 3: 4.03 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 2.29 ms per loop
# results with 10000 rows
%timeit original()
#100 loops, best of 3: 16.6 ms per loop
%timeit WeNYoBen()
#10 loops, best of 3: 23 ms per loop
%timeit jezrael()
#100 loops, best of 3: 6.14 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 19.1 ms per loop
Use factorize with length of second array:
a = df.apply(lambda x: len(pd.factorize(x)[1]))
print (a)
0 5
1 4
2 5
dtype: int64
For integers:
b = df.apply(lambda x: pd.factorize(x)[0])
print (b)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
All together for avoid call function twice:
out = {}
def f(x):
a, b = pd.factorize(x)
out[x.name] = len(b)
return a
b = df.apply(f)
print (b)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
a = pd.Series(out)
print (a)
0 5
1 4
2 5
dtype: int64
with pd.factorize
The point of this is to capture the both out puts of factorize and use them in the integer encoding as well as the nunique calculation without having to factorize twice.
Run this to get encoding and unique values
e, u = zip(*map(pd.factorize, map(df.get, df)))
Turn encoding into dataframe
pd.DataFrame([*zip(*e)], df.index, df.columns)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
Turn length of unique values into a series
pd.Series([*map(len, u)], df.columns)
0 5
1 4
2 5
dtype: int64
All together, the assignment of the two objects is
e, u = zip(*map(pd.factorize, map(df.get, df)))
df_ = pd.DataFrame([*zip(*e)], df.index, df.columns)
c = pd.Series([*map(len, u)], df.columns)
For those stuck with legacy Python, without the [*it] syntax
e, u = zip(*map(pd.factorize, map(df.get, df)))
df_ = pd.DataFrame(list(zip(*e)), df.index, df.columns)
c = pd.Series(list(map(len, u)), df.columns)
I think using list map is good enough
l=list(map(set,df.values.T))
l
Out[71]:
[{'i', 'j', 'o', 'v', 'y'},
{'iw', 'jj', 'vf', 'yk'},
{'iwd', 'jjy', 'ovg', 'vfs', 'yku'}]
list(map(len,l))
Out[74]: [5, 4, 5]
Usage of np.unique
def yourfunc(x):
_,indices = np.unique(x, return_inverse=True)
return indices
df.apply(yourfunc)
Out[102]:
0 1 2
0 2 1 2
1 1 1 1
2 4 3 4
3 3 2 3
4 0 0 0

pandas: set row values to letter of the alphabet corresponding to index number?

I have a dataframe:
a b c country
0 5 7 11 Morocco
1 5 9 9 Nigeria
2 6 2 13 Spain
I'd like to add a column e that is the letter of the alphabet corresponding to the index number, for example:
a b c country e
0 5 7 11 Morocco A
1 5 9 9 Nigeria B
2 6 2 13 Spain C
How can I do this? I've tried:
df['e'] = chr(ord('a') + df.index.astype(int))
But I get:
TypeError: int() argument must be a string or a number, not 'Int64Index'
One method would be to convert the index to a Series and then call apply and pass a lambda:
In[271]:
df['e'] = df.index.to_series().apply(lambda x: chr(ord('a') + x)).str.upper()
df
Out[271]:
a b c country e
0 5 7 11 Morocco A
1 5 9 9 Nigeria B
2 6 2 13 Spain C
basically your error here is that df.index is of type Int64Index and the chr function doesn't understand how to operate with this so by calling apply on a Series we iterate row-wise to convert.
I think performance-wise a list comprehension will be faster:
In[273]:
df['e'] = [chr(ord('a') + x).upper() for x in df.index]
df
Out[273]:
a b c country e
0 5 7 11 Morocco A
1 5 9 9 Nigeria B
2 6 2 13 Spain C
Timings
%timeit df.index.to_series().apply(lambda x: chr(ord('a') + x)).str.upper()
%timeit [chr(ord('a') + x).upper() for x in df.index]
1000 loops, best of 3: 491 µs per loop
100000 loops, best of 3: 19.2 µs per loop
Here the list comprehension method is significantly faster
Here is an alternative functional solution. Assumes you have less countries than letters.
from string import ascii_uppercase
from operator import itemgetter
df['e'] = itemgetter(*df.index)(ascii_uppercase)
print(df)
a b c country e
0 5 7 11 Morocco A
1 5 9 9 Nigeria B
2 6 2 13 Spain C
you can use map and get values from df.index as well:
df['e'] = map(chr, ord('A') + df.index.values)
If you do speed comparison:
# Edchum
%timeit df.index.to_series().apply(lambda x: chr(ord('A') + x))
10000 loops, best of 3: 135 µs per loop
%timeit [chr(ord('A') + x) for x in df.index]
100000 loops, best of 3: 7.38 µs per loop
# jpp
%timeit itemgetter(*df.index)(ascii_uppercase)
100000 loops, best of 3: 7.23 µs per loop
# Me
%timeit map(chr,ord('A') + df.index.values)
100000 loops, best of 3: 3.12 µs per loop
so map seems the faster but it might be because of the length of the data sample

Pandas: Keep rows if at least one of them contains certain value

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !
You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Pandas fillna with a lookup table

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1

Optimizing pandas groupby on many small groups

I have a pandas DataFrame with many small groups:
In [84]: n=10000
In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
In [86]: df.head(9)
Out[86]:
group val
0 0 0
1 0 0
2 0 1
3 0 2
4 1 1
5 1 2
6 1 2
7 1 4
8 2 0
I want to do something special for groups where val==1 appears but not val==0. E.g. replace the 1 in the group by 99 only if the val==0 is in that group.
But for DataFrames of this size it is quite slow:
In [87]: def f(s):
....: if (0 not in s) and (1 in s): s[s==1]=99
....: return s
....:
In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop
Looping through the data frame is much uglier but much faster:
In [89]: %paste
def g(df):
df.sort(['group','val'],inplace=True)
last_g=-1
for i in xrange(len(df)):
if df.group.iloc[i]!=last_g:
has_zero=False
if df.val.iloc[i]==0:
has_zero=True
elif has_zero and df.val.iloc[i]==1:
df.val.iloc[i]=99
return df
## -- End pasted text --
In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop
But I would like to optimizing it further if possible.
Any idea of how to do so?
Thanks
Based on Jeff's answer, I got a solution that is very fast. I'm putting it here if others find it useful:
In [122]: def do_fast(df):
.....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
.....: df.val[(df.val==1) & has_zero_mask]=99
.....: return df
.....:
In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop
Not 100% sure this is what you are going for, but should be simple to have a different filtering/setting criteria
In [37]: pd.set_option('max_rows',10)
In [38]: np.random.seed(1234)
In [39]: def f():
# create the frame
df=pd.DataFrame({'group':sorted(range(n)*4),
'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
df['result'] = np.nan
# Create a count per group
df['counter'] = df.groupby('group').cumcount()
# select which values you want, returning the indexes of those
mask = df[df.val==1].groupby('group').grouper.group_info[0]
# set em
df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99
In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop
In [41]: df
Out[41]:
group val result counter
0 0 3 NaN 0
1 0 4 99 1
2 0 4 NaN 2
3 0 5 99 3
4 1 0 NaN 0
... ... ... ... ...
39995 9998 4 NaN 3
39996 9999 0 NaN 0
39997 9999 0 NaN 1
39998 9999 2 NaN 2
39999 9999 3 NaN 3
[40000 rows x 4 columns]

Categories

Resources