Pandas: Keep rows if at least one of them contains certain value

Pandas: Keep rows if at least one of them contains certain value - python

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !

You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop

You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Related

Speed up integer encoding of strings in pandas dataframe

I have a pandas dataframe as follows, consisting of string values.
0 1 2
0 o jj ovg
1 j jj jjy
2 y yk yku
3 v vf vfs
4 i iw iwd
I have a function which encodes each column with integer values, and counts the number of unique elements in each column. I used cat.codes and nunique functions of pandas. See below the timing results and the code snippets.
As is evident, these operations take a lot of time. How can I speed them up?
Line # Hits Time Per Hit % Time Line Contents
=====================================================================================================================
25 1 7529434.0 7529434.0 79.9 df = df.apply(lambda x: x.astype('category').cat.codes)
26
27 # calculate the number of unique keys for each row
28 1 1825214.0 1825214.0 19.4 len_arr = df.nunique(axis=0).values
Edit Timing results from the answer
df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 6.24 ms per loop
%timeit df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 4.93 ms per loop
%timeit df1.nunique(axis=0).values
#100 loops, best of 3: 2.34 ms per loop
%timeit df1.apply(lambda x: len(pd.factorize(x)[1]))
#100 loops, best of 3: 2.64 ms per loop
Edit 2
More timing results for fun:
# results with 100 rows
%timeit original()
#100 loops, best of 3: 7 ms per loop
%timeit WeNYoBen()
#100 loops, best of 3: 2.4 ms per loop
%timeit jezrael()
#100 loops, best of 3: 4.03 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 2.29 ms per loop
# results with 10000 rows
%timeit original()
#100 loops, best of 3: 16.6 ms per loop
%timeit WeNYoBen()
#10 loops, best of 3: 23 ms per loop
%timeit jezrael()
#100 loops, best of 3: 6.14 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 19.1 ms per loop

Use factorize with length of second array:
a = df.apply(lambda x: len(pd.factorize(x)[1]))
print (a)
0 5
1 4
2 5
dtype: int64
For integers:
b = df.apply(lambda x: pd.factorize(x)[0])
print (b)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
All together for avoid call function twice:
out = {}
def f(x):
a, b = pd.factorize(x)
out[x.name] = len(b)
return a
b = df.apply(f)
print (b)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
a = pd.Series(out)
print (a)
0 5
1 4
2 5
dtype: int64

with pd.factorize
The point of this is to capture the both out puts of factorize and use them in the integer encoding as well as the nunique calculation without having to factorize twice.
Run this to get encoding and unique values
e, u = zip(*map(pd.factorize, map(df.get, df)))
Turn encoding into dataframe
pd.DataFrame([*zip(*e)], df.index, df.columns)
0 1 2
0 0 0 0
1 1 0 1
2 2 1 2
3 3 2 3
4 4 3 4
Turn length of unique values into a series
pd.Series([*map(len, u)], df.columns)
0 5
1 4
2 5
dtype: int64
All together, the assignment of the two objects is
e, u = zip(*map(pd.factorize, map(df.get, df)))
df_ = pd.DataFrame([*zip(*e)], df.index, df.columns)
c = pd.Series([*map(len, u)], df.columns)
For those stuck with legacy Python, without the [*it] syntax
e, u = zip(*map(pd.factorize, map(df.get, df)))
df_ = pd.DataFrame(list(zip(*e)), df.index, df.columns)
c = pd.Series(list(map(len, u)), df.columns)

I think using list map is good enough
l=list(map(set,df.values.T))
l
Out[71]:
[{'i', 'j', 'o', 'v', 'y'},
{'iw', 'jj', 'vf', 'yk'},
{'iwd', 'jjy', 'ovg', 'vfs', 'yku'}]
list(map(len,l))
Out[74]: [5, 4, 5]
Usage of np.unique
def yourfunc(x):
_,indices = np.unique(x, return_inverse=True)
return indices
df.apply(yourfunc)
Out[102]:
0 1 2
0 2 1 2
1 1 1 1
2 4 3 4
3 3 2 3
4 0 0 0

Return All Values of Column A and Put them in Column B until Specific Value Is reached

I am still having trouble with with this and nothing seems to work for me. I have a data frame with two columns. I am trying to return all of the values in column A in a new column, B. However, I want to loop through column A and stop returning those values and instead return 0 when the cumulative sum reaches 8 or the next value would make it greater than 8.
df max_val = 8
A
1
2
2
3
4
5
1
The output should look something like this
df max_val = 8
A B
1 1
2 2
2 2
3 3
4 0
5 0
1 0
I thought something like this
def func(x):
if df['A'].cumsum() <= max_val:
return x
else:
return 0
This doesn't work:
df['B'] = df['A'].apply(func, axis =1 )
Neither does this:
df['B'] = func(df['A'])

You can use Series.where:
df['B'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
print (df)
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0

Approach #1 One approach using np.where -
df['B']= np.where((df.A.cumsum()<=max_val), df.A ,0)
Sample output -
In [145]: df
Out[145]:
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #2 Another using array-initialization -
def app2(df,max_val):
a = df.A.values
colB = np.zeros(df.shape[0],dtype=a.dtype)
idx = np.searchsorted(a.cumsum(),max_val, 'right')
colB[:idx] = a[:idx]
df['B'] = colB
Runtime test
Seems like #jezrael's pd.where based one is close one, so timing against it on a bigger dataset -
In [293]: df = pd.DataFrame({'A':np.random.randint(0,9,(1000000))})
In [294]: max_val = 1000000
# #jezrael's soln
In [295]: %timeit df['B1'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
100 loops, best of 3: 8.22 ms per loop
# Proposed in this post
In [296]: %timeit df['B2']= np.where((df.A.cumsum()<=max_val), df.A ,0)
100 loops, best of 3: 6.45 ms per loop
# Proposed in this post
In [297]: %timeit app2(df, max_val)
100 loops, best of 3: 4.47 ms per loop

df['B']=[x if x<=8 else 0 for x in df['A'].cumsum()]
df
Out[7]:
A B
0 1 1
1 2 3
2 2 5
3 3 8
4 4 0
5 5 0
6 1 0

Why don't you add values to a variable like this :
for i in range(len(df)):
if A<max_val:
return x
else:
return 0
A=A+df[i]

Splitting in multiple lines
import pandas as pd
A=[1,2,2,3,4,5,1]
MAXVAL=8
df=pd.DataFrame(data=A,columns=['A'])
df['cumsumA']=df['A'].cumsum()
df['B']=df['cumsumA']*(df['cumsumA']<MAXVAL).astype(int)
You can then drop the 'cumsumA' column

The below will work fine -
import numpy as np
max_val = 8
df['B'] = np.where(df['A'].cumsum() <= max_val , df['A'],0)
I hope this helps.

just a way to do it with .loc:
df['c'] = df['a'].cumsum()
df['b'] = df['a']
df['b'].loc[df['c'] > 8] = 0

Pivot Table and Counting

I have a data set indicating who has shopped at which stores.
ID Store
1 C
1 A
2 A
2 B
3 A
3 B
3 C
Can I use a pivot table to determine the frequency of a shopper going to other stores? I'm thinking like a 3X3 matrix where the columns and rows would indicate how many people went to both stores.
Desired output
A B C
A 3 2 2
B 2 3 1
C 2 1 3

You can create a conditional table of ID and Store with pd.crosstab() and then calculate the matrix product of its transpose and itself, which should produce what you need:
mat = pd.crosstab(df.ID, df.Store)
mat.T.dot(mat)
#Store A B C
#Store
# A 3 2 2
# B 2 2 1
# C 2 1 2
Note: Since only two IDs visited store B and C, I suppose the corresponding cells should be 2 instead of 3:

Another faster solution with groupby, unstack and dot:
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
df = df.T.dot(df)
print (df)
Store A B C
Store
A 3 2 2
B 2 2 1
C 2 1 2
Timings:
In [119]: %timeit (jez(df))
1000 loops, best of 3: 1.72 ms per loop
In [120]: %timeit (psi(df))
100 loops, best of 3: 7.07 ms per loop
Code for timings:
N = 1000
df = pd.DataFrame({'ID':np.random.choice(5, N),
'Store': np.random.choice(list('ABCDEFGHIJK'), N)})
print (df)
def jez(df):
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
return df.T.dot(df)
def psi(df):
mat = pd.crosstab(df.ID, df.Store)
return mat.T.dot(mat)
print (jez(df))
print (psi(df))

Python Pandas, aggregate multiple columns from one

I'm new to pandas and I have a DataFrame of this kind :
name value
0 alpha a
1 beta b
2 gamma c
3 alpha a
4 beta b
5 beta a
6 gamma a
7 alpha c
which I would like to turn into one of this kind :
name a b c
0 alpha 2 0 1
1 beta 1 2 0
2 gamma 1 0 1
That is to say I would like to group by "name" and "value", then count them, and create a column for each value of "value" I find.

It is just a cross tabulation:
In [78]:
print pd.crosstab(df.name, df.value)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
If you use groupby:
In [90]:
print df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
The latter might be faster:
In [92]:
%timeit df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
100 loops, best of 3: 3.26 ms per loop
In [93]:
%timeit pd.crosstab(df.name, df.value)
100 loops, best of 3: 7.5 ms per loop

Optimizing pandas groupby on many small groups

I have a pandas DataFrame with many small groups:
In [84]: n=10000
In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
In [86]: df.head(9)
Out[86]:
group val
0 0 0
1 0 0
2 0 1
3 0 2
4 1 1
5 1 2
6 1 2
7 1 4
8 2 0
I want to do something special for groups where val==1 appears but not val==0. E.g. replace the 1 in the group by 99 only if the val==0 is in that group.
But for DataFrames of this size it is quite slow:
In [87]: def f(s):
....: if (0 not in s) and (1 in s): s[s==1]=99
....: return s
....:
In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop
Looping through the data frame is much uglier but much faster:
In [89]: %paste
def g(df):
df.sort(['group','val'],inplace=True)
last_g=-1
for i in xrange(len(df)):
if df.group.iloc[i]!=last_g:
has_zero=False
if df.val.iloc[i]==0:
has_zero=True
elif has_zero and df.val.iloc[i]==1:
df.val.iloc[i]=99
return df
## -- End pasted text --
In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop
But I would like to optimizing it further if possible.
Any idea of how to do so?
Thanks
Based on Jeff's answer, I got a solution that is very fast. I'm putting it here if others find it useful:
In [122]: def do_fast(df):
.....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
.....: df.val[(df.val==1) & has_zero_mask]=99
.....: return df
.....:
In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop

Not 100% sure this is what you are going for, but should be simple to have a different filtering/setting criteria
In [37]: pd.set_option('max_rows',10)
In [38]: np.random.seed(1234)
In [39]: def f():
# create the frame
df=pd.DataFrame({'group':sorted(range(n)*4),
'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
df['result'] = np.nan
# Create a count per group
df['counter'] = df.groupby('group').cumcount()
# select which values you want, returning the indexes of those
mask = df[df.val==1].groupby('group').grouper.group_info[0]
# set em
df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99
In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop
In [41]: df
Out[41]:
group val result counter
0 0 3 NaN 0
1 0 4 99 1
2 0 4 NaN 2
3 0 5 99 3
4 1 0 NaN 0
... ... ... ... ...
39995 9998 4 NaN 3
39996 9999 0 NaN 0
39997 9999 0 NaN 1
39998 9999 2 NaN 2
39999 9999 3 NaN 3
[40000 rows x 4 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Keep rows if at least one of them contains certain value - python

Related

Speed up integer encoding of strings in pandas dataframe

Return All Values of Column A and Put them in Column B until Specific Value Is reached

Pivot Table and Counting

Python Pandas, aggregate multiple columns from one

Optimizing pandas groupby on many small groups

Categories

Resources