Python Pandas, aggregate multiple columns from one

Python Pandas, aggregate multiple columns from one - python

I'm new to pandas and I have a DataFrame of this kind :
name value
0 alpha a
1 beta b
2 gamma c
3 alpha a
4 beta b
5 beta a
6 gamma a
7 alpha c
which I would like to turn into one of this kind :
name a b c
0 alpha 2 0 1
1 beta 1 2 0
2 gamma 1 0 1
That is to say I would like to group by "name" and "value", then count them, and create a column for each value of "value" I find.

It is just a cross tabulation:
In [78]:
print pd.crosstab(df.name, df.value)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
If you use groupby:
In [90]:
print df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
value a b c
name
alpha 2 0 1
beta 1 2 0
gamma 1 0 1
The latter might be faster:
In [92]:
%timeit df.groupby(['name', 'value']).agg(len).unstack().fillna(0)
100 loops, best of 3: 3.26 ms per loop
In [93]:
%timeit pd.crosstab(df.name, df.value)
100 loops, best of 3: 7.5 ms per loop

Related

Return All Values of Column A and Put them in Column B until Specific Value Is reached

I am still having trouble with with this and nothing seems to work for me. I have a data frame with two columns. I am trying to return all of the values in column A in a new column, B. However, I want to loop through column A and stop returning those values and instead return 0 when the cumulative sum reaches 8 or the next value would make it greater than 8.
df max_val = 8
A
1
2
2
3
4
5
1
The output should look something like this
df max_val = 8
A B
1 1
2 2
2 2
3 3
4 0
5 0
1 0
I thought something like this
def func(x):
if df['A'].cumsum() <= max_val:
return x
else:
return 0
This doesn't work:
df['B'] = df['A'].apply(func, axis =1 )
Neither does this:
df['B'] = func(df['A'])

You can use Series.where:
df['B'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
print (df)
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0

Approach #1 One approach using np.where -
df['B']= np.where((df.A.cumsum()<=max_val), df.A ,0)
Sample output -
In [145]: df
Out[145]:
A B
0 1 1
1 2 2
2 2 2
3 3 3
4 4 0
5 5 0
6 1 0
Approach #2 Another using array-initialization -
def app2(df,max_val):
a = df.A.values
colB = np.zeros(df.shape[0],dtype=a.dtype)
idx = np.searchsorted(a.cumsum(),max_val, 'right')
colB[:idx] = a[:idx]
df['B'] = colB
Runtime test
Seems like #jezrael's pd.where based one is close one, so timing against it on a bigger dataset -
In [293]: df = pd.DataFrame({'A':np.random.randint(0,9,(1000000))})
In [294]: max_val = 1000000
# #jezrael's soln
In [295]: %timeit df['B1'] = df['A'].where(df['A'].cumsum() <= max_val, 0)
100 loops, best of 3: 8.22 ms per loop
# Proposed in this post
In [296]: %timeit df['B2']= np.where((df.A.cumsum()<=max_val), df.A ,0)
100 loops, best of 3: 6.45 ms per loop
# Proposed in this post
In [297]: %timeit app2(df, max_val)
100 loops, best of 3: 4.47 ms per loop

df['B']=[x if x<=8 else 0 for x in df['A'].cumsum()]
df
Out[7]:
A B
0 1 1
1 2 3
2 2 5
3 3 8
4 4 0
5 5 0
6 1 0

Why don't you add values to a variable like this :
for i in range(len(df)):
if A<max_val:
return x
else:
return 0
A=A+df[i]

Splitting in multiple lines
import pandas as pd
A=[1,2,2,3,4,5,1]
MAXVAL=8
df=pd.DataFrame(data=A,columns=['A'])
df['cumsumA']=df['A'].cumsum()
df['B']=df['cumsumA']*(df['cumsumA']<MAXVAL).astype(int)
You can then drop the 'cumsumA' column

The below will work fine -
import numpy as np
max_val = 8
df['B'] = np.where(df['A'].cumsum() <= max_val , df['A'],0)
I hope this helps.

just a way to do it with .loc:
df['c'] = df['a'].cumsum()
df['b'] = df['a']
df['b'].loc[df['c'] > 8] = 0

When slicing a 1 row pandas dataframe the slice becomes a series

Why when I slice a pandas dataframe containing only 1 row, the slice becomes a pandas series?
How can I keep it a dataframe?
df=pd.DataFrame(data=[[1,2,3]],columns=['a','b','c'])
df
Out[37]:
a b c
0 1 2 3
a=df.iloc[0]
a
Out[39]:
a 1
b 2
c 3
Name: 0, dtype: int64

To avoid the intermediate step of re-converting back to a DataFrame, use double brackets when indexing:
a = df.iloc[[0]]
print(a)
a b c
0 1 2 3
Speed:
%timeit df.iloc[[0]]
192 µs per loop
%timeit df.loc[0].to_frame().T
468 µs per loop

Or you can slice by index
a=df.iloc[df.index==0]
a
Out[1782]:
a b c
0 1 2 3

Use to_frame() and T to transpose:
df.loc[0].to_frame()
0
a 1
b 2
c 3
and
df.loc[0].to_frame().T
a b c
0 1 2 3
OR
Option #2 use double brackets [[]]
df.iloc[[0]]
a b c
0 1 2 3

Pandas: Keep rows if at least one of them contains certain value

I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0.
Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ?
Thanks !

You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform where get size of 0 rows and filter by boolean indexing:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby, better is loc with isin:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop

You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

Pivot Table and Counting

I have a data set indicating who has shopped at which stores.
ID Store
1 C
1 A
2 A
2 B
3 A
3 B
3 C
Can I use a pivot table to determine the frequency of a shopper going to other stores? I'm thinking like a 3X3 matrix where the columns and rows would indicate how many people went to both stores.
Desired output
A B C
A 3 2 2
B 2 3 1
C 2 1 3

You can create a conditional table of ID and Store with pd.crosstab() and then calculate the matrix product of its transpose and itself, which should produce what you need:
mat = pd.crosstab(df.ID, df.Store)
mat.T.dot(mat)
#Store A B C
#Store
# A 3 2 2
# B 2 2 1
# C 2 1 2
Note: Since only two IDs visited store B and C, I suppose the corresponding cells should be 2 instead of 3:

Another faster solution with groupby, unstack and dot:
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
df = df.T.dot(df)
print (df)
Store A B C
Store
A 3 2 2
B 2 2 1
C 2 1 2
Timings:
In [119]: %timeit (jez(df))
1000 loops, best of 3: 1.72 ms per loop
In [120]: %timeit (psi(df))
100 loops, best of 3: 7.07 ms per loop
Code for timings:
N = 1000
df = pd.DataFrame({'ID':np.random.choice(5, N),
'Store': np.random.choice(list('ABCDEFGHIJK'), N)})
print (df)
def jez(df):
df = df.groupby(['ID','Store']).size().unstack(fill_value=0)
return df.T.dot(df)
def psi(df):
mat = pd.crosstab(df.ID, df.Store)
return mat.T.dot(mat)
print (jez(df))
print (psi(df))

Optimizing pandas groupby on many small groups

I have a pandas DataFrame with many small groups:
In [84]: n=10000
In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
In [86]: df.head(9)
Out[86]:
group val
0 0 0
1 0 0
2 0 1
3 0 2
4 1 1
5 1 2
6 1 2
7 1 4
8 2 0
I want to do something special for groups where val==1 appears but not val==0. E.g. replace the 1 in the group by 99 only if the val==0 is in that group.
But for DataFrames of this size it is quite slow:
In [87]: def f(s):
....: if (0 not in s) and (1 in s): s[s==1]=99
....: return s
....:
In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop
Looping through the data frame is much uglier but much faster:
In [89]: %paste
def g(df):
df.sort(['group','val'],inplace=True)
last_g=-1
for i in xrange(len(df)):
if df.group.iloc[i]!=last_g:
has_zero=False
if df.val.iloc[i]==0:
has_zero=True
elif has_zero and df.val.iloc[i]==1:
df.val.iloc[i]=99
return df
## -- End pasted text --
In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop
But I would like to optimizing it further if possible.
Any idea of how to do so?
Thanks
Based on Jeff's answer, I got a solution that is very fast. I'm putting it here if others find it useful:
In [122]: def do_fast(df):
.....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
.....: df.val[(df.val==1) & has_zero_mask]=99
.....: return df
.....:
In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop

Not 100% sure this is what you are going for, but should be simple to have a different filtering/setting criteria
In [37]: pd.set_option('max_rows',10)
In [38]: np.random.seed(1234)
In [39]: def f():
# create the frame
df=pd.DataFrame({'group':sorted(range(n)*4),
'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
df['result'] = np.nan
# Create a count per group
df['counter'] = df.groupby('group').cumcount()
# select which values you want, returning the indexes of those
mask = df[df.val==1].groupby('group').grouper.group_info[0]
# set em
df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99
In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop
In [41]: df
Out[41]:
group val result counter
0 0 3 NaN 0
1 0 4 99 1
2 0 4 NaN 2
3 0 5 99 3
4 1 0 NaN 0
... ... ... ... ...
39995 9998 4 NaN 3
39996 9999 0 NaN 0
39997 9999 0 NaN 1
39998 9999 2 NaN 2
39999 9999 3 NaN 3
[40000 rows x 4 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas, aggregate multiple columns from one - python

Related

Return All Values of Column A and Put them in Column B until Specific Value Is reached

When slicing a 1 row pandas dataframe the slice becomes a series

Pandas: Keep rows if at least one of them contains certain value

Pivot Table and Counting

Optimizing pandas groupby on many small groups

Categories

Resources