Advanced Pandas chaining: chain index.droplevel after groupby - python

I was trying to find the top2 values in column2 grouped by column1.
Here is the dataframe:
# groupby id and take only top 2 values.
df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
I have done without using chained grouping:
x = df.groupby('id')['value'].value_counts().groupby(level=0).nlargest(2).to_frame()
x.columns = ['count']
x.index = x.index.droplevel(0)
x = x.reset_index()
x
Result:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Can we do this is ONE-SINGLE chained operation?
So, far I have done this:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()
.rename({'value':'count'}))
Now, I stuck at how to drop the index level.
How to do all these operations in one single chain?

You could use apply and head without the second groupby:
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
Output:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Timings:
#This method
7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Groupby and groupby(level=0) with nlargest
12.9 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Try the below:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()).rename(columns={'value':'count'}).reset_index([1,2]).reset_index(drop=True)

Yet another solution:
df.groupby('id')['value'].value_counts().rename('count')\
.groupby(level=0).nlargest(2).reset_index(level=[1, 2])\
.reset_index(drop=True)

Using solution from #Scott Boston, I did some testing and also
tried to avoid apply altogether, but still apply is as good performant
as using numpy:
import numpy as np
import pandas as pd
from collections import Counter
np.random.seed(100)
df = pd.DataFrame({'id':np.random.randint(0,5,10000000),
'value':np.random.randint(0,5,10000000)})
# df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
# 'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(df.shape)
df.head()
Using apply
%time
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 6.2 µs
Without using apply at al
%time
grouped = df.groupby('id')['value']
res = np.zeros([2,3],dtype=int)
for name, group in grouped:
data = np.array(Counter(group.values).most_common(2))
ids = np.ones([2,1],dtype=int) * name
data = np.append(ids,data,axis=1)
res = np.append(res,data,axis=0)
pd.DataFrame(res[2:], columns=['id','value','count'])
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 5.96 µs

Related

Python: using groupby and apply to rebulid dataframe

I want to rebulid my dataframe from df1 to df2:
df1 like this:
id
counts
days
1
2
4
1
3
4
1
4
4
2
56
8
2
37
9
2
10
7
2
10
4
df2 like this:
id
countsList
daysList
1
'2,3,4'
'4,4,4'
2
'56,37,10,10'
'8,9,7,4'
where countsList and daysList in df2 is a str.
I have about 1 million lines of df1, it would be very slow if I using for iter.
So I want to using groupby and apply to achieve it. Do you have any solution or efficient way to cover it.
My computer info:
CPU: Xeon 6226R 2.9Ghz 32core
RAM: 16G
python:3.9.7
You might use agg (and then rename columns)
np.random.seed(123)
n = 1_000_000
df = pd.DataFrame({
"id": np.random.randint(100_000, size = n),
"counts": np.random.randint(10, size = n),
"days": np.random.randint(10, size = n)
})
df2 = df.groupby('id').agg(lambda x: ','.join(map(str, x)))\
.add_suffix('List').reset_index()
# id countsList daysList
#0 15725 7,5,6,3,7,0 7,9,5,8,0,1
#1 28030 7,6,5,1,9,6,5 5,0,8,4,8,6,0
It isn't "that" slow - %%timeit for 1 milion rows and 100k groups:
639 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT: Solution proposed here: How to group dataframe rows into list in pandas groupby
is a bit faster:
id, counts, days = df.values[df.values[:, 0].argsort()].T
u_ids, index = np.unique(id, True)
counts = np.split(counts, index[1:])
days = np.split(days, index[1:])
df2 = pd.DataFrame({'id':u_ids, 'counts':counts, 'days':days})
but not mega faster:
313 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I use the fastest method to group the dataframe and generate a column as a sequence of numbers?

How can I use the fastest method to group the dataframe and generate a column as a sequence of numbers?
My code steps are as follows:
First generate the date dataframe date
Generate code dataframe
Generate the Cartesian product df of date and code
Delete redundant columns ['a','level_1','order']
Group according to the date column, and generate an order column in the order of values within each group
my question:
These steps feel too cumbersome, is there an easy way?
How to avoid generation of level_1 and order columns in the fourth step
How to optimize the code, it takes 5 seconds to execute it now
My code is as follows:
import pandas as pd
import numpy as np
def add_order(df):
df = df.reset_index(drop=True).reset_index()
df = df.drop(columns='date')
return df
def generate_data():
np.random.seed(202107)
date = pd.date_range(start='20150101', end='20210723', freq='D')
date = date.to_pydatetime()
date = np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(date)
date = pd.DataFrame(date, columns=['date'])
date['a'] = 1
code = pd.DataFrame(range(50), columns=['code'])
code['a'] = 1
df = pd.merge(date, code, how='outer')
df['value'] = np.random.random(len(df)) * 1000
return df
def get_result(df):
df = df.sort_values(by='value', ascending=False)
df = df.groupby('date').apply(add_order)
df = df.reset_index().sort_values(by=['date', 'code']).reset_index(drop=True)
df = df.rename(columns={'index': 'order'})
col = ['date', 'code', 'value', 'order']
df = df[col]
# print(df)
return df
def main():
df = generate_data()
df = get_result(df)
%timeit main()
5.25 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The result is:
date code value order
0 2015-01-01 0 227.190649 39
1 2015-01-01 1 543.938036 26
2 2015-01-01 2 175.707748 43
3 2015-01-01 3 789.146427 9
4 2015-01-01 4 585.727841 24
... ... ... ... ...
119795 2021-07-23 45 92.698866 43
119796 2021-07-23 46 111.500843 40
119797 2021-07-23 47 700.675634 12
119798 2021-07-23 48 933.134534 4
119799 2021-07-23 49 108.004811 42
It seems the a column is unnecessary, so the generation can become something like:
def generate_data_mod():
np.random.seed(202107)
df = pd.MultiIndex.from_product(
[pd.date_range(
start='20150101', end='20210723', freq='D'
).strftime('%Y-%m-%d'),
np.arange(50)],
names=['date', 'code']
).to_frame(index=False)
df['value'] = np.random.random(len(df)) * 1000
# df['a'] = 1 # (If it is needed)
return df
Then we can use sort_values by value. Then use groupby cumcount to enumerate groups. Then sort_index to restore order:
def get_result_mod(df):
# df = df.drop(columns='a') # If df has the a column
df = df.sort_values(by='value', ascending=False)
df['order'] = df.groupby('date').cumcount()
df = df.sort_index()
return df
Sanity Checks:
def main():
df = generate_data()
df_mod = generate_data_mod()
# True (note df_mod has no A column)
print(df.drop(columns='a').eq(df_mod).all(None))
# True
print(get_result(df).eq(get_result_mod(df_mod)).all(None))
Timing Information:
Generate Data is about the same (merge is very efficient):
%timeit generate_data()
21 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit generate_data_mod()
20.2 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
get_result is much faster this way:
df = generate_data()
%timeit get_result(df)
1.77 s ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit get_result_mod(df)
51 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I hope I've understood your question right, but based on your comments:
from itertools import product
date = pd.date_range(start="20150101", end="20210723", freq="D")
date = pd.DataFrame(date, columns=["date"])
code = pd.DataFrame(range(50), columns=["code"])
# generate product of the two columns:
df = pd.DataFrame(product(date["date"], code["code"]), columns=["date", "code"])
# to each "date group" add ascending random values
g = df.groupby("date")
df["value"] = g.transform(lambda x: np.random.random(len(x)) * 1000)
df["order"] = g["value"].transform(np.argsort)
print(df)
Prints:
date code value order
0 2015-01-01 0 72.380011 26
1 2015-01-01 1 888.644908 42
2 2015-01-01 2 205.610256 40
3 2015-01-01 3 425.763108 16
4 2015-01-01 4 198.628891 0
5 2015-01-01 5 659.725661 34
...
119795 2021-07-23 45 376.110403 19
119796 2021-07-23 46 697.473751 13
119797 2021-07-23 47 615.449182 10
119798 2021-07-23 48 741.031350 39
119799 2021-07-23 49 201.422477 15

Perform True/False operation on a column based on the condition present in another column in pandas

I have a dataframe
df_in = pd.DataFrame([[1,"A",32,">30"],[2,"B",12,"<10"],[3,"C",45,">=45"]],columns=['id', 'input', 'val', 'cond'])
I want to perform an operation on column "val" based on the condition present in "cond" column and get the True/False result in "Output" column.
Expected Output:
df_out = pd.DataFrame([[1,"A",32,">30",True],[2,"B",12,"<10",False],[3,"C",45,">=45",True]],columns=['id', 'input', 'val', 'cond',"Output"])
How to do it?
you can try:
df_in['output']=pd.eval(df_in['val'].astype(str)+df_in['cond'])
OR
If needed performance use the below method but also see this thread but I think in your case it is safe to use eval:
df_in['output']=list(map(lambda x:eval(x),(df_in['val'].astype(str)+df_in['cond']).tolist()))
OR
Even more efficient and fastest:
from numpy.core import defchararray
df_in['output']=list(map(lambda x:eval(x),defchararray.add(df_in['val'].values.astype(str),df_in['cond'].values)))
output of df_in:
id input val cond output
0 1 A 32 >30 True
1 2 B 12 <10 False
2 3 C 45 >=45 True
Time Comparison: using %%timeit -n 1000
Using numexpr
import numexpr
df_in['output'] = df_in.apply(lambda x: numexpr.evaluate(f"{x['val']}{x['cond']}"), axis=1 )
id input val cond output
0 1 A 32 >30 True
1 2 B 12 <10 False
2 3 C 45 >=45 True
Time Comparison:
using %%timeit -n 1000
using apply and numexpr:
865 µs ± 140 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using pd.eval:
2.5 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How do I add a column for each repeated element in Python Pandas?

I'm trying to group all costs for each client in a report separated in columns. The number of columns added depends on how many times the same client adds a cost.
For example:
Client
Costs
A
5
B
10
B
2
B
5
A
4
The result that I want:
Client
Cost_1
Cost_2
Cost_3
Cost_n
A
5
4
B
10
2
5
Keep in mind the original database is huge so any efficiency would help.
You can use GroupBy.cumcount() to get the serial number of column Cost. Then use df.pivot() to transform the data into columns. Use .add_prefix together with the serial number of columns to format the column labels.
df['cost_num'] = df.groupby('Client').cumcount() + 1
(df.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
Result:
Client Cost_1 Cost_2 Cost_3
0 A 5.0 4.0 NaN
1 B 10.0 2.0 5.0
System Performance
Let's see the system performance for 500,000 rows:
df2 = pd.concat([df] * 100000, ignore_index=True)
%%timeit
df2['cost_num'] = df2.groupby('Client').cumcount() + 1
(df2.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
587 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 0.587 second to run 500,000 rows of dataframe.
Let's see the system performance for 5,000,000 rows:
df3 = pd.concat([df] * 1000000, ignore_index=True)
%%timeit
df3['cost_num'] = df3.groupby('Client').cumcount() + 1
(df3.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
6.35 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 6.35 seconds to run 5,000,000 rows of dataframe.

Fast, efficient pandas Groupby sum / mean without aggregation

It is easy and fast to perform grouping and aggregation in pandas. However, performing simple groupby-apply functions that pandas already has built in C without aggregation, at least in the way I do it, is far slower because of a lambda function.
# Form data
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.random((100,3)),columns=['a','b','c'])
>>> df['g'] = np.random.randint(0,3,100)
>>> df.head()
a b c g
0 0.901610 0.643869 0.094082 1
1 0.536437 0.836622 0.763244 1
2 0.647989 0.150460 0.476552 0
3 0.206455 0.319881 0.690032 2
4 0.153557 0.765174 0.377879 1
# groupby and apply and aggregate
>>> df.groupby('g')['a'].sum()
g
0 17.177280
1 15.395264
2 17.668056
Name: a, dtype: float64
# groupby and apply without aggregation
>>> df.groupby('g')['a'].transform(lambda x: x.sum())
0 15.395264
1 15.395264
2 17.177280
3 17.668056
4 15.395264
95 15.395264
96 17.668056
97 15.395264
98 17.668056
99 17.177280
Name: a, Length: 100, dtype: float64
Thus, I have the functionality desired with the lambda function, but the speed is bad.
>>> %timeit df.groupby('g')['a'].sum()
1.11 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df.groupby('g')['a'].transform(lambda x:x.sum())
4.01 ms ± 699 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This becomes a problem in larger datasets. I assume there is a faster and more efficient to get this functionality.
Probably you are looking for
df.groupby('g')['a'].transform('sum')
Indeed is faster than the version with apply:
import numpy as np
import pandas as pd
import timeit
df = pd.DataFrame(np.random.random((100,3)),columns=['a','b','c'])
df['g'] = np.random.randint(0,3,100)
def groupby():
df.groupby('g')['a'].sum()
def transform_apply():
df.groupby('g')['a'].transform(lambda x: x.sum())
def transform():
df.groupby('g')['a'].transform('sum')
print('groupby',timeit.timeit(groupby,number=10))
print('lambda transform',timeit.timeit(transform_apply,number=10))
print('transform',timeit.timeit(transform,number=10))
the output:
groupby 0.010655807999999989
lambda transform 0.029328375000000073
transform 0.01493376600000007

Categories

Resources