Python: using groupby and apply to rebulid dataframe - python

I want to rebulid my dataframe from df1 to df2:
df1 like this:
id
counts
days
1
2
4
1
3
4
1
4
4
2
56
8
2
37
9
2
10
7
2
10
4
df2 like this:
id
countsList
daysList
1
'2,3,4'
'4,4,4'
2
'56,37,10,10'
'8,9,7,4'
where countsList and daysList in df2 is a str.
I have about 1 million lines of df1, it would be very slow if I using for iter.
So I want to using groupby and apply to achieve it. Do you have any solution or efficient way to cover it.
My computer info:
CPU: Xeon 6226R 2.9Ghz 32core
RAM: 16G
python:3.9.7

You might use agg (and then rename columns)
np.random.seed(123)
n = 1_000_000
df = pd.DataFrame({
"id": np.random.randint(100_000, size = n),
"counts": np.random.randint(10, size = n),
"days": np.random.randint(10, size = n)
})
df2 = df.groupby('id').agg(lambda x: ','.join(map(str, x)))\
.add_suffix('List').reset_index()
# id countsList daysList
#0 15725 7,5,6,3,7,0 7,9,5,8,0,1
#1 28030 7,6,5,1,9,6,5 5,0,8,4,8,6,0
It isn't "that" slow - %%timeit for 1 milion rows and 100k groups:
639 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT: Solution proposed here: How to group dataframe rows into list in pandas groupby
is a bit faster:
id, counts, days = df.values[df.values[:, 0].argsort()].T
u_ids, index = np.unique(id, True)
counts = np.split(counts, index[1:])
days = np.split(days, index[1:])
df2 = pd.DataFrame({'id':u_ids, 'counts':counts, 'days':days})
but not mega faster:
313 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Multi-index Dataframe from dictionary of Dataframes

I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.
Example
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]], index = dt_index, columns = column_names),
'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names),
'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}
Expected output:
Y X
2003-05-01 A 1 3
2003-05-01 B 12 3
2003-05-01 C 3 12
2003-05-02 A 7 4
2003-05-02 B 9 8
2003-05-02 C 5 1
2003-05-03 A 5 8
2003-05-03 B 75 0
2003-05-03 C 22 5
I've tried
pd.concat(df_dict, axis=0)
but this gives me the levels of the multi-index in the incorrect order.
Edit: Timings
Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.
Larger dummy data:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
To convert the dictionary to a dataframe, albeit with swapped indicies takes:
%timeit pd.concat(df_dict, axis=0)
63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!
%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use DataFrame.swaplevel with DataFrame.sort_index:
df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
Y X
2003-05-01 A 1 3
B 12 3
C 3 12
2003-05-02 A 7 4
B 9 8
C 5 1
2003-05-03 A 5 8
B 75 0
C 22 5
You can reach down into numpy for a speed up if you can guarantee two things:
Each of your DataFrames in df_dict have the exact same index
Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
out = pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, C),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()
Timing:
%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use concat on axis=1 and stack:
out = pd.concat(df_dict, axis=1).stack(0)
Output:
X Y
2003-05-01 A 3 1
B 3 12
C 12 3
2003-05-02 A 4 7
B 8 9
C 1 5
2003-05-03 A 8 5
B 0 75
C 5 22

Count elements in defined groups in pandas dataframe

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.
You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64
First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)
You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64

How do I add a column for each repeated element in Python Pandas?

I'm trying to group all costs for each client in a report separated in columns. The number of columns added depends on how many times the same client adds a cost.
For example:
Client
Costs
A
5
B
10
B
2
B
5
A
4
The result that I want:
Client
Cost_1
Cost_2
Cost_3
Cost_n
A
5
4
B
10
2
5
Keep in mind the original database is huge so any efficiency would help.
You can use GroupBy.cumcount() to get the serial number of column Cost. Then use df.pivot() to transform the data into columns. Use .add_prefix together with the serial number of columns to format the column labels.
df['cost_num'] = df.groupby('Client').cumcount() + 1
(df.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
Result:
Client Cost_1 Cost_2 Cost_3
0 A 5.0 4.0 NaN
1 B 10.0 2.0 5.0
System Performance
Let's see the system performance for 500,000 rows:
df2 = pd.concat([df] * 100000, ignore_index=True)
%%timeit
df2['cost_num'] = df2.groupby('Client').cumcount() + 1
(df2.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
587 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 0.587 second to run 500,000 rows of dataframe.
Let's see the system performance for 5,000,000 rows:
df3 = pd.concat([df] * 1000000, ignore_index=True)
%%timeit
df3['cost_num'] = df3.groupby('Client').cumcount() + 1
(df3.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
6.35 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 6.35 seconds to run 5,000,000 rows of dataframe.

Advanced Pandas chaining: chain index.droplevel after groupby

I was trying to find the top2 values in column2 grouped by column1.
Here is the dataframe:
# groupby id and take only top 2 values.
df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
I have done without using chained grouping:
x = df.groupby('id')['value'].value_counts().groupby(level=0).nlargest(2).to_frame()
x.columns = ['count']
x.index = x.index.droplevel(0)
x = x.reset_index()
x
Result:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Can we do this is ONE-SINGLE chained operation?
So, far I have done this:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()
.rename({'value':'count'}))
Now, I stuck at how to drop the index level.
How to do all these operations in one single chain?
You could use apply and head without the second groupby:
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
Output:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Timings:
#This method
7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Groupby and groupby(level=0) with nlargest
12.9 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Try the below:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()).rename(columns={'value':'count'}).reset_index([1,2]).reset_index(drop=True)
Yet another solution:
df.groupby('id')['value'].value_counts().rename('count')\
.groupby(level=0).nlargest(2).reset_index(level=[1, 2])\
.reset_index(drop=True)
Using solution from #Scott Boston, I did some testing and also
tried to avoid apply altogether, but still apply is as good performant
as using numpy:
import numpy as np
import pandas as pd
from collections import Counter
np.random.seed(100)
df = pd.DataFrame({'id':np.random.randint(0,5,10000000),
'value':np.random.randint(0,5,10000000)})
# df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
# 'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(df.shape)
df.head()
Using apply
%time
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 6.2 µs
Without using apply at al
%time
grouped = df.groupby('id')['value']
res = np.zeros([2,3],dtype=int)
for name, group in grouped:
data = np.array(Counter(group.values).most_common(2))
ids = np.ones([2,1],dtype=int) * name
data = np.append(ids,data,axis=1)
res = np.append(res,data,axis=0)
pd.DataFrame(res[2:], columns=['id','value','count'])
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 5.96 µs

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources