How do I add a column for each repeated element in Python Pandas? - python

I'm trying to group all costs for each client in a report separated in columns. The number of columns added depends on how many times the same client adds a cost.
For example:
Client
Costs
A
5
B
10
B
2
B
5
A
4
The result that I want:
Client
Cost_1
Cost_2
Cost_3
Cost_n
A
5
4
B
10
2
5
Keep in mind the original database is huge so any efficiency would help.

You can use GroupBy.cumcount() to get the serial number of column Cost. Then use df.pivot() to transform the data into columns. Use .add_prefix together with the serial number of columns to format the column labels.
df['cost_num'] = df.groupby('Client').cumcount() + 1
(df.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
Result:
Client Cost_1 Cost_2 Cost_3
0 A 5.0 4.0 NaN
1 B 10.0 2.0 5.0
System Performance
Let's see the system performance for 500,000 rows:
df2 = pd.concat([df] * 100000, ignore_index=True)
%%timeit
df2['cost_num'] = df2.groupby('Client').cumcount() + 1
(df2.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
587 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 0.587 second to run 500,000 rows of dataframe.
Let's see the system performance for 5,000,000 rows:
df3 = pd.concat([df] * 1000000, ignore_index=True)
%%timeit
df3['cost_num'] = df3.groupby('Client').cumcount() + 1
(df3.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
6.35 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 6.35 seconds to run 5,000,000 rows of dataframe.

Related

Count elements in defined groups in pandas dataframe

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.
You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64
First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)
You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64

Python: using groupby and apply to rebulid dataframe

I want to rebulid my dataframe from df1 to df2:
df1 like this:
id
counts
days
1
2
4
1
3
4
1
4
4
2
56
8
2
37
9
2
10
7
2
10
4
df2 like this:
id
countsList
daysList
1
'2,3,4'
'4,4,4'
2
'56,37,10,10'
'8,9,7,4'
where countsList and daysList in df2 is a str.
I have about 1 million lines of df1, it would be very slow if I using for iter.
So I want to using groupby and apply to achieve it. Do you have any solution or efficient way to cover it.
My computer info:
CPU: Xeon 6226R 2.9Ghz 32core
RAM: 16G
python:3.9.7
You might use agg (and then rename columns)
np.random.seed(123)
n = 1_000_000
df = pd.DataFrame({
"id": np.random.randint(100_000, size = n),
"counts": np.random.randint(10, size = n),
"days": np.random.randint(10, size = n)
})
df2 = df.groupby('id').agg(lambda x: ','.join(map(str, x)))\
.add_suffix('List').reset_index()
# id countsList daysList
#0 15725 7,5,6,3,7,0 7,9,5,8,0,1
#1 28030 7,6,5,1,9,6,5 5,0,8,4,8,6,0
It isn't "that" slow - %%timeit for 1 milion rows and 100k groups:
639 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT: Solution proposed here: How to group dataframe rows into list in pandas groupby
is a bit faster:
id, counts, days = df.values[df.values[:, 0].argsort()].T
u_ids, index = np.unique(id, True)
counts = np.split(counts, index[1:])
days = np.split(days, index[1:])
df2 = pd.DataFrame({'id':u_ids, 'counts':counts, 'days':days})
but not mega faster:
313 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I find the difference in two rows and divide this result by the sum of two rows?

How can I find the difference in two rows and divide this result by the sum of two rows?
Here is how to do it in Excel.
Here is the formula I want to replicate, using Python.
=ABS(((B3-B2)/(B3+B2)/2)/((A3-A2)/(A3+A2)/2))
I know the difference can be calculated with df.diff(), but I can't figure out how to do the sum.
import pandas as pd
data = {'Price':[50,46],'Quantity':[3,6]}
df = pd.DataFrame(data)
print(df)
Can use rolling.sum with a window size of 2:
(df.diff()/df.rolling(2).sum()).eval('abs(Quantity/Price)')
0 NaN
1 8.0
dtype: float64
Basically you already have the diff then already you have two row sum
Since diff : x[2]-x[1] Then 'sum' : x[2]+x[1]=x[2]*2-(x[2]-x[1])
In your case the sum can be calculated by
df*2-df.diff()
Out[714]:
Price Quantity
0 NaN NaN
1 96.0 9.0
So the output is
(df.diff()/(df*2-df.diff())).eval('abs(Quantity/Price)')
Out[718]:
0 NaN
1 8.0
dtype: float64
For small dataframes the use of .eval() is not efficient.
The following is faster upto some 100.000 rows:
df = (df.diff() / df.rolling(2).sum()).div(2)
df['result'] = abs(df.Quantity / df.Price)
32.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs.
39.6 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advanced Pandas chaining: chain index.droplevel after groupby

I was trying to find the top2 values in column2 grouped by column1.
Here is the dataframe:
# groupby id and take only top 2 values.
df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
I have done without using chained grouping:
x = df.groupby('id')['value'].value_counts().groupby(level=0).nlargest(2).to_frame()
x.columns = ['count']
x.index = x.index.droplevel(0)
x = x.reset_index()
x
Result:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Can we do this is ONE-SINGLE chained operation?
So, far I have done this:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()
.rename({'value':'count'}))
Now, I stuck at how to drop the index level.
How to do all these operations in one single chain?
You could use apply and head without the second groupby:
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
Output:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Timings:
#This method
7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Groupby and groupby(level=0) with nlargest
12.9 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Try the below:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()).rename(columns={'value':'count'}).reset_index([1,2]).reset_index(drop=True)
Yet another solution:
df.groupby('id')['value'].value_counts().rename('count')\
.groupby(level=0).nlargest(2).reset_index(level=[1, 2])\
.reset_index(drop=True)
Using solution from #Scott Boston, I did some testing and also
tried to avoid apply altogether, but still apply is as good performant
as using numpy:
import numpy as np
import pandas as pd
from collections import Counter
np.random.seed(100)
df = pd.DataFrame({'id':np.random.randint(0,5,10000000),
'value':np.random.randint(0,5,10000000)})
# df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
# 'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(df.shape)
df.head()
Using apply
%time
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 6.2 µs
Without using apply at al
%time
grouped = df.groupby('id')['value']
res = np.zeros([2,3],dtype=int)
for name, group in grouped:
data = np.array(Counter(group.values).most_common(2))
ids = np.ones([2,1],dtype=int) * name
data = np.append(ids,data,axis=1)
res = np.append(res,data,axis=0)
pd.DataFrame(res[2:], columns=['id','value','count'])
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 5.96 µs

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources