Efficient calculation of frequencies of multiindexed category variables - python

I'm considering a multiindexed (i,j,k) DataFrame with one column that contains three categorical variables A, B or C.
I want to compute the frequency of the variables for each i over all (j,k). I have a solution, but I think there exists a more pythonic and efficient way of doing it.
The code for a MWE reads (in reality len(I)*len(J)*len(K) is large, in the millions, say):
import pandas as pd
import numpy as np
I = range(10)
J = range(3)
K = range(2)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
data = data.unstack(['j', 'k'])
result = data.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))

You could use groupby, and also normalize your value_counts:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
To compare, first let's make the dummy data big (60 million rows):
import pandas as pd
import numpy as np
I = range(100000)
J = range(30)
K = range(20)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
Timing your original method:
data_interim = data.unstack(['j', 'k'])
data_interim.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
gives (on my machine) 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
the alternative:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
gives (on my machine) 8.86 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Baseline for s_pike's method:
%%timeit
(data.groupby(level=0)['cat']
.value_counts(normalize=True)
.unstack(level=1)
.fillna(0))
6.41 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If they're truly categorical, you can get a lot of benefit out of explicitly making the column categorical, and then using either of these methods.
They're both still about twice as fast without being categorical, but become about 3x as fast when made categorical.
data['cat'] = data['cat'].astype('category')
%%timeit
(data.groupby(level=0, as_index=False)['cat']
.value_counts(normalize=True)
.pivot(index='i', columns='cat', values='proportion'))
1.82 s ± 91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(x := data.pivot_table(index='i', columns='cat', aggfunc='value_counts')).div(x.sum(1), 0)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Outputs:
cat A B C
i
0 0.341667 0.318333 0.340000
1 0.311667 0.388333 0.300000
2 0.351667 0.350000 0.298333
3 0.363333 0.333333 0.303333
4 0.326667 0.350000 0.323333
... ... ... ...
99995 0.315000 0.313333 0.371667
99996 0.323333 0.351667 0.325000
99997 0.305000 0.353333 0.341667
99998 0.318333 0.341667 0.340000
99999 0.331667 0.340000 0.328333
[100000 rows x 3 columns]

Related

Add a column to a df where if a certain value is 0, return 1 else return the original value of the column

the Python code with which I am trying to achieve this result is:
df['column2'] = np.where(df['column1'] == 0, 1, df['column1'])
For the sample dataframe it is fastest to use np.where.
You can also use pandas.DataFrame.where, which will replace values where the condition is False otherwise return the value in the dataframe column.
100 is used to make the update easier to see
import pandas as pd
# test dataframe
df = pd.DataFrame({'a': [2, 4, 1, 0, 2, 2, 0, 8, 4, 0], 'b': [2, 4, 0, 9, 2, 0, 2, 8, 0, 3]})
# replace 0 with 100 or leave the same number based on the same column
df['0 → 100 on a if a'] = df.a.where(df.a != 0, 100)
# replace 0 with 100 or leave the same number based on a different column
df['0 → 100 on a if b'] = df.a.where(df.b != 0, 100)
# display(df)
a b 0 → 100 on a if a 0 → 100 on a if b
0 2 2 2 2
1 4 4 4 4
2 1 0 1 100
3 0 9 100 0
4 2 2 2 2
5 2 0 2 100
6 0 2 100 0
7 8 8 8 8
8 4 0 4 100
9 0 3 100 0
%%timeit testing
Test Data
import pandas as pd
import numpy as np
# test dataframe with 1M rows
np.random.seed(365)
df = pd.DataFrame({'a': np.random.randint(0, 10, size=(1000000)), 'b': np.random.randint(0, 10, size=(1000000))})
Tests
%%timeit
np.where(df.a == 0, 1, df.a)
[out]:
161 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.where(df.b == 0, 1, df.a)
[out]:
164 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.a.where(df.a != 0, 1)
[out]:
4.51 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.a.where(df.b != 0, 1)
[out]:
4.55 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
noah1(df)
[out]:
4.63 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
noah2(df)
[out]:
15.3 s ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
paul(df)
[out]:
341 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
karam(df)
[out]:
299 ms ± 4.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Functions
def noah1(d):
return d.a.replace(0, 1)
def noah2(d):
return d.apply(lambda x: 1 if x.a == 0 else x.b, axis=1)
def paul(d):
return [1 if v==0 else v for v in d.a.values]
def karam(d):
return d.a.apply(lambda x: 1 if x == 0 else x)
The apply example provided above should work or this works too:
df['column_2'] = [1 if v==0 else v for v in df['col'].values]
My example uses list comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp
And the other answer uses lambda function: https://www.w3schools.com/python/python_lambda.asp
Personally, when writing scripts that others may use I think list comprehension is more widely known and therefore more verbose, but I believe lambda function performs faster and in general is a highly useful tool so probably recommended above list comprehension.
What you want is essentially to just copy the column and replace 0s with 1s:
df["Column2"] = df["Column1"].replace(0,1)
More generally if you wanted the value in some other ColumnX you can do the following lamda function:
df["Column2"] = df.apply(lambda x: 1 if x["Column1"]==0 else x['ColumnX'], axis=1)
You should be able to achieve that using an apply statement in this manner:
df['column2'] = df['column1'].apply(lambda x: 1 if x == 0 else x)

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.

Pandas vectorization: Compute the fraction of each group that meets a condition

Suppose we have a table of customers and their spending.
import pandas as pd
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Bob", "Charles"],
"Spend": [3, 5, 7, 9]
})
LIMIT = 6
For each customer, we may compute the fraction of his spending that is larger than 6, using the apply method:
df.groupby("Name").apply(
lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)
Name
Alice 0.0
Bob 0.5
Charles 1.0
However, the apply method is just a loop, which is slow if there are many customers.
Question: Is there a faster way, which presumably uses vectorization?
As of version 0.23.4, SeriesGroupBy does not support comparison operators:
(df.groupby("Name") ["Spend"] > LIMIT).mean()
TypeError: '>' not supported between instances of 'SeriesGroupBy' and 'int'
The code below results in a null value for Alice:
df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()
Name
Alice NaN
Bob 0.5
Charles 1.0
The code below gives the correct result, but it requires us to either modify the table, or make a copy to avoid modifying the original.
df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()
Groupby does not use vectorization, but it has aggregate functions that are optimized with Cython.
You can take the mean:
(df["Spend"] > LIMIT).groupby(df["Name"]).mean()
df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
Or use div to replace NaN with 0:
df[df["Spend"] > LIMIT].groupby("Name").size() \
.div(df.groupby("Name").size(), fill_value = 0)
df["Spend"].gt(LIMIT).groupby(df["Name"]).sum() \
.div(df.groupby("Name").size(), fill_value = 0)
Each of the above will yield
Name
Alice 0.0
Bob 0.5
Charles 1.0
dtype: float64
Performance
Depends on the number of rows and number of rows filtered per condition, so it's best to test on real data.
np.random.seed(123)
N = 100000
df = pd.DataFrame({
"Name": np.random.randint(1000, size = N),
"Spend": np.random.randint(10, size = N)
})
LIMIT = 6
In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
6.16 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value = 0)
6.35 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value = 0)
9.66 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# RafaelC comment solution
In [13]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).sum() / s.size)
400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).mean())
328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This NumPy solution is vectorized, but a bit complicated:
In [15]: %%timeit
...: i, r = pd.factorize(df["Name"])
...: a = pd.Series(np.bincount(i), index = r)
...:
...: i1, r1 = pd.factorize(df["Name"].values[df["Spend"].values > LIMIT])
...: b = pd.Series(np.bincount(i1), index = r1)
...:
...: df1 = b.div(a, fill_value = 0)
...:
5.05 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Get index number from multi-index dataframe in python

There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.
Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df
Stock Date Index
AAPL 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 3475
AMZN 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 6951
BAC 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 10427
This is the code I am using, where df3 in the dataframe
test_index_list = []
for start_index in range(len(df3)-1):
end_index = start_index + 1
if df3.index[start_index][0] != df3.index[end_index][0]:
test_index_list.append(start_index)
I change divakar answer a bit with get_level_values for indices of first level of MultiIndex:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
print (df)
C D E
F A B
a a 4 7 1 5
b 5 8 3 3
c 4 9 5 6
b d 5 4 7 9
e 5 2 1 2
c f 4 3 0 4
def start_stop_arr(initial_list):
a = np.asarray(initial_list)
mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
idx = np.flatnonzero(mask)
stop = idx[1:]-1
return stop
print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')
print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]
dict.values
Using dict to track values leaves the last found value as the one that matters.
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
[2, 4, 5]
With Loop
Create function that takes a factorization and number of unique values
def last(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
You can then get the factorization with
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
array([2, 4, 5])
However, the way MultiIndex is usually constructed, the labels objects are already factorizations and the levels objects are unique values.
last(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
What's more is that we can use Numba to use just in time compiling to super-charge this.
from numba import njit
#njit
def nlast(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
nlast(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
Timing
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))
264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))
4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
last(df.index.labels[0], len(df.index.levels[0]))
654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael's solution. Also very fast.
%timeit start_stop_arr(df.index.get_level_values(0))
113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.unique
I did not time this because I don't like it. See below:
Using np.unique and the return_index argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.
Note: this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.
i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1
array([2, 4, 5])
Setup
from #jezrael
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])

How to apply a function on every row on a dataframe?

I am new to Python and I am not sure how to solve the following problem.
I have a function:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
Say I have the dataframe
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
D p
0 10 20
1 20 30
2 30 10
ch=0.2
ck=5
And ch and ck are float types. Now I want to apply the formula to every row on the dataframe and return it as an extra row 'Q'. An example (that does not work) would be:
df['Q']= map(lambda p, D: EOQ(D,p,ck,ch),df['p'], df['D'])
(returns only 'map' types)
I will need this type of processing more in my project and I hope to find something that works.
The following should work:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df
If all you're doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:
In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449
Timings
For a 30k row df:
In [92]:
import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop
You can see that the np method is ~1900 X faster
There are few more ways to apply a function on every row of a DataFrame.
(1) You could modify EOQ a bit by letting it accept a row (a Series object) as argument and access the relevant elements using the column names inside the function. Moreover, you can pass arguments to apply using its keyword, e.g. ch or ck:
def EOQ1(row, ck, ch):
Q = math.sqrt((2*row['D']*ck)/(ch*row['p']))
return Q
df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
(2) It turns out that apply is often slower than a list comprehension (in the benchmark below, it's 20x slower). To use a list comprehension, you could modify EOQ still further so that you access elements by its index. Then call the function in a loop over df rows that are converted to lists:
def EOQ2(row, ck, ch):
Q = math.sqrt((2*row[0]*ck)/(ch*row[1]))
return Q
df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
(3) As it happens, if the goal is to call a function iteratively, map is usually faster than a list comprehension. So you could convert df into a list, map the function to it; then unpack the result in a list:
df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
(4) As #EdChum notes, it's always better to use vectorized methods if it's possible to do so, instead of applying a function row by row. Pandas offers vectorized methods that rival that of numpy's. In the case of EOQ for example, instead of math.sqrt, you could use pandas' pow method (in the benchmark below, using pandas vectorized methods is ~20% faster than using numpy):
df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
Output:
D p Q Q_np Q1 Q2a Q2b Q_pd
0 10 20 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
1 20 30 5.773503 5.773503 5.773503 5.773503 5.773503 5.773503
2 30 10 12.247449 12.247449 12.247449 12.247449 12.247449 12.247449
Timings:
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
df = pd.concat([df]*10000)
>>> %timeit df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
623 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
615 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
31.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
26.9 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q_np'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
1.19 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
966 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources