Get index number from multi-index dataframe in python - python

There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.
Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df
Stock Date Index
AAPL 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 3475
AMZN 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 6951
BAC 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 10427
This is the code I am using, where df3 in the dataframe
test_index_list = []
for start_index in range(len(df3)-1):
end_index = start_index + 1
if df3.index[start_index][0] != df3.index[end_index][0]:
test_index_list.append(start_index)

I change divakar answer a bit with get_level_values for indices of first level of MultiIndex:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
print (df)
C D E
F A B
a a 4 7 1 5
b 5 8 3 3
c 4 9 5 6
b d 5 4 7 9
e 5 2 1 2
c f 4 3 0 4
def start_stop_arr(initial_list):
a = np.asarray(initial_list)
mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
idx = np.flatnonzero(mask)
stop = idx[1:]-1
return stop
print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')
print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]

dict.values
Using dict to track values leaves the last found value as the one that matters.
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
[2, 4, 5]
With Loop
Create function that takes a factorization and number of unique values
def last(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
You can then get the factorization with
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
array([2, 4, 5])
However, the way MultiIndex is usually constructed, the labels objects are already factorizations and the levels objects are unique values.
last(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
What's more is that we can use Numba to use just in time compiling to super-charge this.
from numba import njit
#njit
def nlast(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
nlast(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
Timing
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))
264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))
4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
last(df.index.labels[0], len(df.index.levels[0]))
654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael's solution. Also very fast.
%timeit start_stop_arr(df.index.get_level_values(0))
113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.unique
I did not time this because I don't like it. See below:
Using np.unique and the return_index argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.
Note: this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.
i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1
array([2, 4, 5])
Setup
from #jezrael
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])

Related

Efficient calculation of frequencies of multiindexed category variables

I'm considering a multiindexed (i,j,k) DataFrame with one column that contains three categorical variables A, B or C.
I want to compute the frequency of the variables for each i over all (j,k). I have a solution, but I think there exists a more pythonic and efficient way of doing it.
The code for a MWE reads (in reality len(I)*len(J)*len(K) is large, in the millions, say):
import pandas as pd
import numpy as np
I = range(10)
J = range(3)
K = range(2)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
data = data.unstack(['j', 'k'])
result = data.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
You could use groupby, and also normalize your value_counts:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
To compare, first let's make the dummy data big (60 million rows):
import pandas as pd
import numpy as np
I = range(100000)
J = range(30)
K = range(20)
data = pd.DataFrame(
np.random.randint(0, 3, size=len(I)*len(J)*len(K)),
index=pd.MultiIndex.from_product([I, J, K]),
columns=['cat']
)
data.index.names = ['i', 'j', 'k']
data[data['cat'] == 0] = 'A'
data[data['cat'] == 1] = 'B'
data[data['cat'] == 2] = 'C'
Timing your original method:
data_interim = data.unstack(['j', 'k'])
data_interim.apply(lambda x: x.value_counts(), axis=1).fillna(0) / (len(J)*len(K))
gives (on my machine) 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
the alternative:
data.groupby(level=0)['cat'].value_counts(normalize=True).unstack(level=1).fillna(0)
gives (on my machine) 8.86 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Baseline for s_pike's method:
%%timeit
(data.groupby(level=0)['cat']
.value_counts(normalize=True)
.unstack(level=1)
.fillna(0))
6.41 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If they're truly categorical, you can get a lot of benefit out of explicitly making the column categorical, and then using either of these methods.
They're both still about twice as fast without being categorical, but become about 3x as fast when made categorical.
data['cat'] = data['cat'].astype('category')
%%timeit
(data.groupby(level=0, as_index=False)['cat']
.value_counts(normalize=True)
.pivot(index='i', columns='cat', values='proportion'))
1.82 s ± 91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(x := data.pivot_table(index='i', columns='cat', aggfunc='value_counts')).div(x.sum(1), 0)
1.8 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Outputs:
cat A B C
i
0 0.341667 0.318333 0.340000
1 0.311667 0.388333 0.300000
2 0.351667 0.350000 0.298333
3 0.363333 0.333333 0.303333
4 0.326667 0.350000 0.323333
... ... ... ...
99995 0.315000 0.313333 0.371667
99996 0.323333 0.351667 0.325000
99997 0.305000 0.353333 0.341667
99998 0.318333 0.341667 0.340000
99999 0.331667 0.340000 0.328333
[100000 rows x 3 columns]

What is the most efficient way to get count of distinct values in a pandas dataframe?

I have a dataframe as shown below.
0 1 2
0 A B C
1 B C B
2 B D E
3 C E E
4 B F A
I need to get count of unique values from the entire dataframe, not column-wise unique values.
In the above dataframe, unique values are A, B, C, D, E, F.
So, the result I need is 6.
I'm achieving this using pandas squeeze, ravel and nunique functions, which converts entire dataframe into a series.
pd.Series(df.squeeze().values.ravel()).nunique(dropna=True)
Please let me know if there is any better way to achieve this.
Use numpy.unique with length of unique values:
out = len(np.unique(df))
6
Use NumPy for this, as:
import numpy as np
print(np.unique(df.values).shape[0])
You can use set, len and flatten too:
len(set(df.values.flatten()))
Out:
6
Timings: With a dummy dataframe with 6 unique values
#dummy data
df = pd.DataFrame({'Day':np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6),'Heloo':np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6)})
print(df.shape)
(1000000, 2)
%timeit len(set(df.values.flatten()))
>>>89.5 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.values).shape[0]
>>>1.61 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit len(np.unique(df))
>>>1.85 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

slice panda string in vectorised way [duplicate]

This question already has answers here:
How to slice strings in a column by another column in pandas
(2 answers)
Closed 4 years ago.
I am trying to slice the string in vectorized way and answer is NaN. Although work OK if sequence index (say like str[:1]) is constant. Any help
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]}) #
df['SUB'] = df['NAME'].str[:df['SEQ']]
The output is
NAME SEQ SUB
0 abc 1 NaN
1 xyz 2 NaN
2 hello 1 NaN
Unfortunately vectorized solution does not exist.
Use apply with lambda function:
df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
Or zip with list comprehension for better performance:
df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
print (df)
NAME SEQ SUB
0 abc 1 a
1 xyz 2 xy
2 hello 1 h
Timings:
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]})
df = pd.concat([df] * 1000, ignore_index=True)
In [270]: %timeit df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
4.23 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [271]: %timeit df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
104 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %timeit df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
785 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using groupby:
df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
Might make sense if there are few unique values in SEQ.

How to apply a function on every row on a dataframe?

I am new to Python and I am not sure how to solve the following problem.
I have a function:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
Say I have the dataframe
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
D p
0 10 20
1 20 30
2 30 10
ch=0.2
ck=5
And ch and ck are float types. Now I want to apply the formula to every row on the dataframe and return it as an extra row 'Q'. An example (that does not work) would be:
df['Q']= map(lambda p, D: EOQ(D,p,ck,ch),df['p'], df['D'])
(returns only 'map' types)
I will need this type of processing more in my project and I hope to find something that works.
The following should work:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df
If all you're doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:
In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449
Timings
For a 30k row df:
In [92]:
import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop
You can see that the np method is ~1900 X faster
There are few more ways to apply a function on every row of a DataFrame.
(1) You could modify EOQ a bit by letting it accept a row (a Series object) as argument and access the relevant elements using the column names inside the function. Moreover, you can pass arguments to apply using its keyword, e.g. ch or ck:
def EOQ1(row, ck, ch):
Q = math.sqrt((2*row['D']*ck)/(ch*row['p']))
return Q
df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
(2) It turns out that apply is often slower than a list comprehension (in the benchmark below, it's 20x slower). To use a list comprehension, you could modify EOQ still further so that you access elements by its index. Then call the function in a loop over df rows that are converted to lists:
def EOQ2(row, ck, ch):
Q = math.sqrt((2*row[0]*ck)/(ch*row[1]))
return Q
df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
(3) As it happens, if the goal is to call a function iteratively, map is usually faster than a list comprehension. So you could convert df into a list, map the function to it; then unpack the result in a list:
df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
(4) As #EdChum notes, it's always better to use vectorized methods if it's possible to do so, instead of applying a function row by row. Pandas offers vectorized methods that rival that of numpy's. In the case of EOQ for example, instead of math.sqrt, you could use pandas' pow method (in the benchmark below, using pandas vectorized methods is ~20% faster than using numpy):
df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
Output:
D p Q Q_np Q1 Q2a Q2b Q_pd
0 10 20 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
1 20 30 5.773503 5.773503 5.773503 5.773503 5.773503 5.773503
2 30 10 12.247449 12.247449 12.247449 12.247449 12.247449 12.247449
Timings:
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
df = pd.concat([df]*10000)
>>> %timeit df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
623 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
615 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
31.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
26.9 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q_np'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
1.19 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
966 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources