how to speed up a reindex at the row level?

how to speed up a reindex at the row level? - python

Consider the following example
import pandas as pd
import numpy as np
myidx = pd.date_range('2016-01-01','2017-01-01')
data = pd.DataFrame({'value' : xrange(len(myidx))}, index = myidx)
data.head()
Out[16]:
value
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
This problem is related to expanding each row in a dataframe
I absolutely need to improve the performance of something that is intuitively very simple: I need to "enlarge" the dataframe so that each index value gets "enlarged" by a couple days (2 days before, 2 days after).
To do this task, I have the following function
def expand_onerow(df, ndaysback = 2, nhdaysfwd = 2):
new_index = pd.date_range(pd.to_datetime(df.index[0]) - pd.Timedelta(days=ndaysback),
pd.to_datetime(df.index[0]) + pd.Timedelta(days=nhdaysfwd),
freq='D')
newdf = df.reindex(index=new_index, method='nearest') #New df with expanded index
return newdf
Now either using iterrows or the (supposedly) faster itertuples gives poor results.
%timeit pd.concat([expand_onerow(data.loc[[x],:], ndaysback = 2, nhdaysfwd = 2) for x ,_ in data.iterrows()])
1 loop, best of 3: 574 ms per loop
%timeit pd.concat([expand_onerow(data.loc[[x.Index],:], ndaysback = 2, nhdaysfwd = 2) for x in data.itertuples()])
1 loop, best of 3: 643 ms per loop
Any ideas how to speed up the generation of final dataframe? I have millions of obs in my real dataframe, and the index dates are not necessarily consecutive as they are in this example.
head(10) on the final dataframe
Out[21]:
value
2015-12-30 0
2015-12-31 0
2016-01-01 0
2016-01-02 0
2016-01-03 0
2015-12-31 1
2016-01-01 1
2016-01-02 1
2016-01-03 1
2016-01-04 1
Thanks!

When using NumPy/Pandas, the key to speed is often applying vectorized functions to the largest arrays/NDFrames possible. The main reason why your original code is slow is because it calls expand_onerow once for each row. The rows are tiny and you have millions of them. To make it faster, we need to find a way to express the calculation in terms of functions applied to whole DataFrames or at least whole columns. This tends to achieve the result with more time having been spent in fast C or Fortran code and less time in slower Python code.
In this case, the result can be obtained by making copies of data and shifting the index of the whole DataFrame by i days:
new = df.copy()
new.index = df.index + pd.Timedelta(days=i)
dfs.append(new)
and then concatenating the shifted copies:
pd.concat(dfs)
import pandas as pd
import numpy as np
myidx = pd.date_range('2016-01-01','2017-01-01')
data = pd.DataFrame({'value' : range(len(myidx))}, index = myidx)
def expand_onerow(df, ndaysback = 2, nhdaysfwd = 2):
new_index = pd.date_range(pd.to_datetime(df.index[0]) - pd.Timedelta(days=ndaysback),
pd.to_datetime(df.index[0]) + pd.Timedelta(days=nhdaysfwd),
freq='D')
newdf = df.reindex(index=new_index, method='nearest') #New df with expanded index
return newdf
def orig(df, ndaysback=2, ndaysfwd=2):
return pd.concat([expand_onerow(data.loc[[x],:], ndaysback = ndaysback, nhdaysfwd = ndaysfwd) for x ,_ in data.iterrows()])
def alt(df, ndaysback=2, ndaysfwd=2):
dfs = [df]
for i in range(-ndaysback, ndaysfwd+1):
if i != 0:
new = df.copy()
new.index = df.index + pd.Timedelta(days=i)
# you could instead use
# new = df.set_index(df.index + pd.Timedelta(days=i))
# but it made the timeit result a bit slower
dfs.append(new)
return pd.concat(dfs)
Notice that alt has a Python loop with (essentially) 4 iterations. orig has a Python loop (in the form of a list comprehension) with len(df) iterations. Making fewer function calls and applying vectorized functions to bigger array-like objects is how alt gains speed over orig.
Here is a benchmark comparing orig and alt on data:
In [40]: %timeit orig(data)
1 loop, best of 3: 1.15 s per loop
In [76]: %timeit alt(data)
100 loops, best of 3: 2.22 ms per loop
In [77]: 1150/2.22
Out[77]: 518.018018018018
So alt is over 500x faster than orig on a 367-row DataFrame. For small-to-medium sized DataFrame, the speed advantage tends to grow as len(data) gets larger, because alt's Python loop will still have 4 iterations, while orig's loop gets longer. At some point however, for really large DataFrames, I would expect the speed advantage to crest at some constant factor -- I don't know how large it would be, except that it should be greater than 500x.
This checks that the two functions, orig and alt produce the same result (but in a different order):
result = alt(data)
expected = orig(data)
result = result.reset_index().sort_values(by=['index','value']).reset_index(drop=True)
expected = expected.reset_index().sort_values(by=['index','value']).reset_index(drop=True)
assert expected.equals(result)

Related

Pandas: Select values from specific columns of a DataFrame by row

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?
df = pd.DataFrame({"A":[1,2,3,4],
"B":[10,20,30,40],
"C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]
Goal:
[10, 2, 3, 400]
One method that works is to use an apply statement.
df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)
Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

Pandas approach:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400

NumPy way
Here's a vectorized NumPy way using advanced indexing -
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :
# https://stackoverflow.com/a/38489403/ #Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx would be obtained like so -
col_idx = column_index(df, columns_to_select)
Further optimization
Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from #MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

avoid repetitive operations with Pandas

Derived from another question, here
I got a 2 million rows DataFrame, something similar to this
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
})
I need to run calculations with the values on each row and append the results as new columns, something similar to the question in this link.
I know that there a lot of combinations of speed, temp, and temp2 that are repeated if I drop_duplicates the resulting DataFrame is only 50k rows length, which takes significantly less time to process, using an apply function like this:
def dafunc(row):
row['r1'] = row['speed'] * row['temp1'] * k1
row['r2'] = row['speed'] * row['temp2'] * k2
nodup_df = final_df.drop_duplicates(['speed,','temp1','temp2'])
nodup_df = dodup_df.apply(dafunc,axis=1)
The above code is super simplified of what I actually do.
So far I'm trying to use a dictionary where I store the results and a string formed of the combinations is the key, if the dictionary already has those results, I get them instead of making the calculations again.
Is there a more efficient way to do this using Pandas' vectorized operations?
EDIT:
In the end, the resulting DataFrame should look like this:
#assuming k1 = 0.5, k2 = 1
resulting_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
'r1': [22.5,16,3.5,16,3.5,16],
'r2': [10,8,7,8,7,8],
})

Well if you can access the columns from a numpy array based on the column index it would be a lot faster i.e
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
If you want to create multiple columns at once you can use a for loop for that and speed will be similar like
k = [0.5,1]
for i in range(1,3):
final_df['r'+str(i)] = final_df.values[:,0]*final_df.values[:,i]*k[i-1]
If you drop duplicates it will be much faster.
Output:
speed temp temp2 ts r1 r2
0 5 9 2 0 22.5 10.0
1 4 8 2 1 16.0 8.0
2 1 7 7 2 3.5 7.0
3 4 8 2 3 16.0 8.0
4 1 7 7 4 3.5 7.0
5 4 8 2 5 16.0 8.0
For small dataframe
%%timeit
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
1000 loops, best of 3: 708 µs per loop
For large dataframe
%%timeit
ndf = pd.concat([final_df]*10000)
ndf['r1'] = ndf.values[:,0]*ndf.values[:,1]*k1
ndf['r2'] = ndf.values[:,0]*ndf.values[:,2]*k2
1 loop, best of 3: 6.19 ms per loop

How can I vectorize a function that uses lagged values of its own output?

I'm sorry for the poor phrasing of the question, but it was the best I could do.
I know exactly what I want, but not exactly how to ask for it.
Here is the logic demonstrated by an example:
Two conditions that take on the values 1 or 0 trigger a signal that also takes on the values 1 or 0. Condition A triggers the signal (If A = 1 then signal = 1, else signal = 0) no matter what. Condition B does NOT trigger the signal, but the signal stays triggered if condition B stays equal to 1
after the signal previously has been triggered by condition A.
The signal goes back to 0 only after both A and B have gone back to 0.
1. Input:
2. Desired output (signal_d) and confirmation that a for loop can solve it (signal_l):
3. My attempt using numpy.where():
4. Reproducible snippet:
# Settings
import numpy as np
import pandas as pd
import datetime
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
# My attempt with np.where in column signal_v1
df['Signal_v1'] = df['condition_A'].copy()
df['Signal_v1'] = np.where(df.condition_A == 1, 1, np.where( (df.shift(1).Signal_v1 == 1) & (df.condition_B == 1), 1, 0))
print(df)
This is pretty straight forward using a for loop with lagged values and nested if sentences, but I can't figure it out using vectorized functions like numpy.where(). And I know this would be much faster for bigger data frames.
Thank you for any suggestions!

I don't think there is a way to vectorize this operation that will be significantly faster than a Python loop. (At least, not if you want to stick with just Python, pandas and numpy.)
However, you can improve the performance of this operation by simplifying your code. Your implementation uses if statements and a lot of DataFrame indexing. These are relatively costly operations.
Here's a modification of your script that includes two functions: add_signal_l(df) and add_lagged(df). The first is your code, just wrapped up in a function. The second uses a simpler function to achieve the same result--still a Python loop, but it uses numpy arrays and bitwise operators.
import numpy as np
import pandas as pd
import datetime
#-----------------------------------------------------------------------
# Create the test DataFrame
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
#-----------------------------------------------------------------------
def add_signal_l(df):
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
def compute_lagged_signal(a, b):
x = np.empty_like(a)
x[0] = a[0]
for i in range(1, len(a)):
x[i] = a[i] | (x[i-1] & b[i])
return x
def add_lagged(df):
df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)
Here's a comparison of the timing of the two function, run in an IPython session:
In [85]: df
Out[85]:
condition_A condition_B signal_d
dates
2017-06-09 0 0 0
2017-06-10 0 1 0
2017-06-11 0 1 0
2017-06-12 0 1 0
2017-06-13 1 0 1
2017-06-14 1 0 1
2017-06-15 0 1 1
2017-06-16 0 1 1
2017-06-17 0 1 1
2017-06-18 0 1 1
2017-06-19 0 1 1
2017-06-20 1 0 1
2017-06-21 1 0 1
2017-06-22 0 0 0
In [86]: %timeit add_signal_l(df)
8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [87]: %timeit add_lagged(df)
137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see, add_lagged(df) is much faster.

How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes

I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

Pandas dataframe: return row AND column of maximum value(s)

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.
I can get the max across rows or columns by changing the first argument of
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.
For example, I can do this in numpy:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.
E.g. for the above I'd like a function which does:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where()

I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.

what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to speed up a reindex at the row level? - python

Related

Pandas: Select values from specific columns of a DataFrame by row

avoid repetitive operations with Pandas

How can I vectorize a function that uses lagged values of its own output?

How to speed LabelEncoder up recoding a categorical variable into integers

Pandas dataframe: return row AND column of maximum value(s)

Categories

Resources