I've been exploring how to optimize my code and ran across pandas .at method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
Setup
import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase
lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)
def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df
seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)
print df.head().T.head().T
df looks like:
a
A B C D E
a A 0.444939 0.407554 0.460148 0.465239 0.462691
B 0.032746 0.485650 0.503892 0.351520 0.061569
C 0.777350 0.047677 0.250667 0.602878 0.570528
D 0.927783 0.653868 0.381103 0.959544 0.033253
E 0.191985 0.304597 0.195106 0.370921 0.631576
Lets use .at and .loc and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]
using .loc 0.37374090276
using .at 0.37374090276
Test speed using .loc
%%timeit
df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 180 µs per loop
Test speed using .at
%%timeit
df.at[('a', 'A'), ('c', 'C')]
The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage 6.11 * 8 is a lot faster than 180
Question
What are the limitations of .at? I'm motivated to use it. The documentation says it's similar to .loc but it doesn't behave similarly. Example:
# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)
print sdf.loc[:, :]
A B
a 0.444939 0.407554
b 0.460148 0.465239
where as print sdf.at[:, :] results in TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the .at method?
Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
Note that there is also df.get_value, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.
As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()
A B C
4 0 2 3
5 0 4 1
6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C
4 0 100 3
5 0 4 1
6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that .at tries to conserve the datatype (here: int):
A B C
4 0 100 10
5 0 4 1
6 10 20 30
That seems to be a difference to .loc:
df2.loc[4, 'C'] = 10.05
yields the desired
A B C
4 0 2 10.05
5 0 4 1.00
6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: 'a_string'
It will work, however, if one uses a string on which int() actually works as noted by #n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123'
A B C
4 0 2 3
5 123 4 1
6 10 20 30
Adding to the above, Pandas documentation for the at function states:
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if
you only need to get or set a single value in a DataFrame or Series.
For setting data loc and at are similar, for example:
df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])
Both loc and at will produce the same result
df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]
A B
0 101 11
0 102 22
1 3 33
df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103
A B
0 103 11
0 103 22
1 3 33
Also, for accessing a single value, both are the same
df.loc[1, 'A'] # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A'] # returns a single value (<class 'numpy.int64'>)
3
However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values
df.loc[0, 'A'] # returns a Series (<class 'pandas.core.series.Series'>)
0 103
0 103
Name: A, dtype: int64
df.at[0, 'A'] # returns array of values (<class 'numpy.ndarray'>)
array([103, 103])
And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column
df.loc[0] # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)
A B
0 103 11
0 103 22
# df.at[0] # ERROR: must receive column
.at is an optimized data access method compared to .loc .
.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at selects particular element of a data frame positioned at the given indexed_row and labeled_column.
Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Output using .at is a single element and using .loc maybe a Series or a DataFrame.
Related
This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15
To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'
Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100
Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.
In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]
To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)
.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.
To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array
Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.
According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()
I've been exploring how to optimize my code and ran across pandas .at method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
Setup
import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase
lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)
def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df
seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)
print df.head().T.head().T
df looks like:
a
A B C D E
a A 0.444939 0.407554 0.460148 0.465239 0.462691
B 0.032746 0.485650 0.503892 0.351520 0.061569
C 0.777350 0.047677 0.250667 0.602878 0.570528
D 0.927783 0.653868 0.381103 0.959544 0.033253
E 0.191985 0.304597 0.195106 0.370921 0.631576
Lets use .at and .loc and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]
using .loc 0.37374090276
using .at 0.37374090276
Test speed using .loc
%%timeit
df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 180 µs per loop
Test speed using .at
%%timeit
df.at[('a', 'A'), ('c', 'C')]
The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage 6.11 * 8 is a lot faster than 180
Question
What are the limitations of .at? I'm motivated to use it. The documentation says it's similar to .loc but it doesn't behave similarly. Example:
# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)
print sdf.loc[:, :]
A B
a 0.444939 0.407554
b 0.460148 0.465239
where as print sdf.at[:, :] results in TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the .at method?
Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
Note that there is also df.get_value, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.
As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()
A B C
4 0 2 3
5 0 4 1
6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C
4 0 100 3
5 0 4 1
6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that .at tries to conserve the datatype (here: int):
A B C
4 0 100 10
5 0 4 1
6 10 20 30
That seems to be a difference to .loc:
df2.loc[4, 'C'] = 10.05
yields the desired
A B C
4 0 2 10.05
5 0 4 1.00
6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: 'a_string'
It will work, however, if one uses a string on which int() actually works as noted by #n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123'
A B C
4 0 2 3
5 123 4 1
6 10 20 30
Adding to the above, Pandas documentation for the at function states:
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if
you only need to get or set a single value in a DataFrame or Series.
For setting data loc and at are similar, for example:
df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])
Both loc and at will produce the same result
df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]
A B
0 101 11
0 102 22
1 3 33
df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103
A B
0 103 11
0 103 22
1 3 33
Also, for accessing a single value, both are the same
df.loc[1, 'A'] # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A'] # returns a single value (<class 'numpy.int64'>)
3
However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values
df.loc[0, 'A'] # returns a Series (<class 'pandas.core.series.Series'>)
0 103
0 103
Name: A, dtype: int64
df.at[0, 'A'] # returns array of values (<class 'numpy.ndarray'>)
array([103, 103])
And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column
df.loc[0] # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)
A B
0 103 11
0 103 22
# df.at[0] # ERROR: must receive column
.at is an optimized data access method compared to .loc .
.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at selects particular element of a data frame positioned at the given indexed_row and labeled_column.
Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Output using .at is a single element and using .loc maybe a Series or a DataFrame.
I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.
I can get the max across rows or columns by changing the first argument of
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.
For example, I can do this in numpy:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.
E.g. for the above I'd like a function which does:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where()
I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.
what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop
Suppose I have two tables A and B.
Table A has a multi-level index (a, b) and one column (ts).
b determines univocally ts.
A = pd.DataFrame(
[('a', 'x', 4),
('a', 'y', 6),
('a', 'z', 5),
('b', 'x', 4),
('b', 'z', 5),
('c', 'y', 6)],
columns=['a', 'b', 'ts']).set_index(['a', 'b'])
AA = A.reset_index()
Table B is another one-column (ts) table with non-unique index (a).
The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x.
Moreover, there is always a value in B.ix[x] that is greater than or equal to
the values in A.
B = pd.DataFrame(
dict(a=list('aaaaabbcccccc'),
ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a')
The semantics in this is that B contains observations of occurrences of an event of type indicated by the index.
I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B.
So, my goal would be:
C:
('a', 'x') 4
('a', 'y') 7
('a', 'z') 5
('b', 'x') 7
('b', 'z') 7
('c', 'y') 8
I have some working code, but is terribly slow.
C = AA.apply(lambda row: (
row[0],
row[1],
B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run.
Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's.
1,000,000 x 200 x 300 = 60,000,000,000 rows
may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above.
How would I improve the performance?
Thanks for providing sample data. I've updated this answer with general
suggestions given anticipated array sizes in the 100's of million.
Line profile
Line profiling the guts of your lambda function shows that most time is spent
in B.ix[] (which has been refactored here to only be called once).
In [91]: lprun -f stack.foo1 AA.apply(stack.foo1, B=B, axis=1)
Timer unit: 1e-06 s
File: stack.py
Function: foo1 at line 4
Total time: 0.006651 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def foo1(row, B):
5 6 6158 1026.3 92.6 subset = B.ix[row[0]].ts
6 6 418 69.7 6.3 idx = np.searchsorted(subset, row[2])
7 6 56 9.3 0.8 val = subset.irow(idx)
8 6 19 3.2 0.3 return val
Consider built-in data types and raw numpy arrays over higher-level constructs.
Since B behaves like a dict here and the same key is accessed many times, let's compare df.ix to a normal Python
dictionary (precomputed elsewhere). A dictionary with 1M keys (unique A values) should only require ~34MB (33% capacity: 3 * 1e6 * 12 bytes).
In [102]: timeit B.ix['a']
10000 loops, best of 3: 122 us per loop
In [103]: timeit dct['a']
10000000 loops, best of 3: 53.2 ns per loop
Replace function calls with loops
The last major improvement I can think of would be to replace df.apply() with a for loop to avoid calling any function 200M times (or however large A is).
Hopefully these ideas help.
Original, expressive solution, though not memory efficient:
In [5]: CC = AA.merge(B, left_on='a', right_index=True)
In [6]: CC[CC.ts_x <= CC.ts_y].groupby(['a', 'b']).first()
Out[6]:
ts_x ts_y
a b
a x 4 4
y 6 7
z 5 5
b x 4 7
z 5 7
c y 6 8
Another option using numpy's boolean array notation, which seems an order of magnitude faster than the original (in this tiny example, and I suspect it'll be even better on larger datasets...):
I suspect this is largely because picking the minimum is much faster task than sorting.
In [11]: AA.apply(lambda row: (B.ts.values[(B.ts.values >= row['ts']) &
(B.index == row['a'])].min()),
axis=1)
Out[11]:
0 4
1 7
2 5
3 7
4 7
5 8
In [12]: %timeit AA.apply(lambda row: (B.ts.values[(B.ts.values >= row['ts']) &(B.index == row['a'])].min()), axis=1)
1000 loops, best of 3: 1.46 ms per loop
This seems like the fastest method if you were to simply adding this as a column to AA.
If you were creating a new dataframe as in you example - trying to test this "fairly" - it is slower (but still twice as fast as the original):
In [13]: %timeit C = AA.apply(lambda row: (row[0], row[1], B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
100 loops, best of 3: 10.3 ms per loop
In [14]: %timeit C = AA.apply(lambda row: (row[0], x[1], B.ts.values[(B.ts.values >= row['ts']) & (B.index == row['a'])].min()), axis=1)
100 loops, best of 3: 4.32 ms per loop
I have a series with a MultiIndex like this:
import numpy as np
import pandas as pd
buckets = np.repeat(['a','b','c'], [3,5,1])
sequence = [0,1,5,0,1,2,4,50,0]
s = pd.Series(
np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence))
)
# In [6]: s
# Out[6]:
# a 0 -1.106047
# 1 1.665214
# 5 0.279190
# b 0 0.326364
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
# c 0 -0.091730
I'd like to get the s['b'] values where the second index ('sequence') is between 2 and 10.
Slicing on the first index works fine:
s['a':'b']
# Out[109]:
# bucket value
# a 0 1.828176
# 1 0.160496
# 5 0.401985
# b 0 -1.514268
# 1 -0.973915
# 2 1.285553
# 4 -0.194625
# 5 -0.144112
But not on the second, at least by what seems to be the two most obvious ways:
1) This returns elements 1 through 4, with nothing to do with the index values
s['b'][1:10]
# In [61]: s['b'][1:10]
# Out[61]:
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
However, if I reverse the index and the first index is integer and the second index is a string, it works:
In [26]: s
Out[26]:
0 a -0.126299
1 a 1.810928
5 a 0.571873
0 b -0.116108
1 b -0.712184
2 b -1.771264
4 b 0.148961
50 b 0.089683
0 c -0.582578
In [25]: s[0]['a':'b']
Out[25]:
a -0.126299
b -0.116108
As Robbie-Clarken answers, since 0.14 you can pass a slice in the tuple you pass to loc:
In [11]: s.loc[('b', slice(2, 10))]
Out[11]:
b 2 -0.65394
4 0.08227
dtype: float64
Indeed, you can pass a slice for each level:
In [12]: s.loc[(slice('a', 'b'), slice(2, 10))]
Out[12]:
a 5 0.27919
b 2 -0.65394
4 0.08227
dtype: float64
Note: the slice is inclusive.
Old answer:
You can also do this using:
s.ix[1:10, "b"]
(It's good practice to do in a single ix/loc/iloc since this version allows assignment.)
This answer was written prior to the introduction of iloc in early 2013, i.e. position/integer location - which may be preferred in this case. The reason it was created was to remove the ambiguity from integer-indexed pandas objects, and be more descriptive: "I'm slicing on position".
s["b"].iloc[1:10]
That said, I kinda disagree with the docs that ix is:
most robust and consistent way
it's not, the most consistent way is to describe what you're doing:
use loc for labels
use iloc for position
use ix for both (if you really have to)
Remember the zen of python:
explicit is better than implicit
Since pandas 0.15.0 this works:
s.loc['b', 2:10]
Output:
b 2 -0.503023
4 0.704880
dtype: float64
With a DataFrame it's slightly different (source):
df.loc(axis=0)['b', 2:10]
As of pandas 0.14.0 it is possible to slice multi-indexed objects by providing .loc a tuple containing slice objects:
In [2]: s.loc[('b', slice(2, 10))]
Out[2]:
b 2 -1.206052
4 -0.735682
dtype: float64
The best way I can think of is to use 'select' in this case. Although it even says in the docs that "This method should be used only when there is no more direct way."
Indexing and selecting data
In [116]: s
Out[116]:
a 0 1.724372
1 0.305923
5 1.780811
b 0 -0.556650
1 0.207783
4 -0.177901
50 0.289365
0 1.168115
In [117]: s.select(lambda x: x[0] == 'b' and 2 <= x[1] <= 10)
Out[117]: b 4 -0.177901
not sure if this is ideal but it works by creating a mask
In [59]: s.index
Out[59]:
MultiIndex
[('a', 0) ('a', 1) ('a', 5) ('b', 0) ('b', 1) ('b', 2) ('b', 4)
('b', 50) ('c', 0)]
In [77]: s[(tpl for tpl in s.index if 2<=tpl[1]<=10 and tpl[0]=='b')]
Out[77]:
b 2 -0.586568
4 1.559988
EDIT : hayden's solution is the way to go