Non standard interaction among two tables to avoid very large merge - python

Suppose I have two tables A and B.
Table A has a multi-level index (a, b) and one column (ts).
b determines univocally ts.
A = pd.DataFrame(
[('a', 'x', 4),
('a', 'y', 6),
('a', 'z', 5),
('b', 'x', 4),
('b', 'z', 5),
('c', 'y', 6)],
columns=['a', 'b', 'ts']).set_index(['a', 'b'])
AA = A.reset_index()
Table B is another one-column (ts) table with non-unique index (a).
The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x.
Moreover, there is always a value in B.ix[x] that is greater than or equal to
the values in A.
B = pd.DataFrame(
dict(a=list('aaaaabbcccccc'),
ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a')
The semantics in this is that B contains observations of occurrences of an event of type indicated by the index.
I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B.
So, my goal would be:
C:
('a', 'x') 4
('a', 'y') 7
('a', 'z') 5
('b', 'x') 7
('b', 'z') 7
('c', 'y') 8
I have some working code, but is terribly slow.
C = AA.apply(lambda row: (
row[0],
row[1],
B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run.
Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's.
1,000,000 x 200 x 300 = 60,000,000,000 rows
may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above.
How would I improve the performance?

Thanks for providing sample data. I've updated this answer with general
suggestions given anticipated array sizes in the 100's of million.
Line profile
Line profiling the guts of your lambda function shows that most time is spent
in B.ix[] (which has been refactored here to only be called once).
In [91]: lprun -f stack.foo1 AA.apply(stack.foo1, B=B, axis=1)
Timer unit: 1e-06 s
File: stack.py
Function: foo1 at line 4
Total time: 0.006651 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def foo1(row, B):
5 6 6158 1026.3 92.6 subset = B.ix[row[0]].ts
6 6 418 69.7 6.3 idx = np.searchsorted(subset, row[2])
7 6 56 9.3 0.8 val = subset.irow(idx)
8 6 19 3.2 0.3 return val
Consider built-in data types and raw numpy arrays over higher-level constructs.
Since B behaves like a dict here and the same key is accessed many times, let's compare df.ix to a normal Python
dictionary (precomputed elsewhere). A dictionary with 1M keys (unique A values) should only require ~34MB (33% capacity: 3 * 1e6 * 12 bytes).
In [102]: timeit B.ix['a']
10000 loops, best of 3: 122 us per loop
In [103]: timeit dct['a']
10000000 loops, best of 3: 53.2 ns per loop
Replace function calls with loops
The last major improvement I can think of would be to replace df.apply() with a for loop to avoid calling any function 200M times (or however large A is).
Hopefully these ideas help.
Original, expressive solution, though not memory efficient:
In [5]: CC = AA.merge(B, left_on='a', right_index=True)
In [6]: CC[CC.ts_x <= CC.ts_y].groupby(['a', 'b']).first()
Out[6]:
ts_x ts_y
a b
a x 4 4
y 6 7
z 5 5
b x 4 7
z 5 7
c y 6 8

Another option using numpy's boolean array notation, which seems an order of magnitude faster than the original (in this tiny example, and I suspect it'll be even better on larger datasets...):
I suspect this is largely because picking the minimum is much faster task than sorting.
In [11]: AA.apply(lambda row: (B.ts.values[(B.ts.values >= row['ts']) &
(B.index == row['a'])].min()),
axis=1)
Out[11]:
0 4
1 7
2 5
3 7
4 7
5 8
In [12]: %timeit AA.apply(lambda row: (B.ts.values[(B.ts.values >= row['ts']) &(B.index == row['a'])].min()), axis=1)
1000 loops, best of 3: 1.46 ms per loop
This seems like the fastest method if you were to simply adding this as a column to AA.
If you were creating a new dataframe as in you example - trying to test this "fairly" - it is slower (but still twice as fast as the original):
In [13]: %timeit C = AA.apply(lambda row: (row[0], row[1], B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
100 loops, best of 3: 10.3 ms per loop
In [14]: %timeit C = AA.apply(lambda row: (row[0], x[1], B.ts.values[(B.ts.values >= row['ts']) & (B.index == row['a'])].min()), axis=1)
100 loops, best of 3: 4.32 ms per loop

Related

pandas unexpected behaviour of .loc vs .at [duplicate]

I've been exploring how to optimize my code and ran across pandas .at method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
Setup
import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase
lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)
def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df
seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)
print df.head().T.head().T
df looks like:
a
A B C D E
a A 0.444939 0.407554 0.460148 0.465239 0.462691
B 0.032746 0.485650 0.503892 0.351520 0.061569
C 0.777350 0.047677 0.250667 0.602878 0.570528
D 0.927783 0.653868 0.381103 0.959544 0.033253
E 0.191985 0.304597 0.195106 0.370921 0.631576
Lets use .at and .loc and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]
using .loc 0.37374090276
using .at 0.37374090276
Test speed using .loc
%%timeit
df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 180 µs per loop
Test speed using .at
%%timeit
df.at[('a', 'A'), ('c', 'C')]
The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage 6.11 * 8 is a lot faster than 180
Question
What are the limitations of .at? I'm motivated to use it. The documentation says it's similar to .loc but it doesn't behave similarly. Example:
# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)
print sdf.loc[:, :]
A B
a 0.444939 0.407554
b 0.460148 0.465239
where as print sdf.at[:, :] results in TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the .at method?
Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
Note that there is also df.get_value, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.
As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()
A B C
4 0 2 3
5 0 4 1
6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C
4 0 100 3
5 0 4 1
6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that .at tries to conserve the datatype (here: int):
A B C
4 0 100 10
5 0 4 1
6 10 20 30
That seems to be a difference to .loc:
df2.loc[4, 'C'] = 10.05
yields the desired
A B C
4 0 2 10.05
5 0 4 1.00
6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: 'a_string'
It will work, however, if one uses a string on which int() actually works as noted by #n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123'
A B C
4 0 2 3
5 123 4 1
6 10 20 30
Adding to the above, Pandas documentation for the at function states:
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if
you only need to get or set a single value in a DataFrame or Series.
For setting data loc and at are similar, for example:
df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])
Both loc and at will produce the same result
df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]
A B
0 101 11
0 102 22
1 3 33
df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103
A B
0 103 11
0 103 22
1 3 33
Also, for accessing a single value, both are the same
df.loc[1, 'A'] # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A'] # returns a single value (<class 'numpy.int64'>)
3
However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values
df.loc[0, 'A'] # returns a Series (<class 'pandas.core.series.Series'>)
0 103
0 103
Name: A, dtype: int64
df.at[0, 'A'] # returns array of values (<class 'numpy.ndarray'>)
array([103, 103])
And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column
df.loc[0] # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)
A B
0 103 11
0 103 22
# df.at[0] # ERROR: must receive column
.at is an optimized data access method compared to .loc .
.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at selects particular element of a data frame positioned at the given indexed_row and labeled_column.
Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Output using .at is a single element and using .loc maybe a Series or a DataFrame.

Why is `groupby` with `as_index=False` even slower than `groupby` with `reset_index`

I just recently came across this strange Pandas behavior with groupby.
I have this dataframe:
>>> df = pd.DataFrame({'a': [1, 2, 3, 1, 2, 3], 'b': [4, 5, 6, 7, 8, 9]})
>>> df
a b
0 1 4
1 2 5
2 3 6
3 1 7
4 2 8
5 3 9
>>>
And I want to groupby the column a and sum the column b.
Normal groupby would have the column as index, but as_index=False won't:
>>> df.groupby('a')['b'].sum()
a
1 11
2 13
3 15
Name: b, dtype: int64
>>>
But when I time them:
>>> timeit(lambda: df.groupby('a')['b'].sum(), number=1000)
0.5426476000000093
>>> timeit(lambda: df.groupby('a')['b'].sum(), number=10000)
4.912795499999902
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=1000)
1.419923899999958
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=10000)
11.907147600000144
>>>
You can see that for some reason, groupby with as_index=False is 2.75x slower!
Not just that! It's even slower than reset_index!
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=1000)
1.419923899999958
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=10000)
11.907147600000144
>>> timeit(lambda: df.groupby('a')['b'].sum().reset_index(), number=1000)
1.0641113000001496
>>> timeit(lambda: df.groupby('a')['b'].sum().reset_index(), number=10000)
10.01520289999985
>>>
And reset_index obviously also gives the same output as as_index=False:
>>> df.groupby('a')['b'].sum().reset_index()
a b
0 1 11
1 2 13
2 3 15
>>>
as_index=False:
>>> df.groupby('a', as_index=False)['b'].sum()
a b
0 1 11
1 2 13
2 3 15
>>>
I can understand that as_index=False might be slower, but not this much slower... Also the main thing is that I can't wrap my head around that why is reset_index faster? That's an extra function...
Why is this? What is the implementation of as_index?
I am really surprised, I even thought that it's very possible that as_index=False would be faster than as_index=True, since it doesn't need a column to be set as the index.
But it's the opposite, it's actually as_index=True being 2.75 times faster... And even reset_index being faster than as_index=False.
If this is the case why doesn't as_index=False also just simply use reset_index?
as_index=True is the default as the grouper uses internally an index and resets it if as_index is set toFalse:
cf. core.groupby.py source
if not self.as_index:
self._insert_inaxis_grouper_inplace(result)
The time difference between True/False is actually minimal, you should use a larger dataframe to test the speed.
Here on 600k rows:
True: 30.7 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
False: 32.5 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
The difference is thus not proportional (2 times slower), but rather fixed (2.5 ms slower), which is less a burden.
Now as to why the source doesn't use reset_index, well pandas does a lot of things internally, I am only guessing here as the code is complex, but there are likely many checks in place that do more things than just resetting the index.
Assign to #mozway's answer, the _insert_inaxis_grouper_inplace function is:
def _insert_inaxis_grouper_inplace(self, result: DataFrame) -> None:
# zip in reverse so we can always insert at loc 0
columns = result.columns
for name, lev, in_axis in zip(
reversed(self.grouper.names),
reversed(self.grouper.get_group_levels()),
reversed([grp.in_axis for grp in self.grouper.groupings]),
):
# GH #28549
# When using .apply(-), name will be in columns already
if in_axis and name not in columns:
result.insert(0, name, lev)
As you can see, it's using insert, not reset index.
So I suspect if it uses:
def _insert_inaxis_grouper_inplace(self, result: DataFrame) -> None:
# zip in reverse so we can always insert at loc 0
columns = result.columns
for name, lev, in_axis in zip(
reversed(self.grouper.names),
reversed(self.grouper.get_group_levels()),
reversed([grp.in_axis for grp in self.grouper.groupings]),
):
# GH #28549
# When using .apply(-), name will be in columns already
if in_axis and name not in columns:
result = result.reset_index()
Not completely sure about the setup though, but I expect the above to work...
Then it probably would be faster.

How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?
It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes
I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop
I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

pandas .at versus .loc

I've been exploring how to optimize my code and ran across pandas .at method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
Setup
import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase
lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)
def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df
seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)
print df.head().T.head().T
df looks like:
a
A B C D E
a A 0.444939 0.407554 0.460148 0.465239 0.462691
B 0.032746 0.485650 0.503892 0.351520 0.061569
C 0.777350 0.047677 0.250667 0.602878 0.570528
D 0.927783 0.653868 0.381103 0.959544 0.033253
E 0.191985 0.304597 0.195106 0.370921 0.631576
Lets use .at and .loc and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]
using .loc 0.37374090276
using .at 0.37374090276
Test speed using .loc
%%timeit
df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 180 µs per loop
Test speed using .at
%%timeit
df.at[('a', 'A'), ('c', 'C')]
The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage 6.11 * 8 is a lot faster than 180
Question
What are the limitations of .at? I'm motivated to use it. The documentation says it's similar to .loc but it doesn't behave similarly. Example:
# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)
print sdf.loc[:, :]
A B
a 0.444939 0.407554
b 0.460148 0.465239
where as print sdf.at[:, :] results in TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the .at method?
Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
Note that there is also df.get_value, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.
As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()
A B C
4 0 2 3
5 0 4 1
6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C
4 0 100 3
5 0 4 1
6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that .at tries to conserve the datatype (here: int):
A B C
4 0 100 10
5 0 4 1
6 10 20 30
That seems to be a difference to .loc:
df2.loc[4, 'C'] = 10.05
yields the desired
A B C
4 0 2 10.05
5 0 4 1.00
6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: 'a_string'
It will work, however, if one uses a string on which int() actually works as noted by #n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123'
A B C
4 0 2 3
5 123 4 1
6 10 20 30
Adding to the above, Pandas documentation for the at function states:
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if
you only need to get or set a single value in a DataFrame or Series.
For setting data loc and at are similar, for example:
df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])
Both loc and at will produce the same result
df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]
A B
0 101 11
0 102 22
1 3 33
df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103
A B
0 103 11
0 103 22
1 3 33
Also, for accessing a single value, both are the same
df.loc[1, 'A'] # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A'] # returns a single value (<class 'numpy.int64'>)
3
However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values
df.loc[0, 'A'] # returns a Series (<class 'pandas.core.series.Series'>)
0 103
0 103
Name: A, dtype: int64
df.at[0, 'A'] # returns array of values (<class 'numpy.ndarray'>)
array([103, 103])
And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column
df.loc[0] # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)
A B
0 103 11
0 103 22
# df.at[0] # ERROR: must receive column
.at is an optimized data access method compared to .loc .
.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at selects particular element of a data frame positioned at the given indexed_row and labeled_column.
Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Output using .at is a single element and using .loc maybe a Series or a DataFrame.

Pandas dataframe: return row AND column of maximum value(s)

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.
I can get the max across rows or columns by changing the first argument of
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.
For example, I can do this in numpy:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.
E.g. for the above I'd like a function which does:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where()
I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.
what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop

Categories

Resources