How to efficiently select a rows from pandas DataFrame?

How to efficiently select a rows from pandas DataFrame? - python

The following table contains some keys and values:
N = 100
tbl = pd.DataFrame({'key':np.random.randint(0, 10, N),
'y':np.random.rand(N), 'z':np.random.rand(N)})
I would like to obtain a DataFrame in which each row contains a key and all the fields that correspond to the minimal value of a specified field.
Since the original table is very large, I'm interested in the most efficient way.
NOTE getting the minimal value of a field is simple:
tbl.groupby('key').agg(pd.Series.min)
But this takes the minimum values of every field, independently, I would like to know what is the minimum value of y and what z value corresponds to it.
Below I post an answer to my question with my naive approach, but I suspect there are better ways

Here is a straightforward approach:
gr = tbl.groupby('key')
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
tbl_mins = gr.apply(take_min_y)
Is there a better way?

Based on your updated edit I believe the following is what you want:
In [107]:
tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
Out[107]:
key y z
47 0 0.094841 0.221435
26 1 0.062200 0.748082
45 2 0.032497 0.160199
28 3 0.002242 0.064829
73 4 0.122438 0.723844
75 5 0.128193 0.638933
79 6 0.071833 0.952624
86 7 0.058974 0.113317
36 8 0.068757 0.611111
12 9 0.082604 0.271268
idxmin returns the index of the min value, we can then use this to filter the original dataframe to select these rows.
Timings show this method is approx 7 times faster:
In [108]:
%timeit tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
%timeit tbl_mins = gr.apply(take_min_y)
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 7.06 ms per loop

Related

Pandas Data Frame get the values only not the definition [duplicate]

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15

To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'

Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100

Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.

In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]

To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)

.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.

To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array

Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.

According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

In pandas, how do I only do a calculation for a specific subset of indices?

I have a Pandas dataframe X with two columns, 'report_me' and 'n'. I want to get a list (or series) that contains, for each element X where report_me is true, the sum of the n values for the previous two elements of the dataframe (regardless of their report_me values). For instance, if the data frame is:
X = pd.DataFrame({"report_me":[False,False,False,True,False,
False,True,False,False,False],
"n":range(10)})
then I want the result (3, 9).
One way to do this is:
sums = df['n'].shift(1) + df['n'].shift(2)
display(sums[df["report_me"]])
but this is slow because it computes the values of sums for all the indices, not just the ones that are going to be reported. One could also try filtering by report_me first:
reported = df[df["report_me"]]
display(reported["n"].shift(1) + reported["n"].shift(2))
but this gives the wrong answer because now you are getting rid of the previous values that you would be using to compute sums. Is there a way to do this that doesn't do unnecessary work?

If report_me is sparse, you might gain some speed using a numpy solution as follows:
# find the index where report_me is True
idx = np.where(X.report_me.values)
# find previous two indices when report_me is True, subset the value from n, and sum
X.n.values[idx - np.arange(1,3)[:,None]].sum(axis=0)
You might need some extra logic to handle edge cases as pointed out in the comment
Timing:
%%timeit
idx = np.where(X.report_me.values)
X.n.values[idx - np.arange(1,3)[:,None]].sum(axis=0)
# 10000 loops, best of 3: 23 µs per loop
%timeit X.rolling(2).n.sum().shift()[X.report_me]
#1000 loops, best of 3: 684 µs per loop
%%timeit
sums = df['n'].shift(1) + df['n'].shift(2)
sums[df["report_me"]]
# 1000 loops, best of 3: 704 µs per loop

X['report_sum'] = (X.loc[X.report_me]
.apply(lambda x: X.iloc[[x.name-1, x.name-2]].n.sum(),
axis=1))
n report_me report_sum
0 0 False NaN
1 1 False NaN
2 2 False NaN
3 3 True 3.0
4 4 False NaN
5 5 False NaN
6 6 True 9.0
7 7 False NaN
8 8 False NaN
9 9 False NaN
If you just want the non-NaN values, take .values from the right-hand side of the assignment statement.

How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes

I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

Finding the common columns when comparing two rows in a dataframe in python

I have a dataframe of the below structure. I want to get the column numbers which has the same value (for a specific value) when i compare two rows.
1 1 0 1 1
0 1 0 1 0
0 1 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 1
So for example when I use the above sample df to compare two rows to get the columns which has 1 in it, I should get col(1) and col(3) when I compare row(0) and row(1). Similarly, when I compare row(1) and row(2), I should get col(1). I want to know if there is a more efficient solution in python.
NB: I want only the matching column numbers and also I will specify the rows to compare.

Consider the following dataframe:
import numpy as np
df = pd.DataFrame(np.random.binomial(1, 0.2, (2, 10000)))
It will be a binary matrix of size 2x10000.
np.where((df.iloc[0] * df.iloc[1]))
Or,
np.where((df.iloc[0]) & (df.iloc[1]))
returns the columns that have 1s in both rows. Multiplication seems to be faster:
%timeit np.where((df.iloc[0]) & (df.iloc[1]))
1000 loops, best of 3: 400 µs per loop
%timeit np.where((df.iloc[0] * df.iloc[1]))
1000 loops, best of 3: 269 µs per loop

Here's a simple function. You can modify it as needed, depending on how you represent your data. I'm assuming a list of lists:
df = [[1,1,0,1,1],
[0,1,0,1,0],
[0,1,0,0,1],
[1,0,0,0,1],
[0,0,0,0,0],
[1,0,0,0,1]]
def compare_rows(df,row1,row2):
"""Returns the column numbers in which both rows contain 1's"""
column_numbers = []
for i,_ in enumerate(df[0]):
if (df[row1][i] == 1) and (df[row2][i] ==1):
column_numbers.append(i)
return column_numbers
compare_rows(df,0,1) produces the output:
[1,3]

pandas: conditional count across row

I have a dataframe that has months for columns, and various departments for rows.
2013April 2013May 2013June
Dep1 0 10 15
Dep2 10 15 20
I'm looking to add a column that counts the number of months that have a value greater than 0. Ex:
2013April 2013May 2013June Count>0
Dep1 0 10 15 2
Dep2 10 15 20 3
The number of columns this function needs to span is variable. I think defining a function then using .apply is the solution, but I can't seem to figure it out.

first, pick your columns, cols
df[cols].apply(lambda s: (s > 0).sum(), axis=1)
this takes advantage of the fact that True and False are 1 and 0 respectively in python.
actually, there's a better way:
(df[cols] > 0).sum(1)
because this takes advantage of numpy vectorization
%timeit df.apply(lambda s: (s > 0).sum(), axis=1)
10 loops, best of 3: 141 ms per loop
%timeit (df > 0).sum(1)
1000 loops, best of 3: 319 µs per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently select a rows from pandas DataFrame? - python

Here is a straightforward approach: gr = tbl.groupby('key') def take_min_y(t): ix = t.y.argmin() return t.loc[[ix]] tbl_mins = gr.apply(take_min_y) Is there a better way?

Related

Pandas Data Frame get the values only not the definition [duplicate]

In pandas, how do I only do a calculation for a specific subset of indices?

How to speed LabelEncoder up recoding a categorical variable into integers

Finding the common columns when comparing two rows in a dataframe in python

pandas: conditional count across row

Categories

Resources