Pandas DataFrame filtering

Pandas DataFrame filtering - python

Let's say I have a DataFrame with four columns, each of which has a threshold value against which I'd like to compare the DataFrame's values.
I would simply like the minimum value of the DataFrame or the threshold.
For example:
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
>>> df.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.711597 -0.063274
4 1.217197 0.202063 -1.407561 0.940371
thresholds = pd.Series({'A': 1, 'B': 1.1, 'C': 1.2, 'D': 1.3})
This solution works (A4 and C3 were filtered), but there must be an easier way:
df_filtered = df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
>>> df_filtered.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.200000 -0.063274
4 1.000000 0.202063 -1.407561 0.940371
Ideally, I'd like to use .loc to filter in place, but I haven't managed to figure it out. I'm using Pandas 0.14.1 (and can't upgrade).
RESPONSE Below are the timed tests of my initial proposal against the alternatives:
%%timeit
df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
1000 loops, best of 3: 990 µs per loop
%%timeit
np.minimum(df, thresholds) # <--- Simple, fast, and returns DataFrame!
10000 loops, best of 3: 110 µs per loop
%%timeit
df[df < thresholds].fillna(thresholds, inplace=True)
1000 loops, best of 3: 1.36 ms per loop

This is pretty fast (and returns a dataframe):
np.minimum( df, [1.0,1.1,1.2,1.3] )
A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...

How about:
df[df < thresholds].fillna(thresholds, inplace=True)

Related

Conditionally assign values to DF column using np.where

I'm trying to group some columns into a list inside a single column.
In case one of these columns contain NaN, the result column should be just NaN instead of the list.
df = pd.DataFrame({'a.0':[11, 1.1], 'a.5':[12, 1.2], 'a.10':[13, pd.np.NaN]})
The result column of the DF should be as such:
a.0 a.10 a.5 result
0 . 11.0 13.0 12.0 [11, 13, 12]
1 . 1.1 nan 1.2 nan
These 2 lines do the job:
df['result'] = df[['a.0','a.10','a.5']].values.tolist()
df['result'] = pd.np.where(df[['a.0','a.10','a.5']].isnull().any(axis=1), pd.np.nan, df['result'])
And I was wondering how to do it in one line. Help would be appreciated

Using isnull + any + mask/where
df['result'] = pd.Series(df.values.tolist()).mask(df.isnull().any(1))
Or,
df['result'] = pd.Series(df.values.tolist()).where(~df.isnull().any(1))
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# Wen's solution
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
1 loop, best of 3: 1min 37s per loop
# Anton vBR's solution
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
1 loop, best of 3: 5.79 s per loop
# my answer
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
10 loops, best of 3: 133 ms per loop
Conclusions
apply is inefficient, more so than a loop. An apply call typically has many overheads associated with it. A pure python loop is generally faster.
A vectorised approach such as where/mask will vectorize operations and offers improved performance over loopy solutions.

Update With timings and large data-sets cᴏʟᴅsᴘᴇᴇᴅ answer is the best. List comprehensions always suffer here. I've updated my previous answer with timings.
You could use itertuples and assign np.nan if any np.nan in row:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a.0':np.random.choice(np.append(np.nan,np.arange(10)), 1000),
'a.5':np.random.choice(10, 1000),
'a.10':np.random.choice(10, 1000)})
# 3 solutions to solve the problem
# Assign with df['results'] =
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
Timings:
100 loops, best of 3: 8.38 ms per loop
1000 loops, best of 3: 772 µs per loop
1 loop, best of 3: 214 ms per loop

df['result']=df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
df
Out[30]:
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?

Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True

jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

pandas sort lambda function

Given a dataframe a with 3 columns, A , B , C and 3 rows of numerical values. How does one sort all the rows with a comp operator using only the product of A[i]*B[i]. It seems that the pandas sort only takes columns and then a sort method.
I would like to use a comparison function like below.
f = lambda i,j: a['A'][i]*a['B'][i] < a['A'][j]*a['B'][j]

There are at least two ways:
Method 1
Say you start with
In [175]: df = pd.DataFrame({'A': [1, 2], 'B': [1, -1], 'C': [1, 1]})
You can add a column which is your sort key
In [176]: df['sort_val'] = df.A * df.B
Finally sort by it and drop it
In [190]: df.sort_values('sort_val').drop('sort_val', 1)
Out[190]:
A B C
1 2 -1 1
0 1 1 1
Method 2
Use numpy.argsort and then use .ix on the resulting indices:
In [197]: import numpy as np
In [198]: df.ix[np.argsort(df.A * df.B).values]
Out[198]:
A B C
0 1 1 1
1 2 -1 1

Another way, adding it here because this is the first result at Google:
df.loc[(df.A * df.B).sort_values().index]
This works well for me and is pretty straightforward. #Ami Tavory's answer gave strange results for me with a categorical index; not sure it's because of that though.

Just adding on #srs super elegant answer an iloc option with some time comparisons with loc and the naive solution.
(iloc is preferred for when your your index is position-based (vs label-based for loc)
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({
'A': np.random.randint(low=1, high=N, size=N),
'B': np.random.randint(low=1, high=N, size=N)
})
%%timeit -n 100
df['C'] = df['A'] * df['B']
df.sort_values(by='C')
naive: 100 loops, best of 3: 1.85 ms per loop
%%timeit -n 100
df.loc[(df.A * df.B).sort_values().index]
loc: 100 loops, best of 3: 2.69 ms per loop
%%timeit -n 100
df.iloc[(df.A * df.B).sort_values().index]
iloc: 100 loops, best of 3: 2.02 ms per loop
df['C'] = df['A'] * df['B']
df1 = df.sort_values(by='C')
df2 = df.loc[(df.A * df.B).sort_values().index]
df3 = df.iloc[(df.A * df.B).sort_values().index]
print np.array_equal(df1.index, df2.index)
print np.array_equal(df2.index, df3.index)
testing results (comparing the entire index order) between all options:
True
True

Pandas DataFrame: How to quickly get first non-NaN value in each row? [duplicate]

If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).

Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop

Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)

This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64

groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])

Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.

JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

In pandas, how can I get a DataFrame as the output while I sum the DataFrame

While I sum a DataFrame, it returns a Series:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
In [3]: df
Out[3]:
a b c
0 1 2 3
1 2 3 3
In [4]: s = df.sum()
In [5]: type(s)
Out[5]: pandas.core.series.Series
I know I can construct a new DataFrame by this Series. But, is there any more "pandasic" way?

I'm going to go ahead and say... "No", I don't think there is a direct way to do it, the pandastic way (and pythonic too) is to be explicit:
pd.DataFrame(df.sum(), columns=['sum'])
or, more elegantly, using a dictionary (be aware that this copies the summed array):
pd.DataFrame({'sum': df.sum()})
As #root notes it's faster to use:
pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
(As the zen of python states: "practicality beats purity", so if you care about this time, use this).
However, perhaps the most pandastic way is to just use the Series! :)
.
Some %timeits for your tiny example:
In [11]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
1000 loops, best of 3: 356 us per loop
In [12]: %timeit pd.DataFrame({'sum': df.sum()})
1000 loops, best of 3: 462 us per loop
In [13]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
1000 loops, best of 3: 205 us per loop
and for a slightly larger one:
In [21]: df = pd.DataFrame(np.random.randn(100000, 3), columns=list('abc'))
In [22]: %timeit pd.DataFrame(df.sum(), columns=['sum'])
100 loops, best of 3: 7.99 ms per loop
In [23]: %timeit pd.DataFrame({'sum': df.sum()})
100 loops, best of 3: 8.3 ms per loop
In [24]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum'])
100 loops, best of 3: 2.47 ms per loop

Often it is necessary not only to convert the sum of the columns into a dataframe, but also to transpose the resulting dataframe. There is also a method for this:
df.sum().to_frame().transpose()

I am not sure about earlier versions, but as of pandas 0.18.1 one can use pandas.Series.to_frame method.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c'])
s = df.sum().to_frame(name='sum')
type(s)
>>> pandas.core.frame.DataFrame
The name argument is optional and defines the column name.

You can use agg for simple operations like sum, have a look at how compact this is:
df.agg(['sum'])

df.sum().to_frame() should do what you want.
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_frame.html.

By DF.sum().to_frame() or storing aggregate results directly to Dataframe, is not a healthy option. More importantly when you want to store aggregate value and aggregate sum separate. Using DF.sum().to_frame will store values and sum together.
Try below for cleaner version.
a = DF.sum()
sum = list(a)
values = list(a.index)
Series_Dict = {"Agg_Value":values, "Agg_Sum":sum}
Agg_DF = pd.DataFrame(Series_Dict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame filtering - python

This is pretty fast (and returns a dataframe): np.minimum( df, [1.0,1.1,1.2,1.3] ) A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...

How about: df[df < thresholds].fillna(thresholds, inplace=True)

Related

Conditionally assign values to DF column using np.where

How to check if a particular cell in pandas DataFrame isnull?

pandas sort lambda function

Pandas DataFrame: How to quickly get first non-NaN value in each row? [duplicate]

In pandas, how can I get a DataFrame as the output while I sum the DataFrame

Categories

Resources