Find Minimum without Zero and NaN in Pandas Dataframe - python

I have a pandas Dataframe and I want to find the minimum without zeros and Nans.
I was trying to combine from numpy nonzero and nanmin, but it does not work.
Does someone has an idea?

If you want the minimum of all df, you can try so:
m = np.nanmin(df.replace(0, np.nan).values)

Use numpy.where with numpy.nanmin:
df = pd.DataFrame({'B':[4,0,4,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,np.nan,5,7,1,0],
'E':[5,3,0,9,2,4]})
print (df)
B C D E
0 4.0 7.0 1.0 5
1 0.0 8.0 NaN 3
2 4.0 9.0 5.0 0
3 5.0 NaN 7.0 9
4 5.0 2.0 1.0 2
5 NaN 3.0 0.0 4
Numpy solution:
arr = df.values
a = np.nanmin(np.where(arr == 0, np.nan, arr))
print (a)
1.0
Pandas solution - NaNs are removed by default:
a = df.mask(df==0).min().min()
print (a)
1.0
Performance - for each row is added one NaN value:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(1000,1000))
np.fill_diagonal(df.values, np.nan)
print (df)
#joe answer
In [399]: %timeit np.nanmin(df.replace(0, np.nan).values)
15.3 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [400]: %%timeit
...: arr = df.values
...: a = np.nanmin(np.where(arr == 0, np.nan, arr))
...:
6.41 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [401]: %%timeit
...: df.mask(df==0).min().min()
...:
23.9 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Python Pandas Fast Way to Divide Row Value by Previous Value

I want to calculate daily bond returns from clean prices based on the logarithm of the bond price in t divided by the bond price in t-1. So far, I calculate it like this:
import pandas as pd
import numpy as np
#create example data
col1 = np.random.randint(0,10,size=10)
df = pd.DataFrame()
df["col1"] = col1
df["result"] = [0]*len(df)
#slow computation
for i in range(len(df)):
if i == 0:
df["result"][i] = np.nan
else:
df["result"][i] = np.log(df["col1"][i]/df["col1"][i-1])
However, since I have a large sample this takes a lot of time to compute. Is there a way to improve the code in order to make it faster?
Use Series.shift by col1 column with Series.div for division:
df["result1"] = np.log(df["col1"].div(df["col1"].shift()))
#alternative
#df["result1"] = np.log(df["col1"] / df["col1"].shift())
print (df)
col1 result result1
0 5 NaN NaN
1 0 -inf -inf
2 3 inf inf
3 3 0.000000 0.000000
4 7 0.847298 0.847298
5 9 0.251314 0.251314
6 3 -1.098612 -1.098612
7 5 0.510826 0.510826
8 2 -0.916291 -0.916291
9 4 0.693147 0.693147
I test both solutions:
np.random.seed(0)
col1 = np.random.randint(0,10,size=10000)
df = pd.DataFrame({'col1':col1})
In [128]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
865 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [129]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
1.16 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
1.03 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(0)
col1 = np.random.randint(0,10,size=100000)
df = pd.DataFrame({'col1':col1})
In [132]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
3.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [133]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
6.31 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
3.75 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
No need to use multiple functions, use Series.pct_change():
df = df.assign(
result=lambda x: np.log(x.col1.pct_change() + 1)
)
print(df)
col1 result
0 3 NaN
1 5 0.510826
2 8 0.470004
3 7 -0.133531
4 9 0.251314
5 1 -2.197225
6 1 0.000000
7 2 0.693147
8 7 1.252763
9 0 -inf
This should be a much faster way to get the same results:
df["result_2"] = np.log(df["col1"] / df["col1"].shift())

How to convert a particular dataframe into a series by combining columns

Given the following data, where 3 means yes and 2 means no
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
which looks as
v_1 v_2 v_3
0 2 2 3
1 2 3 2
2 3 2 2
I would like to create the following series
0 v_3
1 v_2
2 v_1
All I cna think of is the following:
t['V'] = t.sum().reset_index(drop=True)
which gives
v_1 v_2 v_3 V
0 v_3 v_1
1 v_2 v_2
2 v_1 v_3
I'm wondering if there's a nicer approach than this, or perhaps more general.
Perhaps this is what you need, to keep the 3s and concat them in a series?
(
t.apply(lambda x: np.where(x.eq(3), x.name, None))
.stack().reset_index(drop=True)
)
0 v_3
1 v_2
2 v_1
dtype: object
Give this a whirl :
(t
.stack()
.droplevel(0)
.loc[lambda x: x.eq(3)]
.reset_index(name='temp')
.drop('temp',axis=1)
)
index
0 v_3
1 v_2
2 v_1
Use DataFrame.where for replace non 3 values to missing values, then reshape by DataFrame.stack, remove first level of MultiIndex and last create Series from index if performance is important:
s = pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
#alternative
#s = pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
print (s)
0 v_3
1 v_2
2 v_1
dtype: object
Details:
print (t.where(t.eq(3)))
v_1 v_2 v_3
0 NaN NaN 3.0
1 NaN 3.0 NaN
2 3.0 NaN NaN
print (t.where(t.eq(3)).stack())
0 v_3 3.0
1 v_2 3.0
2 v_1 3.0
dtype: float64
print (t.where(t.eq(3)).stack().droplevel(0))
v_3 3.0
v_2 3.0
v_1 3.0
dtype: float64
Performance for 1k rows and 10 columns:
np.random.seed(123)
t = pd.DataFrame(np.random.choice([2,3], (1000, 10))).add_prefix('v_')
#print (t)
In [25]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
2.66 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
2.61 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
5.98 ms ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
3.48 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Performance for 100k rows and 10 columns:
t = pd.DataFrame(np.random.choice([2,3], (100000, 10))).add_prefix('v_')
print (t)
In [30]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
84.7 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
84.1 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
147 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
101 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can create a new index that has the location of 3 for each column. Then you apply that index to your column names.
import pandas as pd
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
index_list = [t[t[col]==3].index[0] for col in t.columns] # create new index
series = pd.Series(t.columns) # series of column names
series.index = index_list # apply index to column names
print(series.sort_index())

Concatenate column values in a pandas DataFrame while ignoring NaNs

I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.
You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.
Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""

How do I find: Is the first non-NaN value in each column the maximum for that column in a DataFrame?

For example:
0 1
0 87.0 NaN
1 NaN 99.0
2 NaN NaN
3 NaN NaN
4 NaN 66.0
5 NaN NaN
6 NaN 77.0
7 NaN NaN
8 NaN NaN
9 88.0 NaN
My expected output is: [False, True] since 87 is the first !NaN value but not the maximum in column 0. 99 however is the first !NaN value and is indeed the max in that column.
Option a): Just do groupby with first
(May not be 100% reliable )
df.groupby([1]*len(df)).first()==df.max()
Out[89]:
0 1
1 False True
Option b): bfill
Or using bfill(Fill any NaN value by the backward value in the column , then the first row after bfill is the first not NaN value )
df.bfill().iloc[0]==df.max()
Out[94]:
0 False
1 True
dtype: bool
Option c): stack
df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
Out[102]:
level_1
0 False
1 True
dtype: bool
Option d): idxmax with first_valid_index
df.idxmax()==df.apply(pd.Series.first_valid_index)
Out[105]:
0 False
1 True
dtype: bool
Option e)(From Pir): idxmax with isna
df.notna().idxmax() == df.idxmax()
Out[107]:
0 False
1 True
dtype: bool
Using pure numpy (I think this is very fast)
>>> np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
array([False, True])
The idea is to compare if the index of the first non-nan is also the index of the argmax.
Timings
df = pd.concat([df]*1000).reset_index(drop=True) # setup
%timeit np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
207 µs ± 8.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.groupby([1]*len(df)).first()==df.max()
9.78 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.bfill().iloc[0]==df.max()
824 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
3.55 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.idxmax()==df.apply(pd.Series.first_valid_index)
1.5 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
1.13 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.values[(~np.isnan(df.values)).argmax(axis=0), np.arange(df.shape[1])] == df.max(axis=0).values
450 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We can use numpy's nanmax here for an efficient solution:
a = df.values
np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]
array([False, True])
Timings (Whole lot of options presented here):
Functions
def chris(df):
a = df.values
return np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]
def bradsolomon(df):
df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values
def wen1(df):
return df.groupby([1]*len(df)).first()==df.max()
def wen2(df):
return df.bfill().iloc[0]==df.max()
def wen3(df):
return df.idxmax()==df.apply(pd.Series.first_valid_index)
def rafaelc(df):
return np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
def pir(df):
return df.notna().idxmax() == df.idxmax()
Setup
res = pd.DataFrame(
index=['chris', 'bradsolomon', 'wen1', 'wen2', 'wen3', 'rafaelc', 'pir'],
columns=[10, 20, 30, 100, 500, 1000],
dtype=float
)
for f in res.index:
for c in res.columns:
a = np.random.rand(c, c)
a[a > 0.4] = np.nan
df = pd.DataFrame(a)
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Results
You can do something similar to Wens' answer with the underlying Numpy arrays:
>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values
array([False, True])
df.max(axis=0) gives the column-wise max.
The left hand side indexes df.values, which is a 2d array, to make it a 1d array and compare it element-wise to the maxes per column.
If you exclude .values from the right-hand side, the result will just be a Pandas Series:
>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
0 False
1 True
dtype: bool
After posting the question I came up with this:
def nice_method_name_here(sr):
return sr[sr > 0][0] == np.max(sr)
print(df.apply(nice_method_name_here))
which seems to work, but not sure yet!

Round up or down an entire column in a dataframe

A
0 31.353
1 28.945
2 17.377
I want to create a new df["B"] with A column values round up to 5.
The desired output:
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
I´ve tried:
def roundup5(x):
return int(math.ceil(x / 5.0)) * 5
df["B"] = df["A"].apply(roundup5)
I get:
TypeError: unsupported operand type(s) for /: 'str' and 'float'
I think you need convert values to floats first, then divide and use numpy.ceil with multiple:
df["B"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
df["B"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
Loop version:
def roundup5(x):
return int(math.ceil(float(x) / 5.0)) * 5.0
df["B"] = df["A"].apply(roundup5)
print (df)
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
Timings:
[30000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [327]: %timeit df["B1"] = df["A"].apply(roundup5)
35.7 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [328]: %timeit df["B2"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
1.25 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [329]: %timeit df["B3"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
1.19 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources