How to leave NaN behind after shifting over - python

I have a function that shifts the values of one column (Col_5) into another column (Col_6) if that column (Col_6) is blank, like this:
def shift(row):
return row['Col_6'] if not pd.isnull(row['Col_6']) else row['Col_5']
I then apply this function to my columns like this:
df[['Col_6', 'Col_5']].apply(shift, axis=1)
This works fine, but instead of leaving the original value in Col_5, I need it to shift to Col_6 and in its place, leave a np.nan (so I can apply the same function to the preceeding column.) Thoughts?

fillna + mask: vectorise, not row-wise
With Pandas, you should try to avoid row-wise operations via apply, as these are processed via Python-level loops. In this case, you can use:
null_mask = df['Col_6'].isnull()
df['Col_6'] = df['Col_6'].fillna(df['Col_5'])
df['Col_5'] = df['Col_5'].mask(null_mask)
Notice we calculate and store a Boolean series representing where Col_6 is null first, then use it later to make those values null where values have been moved across via fillna.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
'Col_6':[np.nan, 8, np.nan, 6, np.nan]})
col_5 = df['Col_5'].copy()
df.loc[pd.isnull(df['Col_6']), 'Col_5'] = np.nan
df.loc[pd.isnull(df['Col_6']), 'Col_6'] = col_5
Output:
# Original Dataframe:
Col_5 Col_6
0 1.0 NaN
1 NaN 8.0
2 3.0 NaN
3 4.0 6.0
4 NaN NaN
# Fill Col_5 with NaN where Col_6 is NaN:
Col_5 Col_6
0 NaN NaN
1 NaN 8.0
2 NaN NaN
3 4.0 6.0
4 NaN NaN
# Assign the original col_5 values to Col_6:
Col_5 Col_6
0 NaN 1.0
1 NaN 8.0
2 NaN 3.0
3 4.0 6.0
4 NaN NaN

Setup (using the setup from #cosmic_inquiry)
df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
'Col_6':[np.nan, 8, np.nan, 6, np.nan]})
You can look at this problem like a basic swap operation with a mask
numpy.flip + numpy.isnan
a = df[['Col_5', 'Col_6']].values
m = np.isnan(a[:, 1])
a[m] = np.flip(a[m], axis=1)
df[['Col_5', 'Col_6']] = a
np.isnan + loc:
m = np.isnan(df['Col_6'])
df.loc[m, ['Col_5', 'Col_6']] = df.loc[m, ['Col_6', 'Col_5']].values
Col_5 Col_6
0 NaN 1.0
1 NaN 8.0
2 NaN 3.0
3 4.0 6.0
4 NaN NaN
Performance
test_df = \
pd.DataFrame(np.random.choice([1, np.nan], (1_000_000, 2)), columns=['Col_5', 'Col_6'])
In [167]: %timeit chris(test_df)
68.3 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [191]: %timeit chris2(test_df)
43.9 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [168]: %timeit jpp(test_df)
86.7 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [169]: %timeit cosmic(test_df)
130 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Pandas convert boolean column to column name when true

I have a df with boolean values (well int values that are either 0 or 1, but that's not important right now):
A B C D
0 0 1 0
1 0 0 0
0 1 1 1
1 0 0 1
And I want to convert it so that "1" (True) values are converted to the header name of the column and 0 values to NaN. The resulting df needs not have a header.
Expected output:
NaN NaN C NaN
A NaN NaN NaN
NaN B C D
A NaN NaN D
Iterating over the rows and assigning those values with a check could work, but is there no faster/more pandas-idiomatic way?
Maybe something with DataFrame.apply:
df.apply(lambda s: [s.name if v == 1 else np.nan for v in s])
With numpy where
np.where(df == 1, df.columns, np.nan)
array([[nan, nan, 'C', nan],
['A', nan, nan, nan],
[nan, 'B', 'C', 'D'],
['A', nan, nan, 'D']], dtype=object)
How to convert np.array to pd.DataFrame (added by #jezrael)
df = pd.DataFrame(np.where(df == 1, df.columns, np.nan), columns=df.columns)
print (df)
A B C D
0 NaN NaN C NaN
1 A NaN NaN NaN
2 NaN B C D
3 A NaN NaN D
Use numpy.where with DataFrame constructor and no columns parameter if performance is important:
df = pd.DataFrame(np.where(df == 1, df.columns, np.nan))
print (df)
0 1 2 3
0 NaN NaN C NaN
1 A NaN NaN NaN
2 NaN B C D
3 A NaN NaN D
And if need output in file with no columns and index values add index=False and header=None to DataFrame.to_csv:
df.to_csv('file.csv', index=False, header=None)
EDIT:
If performance is important, you can avoid apply because loops under the hood. Here for the most vectorized and fastest solution is best use np.where:
#[40000 rows x 40 columns]
df = pd.concat([df] * 10000, ignore_index=True)
df = pd.concat([df] * 10, ignore_index=True, axis=1)
In [180]: %%timeit
...: for i in df.columns:
...: df[i] = df[i].apply(lambda x: i if x==1 else np.nan)
...:
690 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [181]: %%timeit
...: df.apply(lambda s: [s.name if v == 1 else np.nan for v in s])
...:
680 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [182]: %%timeit
...: pd.DataFrame(np.where(df == 1, df.columns, np.nan))
...:
42.7 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [183]: %%timeit
...: df.T.where(df.T != 1, df.columns).T.where(df != 0, np.nan)
...:
17 s ± 644 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use this:
for i in df.columns:
df[i] = df[i].apply(lambda x: i if x==1 else np.nan)
df.columns = [''] * len(df.columns)
you can use np.where or pd.mask like below
np.where(df.values==1, df.columns, np.nan)
## or
df.mask(df==1,df.columns)
You can also use where from pandas:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html)
Note that T is important to have proper result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,0,1],
'B': [0,0,1,0],
'C': [1,0,1,0],
'D': [0,0,1,1]
})
df = df.T.where(df.T != 1, df.columns).T.where(df != 0, np.nan)
Output:
A B C D
0 NaN NaN C NaN
1 A NaN NaN NaN
2 NaN B C D
3 A NaN NaN D

Inserting "missing" multiindex rows into a Pandas Dataframe

I have a pandas DataFrame with a two-level multiindex. The second level is numeric and supposed to be sorted and sequential for each unique value of the first-level index, but has gaps. How do I insert the "missing" rows? Sample input:
import pandas as pd
df = pd.DataFrame(list(range(5)),
index=pd.MultiIndex.from_tuples([('A',1), ('A',3),
('B',2), ('B',3), ('B',6)]),
columns='value')
# value
#A 1 0
# 3 1
#B 2 2
# 3 3
# 6 4
Expected output:
# value
#A 1 0
# 2 NaN
# 3 1
#B 2 2
# 3 3
# 4 NaN
# 5 NaN
# 6 4
I suspect I could have used resample, but I am having trouble converting the numbers to anything date-like.
If there is a will, there is a way. I am not proud of this but, I think it works.
Try:
def f(x):
levels = x.index.remove_unused_levels().levels
x = x.reindex(pd.MultiIndex.from_product([levels[0], np.arange(levels[1][0], levels[1][-1]+1)]))
return x
df.groupby(level=0, as_index=False, group_keys=False).apply(f)
Output:
value
A 1 0.0
2 NaN
3 1.0
B 2 2.0
3 3.0
4 NaN
5 NaN
6 4.0
After much deliberations, I was able to come up with a solution myself. Judging by the fact of how lousy it is, the problem I am facing is not a very typical one.
new_index = d.index.to_frame()\
.groupby(0)[1]\
.apply(lambda x:
pd.Series(1, index=range(x.min(), x.max() + 1))).index
d.reindex(new_index)
You can simply use the following depends on the missing index:
result.unstack(1).stack(0, dropna=False).fillna(0)
When you unstack, the pandas expand the df to have rows and columns and in the above example, level 1 index is going to be the column names. Then, again by stacking, you return the df to its original form, BUT, this time you need to make sure you use dropna=False so the NaN values are going to be there for missing indexes. In the end, using .fillna(0) is optional depends on what you want to do with the NaN values.
There's no accounting for taste, but I think falling back to list comprehension leads to slightly more readable code:
df.reindex(
pd.MultiIndex.from_tuples([
(level_0, level_1)
for level_0 in df.reset_index(0).level_0.unique()
for level_1 in range(
df.reset_index(1).loc[level_0, "level_1"].min(),
df.reset_index(1).loc[level_0, "level_1"].max()+1
)
]))
# Output:
#value
#A 1 0.0
# 2 NaN
# 3 1.0
#B 2 2.0
# 3 3.0
# 4 NaN
# 5 NaN
# 6 4.0
Although this is of course slower than going down the apply route:
list-comprehension: 2.57 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
DYZ apply: 1.25 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scott's apply: 2.19 ms ± 9.84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Concatenate column values in a pandas DataFrame while ignoring NaNs

I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.
You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.
Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""

Find Minimum without Zero and NaN in Pandas Dataframe

I have a pandas Dataframe and I want to find the minimum without zeros and Nans.
I was trying to combine from numpy nonzero and nanmin, but it does not work.
Does someone has an idea?
If you want the minimum of all df, you can try so:
m = np.nanmin(df.replace(0, np.nan).values)
Use numpy.where with numpy.nanmin:
df = pd.DataFrame({'B':[4,0,4,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,np.nan,5,7,1,0],
'E':[5,3,0,9,2,4]})
print (df)
B C D E
0 4.0 7.0 1.0 5
1 0.0 8.0 NaN 3
2 4.0 9.0 5.0 0
3 5.0 NaN 7.0 9
4 5.0 2.0 1.0 2
5 NaN 3.0 0.0 4
Numpy solution:
arr = df.values
a = np.nanmin(np.where(arr == 0, np.nan, arr))
print (a)
1.0
Pandas solution - NaNs are removed by default:
a = df.mask(df==0).min().min()
print (a)
1.0
Performance - for each row is added one NaN value:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(1000,1000))
np.fill_diagonal(df.values, np.nan)
print (df)
#joe answer
In [399]: %timeit np.nanmin(df.replace(0, np.nan).values)
15.3 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [400]: %%timeit
...: arr = df.values
...: a = np.nanmin(np.where(arr == 0, np.nan, arr))
...:
6.41 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [401]: %%timeit
...: df.mask(df==0).min().min()
...:
23.9 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Dataframe: Replacing NaN with row average

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
import pandas as pd
import numpy as np
​
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
As commented the axis argument to fillna is NotImplemented.
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
For now you'll need to iterate through:
m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
print(df)
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
df.T.fillna(df.mean(axis=1)).T
As an alternative, you could also use an apply with a lambda expression like this:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
For an efficient solution, use DataFrame.where:
We could use where on axis=0:
df.where(df.notna(), df.mean(axis=1), axis=0)
or mask on axis=0:
df.mask(df.isna(), df.mean(axis=1), axis=0)
By using axis=0, we can fill in the missing values in each column with the row averages.
These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.
Some benchmarks:
df = creator()
>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def creator():
A = np.random.rand(300_000, 20)
A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
return pd.DataFrame(A)
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0
Just had the same problem. I found this workaround to be working:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.

Categories

Resources