Concatenate column values in a pandas DataFrame while ignoring NaNs - python

I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.

Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.

You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.

Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""

Related

Remap values in pandas column with a dict, None if KeyError

I'd like to modify the col1 of the following dataframe df:
col1 col2
0 Black 7
1 Death 2
2 Hardcore 6
3 Grindcore 1
4 Deathcore 4
...
I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'} to get the following dataframe:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
...
I know I can use df.map or df.replace, for example like this:
df.replace({"col1":cat_dic})
but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:
col1 col2
0 B 7
1 D 2
2 H 6
3 Grindcore 1
4 Deathcore 4
...
Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?
Use dict.get:
df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)
print (df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Performance comparison in 50k rows:
df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
You may use pd.apply first
df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
Then, you can safely use pd.replace
df.replace({"col1":cat_dic})
This can be done in One line:
df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
Output is:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Here is a one liner easy solution which gives us the expected output.
df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))
Output :
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Series.map already maps NaN to the mismatched key
$ print(df['col1'].map(cat_dic))
0 B
1 D
2 H
3 NaN
4 NaN
Name: col1, dtype: object
Anyway, you can update your cat_dic with missing keys from col1 column
cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)
print(cat_dic)
{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}
print(df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
In [6]: df.col1.map(cat_dic.get)
Out[6]:
0 B
1 D
2 H
3 None
4 None
dtype: object
You could also use apply, both work. When working on a Series, map is faster I think.
Explanation:
You can get a default value for missing keys by using dict.get instead using the [..]-operator. By default, this default value is None. So simply passing the dict.get method to apply/map just works.

How to convert a particular dataframe into a series by combining columns

Given the following data, where 3 means yes and 2 means no
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
which looks as
v_1 v_2 v_3
0 2 2 3
1 2 3 2
2 3 2 2
I would like to create the following series
0 v_3
1 v_2
2 v_1
All I cna think of is the following:
t['V'] = t.sum().reset_index(drop=True)
which gives
v_1 v_2 v_3 V
0 v_3 v_1
1 v_2 v_2
2 v_1 v_3
I'm wondering if there's a nicer approach than this, or perhaps more general.
Perhaps this is what you need, to keep the 3s and concat them in a series?
(
t.apply(lambda x: np.where(x.eq(3), x.name, None))
.stack().reset_index(drop=True)
)
0 v_3
1 v_2
2 v_1
dtype: object
Give this a whirl :
(t
.stack()
.droplevel(0)
.loc[lambda x: x.eq(3)]
.reset_index(name='temp')
.drop('temp',axis=1)
)
index
0 v_3
1 v_2
2 v_1
Use DataFrame.where for replace non 3 values to missing values, then reshape by DataFrame.stack, remove first level of MultiIndex and last create Series from index if performance is important:
s = pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
#alternative
#s = pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
print (s)
0 v_3
1 v_2
2 v_1
dtype: object
Details:
print (t.where(t.eq(3)))
v_1 v_2 v_3
0 NaN NaN 3.0
1 NaN 3.0 NaN
2 3.0 NaN NaN
print (t.where(t.eq(3)).stack())
0 v_3 3.0
1 v_2 3.0
2 v_1 3.0
dtype: float64
print (t.where(t.eq(3)).stack().droplevel(0))
v_3 3.0
v_2 3.0
v_1 3.0
dtype: float64
Performance for 1k rows and 10 columns:
np.random.seed(123)
t = pd.DataFrame(np.random.choice([2,3], (1000, 10))).add_prefix('v_')
#print (t)
In [25]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
2.66 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
2.61 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
5.98 ms ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
3.48 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Performance for 100k rows and 10 columns:
t = pd.DataFrame(np.random.choice([2,3], (100000, 10))).add_prefix('v_')
print (t)
In [30]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
84.7 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
84.1 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
147 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
101 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can create a new index that has the location of 3 for each column. Then you apply that index to your column names.
import pandas as pd
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
index_list = [t[t[col]==3].index[0] for col in t.columns] # create new index
series = pd.Series(t.columns) # series of column names
series.index = index_list # apply index to column names
print(series.sort_index())

How to leave NaN behind after shifting over

I have a function that shifts the values of one column (Col_5) into another column (Col_6) if that column (Col_6) is blank, like this:
def shift(row):
return row['Col_6'] if not pd.isnull(row['Col_6']) else row['Col_5']
I then apply this function to my columns like this:
df[['Col_6', 'Col_5']].apply(shift, axis=1)
This works fine, but instead of leaving the original value in Col_5, I need it to shift to Col_6 and in its place, leave a np.nan (so I can apply the same function to the preceeding column.) Thoughts?
fillna + mask: vectorise, not row-wise
With Pandas, you should try to avoid row-wise operations via apply, as these are processed via Python-level loops. In this case, you can use:
null_mask = df['Col_6'].isnull()
df['Col_6'] = df['Col_6'].fillna(df['Col_5'])
df['Col_5'] = df['Col_5'].mask(null_mask)
Notice we calculate and store a Boolean series representing where Col_6 is null first, then use it later to make those values null where values have been moved across via fillna.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
'Col_6':[np.nan, 8, np.nan, 6, np.nan]})
col_5 = df['Col_5'].copy()
df.loc[pd.isnull(df['Col_6']), 'Col_5'] = np.nan
df.loc[pd.isnull(df['Col_6']), 'Col_6'] = col_5
Output:
# Original Dataframe:
Col_5 Col_6
0 1.0 NaN
1 NaN 8.0
2 3.0 NaN
3 4.0 6.0
4 NaN NaN
# Fill Col_5 with NaN where Col_6 is NaN:
Col_5 Col_6
0 NaN NaN
1 NaN 8.0
2 NaN NaN
3 4.0 6.0
4 NaN NaN
# Assign the original col_5 values to Col_6:
Col_5 Col_6
0 NaN 1.0
1 NaN 8.0
2 NaN 3.0
3 4.0 6.0
4 NaN NaN
Setup (using the setup from #cosmic_inquiry)
df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
'Col_6':[np.nan, 8, np.nan, 6, np.nan]})
You can look at this problem like a basic swap operation with a mask
numpy.flip + numpy.isnan
a = df[['Col_5', 'Col_6']].values
m = np.isnan(a[:, 1])
a[m] = np.flip(a[m], axis=1)
df[['Col_5', 'Col_6']] = a
np.isnan + loc:
m = np.isnan(df['Col_6'])
df.loc[m, ['Col_5', 'Col_6']] = df.loc[m, ['Col_6', 'Col_5']].values
Col_5 Col_6
0 NaN 1.0
1 NaN 8.0
2 NaN 3.0
3 4.0 6.0
4 NaN NaN
Performance
test_df = \
pd.DataFrame(np.random.choice([1, np.nan], (1_000_000, 2)), columns=['Col_5', 'Col_6'])
In [167]: %timeit chris(test_df)
68.3 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [191]: %timeit chris2(test_df)
43.9 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [168]: %timeit jpp(test_df)
86.7 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [169]: %timeit cosmic(test_df)
130 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Find Minimum without Zero and NaN in Pandas Dataframe

I have a pandas Dataframe and I want to find the minimum without zeros and Nans.
I was trying to combine from numpy nonzero and nanmin, but it does not work.
Does someone has an idea?
If you want the minimum of all df, you can try so:
m = np.nanmin(df.replace(0, np.nan).values)
Use numpy.where with numpy.nanmin:
df = pd.DataFrame({'B':[4,0,4,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,np.nan,5,7,1,0],
'E':[5,3,0,9,2,4]})
print (df)
B C D E
0 4.0 7.0 1.0 5
1 0.0 8.0 NaN 3
2 4.0 9.0 5.0 0
3 5.0 NaN 7.0 9
4 5.0 2.0 1.0 2
5 NaN 3.0 0.0 4
Numpy solution:
arr = df.values
a = np.nanmin(np.where(arr == 0, np.nan, arr))
print (a)
1.0
Pandas solution - NaNs are removed by default:
a = df.mask(df==0).min().min()
print (a)
1.0
Performance - for each row is added one NaN value:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(1000,1000))
np.fill_diagonal(df.values, np.nan)
print (df)
#joe answer
In [399]: %timeit np.nanmin(df.replace(0, np.nan).values)
15.3 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [400]: %%timeit
...: arr = df.values
...: a = np.nanmin(np.where(arr == 0, np.nan, arr))
...:
6.41 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [401]: %%timeit
...: df.mask(df==0).min().min()
...:
23.9 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Dataframe: Replacing NaN with row average

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
import pandas as pd
import numpy as np
​
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
As commented the axis argument to fillna is NotImplemented.
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
For now you'll need to iterate through:
m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
print(df)
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
df.T.fillna(df.mean(axis=1)).T
As an alternative, you could also use an apply with a lambda expression like this:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
For an efficient solution, use DataFrame.where:
We could use where on axis=0:
df.where(df.notna(), df.mean(axis=1), axis=0)
or mask on axis=0:
df.mask(df.isna(), df.mean(axis=1), axis=0)
By using axis=0, we can fill in the missing values in each column with the row averages.
These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.
Some benchmarks:
df = creator()
>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def creator():
A = np.random.rand(300_000, 20)
A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
return pd.DataFrame(A)
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0
Just had the same problem. I found this workaround to be working:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.

Categories

Resources