I have df that has many variables and I need to concatenate only 3 float variables of it:
v1 v2 v3
0 2.0 NaN 1.0
1 1.0 1.0 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN 2.0
df.dtypes()
v1 float64
v2 float64
v3 float64
dtype: object
I need to concatenate all 3 variables into df['concatenated'] and to have these result:
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_NaN_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 NaN_NaN_2.0
3 NaN NaN NaN NaN_NaN_NaN
4 NaN NaN 2.0 NaN_NaN_2.0
If the capitalization of your NaNs doesn't matter, this would be sufficient:
df['concatenated'] = df.astype(str).apply('_'.join,1)
>>> df
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_nan_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 nan_nan_2.0
3 NaN NaN NaN nan_nan_nan
4 NaN NaN 2.0 nan_nan_2.0
If the capitalization matters, then you have to use replace beforehand:
df['concatenated'] = df.astype(str).replace('nan','NaN').apply('_'.join,1)
>>> df
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_NaN_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 NaN_NaN_2.0
3 NaN NaN NaN NaN_NaN_NaN
4 NaN NaN 2.0 NaN_NaN_2.0
Related
I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.
I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
I have a Pandas dataframe that I want to forward fill HORIZONTALLY but I don't want to forward fill past the last entry in each row. This is time series pricing data on products where some have been discontinued so I dont want the last value recorded to be forward filled to current.
FWDFILL.apply(lambda series: series.iloc[:,series.last_valid_index()].ffill(axis=1))
^The code I have included does what I want but it does it VERTICALLY. This could maybe help people as a starting point.
>>> print(FWDFILL)
1 1 NaN NaN 2 NaN
2 NaN 1 NaN 5 NaN
3 NaN 3 1 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5 NaN NaN 1
Desired Output:
1 1 1 1 2 NaN
2 NaN 1 1 5 NaN
3 NaN 3 1 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5 5 5 1
IIUC, you need to apply with axis=1, so you are applying to dataframe rows instead of dataframe columns.
df.apply(lambda x: x[:x.last_valid_index()].ffill(), axis=1)
Output:
1 2 3 4 5
0
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
Usage of bfill and ffill
s1=df.ffill(1)
s2=df.bfill(1)
df=df.mask(s1.notnull()&s2.notnull(),s1)
df
Out[222]:
1 2 3 4 5
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
Or just using interpolate
df.mask(df.interpolate(axis=1,limit_area='inside').notnull(),df.ffill(1))
Out[226]:
1 2 3 4 5
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
You can use numpy to find the last valid indices and mask your ffill. This allows you to use the vectorized ffill and then a vectorized mask.
u = df.values
m = (~np.isnan(u)).cumsum(1).argmax(1)
df.ffill(1).mask(np.arange(df.shape[0]) > m[:, None])
0 1 2 3 4
0 1.0 1.0 1.0 2.0 NaN
1 NaN 1.0 1.0 5.0 NaN
2 NaN 3.0 1.0 NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN 5.0 5.0 5.0 1.0
Info
>>> np.arange(df.shape[0]) > m[:, None]
array([[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, True, True],
[False, True, True, True, True],
[False, False, False, False, False]])
Little modification to - Most efficient way to forward-fill NaN values in numpy array's solution, solves it here -
def ffillrows_stoplast(arr):
# Identical to earlier solution of forward-filling
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
idx_acc = np.maximum.accumulate(idx,axis=1)
out = arr[np.arange(idx.shape[0])[:,None], idx_acc]
# Perform flipped index accumulation to get trailing NaNs mask and
# accordingly assign NaNs there
out[np.maximum.accumulate(idx[:,::-1],axis=1)[:,::-1]==0] = np.nan
return out
Sample run -
In [121]: df
Out[121]:
A B C D E
1 1.0 NaN NaN 2.0 NaN
2 NaN 1.0 NaN 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 NaN NaN 1.0
In [122]: out = ffillrows_stoplast(df.to_numpy())
In [123]: pd.DataFrame(out,columns=df.columns,index=df.index)
Out[123]:
A B C D E
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
I think of using where on ffill to flip back to NaN those got ignored on bfill
df.ffill(1).where(df.bfill(1).notna())
Out[1623]:
a b c d e
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
I have 3 columns -A, B and C in a pandas dataframe. What i want to do is, where ever A is not null AND B|C are not null, that row in A should be set to null.
if(dffinal['A'].loc[dffinal['A'].notnull()] &
(dffinal['B'].loc[dffinal['B'].notnull()] |
dffinal['C'].loc[dffinal['C'].notnull()])):
dffinal['A'] = np.nan
this is the error I'm getting: cannot do a non-empty take from an empty axes.
Use df.loc[]:
df.loc[df.A.notna() & (df.B.notna()|df.C.notna()),'A']=np.nan
Here first condition is not necessary, so solution should be simplify:
dffinal = pd.DataFrame({
'A':[np.nan,np.nan,4,5,5,np.nan],
'B':[7,np.nan,np.nan,4,np.nan,np.nan],
'C':[1,3,5,7,np.nan,np.nan],
})
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 4.0 NaN 5.0
3 5.0 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
mask = (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
Same output like in first condition:
mask = dffinal['A'].notnull() & (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
given the dataframe df
df = pd.DataFrame(data=[[np.nan,1],
[np.nan,np.nan],
[1,2],
[2,3],
[np.nan,np.nan],
[np.nan,np.nan],
[3,4],
[4,5],
[np.nan,np.nan],
[np.nan,np.nan]],columns=['A','B'])
df
Out[16]:
A B
0 NaN 1.0
1 NaN NaN
2 1.0 2.0
3 2.0 3.0
4 NaN NaN
5 NaN NaN
6 3.0 4.0
7 4.0 5.0
8 NaN NaN
9 NaN NaN
I would need to replace the nan using the following rules:
1) if nan is at the beginning replace with the first values after the nan
2) if nan is in the middle of 2 or more values replace the nan with the average of these values
3) if nan is at the end replace with the last value
df
Out[16]:
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Use add between forward filling and backfilling values, then divide by 2 and last replace last and first NaNs:
df = df.bfill().add(df.ffill()).div(2).ffill().bfill()
print (df)
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Detail:
print (df.bfill().add(df.ffill()))
A B
0 NaN 2.0
1 NaN 3.0
2 2.0 4.0
3 4.0 6.0
4 5.0 7.0
5 5.0 7.0
6 6.0 8.0
7 8.0 10.0
8 NaN NaN
9 NaN NaN