How to fill NaN between two values? - python

I have a df which looks like below. I want to fillNA with some value between two values.
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 NaN NaN 43.0 NaN NaN
3 34.0 34 NaN NaN 34.0 34.0 34.0 34.0
4 NaN 34 34.0 NaN 34.0 34.0 34.0 34.0
For Example, I dont want to fillna in first and second row, because NaN doesn't occur between values. But I want to fillna in third row at col4 and col5. because these two columns contains NaN between two values (col3 and col6).
How to do this,
Expected Output:
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 -100 -100 43.0 NaN NaN
3 34.0 34 -100 -100 34.0 34.0 34.0 34.0
4 NaN 34 34.0 -100 34.0 34.0 34.0 34.0
For this problem
I can't simply use fillna, because it will fill completely, similarly I can't use ffill or bfill, because it violate at leading or trailing values. I'm clueless at this stage. any help would be appreciable.
Note: After search related to this I'm raising this question. I don't find any duplicates related to this. If you find feel free to mark it as duplicate.

I think you need get boolean mask where are missing values without first and last one rows by 2 methods - forward fill and back fill missing values and check non missing or create cumulative sum with comparing >0:
m = df.ffill(axis=1).notnull() & df.bfill(axis=1).notnull()
#alternative mask
a = df.notnull()
m = a.cumsum(axis=1).gt(0) & a.iloc[:, ::-1].cumsum(axis=1).gt(0)
df = df.mask(m, df.fillna(-100))
print (df)
col1 col2 col3 col4 col5 col6 col7 col8
0 NaN 12 12.0 4.0 NaN NaN NaN NaN
1 54.0 54 32.0 11.0 21.0 NaN NaN NaN
2 3.0 34 34.0 -100.0 -100.0 43.0 NaN NaN
3 34.0 34 -100.0 -100.0 34.0 34.0 34.0 34.0
4 NaN 34 34.0 -100.0 34.0 34.0 34.0 34.0
Detail:
print (m)
col1 col2 col3 col4 col5 col6 col7 col8
0 False True True True False False False False
1 True True True True True False False False
2 True True True True True True False False
3 True True True True True True True True
4 False True True True True True True True

Related

Shift all NaN values in pandas to the left

I have a (250, 33866) dataframe. As you can see in the picture, all the NaN values are at the end of each row. I would like to shift those NaNvalues ti the left of the dataframe. At the same time I wanna keep the 0 column (which refers to the Id) in its place (stays the first one).
I was trying to define a function that loops over all rows and columns to do that, but figured it will be very inefficient for large data. Any other options? Thanks
You could reverse the columns of df, drop NaNs; build a DataFrame and reverse it back:
out = pd.DataFrame(df.iloc[:,::-1].apply(lambda x: x.dropna().tolist(), axis=1).tolist(),
columns=df.columns[::-1]).iloc[:,::-1]
For example, for a DataFrame that looks like below:
col0 col1 col2 col3 col4
1 1.0 2.0 3.0 10.0 20.0
2 1.0 2.0 3.0 NaN NaN
3 1.0 2.0 NaN NaN NaN
the above code produces:
col0 col1 col2 col3 col4
0 1.0 2.0 3.0 10.0 20.0
1 NaN NaN 1.0 2.0 3.0
2 NaN NaN NaN 1.0 2.0

pandas: Perform mulitple rolling computation efficiently?

Say I have dataset that is index on date
id, date, col1, col2
1, 4, 1, 12
1, 5, 2, 13
1, 6, 6, 14
2, 4, 20, 16
2, 5, 8, 17
2, 6, 11, 18
...
and I wish to compute the rolling mean, sum, min, max for col1 and col2 grouped by id, with window size 2 and 3. I can do that in a loop like so
def multi_rolling(df, winsize, column):
[df.groupby("id")[column].rolling(winsize).mean(),
df.groupby("id")[column].rolling(winsize).sum(),
df.groupby("id")[column].rolling(winsize).min(),
df.groupby("id")[column].rolling(winsize).max(),
df.groupby("id")[column].rolling(winsize).count()]
Then I just have to call the above in a loop. But this feels inefficient. Is there a way to call it on all combinations of all functions and all columns and all window size more efficiently? E.g. run them in parallel?
Use pandas.DataFrame.agg:
new_df = df.groupby("id").rolling(2)[["col1","col2"]].agg(['mean','sum','min','max','count'])
print(new_df)
Output:
col1 col2 \
mean sum min max count mean
col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
1 1.5 12.5 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0 1.5 12.5
2 4.0 13.5 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0 4.0 13.5
2 3 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
4 14.0 16.5 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0 14.0 16.5
5 9.5 17.5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0 9.5 17.5
sum min max count
col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN 1.0 1.0
1 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0
2 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0
2 3 NaN NaN NaN NaN NaN NaN 1.0 1.0
4 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0
5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0
Because your question is ambiguous, I'm not sure if I understand what you need the output data to look like.
But see if one liner helps:
df.groupby("id")[column].rolling(winsize).agg(['mean','sum','min','max','count'])
Because you are grouping repeatedly, it is bound to be very inefficient.

Pandas bfill and ffill how to use for numeric and non-numeric columns

Some of my NANs are strings and some of my NANs are numeric missing values, how to use bfill and ffill in both cases?
df
Criteria Col1 Col2 Col3 Col4
Jan10Sales 12 13 NAN NAN
Feb10Sales 1 3 4 ABC
Mar10Sales NAN 13 14 XY
Apr10Sales 5 NAN 12 V
May10Sales 6 18 19 AB
If NaNs are missing values you can pass columns names like list:
cols = ['Col1','Col2','Col3']
df[cols]=df[cols].bfill()
If NaNs are strings first replace strings to numeric with missing values for non numbers:
cols = ['Col1','Col2','Col3']
df[cols]=df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).bfill()
If want use your solution:
for col in ['Col1','Col2','Col3']:
df[col]= pd.to_numeric(df[col], errors='coerce').bfill()
print (df)
Criteria Col1 Col2 Col3
0 Jan10Sales 12.0 13.0 4.0
1 Feb10Sales 1.0 3.0 4.0
2 Mar10Sales 5.0 13.0 14.0
3 Apr10Sales 5.0 18.0 12.0
4 May10Sales 6.0 18.0 19.0
But if last rows has missing values, back filling not repalce them, because not exist next non missing value:
print (df)
Criteria Col1 Col2 Col3
0 Jan10Sales 12 13 NAN
1 Feb10Sales 1 3 4
2 Mar10Sales NAN 13 14
3 Apr10Sales 5 NAN 12
4 May10Sales 6 18 NaN
cols = ['Col1','Col2','Col3']
df[cols]=df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).bfill()
print (df)
Criteria Col1 Col2 Col3
0 Jan10Sales 12.0 13.0 4.0
1 Feb10Sales 1.0 3.0 4.0
2 Mar10Sales 5.0 13.0 14.0
3 Apr10Sales 5.0 18.0 12.0
4 May10Sales 6.0 18.0 NaN
Then is possible chain bfill and ffill:
df[cols]=df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).bfill().ffill()
print (df)
Criteria Col1 Col2 Col3
0 Jan10Sales 12.0 13.0 4.0
1 Feb10Sales 1.0 3.0 4.0
2 Mar10Sales 5.0 13.0 14.0
3 Apr10Sales 5.0 18.0 12.0
4 May10Sales 6.0 18.0 12.0
You may try this:
for cols in ['Col1','Col2','Col3']:
df[cols].fillna(method='bfill', inplace=True)
pandas.DataFrame.fillna
I guess string 'NAN' does not mean Non-Value Nan, you already got the solution, you can check my code too
df = df[df.ne('NAN')].bfill()
Criteria Col1 Col2 Col3
0 Jan10Sales 12 13 4
1 Feb10Sales 1 3 4
2 Mar10Sales 5 13 14
3 Apr10Sales 5 18 12
4 May10Sales 6 18 19

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

pandas shift rows NaNs

Say we have a dataframe set up as follows:
x = pd.DataFrame(np.random.randint(1, 10, 30).reshape(5,6),
columns=[f'col{i}' for i in range(6)])
x['col6'] = np.nan
x['col7'] = np.nan
col0 col1 col2 col3 col4 col5 col6 col7
0 6 5 1 5 2 4 NaN NaN
1 8 8 9 6 7 2 NaN NaN
2 8 3 9 6 6 6 NaN NaN
3 8 4 4 4 8 9 NaN NaN
4 5 3 4 3 8 7 NaN NaN
When calling x.shift(2, axis=1), col2 -> col5 shifts correctly, but col6 and col7 stays as NaN?
How can I overwrite the NaN in col6 and col7 values with col4 and col5's values? Is this a bug or intended?
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 NaN NaN
1 NaN NaN 8.0 8.0 9.0 6.0 NaN NaN
2 NaN NaN 8.0 3.0 9.0 6.0 NaN NaN
3 NaN NaN 8.0 4.0 4.0 4.0 NaN NaN
4 NaN NaN 5.0 3.0 4.0 3.0 NaN NaN
It's possible this is a bug, you can use np.roll to achieve this:
In[11]:
x.apply(lambda x: np.roll(x, 2), axis=1)
Out[11]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
Speedwise, it's probably quicker to construct a df and reuse the existing columns and pass the result of np.roll as the data arg to the constructor to DataFrame:
In[12]:
x = pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
x
Out[12]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
timings
In[13]:
%timeit pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
%timeit x.fillna(0).astype(int).shift(2, axis=1)
10000 loops, best of 3: 117 µs per loop
1000 loops, best of 3: 418 µs per loop
So constructing a new df with the result of np.roll is quicker than first filling the NaN values, cast to int, and then shifting.

Categories

Resources