Pandas backfill specific value - python

I have dataframe as such:
df = pd.DataFrame({'val': [np.nan,np.nan,np.nan,np.nan, 15, 1, 5, 2,np.nan, np.nan, np.nan, np.nan,np.nan,np.nan,2,23,5,12, np.nan np.nan, 3,4,5]})
df['name'] = ['a']*8 + ['b']*15
df
>>>
val name
0 NaN a
1 NaN a
2 NaN a
3 NaN a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 NaN b
12 NaN b
13 NaN b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 NaN b
19 NaN b
20 3.0 b
21 4.0 b
22 5.0 b
For each name i want to backfill the prior 3 na spots with -1 so that I end up with
>>>
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 -1 b
19 -1 b
20 3.0 b
21 4.0 b
22 5.0 b
Note there can be multiple sections with NaN. If a section has less than 3 nans it will fill all of them (it backfills all up to 3).

You can using first_valid_index, return the first not null value of each group
then assign the -1 in by using the loc
idx=df.groupby('name').val.apply(lambda x : x.first_valid_index())
for x in idx:
df.loc[x - 3:x - 1, 'val'] = -1
df
Out[51]:
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
Update
s=df.groupby('name').val.bfill(limit=3)
s.loc[s.notnull()&df.val.isnull()]=-1
s
Out[59]:
0 NaN
1 -1.0
2 -1.0
3 -1.0
4 15.0
5 1.0
6 5.0
7 2.0
8 NaN
9 NaN
10 NaN
11 -1.0
12 -1.0
13 -1.0
14 2.0
15 23.0
16 5.0
17 12.0
18 NaN
19 -1.0
20 -1.0
21 -1.0
22 3.0
23 4.0
24 5.0
Name: val, dtype: float64

Related

Insert empty row after every Nth row in pandas dataframe

I have a dataframe:
pd.DataFrame(columns=['a','b'],data=[[3,4],
[5,5],[9,3],[1,2],[9,9],[6,5],[6,5],[6,5],[6,5],
[6,5],[6,5],[6,5],[6,5],[6,5],[6,5],[6,5],[6,5]])
I want to insert two empty rows after every third row so the resulting output looks like that:
a b
0 3.0 4.0
1 5.0 5.0
2 9.0 3.0
3 NaN NaN
4 NaN NaN
5 1.0 2.0
6 9.0 9.0
7 6.0 5.0
8 NaN NaN
9 NaN NaN
10 6.0 5.0
11 6.0 5.0
12 6.0 5.0
13 NaN NaN
14 NaN NaN
15 6.0 5.0
16 6.0 5.0
17 6.0 5.0
18 NaN NaN
19 NaN NaN
20 6.0 5.0
21 6.0 5.0
22 6.0 5.0
23 NaN NaN
24 NaN NaN
25 6.0 5.0
26 6.0 5.0
I tried a number of things but didn't get any closer to the desired output.
The following should scale well with the size of the DataFrame since it doesn't iterate over the rows and doesn't create intermediate DataFrames.
import pandas as pd
df = pd.DataFrame(columns=['a','b'],data=[[3,4],
[5,5],[9,3],[1,2],[9,9],[6,5],[6,5],[6,5],[6,5],
[6,5],[6,5],[6,5],[6,5],[6,5],[6,5],[6,5],[6,5]])
def add_empty_rows(df, n_empty, period):
""" adds 'n_empty' empty rows every 'period' rows to 'df'.
Returns a new DataFrame. """
# to make sure that the DataFrame index is a RangeIndex(start=0, stop=len(df))
# and that the original df object is not mutated.
df = df.reset_index(drop=True)
# length of the new DataFrame containing the NaN rows
len_new_index = len(df) + n_empty*(len(df) // period)
# index of the new DataFrame
new_index = pd.RangeIndex(len_new_index)
# add an offset (= number of NaN rows up to that row)
# to the current df.index to align with new_index.
df.index += n_empty * (df.index
.to_series()
.groupby(df.index // period)
.ngroup())
# reindex by aligning df.index with new_index.
# Values of new_index not present in df.index are filled with NaN.
new_df = df.reindex(new_index)
return new_df
Tests:
# original df
>>> df
a b
0 3 4
1 5 5
2 9 3
3 1 2
4 9 9
5 6 5
6 6 5
7 6 5
8 6 5
9 6 5
10 6 5
11 6 5
12 6 5
13 6 5
14 6 5
15 6 5
16 6 5
# add 2 empty rows every 3 rows
>>> add_empty_rows(df, 2, 3)
a b
0 3.0 4.0
1 5.0 5.0
2 9.0 3.0
3 NaN NaN
4 NaN NaN
5 1.0 2.0
6 9.0 9.0
7 6.0 5.0
8 NaN NaN
9 NaN NaN
10 6.0 5.0
11 6.0 5.0
12 6.0 5.0
13 NaN NaN
14 NaN NaN
15 6.0 5.0
16 6.0 5.0
17 6.0 5.0
18 NaN NaN
19 NaN NaN
20 6.0 5.0
21 6.0 5.0
22 6.0 5.0
23 NaN NaN
24 NaN NaN
25 6.0 5.0
26 6.0 5.0
# add 5 empty rows every 4 rows
>>> add_empty_rows(df, 5, 4)
a b
0 3.0 4.0
1 5.0 5.0
2 9.0 3.0
3 1.0 2.0
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 9.0 9.0
10 6.0 5.0
11 6.0 5.0
12 6.0 5.0
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 6.0 5.0
19 6.0 5.0
20 6.0 5.0
21 6.0 5.0
22 NaN NaN
23 NaN NaN
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 6.0 5.0
28 6.0 5.0
29 6.0 5.0
30 6.0 5.0
31 NaN NaN
32 NaN NaN
33 NaN NaN
34 NaN NaN
35 NaN NaN
36 6.0 5.0
Try this:
(pd.concat([df,pd.DataFrame([[np.NaN]*2],
index = [i for i in df.index if i%3 == 2] * 2,
columns = list('ab'))])
.sort_index()
.reset_index(drop=True))
Output:
a b
0 3.0 4.0
1 5.0 5.0
2 9.0 3.0
3 NaN NaN
4 NaN NaN
5 1.0 2.0
6 9.0 9.0
7 6.0 5.0
8 NaN NaN
9 NaN NaN
10 6.0 5.0
11 6.0 5.0
12 6.0 5.0
13 NaN NaN
14 NaN NaN
15 6.0 5.0
16 6.0 5.0
17 6.0 5.0
18 NaN NaN
19 NaN NaN
20 6.0 5.0
21 6.0 5.0
22 6.0 5.0
23 NaN NaN
24 NaN NaN
25 6.0 5.0
26 6.0 5.0
You can iterate over rows and add two rows every third rows
data = [[row.tolist(), [pd.NA]*len(row), [pd.NA]*len(row)]
if (idx+1) % 3 == 0 else [row.tolist()]
for idx, row in df.iterrows()]
out = pd.DataFrame([i for lst in data for i in lst], columns=df.columns)
print(data)
[[[3, 4]],
[[5, 5]],
[[9, 3], [<NA>, <NA>], [<NA>, <NA>]],
[[1, 2]],
[[9, 9]],
[[6, 5], [<NA>, <NA>], [<NA>, <NA>]],
[[6, 5]],
[[6, 5]],
[[6, 5], [<NA>, <NA>], [<NA>, <NA>]],
[[6, 5]],
[[6, 5]],
[[6, 5], [<NA>, <NA>], [<NA>, <NA>]],
[[6, 5]],
[[6, 5]],
[[6, 5], [<NA>, <NA>], [<NA>, <NA>]],
[[6, 5]],
[[6, 5]]]
print(out)
a b
0 3 4
1 5 5
2 9 3
3 <NA> <NA>
4 <NA> <NA>
5 1 2
6 9 9
7 6 5
8 <NA> <NA>
9 <NA> <NA>
10 6 5
11 6 5
12 6 5
13 <NA> <NA>
14 <NA> <NA>
15 6 5
16 6 5
17 6 5
18 <NA> <NA>
19 <NA> <NA>
20 6 5
21 6 5
22 6 5
23 <NA> <NA>
24 <NA> <NA>
25 6 5
26 6 5

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0
This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

combine multiple dataframes in a csv file separating each with an empty row

how can I separate each dataframe with an empty row
ive combined them using this snippet
frames1 = [df4, df5, df6]
Summary = pd.concat(frames1)
so how can i split them with an empty row
You can use the below example which works:
Create test dfs
df1 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df3 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
dfs=[df1,df2,df3]
Solution:
pd.concat([df.append(pd.Series(), ignore_index=True) for df in dfs])
A B C D
0 17.0 16.0 15.0 7.0
1 13.0 6.0 12.0 18.0
2 0.0 2.0 10.0 17.0
3 8.0 13.0 10.0 17.0
4 4.0 18.0 8.0 19.0
5 NaN NaN NaN NaN
0 14.0 0.0 13.0 12.0
1 10.0 3.0 6.0 3.0
2 15.0 10.0 15.0 3.0
3 9.0 16.0 11.0 4.0
4 5.0 7.0 6.0 2.0
5 NaN NaN NaN NaN
0 10.0 18.0 13.0 12.0
1 1.0 6.0 10.0 0.0
2 2.0 19.0 4.0 18.0
3 4.0 3.0 9.0 16.0
4 16.0 6.0 5.0 6.0
5 NaN NaN NaN NaN
For horizontal stack:
pd.concat([df.assign(test=np.nan) for df in dfs],axis=1)
A B C D test A B C D test A B C D test
0 17 16 15 7 NaN 14 0 13 12 NaN 10 18 13 12 NaN
1 13 6 12 18 NaN 10 3 6 3 NaN 1 6 10 0 NaN
2 0 2 10 17 NaN 15 10 15 3 NaN 2 19 4 18 NaN
3 8 13 10 17 NaN 9 16 11 4 NaN 4 3 9 16 NaN
4 4 18 8 19 NaN 5 7 6 2 NaN 16 6 5 6 NaN
Is this what you want?:
fname = 'test2.csv'
frames1 = [df4, df5, df6]
with open(fname, mode='a+') as f:
for df in frames1:
df.to_csv(fname, mode='a', header = f.tell() == 0)
f.write('\n')
test2.csv:
,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
f.tell() == 0 checks whether the file handle is at the beginning of the file, i.e. at 0, if yes, prints header, else doesn't.
NOTE: I have used same values for all the dfs, that's why all the results are similar.
For columns:
fname = 'test3.csv'
frames1 = [df1, df2, df3]
Summary = pd.concat([df.assign(**{' ':' '}) for df in frames1], axis=1)
Summary.to_csv(fname)
test3.csv:
,a,b,c, ,a,b,c, ,a,b,c,
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
But the columns will not be equally spaced. If you save with header=False:
test3.csv:
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,

pandas filling nans by mean of before and after non-nan values

I would like to fill df's nan with an average of adjacent elements.
Consider a dataframe:
df = pd.DataFrame({'val': [1,np.nan, 4, 5, np.nan, 10, 1,2,5, np.nan, np.nan, 9]})
val
0 1.0
1 NaN
2 4.0
3 5.0
4 NaN
5 10.0
6 1.0
7 2.0
8 5.0
9 NaN
10 NaN
11 9.0
My desired output is:
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0 <<< deadend
10 7.0 <<< deadend
11 9.0
I've looked into other solutions such as Fill cell containing NaN with average of value before and after, but this won't work in case of two or more consecutive np.nans.
Any help is greatly appreciated!
Use ffill + bfill and divide by 2:
df = (df.ffill()+df.bfill())/2
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0
10 7.0
11 9.0
EDIT : If 1st and last element contains NaN then use (Dark
suggestion):
df = pd.DataFrame({'val':[np.nan,1,np.nan, 4, 5, np.nan,
10, 1,2,5, np.nan, np.nan, 9,np.nan,]})
df = (df.ffill()+df.bfill())/2
df = df.bfill().ffill()
print(df)
val
0 1.0
1 1.0
2 2.5
3 4.0
4 5.0
5 7.5
6 10.0
7 1.0
8 2.0
9 5.0
10 7.0
11 7.0
12 9.0
13 9.0
Althogh in case of multiple nan's in a row it doesn't produce the exact output you specified, other users reaching this page may actually prefer the effect of the method interpolate():
df = df.interpolate()
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 6.3
10 7.7
11 9.0

pandas backfill NaN by incrementing the last value

I have a data frame:
A B C
Timestamp
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 5
4 NaN NaN 4
5 NaN 3 3
6 NaN 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
I would like to backfill it by incrementing the last available value in each column so it looks like this:
A B C
Timestamp
1 9 7 7
2 8 6 6
3 7 5 5
4 6 4 4
5 5 3 3
6 4 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
Let's try this:
df1 = df1[::-1].fillna(method='ffill')
(df1 + (df1 == df1.shift()).cumsum()).sort_index()
Output:
A B C
Timestamp
1 9.0 7.0 7.0
2 8.0 6.0 6.0
3 7.0 5.0 5.0
4 6.0 4.0 4.0
5 5.0 3.0 3.0
6 4.0 2.0 NaN
7 3.0 1.0 NaN
8 2.0 NaN NaN
9 1.0 NaN NaN
You can try this:
def bfill_increment(col):
col_null = col.isnull()[::-1]
groups = col_null.diff().fillna(0).cumsum()
return col_null.groupby(groups).cumsum()[::-1] + col.bfill()
df.apply(bfill_increment)

Categories

Resources