I am working Time Series data. I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. I tried looking at some of the links like:
Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count.
Pandas: run length of NaN holes : Outputs run Length encoding for NaNs
There are many more others along this lane, but none of them actually tells how can we remove them after identifying.
I found one similar solution but that is in R :
How to remove more than 2 consecutive NA's in a column?
I want solution in Python.
So here is the example:
Here is my dataframe column:
a
0 36.45
1 35.45
2 NaN
3 NaN
4 NaN
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
10 NaN
11 NaN
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
If k = 3, my output should be:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
How can I go about removing the consecutive NaNs less than or equal to some threshold (k).
There are a few ways, but this is how I've done it:
Determine groups of consecutive numbers using a neat cumsum trick
Use groupby + transform to determine the size of each group
Identify groups of NaNs that are within the threshold
Filter them out with boolean indexing.
k = 3
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)
df[m]
a
0 36.45
1 35.45
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
You can perform df = df[m]; df.reset_index(drop=True) step at the end if you want a monotonically increasing integer index.
You can create a indicator column to count the consecutive nans.
k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)
Out[375]:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
Related
I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much
Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
I have a data frame like this:
df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
17 NaN
18 NaN
I want to filter this data frame from the start to the row where it finds a number in the score column.
So, after filtering the data frame should look like this:
new_df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
I want to filter this data frame from the row where it finds a number in the score column to the end of the data frame.
So, after filtering the data frame should look like this:
new_df:
number score
16 10
17 NaN
18 NaN
How do I filter this data frame?
Kindly help
You can use pd.Series.last_valid_index and pd.Series.first_valid_index like this:
df.loc[df['score'].first_valid_index():]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
And,
df.loc[:df['score'].last_valid_index()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
And, if you wanted to clip leading NaN and trailing Nan you can combined the two.
df.loc[df['score'].first_valid_index():df['score'].last_valid_index()]
Output:
number score
4 16 10.0
You can use a reverse cummax and boolean slicing:
new_df = df[df['score'].notna()[::-1].cummax()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
For the second one, a simple cummax:
new_df = df[df['score'].notna().cummax()]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN
I have a one column dataframe which looks like this:
Neive Bayes
0 8.322087e-07
1 3.213342e-24
2 4.474122e-28
3 2.230054e-16
4 3.957606e-29
5 9.999992e-01
6 3.254807e-13
7 8.836033e-18
8 1.222642e-09
9 6.825381e-03
10 5.275194e-07
11 2.224289e-06
12 2.259303e-09
13 2.014053e-09
14 1.755933e-05
15 1.889681e-04
16 9.929193e-01
17 4.599619e-05
18 6.944654e-01
19 5.377576e-05
I want to pivot it to wide format but with specific intervals. The first 9 rows should make up 9 columns of the first row, and continue this pattern until the final table has 9 columns and has 9 times fewer rows than now. How would I achieve this?
Using pivot_table:
df.pivot_table(columns=df.index % 9, index=df.index // 9, values='Neive Bayes')
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN
Construct multiindex, set_index and unstack
iix = pd.MultiIndex.from_arrays([np.arange(df.shape[0]) // 9,
np.arange(df.shape[0]) % 9])
df_wide = df.set_index(iix)['Neive Bayes'].unstack()
Out[204]:
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN
I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.
You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)
Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16