Why does pandas.interpolate() interpolate single values surrounded by NaNs?

Why does pandas.interpolate() interpolate single values surrounded by NaNs? - python

I have a problem with pandas interpolate(). I only want to interpolate when there are not more than 2 succsessive "np.nans".
But the interpolate function tries to interpolate also single values when there are more than 2 np.nans!?
s = pd.Series(data = [np.nan,10,np.nan,np.nan,np.nan,5,np.nan,6,np.nan,np.nan,30])
a = s.interpolate(limit=2,limit_area='inside')
print(a)
the output I get is:
0 NaN
1 10.00
2 8.75
3 7.50
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
I do not want the result in line 2 and 3.
What I want is:
0 NaN
1 10.00
2 NaN
3 NaN
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
Can anybody please help?

Groupby.transform with Series.where
s_notna = s.notna()
m = (s.groupby(s_notna.cumsum()).transform('size').le(3) | s_notna)
s = s.interpolate(limit_are='inside').where(m)
print(s)
Output
0 NaN
1 10.0
2 NaN
3 NaN
4 NaN
5 5.0
6 5.5
7 6.0
8 14.0
9 22.0
10 30.0
dtype: float64

Related

how to detect when a price higher than previous high

I am trying to find when a price value is cross above a high, I can find the high but when I compare it to current price it gives me all 1
my code :
peak = df[(df[‘price’] > df[‘price’].shift(-1)) & (df[‘price’] > df[‘price’].shift(1))]
df[‘peak’] = peak
df[‘breakout’] = df[‘price’] > df[‘peak’]
print(df)
out :
price
peak
breakout
1
2
NaN
1
2
2
NaN
1
3
4
NaN
1
4
5
NaN
1
5
6
6.0
1
6
5
NaN
1
7
4
NaN
1
8
3
NaN
1
9
12
12.0
1
10
10
NaN
1
11
50
NaN
1
12
100
NaN
1
13
110
110
1
14
84
NaN
1
expect:
price
peak
high
breakout
1
2
NaN
0
0
2
2
NaN
0
0
3
4
NaN
0
0
4
5
NaN
0
0
5
6
6.0
1
1
6
5
NaN
0
0
7
4
NaN
0
0
8
3
NaN
0
0
9
12
12.0
1
1
10
10
NaN
0
0
11
50
NaN
0
1
12
100
NaN
0
1
13
110
110
1
1
14
84
NaN
0
0
with fillna :
price peak look breakout
0 2 NaN NaN False
1 4 NaN NaN False
2 5 NaN NaN False
3 6 6.0 6.0 False
4 5 NaN 6.0 False
5 4 NaN 6.0 False
6 3 NaN 6.0 False
7 12 12.0 12.0 False ----> this should be True because it it higher than 6 and it also the high for shift(-1) and shift(1)
8 10 NaN 12.0 False
9 50 NaN 12.0 True
10 100 100.0 100.0 False
11 40 NaN 100.0 False
12 45 45.0 45.0 False
13 30 NaN 45.0 False
14 200 NaN 45.0 True

Try with pandas.DataFrame.fillna:
df["breakout"] = df["price"] >= df["peak"].fillna(method = "ffill")
If you want it with 1s and 0s add the line:
df["breakout"] = df["breakout"].replace([True, False],[1,0])
Note that df["peak"].fillna(method = "ffill") returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 6.0
5 6.0
6 6.0
7 6.0
8 12.0
9 12.0
10 12.0
11 12.0
12 110.0
13 110.0
Name: peak, dtype: float64
So you can compare it easily with the price column.

rolling moving average and std dev by multiple columns dynamically

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc

Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan

This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.

IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0

You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0

How to remove clustered/unclustered values less than a certain length from pandas dataframe?

If I have a pandas data frame like this:
A
1 1
2 1
3 NaN
4 1
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 1
12 1
13 1
How do I remove values that are clustered in a length less than some value (in this case four) for example? Such that I get an array like this:
A
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 NaN
12 NaN
13 NaN

Using groupby and np.where
s = df.groupby(df.A.isnull().cumsum()).transform(lambda s: pd.notnull(s).sum())
df['B'] = np.where(s.A>=4, df.A, np.nan)
Outputs
A B
1 1.0 NaN
2 1.0 NaN
3 NaN NaN
4 1.0 NaN
5 NaN NaN
6 1.0 1.0
7 1.0 1.0
8 1.0 1.0
9 1.0 1.0
10 NaN NaN
11 1.0 NaN
12 1.0 NaN
13 1.0 NaN

Pandas Count frequency of values by column

I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
After reading the data from a text file with
import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')
I want to: (1) get the frequency of values larger than zero for each column; (2) get the sum of values in each column; (3) find the maximum value in each column.
I managed to obtain (2) using
N = df.apply(lambda x: np.sum(x))
But could not figure out how to achieve (1) and (3).
I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows).
Thanks in advance for any hints and suggestions.

Your 1st
df.gt(0).sum()
2nd
df.sum()
3rd
df.max()

You can use mask and describe to get a bunch of stats by column.
df.mask(df <= 0).describe().T
Output:
count mean std min 25% 50% 75% max
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0
The reason to use mask is that count counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count.
And, finally, we can add "sum" too, using assign:
df.mask(df<=0).describe().T.assign(sum=df.sum())
Output:
count mean std min 25% 50% 75% max sum
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0 42
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0 38
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0 39
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0 47
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0 46
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0 50
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0 63
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0 48
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0 42
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0 43

Computing the difference between first and last values in a rolling window

I am using the Pandas rolling window tool on a one-column dataframe whose index is in datetime form.
I would like to compute, for each window, the difference between the first value and the last value of said window. How do I refer to the relative index when giving a lambda function? (in the brackets below)
df2 = df.rolling('3s').apply(...)

IIUC:
In [93]: df = pd.DataFrame(np.random.randint(10,size=(9, 3)))
In [94]: df
Out[94]:
0 1 2
0 7 4 5
1 9 9 3
2 1 7 6
3 0 9 2
4 2 3 7
5 6 7 1
6 1 0 1
7 8 4 7
8 0 0 9
In [95]: df.rolling(window=3).apply(lambda x: x[0]-x[-1])
Out[95]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 6.0 -3.0 -1.0
3 9.0 0.0 1.0
4 -1.0 4.0 -1.0
5 -6.0 2.0 1.0
6 1.0 3.0 6.0
7 -2.0 3.0 -6.0
8 1.0 0.0 -8.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does pandas.interpolate() interpolate single values surrounded by NaNs? - python

Groupby.transform with Series.where s_notna = s.notna() m = (s.groupby(s_notna.cumsum()).transform('size').le(3) | s_notna) s = s.interpolate(limit_are='inside').where(m) print(s) Output 0 NaN 1 10.0 2 NaN 3 NaN 4 NaN 5 5.0 6 5.5 7 6.0 8 14.0 9 22.0 10 30.0 dtype: float64

Related

how to detect when a price higher than previous high

rolling moving average and std dev by multiple columns dynamically

How to remove clustered/unclustered values less than a certain length from pandas dataframe?

Pandas Count frequency of values by column

Computing the difference between first and last values in a rolling window

Categories

Resources