i'm having a trouble to code into finding in between values in pandas dataframe.
the dataframe:
value
30
NaN
NaN
25
NaN
20
NaN
NaN
NaN
NaN
15
...
the formula is like this:
value before nan - ((value before nan - value after nan)/div by no. of nan in between the values)
example of expected value should be like this:
30 - (30-25)/2 = 27.5
27.5 - (27.5-25)/1 = 25
so the expected dataframe will look like this:
value
expected value
30
30
NaN
27.5
NaN
25
25
25
NaN
20
20
20
NaN
18.75
NaN
17.5
NaN
16.25
NaN
15
15
15
...
...
IIUC, you can generalize your formula into two parts:
Any nan right before a non-nan is just same as that number
{value-before-nan} - ({value-before-nan} - {value-after-nan})/1 = {value-after-nan}
Rest of nan are linear interpolation.
So you can use bfill with interpolate:
df.bfill(limit=1).interpolate()
Output:
value
0 30.00
1 27.50
2 25.00
3 25.00
4 20.00
5 20.00
6 18.75
7 17.50
8 16.25
9 15.00
10 15.00
Related
I have a csv file that I can read and print
reference radius diameter length sfcefin pltol mitol sfcetrement
0 jnl1 15 30.0 35 Rz2 0.0 -0.03 Stellite Spray
1 jnl2 28 56.0 50 NaN NaN NaN NaN
2 jnl3 10 20.0 25 NaN NaN NaN NaN
3 jnlfce1 15 NaN 15 NaN NaN NaN NaN
4 jnlfce2 28 NaN 13 NaN NaN NaN NaN
5 jnlfce3 28 NaN 18 NaN NaN NaN NaN
6 jnlfce4 10 NaN 10 NaN NaN NaN NaN
I have managed to isolate and print a specific row using
df1 = df[df['reference'].str.contains(feature)]
reference radius diameter length sfcefin pltol mitol sfcetrement
1 jnl2 28 56.0 50 NaN NaN NaN NaN
I now want to select the radius column and put the value into a variable
I have tried the similar technique on the output of the df1 but with no success
value = df1[df1['radius']]
print(value)
Has anyone any more suggestions?
You can use .loc and simply do:
value = df.loc[df1.reference.str.contains(feature), 'radius']
Coming from R and been working with the tidyverse mostly, I wonder how does pandas groupby and aggregations work. I have this code and the results are heartbreaking to me.
import pandas as pd
df = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
df.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
Now I would like to calculate the average displacement (disp) by cylinders, like that:
df['avg_disp'] = df.groupby('cyl').disp.mean()
Which results in something like:
cyl disp avg_disp
31 4 121.0 NaN
2 4 108.0 NaN
27 4 95.1 NaN
26 4 120.3 NaN
25 4 79.0 NaN
20 4 120.1 NaN
7 4 146.7 NaN
8 4 140.8 353.100000
19 4 71.1 NaN
18 4 75.7 NaN
17 4 78.7 NaN
29 6 145.0 NaN
0 6 160.0 NaN
1 6 160.0 NaN
3 6 258.0 NaN
10 6 167.6 NaN
9 6 167.6 NaN
5 6 225.0 NaN
13 8 275.8 NaN
28 8 351.0 NaN
4 8 360.0 105.136364
24 8 400.0 NaN
23 8 350.0 NaN
22 8 304.0 NaN
21 8 318.0 NaN
6 8 360.0 183.314286
11 8 275.8 NaN
16 8 440.0 NaN
30 8 301.0 NaN
14 8 472.0 NaN
12 8 275.8 NaN
15 8 460.0 NaN
After searching for a while, I discovered the transform function which leads to the correct value for avg_disp by assigning the mean value to each row according to the grouping cyl var.
My point is... why can't it be done easily with the mean function instead of using .transform('mean') on the grouped data frame?
If you want to add the results back to the ungrouped dataframe you could use .transform:
... and return a DataFrame having the same indexes as the original object filled with the transformed values.
df['avg_disp'] = df.groupby('cyl').disp.transform('mean')
I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0
I have a data frame like:
A B datetime
10 NaN 12-03-2020 04:43:11
NaN 20 13-03-2020 04:43:11
NaN NaN 14-03-2020 04:43:11
NaN NaN 15-03-2020 04:43:11
NaN NaN 16-03-2020 04:43:11
NaN 50 17-03-2020 04:43:11
20 NaN 18-03-2020 04:43:11
NaN 30 19-03-2020 04:43:11
NaN NaN 20-03-2020 04:43:11
30 30 21-03-2020 04:43:11
40 NaN 22-03-2020 04:43:11
NaN 10 23-03-2020 04:43:11
The code which I'm using is :
df['timegap_in_min'] = np.where( ((df['A'].notna()) &(df[['B','c']].shift(-1).notna())),df['Datetime'].shift(-1) - df['timestamp'], np.nan)
df['timegap_in_min'] = df['timegap_in_min'].astype('timedelta64[h]')
The required output is:
A B datetime prev_timegap next_timegap
10 NaN 12-03-2020 04:43:11 NaN 24
NaN 20 13-03-2020 04:43:11 NaN NaN
NaN NaN 14-03-2020 04:43:11 NaN NaN
NaN NaN 15-03-2020 04:43:11 NaN NaN
NaN NaN 16-03-2020 04:43:11 NaN NaN
NaN 50 17-03-2020 04:43:11 NaN NaN
20 NaN 18-03-2020 04:43:11 24 24
NaN 30 19-03-2020 04:43:11 NaN NaN
NaN NaN 20-03-2020 04:43:11 NaN NaN
30 30 21-03-2020 04:43:11 24 24
40 NaN 22-03-2020 04:43:11 24 24
NaN 10 23-03-2020 04:43:11 NaN NaN
Someone help me in correcting my logic.
Just slightly customize your codes to the following codes:
df['prev_timegap'] = np.where( ((df['A'].notna()) & (df['B'].shift(1).notna())), abs(df['datetime'].shift(1) - df['datetime']), np.nan)
df['prev_timegap'] = df['prev_timegap'].astype('timedelta64[h]')
df['next_timegap'] = np.where( ((df['A'].notna()) & (df['B'].shift(-1).notna())), abs(df['datetime'].shift(-1) - df['datetime']), np.nan)
df['next_timegap'] = df['next_timegap'].astype('timedelta64[h]')
This should give you the desired result based on your description of requirement in the question title. Anyway, the test result would be slightly different from the tabulated required output at the following row:
A B datetime prev_timegap next_timegap
...
30 30 21-03-2020 04:43:11 24 24
...
...
instead, the result is:
A B datetime prev_timegap next_timegap
...
30 30 21-03-2020 04:43:11 NaN NaN
...
...
This result is based on you mentioned that "previous time gap and next time gap" (assuming "previous" and "next" are referring to different timeframe, i.e. different 'datetime').
Note that in your sample df, on column A with value 30, the corresponding value in B on the previous date and next date are both NaN. Hence, should we show NaN instead ?
In case your requirement includes showing some values for the 2 time gaps when there are non NaN values in both "current" datetime of Column A and B, we may need to further enhance the codes above.
For example:
If I have a data frame like this:
20 40 60 80 100 120 140
1 1 1 1 NaN NaN NaN NaN
2 1 1 1 1 1 NaN NaN
3 1 1 1 1 NaN NaN NaN
4 1 1 NaN NaN 1 1 1
How do I find the last index in each row and then count the difference in columns elapsed so I get something like this?
20 40 60 80 100 120 140
1 40 20 0 NaN NaN NaN NaN
2 80 60 40 20 0 NaN NaN
3 60 40 20 0 NaN NaN NaN
4 20 0 NaN NaN 40 20 0
You can try of Transposing the dataframe, then after count only not null values and last set 1
#bit of complex procedure, solution involving with.
def fill_values(df):
df = df[::-1]
a = df == 1
b = a.cumsum()
#Function in counting the cummulative not null values
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
return (b-b.mask(a).ffill().fillna(0).astype(int))[::-1]*20
df.apply(fill_values,1).replace(0,np.nan)-20
Out:
20 40 60 80 100 120 140
1 40.0 20.0 0.0 NaN NaN NaN NaN
2 80.0 60.0 40.0 20.0 0.0 NaN NaN
3 60.0 40.0 20.0 0.0 NaN NaN NaN
4 20.0 0.0 NaN NaN 40.0 20.0 0.0