Read all related pages on google and stackoverflow, and I still can't find the solution..
Given this df fragment:
key_br_acc_posid lot_in price
ix
1 1_885020_76141036 0.03 1.30004
2 1_885020_76236801 0.02 1.15297
5 1_885020_76502318 0.50 2752.08000
8 1_885020_76502318 4.50 2753.93000
9 1_885020_76502318 0.50 2753.93000
... ... ...
1042 1_896967_123068980 0.01 1.17657
1044 1_896967_110335293 0.01 28.07100
1047 1_896967_110335293 0.01 24.14000
1053 1_896967_146913299 25.00 38.55000
1054 1_896967_147039856 2.00 121450.00000
How can I create a new column w_avg_price computing the moving weighted average price by key_br_acc_posid? The lot_in is the weight and the price is the value.
I tried many approaches with groupby() + np.average() buy I have to avoid the data aggregation. I need this value in each row.
groupby and then perform the calculation for each group using cumsum()s:
(df.groupby('key_br_acc_posid', as_index = False)
.apply(lambda g: g.assign(w_avg_price = (g['lot_in']*g['price']).cumsum()/g['lot_in'].cumsum()))
.reset_index(level = 0, drop = True)
)
result:
key_br_acc_posid lot_in price w_avg_price
---- ------------------ -------- ------------ -------------
1 1_885020_76141036 0.03 1.30004 1.30004
2 1_885020_76236801 0.02 1.15297 1.15297
5 1_885020_76502318 0.5 2752.08 2752.08
8 1_885020_76502318 4.5 2753.93 2753.74
9 1_885020_76502318 0.5 2753.93 2753.76
1044 1_896967_110335293 0.01 28.071 28.071
1047 1_896967_110335293 0.01 24.14 26.1055
1042 1_896967_123068980 0.01 1.17657 1.17657
1053 1_896967_146913299 25 38.55 38.55
1054 1_896967_147039856 2 121450 121450
I don't think I'm calculating it right, but what you want is cumsum()
df = pd.DataFrame({'lot_in':[.1,.2,.3],'price':[1.0,1.25,1.3]})
df['mvg_avg'] = (df['lot_in'] * df['price']).cumsum()
print(df)
lot_in price mvg_avg
0 0.1 1.00 0.10
1 0.2 1.25 0.35
2 0.3 1.30 0.74
Related
I'm working with a dataframe containing environnemental values (sentinel2 satellite : NDVI) like:
Date ID_151894 ID_109386 ID_111656 ID_110006 ID_112281 ID_132408
0 2015-07-06 0.82 0.61 0.85 0.86 0.76 nan
1 2015-07-16 0.83 0.81 0.77 0.83 0.84 0.82
2 2015-08-02 0.88 0.89 0.89 0.89 0.86 0.84
3 2015-08-05 nan nan 0.85 nan 0.83 0.77
4 2015-08-12 0.82 0.77 nan 0.65 nan 0.42
5 2015-08-22 0.85 0.85 0.88 0.87 0.83 0.83
The columns correspond to different places and the nan values are due to cloudy conditions (which happen often in Belgium). There are obviously lot more values. To remove outliers, I use the method described in the timesat manual (Jönsson & Eklundh, 2015) :
it deviates more than a maximum deviation (here called cutoff) from the median
value is lower than the mean value of its immediate neighbors minus the cutoff
or it is larger than the highest value of its immediate neighbor plus the cutoff
So, I have made the code below to do so :
NDVI = pd.read_excel("C:/Python_files/Cartofor/NDVI_frene_5ha.xlsx")
date = NDVI["Date"]
MED = NDVI.median(axis = 0, skipna = True, numeric_only=True)
SD = NDVI.std(axis = 0, skipna = True, numeric_only=True)
cutoff = 1.5 * SD
for j in range(1,21): #columns
for i in range(1,480): #rows
if (NDVIF.iloc[i,j] < (((NDVIF.iloc[i-1,j] + NDVIF.iloc[i+1,j])/2) - cutoff.iloc[j])):
NDVIF.iloc[i,j] == float('NaN')
elif (NDVIF.iloc[i,j] > (max(NDVIF.iloc[i-1,j], NDVIF.iloc[i+1,j]) + cutoff.iloc[j])): #2)
NDVIF.iloc[i,j] == float('NaN')
elif ((NDVIF.iloc[i,j] >= abs(MED.iloc[j] - cutoff.iloc[j]))) & (NDVIF.iloc[i,j] <= abs(MED.iloc[j] + cutoff.iloc[j])): #1)
NDVIF.iloc[i,j] == NDVIF.iloc[i,j]
else:
NDVIF.iloc[i,j] == float('NaN')
The problem is that I need to omit the 'NaN' values for the calculations. The goal is to have a dataframe like the one above without the outliers.
Once this is made, I have to interpolate the values for a new chosen time index (e.g. one value per day or one value every five days from 2016 to 2020) and write each interpolated column on a txt file to enter it on the TimeSat software.
I hope my english is not too bad and thank you for your answers! :)
There are a number of answers that each provide me with a portion of my desired result, but I am challenged putting them all together. My core Pandas data frame looks like this, where I am trying to estimate volume_step_1:
date volume_step_0 volume_step_1
2018-01-01 100 a
2018-01-02 101 b
2018-01-03 105 c
2018-01-04 123 d
2018-01-05 121 e
I then have a reference table with the conversion rates, for e.g.
step conversion
0 0.60
1 0.81
2 0.18
3 0.99
4 0.75
I have another table containing point estimates of a Poisson distribution:
days_to_complete step_no pc_cases
0 0 0.50
1 0 0.40
2 0 0.07
Using these data, I now want to estimate
volume_step_1 =
(volume_step_0(today) * days_to_complete(step0, day0) * conversion(step0)) +
(volume_step_0(yesterday) * days_to_complete(step0,day1) * conversion(step0))
and so forth.
How do I write some Python code to do so?
Calling your dataframes (from top to bottom as df1, df2, and df3):
df1['volume_step_1'] = (
(df1['volume_step_0']*
df2.loc[(df2['days_to_complete'] == 0) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion']) +
df1['volume_step_0'].shift(1)*
df2.loc[(df2['days_to_complete'] == 1) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion'])
EDIT:
IIUC, you are trying to get a 'dot product' of sorts between the volume_step_0 column and the product of the pc_cases and conversionfor a particular step_no. You can merge df2 and df3 to match steps:
df_merged = df_merged = df2.merge(df3, how = 'left', left_on = 'step', right_on = 'step_no')
df_merged.head(3)
step conversion days_to_complete step_no pc_cases
0 0.0 0.6 0.0 0.0 0.50
1 0.0 0.6 1.0 0.0 0.40
2 0.0 0.6 2.0 0.0 0.07
I'm guessing you're only using stepk to get volume_step_k+1, and you want to iterate the sum over the days. The following code generates a vector of days_to_complete(step0, dayk) and conversion(step0) for all values of k that are available in days_to_complete, and finds their product:
df_fin = df_merged[df_merged['step'] == 0][['conversion', 'pc_cases']].product(axis = 1)
0 0.300
1 0.240
2 0.042
df_fin = df_fin[::-1].reset_index(drop = True)
Finally, you want to take the dot product of the days_to_complete * conversion vector by the volume_step_0 vector, for a rolling window (as many values exist in days_to_complete):
vol_step_1 = pd.Series([df1['volume_step_0'][i:i+len(df3)].reset_index(drop = True).dot(df_fin) for i in range(0,len(df3))])
df1['volume_step_1'] = df1['volume_step_1'][::-1].reset_index(drop = True)
Output:
df1
date volume_step_0 volume_step_1
0 2018-01-01 100 NaN
1 2018-01-02 101 NaN
2 2018-01-03 105 70.230
3 2018-01-04 123 66.342
4 2018-01-05 121 59.940
While this is by no means a comprehensive solution, the code is meant to provide the logic to "sum multiple products", as you had asked.
I have the following dataset (replication):
ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12
Now I want to do linear regression analysis (preferably in R but python is also fine) were I pass the error in y for each point within the formula. So in R this would be something like (for better understanding of the question):
lm(fraction +-error_on_fraction ~ ordinal_var, data = dataset)
Of course I tried to find how to do it myself first but I can't find an answer.
For previous analysis with error on x and y I just the scipy.odr library but I can't find how to do it with only an error in the y(response) variable.
Any help would be much appreciated!
We can use a simple weighted least squares model.
Sample data
Let's read in your sample data.
df <- read.table(text =
"ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12", header = T)
Weighted least squares model
We fit a weighted linear model of the form fraction ~ ordered(ordinal_var), where the weights are given by 1 / error_on_fraction.
fit <- lm(
fraction ~ ordered(ordinal_var),
weights = 1 / error_on_fraction,
data = df)
summary(fit)
#
#Call:
#lm(formula = fraction ~ ordered(ordinal_var), data = df, weights = 1/error_on_fraction)
#
#Weighted Residuals:
# 1 2 3 4 5 6 7
# 2.220e-16 -1.851e-16 -1.753e-17 1.050e-01 -1.660e-01 1.810e-17 -1.999e+00
# 8
# 3.097e+00
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.1136 0.3365 3.309 0.0804 .
#ordered(ordinal_var).L 0.3430 0.7847 0.437 0.7047
#ordered(ordinal_var).Q 0.6228 0.7057 0.883 0.4706
#ordered(ordinal_var).C 0.2794 0.8920 0.313 0.7838
#ordered(ordinal_var)^4 0.2127 0.9278 0.229 0.8400
#ordered(ordinal_var)^5 -0.2469 0.7916 -0.312 0.7846
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 2.61 on 2 degrees of freedom
#Multiple R-squared: 0.5427, Adjusted R-squared: -0.6004
#F-statistic: 0.4748 on 5 and 2 DF, p-value: 0.783
As a part of data comparison in Python, i have a dataframe's output. As you can see, PROD_ and PROJ_ data are compared.
Sample:
print (df)
PROD_Label PROJ_Label Diff_Label PROD_OAD PROJ_OAD \
0 Energy Energy True 1.94 1.94
1 Food and Beverage Food and Beverage True 1.97 1.97
2 Healthcare Healthcare True 8.23 8.23
3 Consumer Products Consumer Products True 3.67 NaN
4 Retailers Retailers True 5.88 NaN
Diff_OAD PROD_OAD_Tin PROJ_OAD_Tin Diff_OAD_Tin
0 True 0.02 0.02 True
1 True 0.54 0.01 False
2 True 0.05 0.05 True
3 False 0.02 0.02 True
4 False 0.06 0.06 True
String columns like PROD_Label, PROJ_Label are "non-null objects". Here the comparison results in true/false and as expected.
While for numeric columns like PROD_OAD, PROJ_OAD, PROD_OAD_Tin, PROJ_OAD_Tin are "non-null float64". Currently my output is showing the comparison as true and false (as above). but i expect this to be as the actual differences, like below but only for the numeric columns.
Is there a way to specify the particular column names and get the difference of results to be dumped in the Diff_ column.
Please note i don't want to compare all the PROD_ and PROJ_ columns. String's differences are already correct in true/false. Just looking for some specific columns which are in numeric format.
I think if exist only numeric columns with same structure is possible extract only numeric columns and get unique values which are used in for with sub:
a = df.select_dtypes([np.number]).columns.str.split('_', n=1).str[1].unique()
print (a)
Index(['OAD', 'OAD_Tin'], dtype='object')
for x in a:
df['Diff_' + x] = df['PROJ_' + x].sub(df['PROD_' + x], fill_value=0)
print (df)
PROD_Label PROJ_Label Diff_Label PROD_OAD PROJ_OAD \
0 Energy Energy True 1.94 1.94
1 Food and Beverage Food and Beverage True 1.97 1.97
2 Healthcare Healthcare True 8.23 8.23
3 Consumer Products Consumer Products True 3.67 NaN
4 Retailers Retailers True 5.88 NaN
Diff_OAD PROD_OAD_Tin PROJ_OAD_Tin Diff_OAD_Tin
0 0.00 0.02 0.02 0.00
1 0.00 0.54 0.01 -0.53
2 0.00 0.05 0.05 0.00
3 -3.67 0.02 0.02 0.00
4 -5.88 0.06 0.06 0.00
A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64