Express pandas operations as pipeline

Express pandas operations as pipeline - python

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94

I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

Related

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?

Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

How can I compute the cumulative weighted average in new column?

Read all related pages on google and stackoverflow, and I still can't find the solution..
Given this df fragment:
key_br_acc_posid lot_in price
ix
1 1_885020_76141036 0.03 1.30004
2 1_885020_76236801 0.02 1.15297
5 1_885020_76502318 0.50 2752.08000
8 1_885020_76502318 4.50 2753.93000
9 1_885020_76502318 0.50 2753.93000
... ... ...
1042 1_896967_123068980 0.01 1.17657
1044 1_896967_110335293 0.01 28.07100
1047 1_896967_110335293 0.01 24.14000
1053 1_896967_146913299 25.00 38.55000
1054 1_896967_147039856 2.00 121450.00000
How can I create a new column w_avg_price computing the moving weighted average price by key_br_acc_posid? The lot_in is the weight and the price is the value.
I tried many approaches with groupby() + np.average() buy I have to avoid the data aggregation. I need this value in each row.

groupby and then perform the calculation for each group using cumsum()s:
(df.groupby('key_br_acc_posid', as_index = False)
.apply(lambda g: g.assign(w_avg_price = (g['lot_in']*g['price']).cumsum()/g['lot_in'].cumsum()))
.reset_index(level = 0, drop = True)
)
result:
key_br_acc_posid lot_in price w_avg_price
---- ------------------ -------- ------------ -------------
1 1_885020_76141036 0.03 1.30004 1.30004
2 1_885020_76236801 0.02 1.15297 1.15297
5 1_885020_76502318 0.5 2752.08 2752.08
8 1_885020_76502318 4.5 2753.93 2753.74
9 1_885020_76502318 0.5 2753.93 2753.76
1044 1_896967_110335293 0.01 28.071 28.071
1047 1_896967_110335293 0.01 24.14 26.1055
1042 1_896967_123068980 0.01 1.17657 1.17657
1053 1_896967_146913299 25 38.55 38.55
1054 1_896967_147039856 2 121450 121450

I don't think I'm calculating it right, but what you want is cumsum()
df = pd.DataFrame({'lot_in':[.1,.2,.3],'price':[1.0,1.25,1.3]})
df['mvg_avg'] = (df['lot_in'] * df['price']).cumsum()
print(df)
lot_in price mvg_avg
0 0.1 1.00 0.10
1 0.2 1.25 0.35
2 0.3 1.30 0.74

An efficient way to calculate deltas in the DataFrame?

I need to calculate the delta and I did it, but I'm using itertuples and I want to avoid use it...
There is an efficient way to do that? Take a look how I did it:
from numpy import append, around, array, float64
from numpy.random import uniform
from pandas import DataFrame
matrix = around(a=uniform(low=1.0, high=50.0, size=(10, 2)), decimals=2)
points = DataFrame(data=matrix, columns=['x', 'y'], dtype='float64')
x_column = points.columns.get_loc('x')
y_column = points.columns.get_loc('y')
x_delta = array(object=[], dtype=float64)
y_delta = array(object=[], dtype=float64)
for row, iterator in enumerate(iterable=points.itertuples(index=False, name='Point')):
if row == 0:
x_delta = append(arr=x_delta, values=0.0)
y_delta = append(arr=y_delta, values=0.0)
else:
x_delta = append(arr=x_delta, values=iterator.x / points.iat[row - 1, x_column] - 1)
y_delta = append(arr=y_delta, values=iterator.y / points.iat[row - 1, y_column] - 1)
x_delta = around(a=x_delta, decimals=2)
y_delta = around(a=y_delta, decimals=2)
points.insert(loc=points.shape[1], column='x_delta', value=x_delta)
points.insert(loc=points.shape[1], column='y_delta', value=y_delta)
print(points)
x y x_delta y_delta
0 26.08 1.37 0.00 0.00
1 8.34 6.82 -0.68 3.98
2 38.42 45.20 3.61 5.63
3 3.59 33.12 -0.91 -0.27
4 42.94 11.06 10.96 -0.67
5 31.99 17.38 -0.26 0.57
6 4.29 17.46 -0.87 0.00
7 19.68 22.28 3.59 0.28
8 27.55 12.98 0.40 -0.42
9 40.23 9.60 0.46 -0.26
Thanks a lot!

Pandas has pct_change() function which compares the current and prior element. You can achieve the same result with one line:
points[['x_delta', 'y_delta']] = points[['x', 'y']].pct_change().fillna(0).round(2)
The fillna(0) is to fix the first row which would otherwise return as NaN.

Pandas has the .diff() built in function.
Calculates the difference of a Dataframe element compared with
another element in the Dataframe (default is element in previous row).
delta_dataframe = original_dataframe.diff()
In this case delta_dataframe will give you the change between rows of the original_dataframe.

Pandas: rolling windows with a sum product

There are a number of answers that each provide me with a portion of my desired result, but I am challenged putting them all together. My core Pandas data frame looks like this, where I am trying to estimate volume_step_1:
date volume_step_0 volume_step_1
2018-01-01 100 a
2018-01-02 101 b
2018-01-03 105 c
2018-01-04 123 d
2018-01-05 121 e
I then have a reference table with the conversion rates, for e.g.
step conversion
0 0.60
1 0.81
2 0.18
3 0.99
4 0.75
I have another table containing point estimates of a Poisson distribution:
days_to_complete step_no pc_cases
0 0 0.50
1 0 0.40
2 0 0.07
Using these data, I now want to estimate
volume_step_1 =
(volume_step_0(today) * days_to_complete(step0, day0) * conversion(step0)) +
(volume_step_0(yesterday) * days_to_complete(step0,day1) * conversion(step0))
and so forth.
How do I write some Python code to do so?

Calling your dataframes (from top to bottom as df1, df2, and df3):
df1['volume_step_1'] = (
(df1['volume_step_0']*
df2.loc[(df2['days_to_complete'] == 0) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion']) +
df1['volume_step_0'].shift(1)*
df2.loc[(df2['days_to_complete'] == 1) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion'])
EDIT:
IIUC, you are trying to get a 'dot product' of sorts between the volume_step_0 column and the product of the pc_cases and conversionfor a particular step_no. You can merge df2 and df3 to match steps:
df_merged = df_merged = df2.merge(df3, how = 'left', left_on = 'step', right_on = 'step_no')
df_merged.head(3)
step conversion days_to_complete step_no pc_cases
0 0.0 0.6 0.0 0.0 0.50
1 0.0 0.6 1.0 0.0 0.40
2 0.0 0.6 2.0 0.0 0.07
I'm guessing you're only using stepk to get volume_step_k+1, and you want to iterate the sum over the days. The following code generates a vector of days_to_complete(step0, dayk) and conversion(step0) for all values of k that are available in days_to_complete, and finds their product:
df_fin = df_merged[df_merged['step'] == 0][['conversion', 'pc_cases']].product(axis = 1)
0 0.300
1 0.240
2 0.042
df_fin = df_fin[::-1].reset_index(drop = True)
Finally, you want to take the dot product of the days_to_complete * conversion vector by the volume_step_0 vector, for a rolling window (as many values exist in days_to_complete):
vol_step_1 = pd.Series([df1['volume_step_0'][i:i+len(df3)].reset_index(drop = True).dot(df_fin) for i in range(0,len(df3))])
df1['volume_step_1'] = df1['volume_step_1'][::-1].reset_index(drop = True)
Output:
df1
date volume_step_0 volume_step_1
0 2018-01-01 100 NaN
1 2018-01-02 101 NaN
2 2018-01-03 105 70.230
3 2018-01-04 123 66.342
4 2018-01-05 121 59.940
While this is by no means a comprehensive solution, the code is meant to provide the logic to "sum multiple products", as you had asked.

Pandas - get last n values from a group with an offset.

I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.

A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Express pandas operations as pipeline - python

I think need assign: df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()}) .assign(cover = df['cover'] * 100., id = df['condition'].map(constants.dict_c), temperature = (df['min_t'] + df['max_t'])/2.)

Related

How to display `.value_counts()` in interval in pandas dataframe

How can I compute the cumulative weighted average in new column?

An efficient way to calculate deltas in the DataFrame?

Pandas: rolling windows with a sum product

Pandas - get last n values from a group with an offset.

Categories

Resources