How to display `.value_counts()` in interval in pandas dataframe - python

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?

Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

Related

How to rearrange Excel data set with some complex?

I am very new to asking questions to stack overflow. Please let me know if I have missed something.
I am trying to rearrange some data from excel-like below
Excel Data
To like:
Rearranged
I already tried one in stack overflow How to Rearrange Data
I just need to add one more column next to the above answer, but couldn't find an answer with my short python knowledge.
Anyone could suggest a way to rearrange a little more complex than the above link?
You will have to transform a little bit your data in order to get to the result you want, but here is my solution:
1.Imports
import pandas as pd
import numpy as np
Remove the merged title from your data ("Budget and Actual"). You may want to rename you columns as 1/31/2020 Actual and 1/31/2020 Budget. Otherwise, if you have the same column name, Pandas will bring you the columns with a differentiator like '.1'. Sample data below with only a couple of columns for demonstration purposes.
Item 1/31/2020 2/29/2020 1/31/2020.1 2/29/2020.1
0 A 0.01 0.02 0.03 0.04
1 B 0.20 0.30 0.40 0.50
2 C 0.33 0.34 0.35 0.36
3.Create two separate datasets for Actuals and Budget
#item name and all budget columns from your dataset
df_budget = df.iloc[:, 0:12]
# item name and the actuals columns
df_actuals = df.iloc[:, [0,13,14,15,16,17,18,19,20,21,22,22,24,25]]
4.Correct the names of the columns to remove the differentiator '.1' and reflect your dates
df_actuals.columns = ['Item','1/31/2020','2/29/2020' so far so on...]
5.Transform the Date columns in rows
df_actuals = df_actuals.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Actual')
df_budget = df_budget.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Budget')
You should see something like this at this point
Item Date Actual
0 A 1/31/2020 0.01
1 B 1/31/2020 0.20
Item Date Budget
0 A 1/31/2020 0.03
1 B 1/31/2020 0.40
6.Merge Both datasets
pd.merge(df_actuals, df_budget, on=['Item', 'Date'], sort=True)
Result:
Item Date Actual Budget
0 A 1/31/2020 0.01 0.03
1 A 2/29/2020 0.02 0.04
2 B 1/31/2020 0.20 0.40
3 B 2/29/2020 0.30 0.50
4 C 1/31/2020 0.33 0.35
5 C 2/29/2020 0.34 0.36

Pandas: rolling windows with a sum product

There are a number of answers that each provide me with a portion of my desired result, but I am challenged putting them all together. My core Pandas data frame looks like this, where I am trying to estimate volume_step_1:
date volume_step_0 volume_step_1
2018-01-01 100 a
2018-01-02 101 b
2018-01-03 105 c
2018-01-04 123 d
2018-01-05 121 e
I then have a reference table with the conversion rates, for e.g.
step conversion
0 0.60
1 0.81
2 0.18
3 0.99
4 0.75
I have another table containing point estimates of a Poisson distribution:
days_to_complete step_no pc_cases
0 0 0.50
1 0 0.40
2 0 0.07
Using these data, I now want to estimate
volume_step_1 =
(volume_step_0(today) * days_to_complete(step0, day0) * conversion(step0)) +
(volume_step_0(yesterday) * days_to_complete(step0,day1) * conversion(step0))
and so forth.
How do I write some Python code to do so?
Calling your dataframes (from top to bottom as df1, df2, and df3):
df1['volume_step_1'] = (
(df1['volume_step_0']*
df2.loc[(df2['days_to_complete'] == 0) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion']) +
df1['volume_step_0'].shift(1)*
df2.loc[(df2['days_to_complete'] == 1) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion'])
EDIT:
IIUC, you are trying to get a 'dot product' of sorts between the volume_step_0 column and the product of the pc_cases and conversionfor a particular step_no. You can merge df2 and df3 to match steps:
df_merged = df_merged = df2.merge(df3, how = 'left', left_on = 'step', right_on = 'step_no')
df_merged.head(3)
step conversion days_to_complete step_no pc_cases
0 0.0 0.6 0.0 0.0 0.50
1 0.0 0.6 1.0 0.0 0.40
2 0.0 0.6 2.0 0.0 0.07
I'm guessing you're only using stepk to get volume_step_k+1, and you want to iterate the sum over the days. The following code generates a vector of days_to_complete(step0, dayk) and conversion(step0) for all values of k that are available in days_to_complete, and finds their product:
df_fin = df_merged[df_merged['step'] == 0][['conversion', 'pc_cases']].product(axis = 1)
0 0.300
1 0.240
2 0.042
df_fin = df_fin[::-1].reset_index(drop = True)
Finally, you want to take the dot product of the days_to_complete * conversion vector by the volume_step_0 vector, for a rolling window (as many values exist in days_to_complete):
vol_step_1 = pd.Series([df1['volume_step_0'][i:i+len(df3)].reset_index(drop = True).dot(df_fin) for i in range(0,len(df3))])
df1['volume_step_1'] = df1['volume_step_1'][::-1].reset_index(drop = True)
Output:
df1
date volume_step_0 volume_step_1
0 2018-01-01 100 NaN
1 2018-01-02 101 NaN
2 2018-01-03 105 70.230
3 2018-01-04 123 66.342
4 2018-01-05 121 59.940
While this is by no means a comprehensive solution, the code is meant to provide the logic to "sum multiple products", as you had asked.

Express pandas operations as pipeline

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94
I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

get means and SEM in one df with pandas groupby

I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03

Pandas - get last n values from a group with an offset.

I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.
A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Categories

Resources