I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.
A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0
Related
I'm trying to create a list of timestamps from a column in a dataframe, that resets after a certain time to zero. So, if the limit was 4, I want the count to add up the values of the column up to position 4, and then reset to zero, and continue adding the values of the column, from position 5, and so forth until it reaches the length of the column. I used itertools.islice earlier in the script to create a counter, so I was wondering if I could use a combination of this and itertools.count to do something similar? So far, this is my code:
cycle_time = list(itertools.islice(itertools.count(0,raw_data['Total Time (s)'][lens]),range(0, block_cycles),lens))
Where raw_data['Total Time (s)'] contains the values I wish to add up, block_cycles is the number I want to add up to in the dataframe column before resetting, and lens is the length of the column in the dataframe. Ideally, the output from my list would look like this:
print(cycle_time)
0
0.24
0.36
0.57
0
0.13
0.32
0.57
Which is calculated from this input:
print(raw_data['Total Time (s)'])
0
0.24
0.36
0.57
0.7
0.89
1.14
Which I would then append to a new column in a dataframe, interim_data_output['Cycle time (s)'] which details the time elapsed at that point in the 'cycle'. block_cycles is the number of iterations in each large 'cycle' This is what I would do with the list:
interim_data_output['Cycle time (s)'] = cycle_time
I'm a bit lost here, is this even possible using these methods? I'd like to use itertools for performance reasons. Any help would be greatly appreciated!
Given the discussion in the comments, here is an example:
df = pd.DataFrame({'Total Time (s)':[0, 0.24, 0.36, 0.57, 0.7, 0.89, 1.14]})
Total Time (s)
0 0.00
1 0.24
2 0.36
3 0.57
4 0.70
5 0.89
6 1.14
You can do:
block_cycles = 4
# Calculate cycle times.
cycle_times = df['Total Time (s)'].diff().fillna(0).groupby(df.index // block_cycles).cumsum()
# Insert the desired zeros after all cycles.
for idx in range(block_cycles, cycle_times.index.max(), block_cycles):
cycle_times.loc[idx-0.5] = 0
cycle_times = cycle_times.sort_index().reset_index(drop=True)
print(cycle_times)
Which gives:
0 0.00
1 0.24
2 0.36
3 0.57
4 0.00
5 0.13
6 0.32
7 0.57
Name: Total Time (s), dtype: float64
I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())
I need to calculate the delta and I did it, but I'm using itertuples and I want to avoid use it...
There is an efficient way to do that? Take a look how I did it:
from numpy import append, around, array, float64
from numpy.random import uniform
from pandas import DataFrame
matrix = around(a=uniform(low=1.0, high=50.0, size=(10, 2)), decimals=2)
points = DataFrame(data=matrix, columns=['x', 'y'], dtype='float64')
x_column = points.columns.get_loc('x')
y_column = points.columns.get_loc('y')
x_delta = array(object=[], dtype=float64)
y_delta = array(object=[], dtype=float64)
for row, iterator in enumerate(iterable=points.itertuples(index=False, name='Point')):
if row == 0:
x_delta = append(arr=x_delta, values=0.0)
y_delta = append(arr=y_delta, values=0.0)
else:
x_delta = append(arr=x_delta, values=iterator.x / points.iat[row - 1, x_column] - 1)
y_delta = append(arr=y_delta, values=iterator.y / points.iat[row - 1, y_column] - 1)
x_delta = around(a=x_delta, decimals=2)
y_delta = around(a=y_delta, decimals=2)
points.insert(loc=points.shape[1], column='x_delta', value=x_delta)
points.insert(loc=points.shape[1], column='y_delta', value=y_delta)
print(points)
x y x_delta y_delta
0 26.08 1.37 0.00 0.00
1 8.34 6.82 -0.68 3.98
2 38.42 45.20 3.61 5.63
3 3.59 33.12 -0.91 -0.27
4 42.94 11.06 10.96 -0.67
5 31.99 17.38 -0.26 0.57
6 4.29 17.46 -0.87 0.00
7 19.68 22.28 3.59 0.28
8 27.55 12.98 0.40 -0.42
9 40.23 9.60 0.46 -0.26
Thanks a lot!
Pandas has pct_change() function which compares the current and prior element. You can achieve the same result with one line:
points[['x_delta', 'y_delta']] = points[['x', 'y']].pct_change().fillna(0).round(2)
The fillna(0) is to fix the first row which would otherwise return as NaN.
Pandas has the .diff() built in function.
Calculates the difference of a Dataframe element compared with
another element in the Dataframe (default is element in previous row).
delta_dataframe = original_dataframe.diff()
In this case delta_dataframe will give you the change between rows of the original_dataframe.
I am very new to asking questions to stack overflow. Please let me know if I have missed something.
I am trying to rearrange some data from excel-like below
Excel Data
To like:
Rearranged
I already tried one in stack overflow How to Rearrange Data
I just need to add one more column next to the above answer, but couldn't find an answer with my short python knowledge.
Anyone could suggest a way to rearrange a little more complex than the above link?
You will have to transform a little bit your data in order to get to the result you want, but here is my solution:
1.Imports
import pandas as pd
import numpy as np
Remove the merged title from your data ("Budget and Actual"). You may want to rename you columns as 1/31/2020 Actual and 1/31/2020 Budget. Otherwise, if you have the same column name, Pandas will bring you the columns with a differentiator like '.1'. Sample data below with only a couple of columns for demonstration purposes.
Item 1/31/2020 2/29/2020 1/31/2020.1 2/29/2020.1
0 A 0.01 0.02 0.03 0.04
1 B 0.20 0.30 0.40 0.50
2 C 0.33 0.34 0.35 0.36
3.Create two separate datasets for Actuals and Budget
#item name and all budget columns from your dataset
df_budget = df.iloc[:, 0:12]
# item name and the actuals columns
df_actuals = df.iloc[:, [0,13,14,15,16,17,18,19,20,21,22,22,24,25]]
4.Correct the names of the columns to remove the differentiator '.1' and reflect your dates
df_actuals.columns = ['Item','1/31/2020','2/29/2020' so far so on...]
5.Transform the Date columns in rows
df_actuals = df_actuals.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Actual')
df_budget = df_budget.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Budget')
You should see something like this at this point
Item Date Actual
0 A 1/31/2020 0.01
1 B 1/31/2020 0.20
Item Date Budget
0 A 1/31/2020 0.03
1 B 1/31/2020 0.40
6.Merge Both datasets
pd.merge(df_actuals, df_budget, on=['Item', 'Date'], sort=True)
Result:
Item Date Actual Budget
0 A 1/31/2020 0.01 0.03
1 A 2/29/2020 0.02 0.04
2 B 1/31/2020 0.20 0.40
3 B 2/29/2020 0.30 0.50
4 C 1/31/2020 0.33 0.35
5 C 2/29/2020 0.34 0.36
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94
I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)