Pandas - Adding values from DataFrame for different rows

Pandas - Adding values from DataFrame for different rows - python

I have a pandas df and I would like to add values for each row from the "total_load" column with the "Battery capacity" column. For example 4755 +(-380) = 4375 and so on.
Obviously, what I am doing right now is for every row in the "Battery capacity" column do: 5200 - the value from "total_load" column. Any ideas how I can write that? Should I use an for loop?
df["Battery capacity"] = 5200 + df["total_load"]
Output should be something like:
time total_load battery capacity
2016-06-01 00:00:00 -445 4755
2016-06-01 01:00:00 -380 4375
2016-06-01 02:00:00 -350 4025
Thanks!

IIUC, use cumsum to get a "running total" of total_load:
df['Battery capacity'] = df['total_load'].cumsum() + 5200
Output:
Battery capacity total_load
time
2016-01-01 00:00:00 4755.0 -445.0
2016-01-01 01:00:00 4375.0 -380.0
2016-01-01 02:00:00 4025.0 -350.0
2016-01-01 03:00:00 3685.0 -340.0

Related

Python: Working with columns inside a pandas Dataframe

Good evening,
is it possible to calculate with - let's say - two columns inside a dataframe and add a third column with the fitting result?
Dataframe (original):
name time_a time_b
name_a 08:00:00 09:00:00
name_b 07:45:00 08:15:00
name_c 07:00:00 08:10:00
name_d 06:00:00 10:00:00
Or to be specific...is it possible to obtain the difference of two times (time_b - time_a) and create a
new column (time_c) at the end of the dataframe?
Dataframe (new):
name time_a time_b time_c
name_a 08:00:00 09:00:00 01:00:00
name_b 07:45:00 08:15:00 00:30:00
name_c 07:00:00 08:10:00 01:10:00
name_d 06:00:00 10:00:00 04:00:00
Thanks and a good night!

If your columns are in datetime or timedelta format:
# New column is a timedelta object
df["time_c"] = (df["time_b"] - df["time_a"])
If your columns are in datetime.time format (which it appears they are):
def time_diff(time_1,time_2):
"""returns the difference between time 1 and time 2 (time_2-time_1)"""
now = datetime.datetime.now()
time_1 = datetime.datetime.combine(now,time_1)
time_2 = datetime.datetime.combine(now,time_2)
return time_2 - time_1
# Apply the function
df["time_c"] = df[["time_a","time_b"]].apply(lambda arr: time_diff(*arr), axis=1)
Alternatively, you can convert to a timedelta by first converting to a string:
df["time_a"]=pd.to_timedelta(df["time_a"].astype(str))
df["time_b"]=pd.to_timedelta(df["time_b"].astype(str))
df["time_c"] = df["time_b"] - df["time_a"]

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!

We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps

How about transpose it:
df_seasons.T.plot()
Output:

How to find the min value between two different datetime using Pandas?

(Not duplicate / my question is entirely different)
My dataframe looks like this:
# [df2] is day based
time time2
2017-01-01, 2017-01-01 00:12:00
2017-01-02, 2017-01-02 03:15:00
2017-01-03, 2017-01-03 01:25:00
2017-01-04, 2017-01-04 04:12:00
2017-01-05, 2017-01-05 00:45:00
....
# [df] is minute based
time value
2017-01-01 00:01:00, 0.1232
2017-01-01 00:02:00, 0.1232
2017-01-01 00:03:00, 0.1232
2017-01-01 00:04:00, 0.1232
2017-01-01 00:05:00, 0.1232
....
I want to create a new column called time_val_min in [df2] that finds the min value between df2['time2'] and df2['time'] form [df] within the range specified in df2['time'] and df2['time2']
What did I do?
I did df2['time_val_min'] = df[df['time'].dt.hour.between(df2['time'], df2['time'])].min() but it does not work.
Could you please let me know how to fix it?

You can merge the two data frame on date, and filter the time:
# create the date from the time column
df['date'] = df['time'].dt.normalize()
# merge
new_df = (df.merge(df2, left_on='date', # left on date
right_on='time', # right on time, if time is purely beginning of days
how='right',
suffixes=['','_y'])
.query('time < time2')
.groupby('date')
['time'].min()
.to_frame(name='time_val_min')
.merge(df2, right_on='time', left_index=True)
)
Output:
time_val_min time time2
0 2017-01-01 00:01:00 2017-01-01 2017-01-01 00:12:00

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific

df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234

an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

Pandas Merge on Specific Attributes of DateTimeIndex

I currently have two pandas data frames which are both indexed using the pandas DateTimeIndex format.
df1
datetimeindex value
2014-01-01 00:00:00 204.501667
2014-01-01 01:00:00 125.345000
2014-01-01 02:00:00 119.660000
df2 (where the year 1900 is a filler year I added during import. Actual year does not matter)
datetimeindex temperature
1900-01-01 00:00:00 48.2
1900-01-01 01:00:00 30.2
1900-01-01 02:00:00 42.8
I would like to use pd.merge to combine the data frames based on the left index, however, I would like to ignore the year altogether to yield this:
merged_df
datetimeindex value temperature
2014-01-01 00:00:00 204.501667 48.2
2014-01-01 01:00:00 125.345000 30.2
2014-01-01 02:00:00 119.660000 42.8
so far I have tried:
merged_df = pd.merge(df1,df2,left_on =
['df1.index.month','df1.index.day','df1,index.hour'],right_on =
['df2.index.month','df2.index.day','df2.index.hour'],how = 'left')
which gave me the error KeyError: 'df2.index.month'
Is there a way to perform this merge as I have outlined it?
Thanks

You have to lose the quotesL
In [11]: pd.merge(df1, df2, left_on=[df1.index.month, df1.index.day, df1.index.hour],
right_on=[df2.index.month, df2.index.day, df2.index.hour])
Out[11]:
key_0 key_1 key_2 value temperature
0 1 1 0 204.501667 48.2
1 1 1 1 125.345000 30.2
2 1 1 2 119.660000 42.8
Here "df2.index.month" is a string whereas df2.index.month is the array of months.

Probably not as efficient because pd.to_datetime can be slow:
df2['NewIndex'] = pd.to_datetime(df2.index)
df2['NewIndex'] = df2['NewIndex'].apply(lambda x: x.replace(year=2014))
df2.set_index('NewIndex',inplace=True)
Then just do a merge on the whole index.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Adding values from DataFrame for different rows - python

IIUC, use cumsum to get a "running total" of total_load: df['Battery capacity'] = df['total_load'].cumsum() + 5200 Output: Battery capacity total_load time 2016-01-01 00:00:00 4755.0 -445.0 2016-01-01 01:00:00 4375.0 -380.0 2016-01-01 02:00:00 4025.0 -350.0 2016-01-01 03:00:00 3685.0 -340.0

Related

Python: Working with columns inside a pandas Dataframe

Plot each column mean grouped by specific date range

How to find the min value between two different datetime using Pandas?

Calculate the sum between the fixed time range using Pandas

Pandas Merge on Specific Attributes of DateTimeIndex

Categories

Resources