Having a terrible time finding information on this. I am tracking several completion times every single day to measure them against goal completion time.
I am reading the completion date and time into a pandas dataframe and using df.map to map a dictionary of completion times to create a "goal time" column in a dataframe.
Sample Data:
Date Process
1/2/2020 10:20:00 AM Test 1
1/2/2020 10:25:00 AM Test 2
1/3/2020 10:15:00 AM Test 1
1/3/2020 10:00:00 AM Test 2
Using df.map() to create a column with the goal time:
goalmap={
'Test 1':dt.datetime.strptime('10:15', '%H:%M'),
'Test 2':dt.datetime.strptime('10:30', '%H:%M')}
df['Goal Time']=df['Process'].map(goalmap)
I am then trying to create a new column of "Delta" that calculates the time difference between the two in minutes. Most of the issues I am running into relate to the data types. I got it to calculate an time difference by converting column one (Date) using pd.to_datetime but because my 'Goal Time' column does not store a date, it calculates a delta that is massive (back to 1900). I've also tried parsing the time out of the Date Time column to no avail.
Any best way to calculate the difference between time stamps only?
I recommend timedelta over datetime:
goalmap={
'Test 1': pd.to_timedelta('10:15:00'),
'Test 2': pd.to_timedelta('10:30:00') }
df['Goal Time']=df['Process'].map(goalmap)
df['Goal_Timestamp'] = df['Date'].dt.normalize() + df['Goal Time']
df['Meet_Goal'] = df['Date'] <= df['Goal_Timestamp']
Output:
Date Process Goal Time Goal_Timestamp Meet_Goal
0 2020-01-02 10:20:00 Test 1 10:15:00 2020-01-02 10:15:00 False
1 2020-01-02 10:25:00 Test 2 10:30:00 2020-01-02 10:30:00 True
2 2020-01-03 10:15:00 Test 1 10:15:00 2020-01-03 10:15:00 True
3 2020-01-03 10:00:00 Test 2 10:30:00 2020-01-03 10:30:00 True
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
I'm having trouble finding an efficient way to update some column values in a large pandas DataFrame.
The code below creates a DataFrame in a similar format to what I'm working with. A summary of the data: the DataFrame contains three days of consumption data with each day being split into 10 periods of measurement. Each measurement period is also recorded during four separate processes being a preliminary reading, end of day reading and two later revisions with all updates being recorded by the Last_Update column with the date.
dates = ['2022-01-01']*40 + ['2022-01-02']*40 + ['2022-01-03']*40
periods = list(range(1,11))*12
versions = (['PRELIM'] * 10 + ['DAILY'] * 10 + ['REVISE'] * 20) * 3
data = {'Date': dates,
'Period' : periods,
'Version': versions,
'Consumption': np.random.randint(1, 30, 120)}
df = pd.DataFrame(data)
df.Date = pd.to_datetime(df.Date)
## Add random times to the REVISE Last_Update values
df['Last_Update'] = df['Date'].apply(lambda x: x + pd.Timedelta(hours=np.random.randint(1,23), minutes=np.random.randint(1,59)))
df['Last_Update'] = df['Last_Update'].where(df.Version == 'REVISE', df['Date'])
The problem is that the two revision categories are both specified by the same value: "REVISE". One of these "REVISE" values must be changed to something like "REVISE_2". If you group the data in the following way df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].sum() you can see there are two Last_Update dates for each period in each day for REVISE. So we need to set the REVISE with the largest date to REVISE_2.
The only way I've managed to find a solution is using a very convoluted function with the apply method to test which date is larger and store its index and then change the value using loc. This ended up taking huge amount of time for small segments of the data (the full dataset is millions of rows).
I feel like there is an easy solution using groupby functions by I'm having difficulties navigating the multi index output.
Any help would be appreciated cheers.
We figure our the index of the max REVISE date using idxmax after some grouping, and then change the labels:
last_revised_date_idx = df[df['Version'] == 'REVISE'].groupby(['Date', 'Period'], group_keys = False)['Last_Update'].idxmax()
df.loc[last_revised_date_idx, 'Version'] = 'REVISE_2'
check the output:
df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].count().head(20)
produces
Date Period Version Last_Update
2022-01-01 1 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 03:50:00 1
REVISE_2 2022-01-01 12:10:00 1
2 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 10:45:00 1
REVISE_2 2022-01-01 22:05:00 1
3 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 17:03:00 1
REVISE_2 2022-01-01 19:10:00 1
4 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 15:23:00 1
REVISE_2 2022-01-01 18:08:00 1
5 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 12:19:00 1
REVISE_2 2022-01-01 18:04:00 1
I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.
datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11
What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.
Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.
Would there be another way to handle this data that I'm just not presenting?
or what I'm doing makes sense and is efficient enough?
I eventually want to plot some of this data by selecting specific data names.
I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:
print(df)
datetime battery datetime.1 temperature datetime.2 pressure
0 2020-01-01 01:01:01 13.8 2020-01-01 01:01:02 97 2020-01-01 01:01:03 10
1 2020-01-01 01:01:04 13.8 2020-01-01 01:01:05 97 2020-01-01 01:01:06 11
print(df.memory_usage().sum())
==> 224
After converting the dataframe:
dfs = []
for i in range(0, 6, 2):
d = df.iloc[:, i:i+2]
d["type"] = d.columns[1]
d.columns = ["datetime", "value", "type"]
dfs.append(d)
new_df = pd.concat(dfs)
print(new_df)
==>
datetime value type
0 2020-01-01 01:01:01 13.8 battery
1 2020-01-01 01:01:04 13.8 battery
0 2020-01-01 01:01:02 97.0 temperature
1 2020-01-01 01:01:05 97.0 temperature
0 2020-01-01 01:01:03 10.0 pressure
1 2020-01-01 01:01:06 11.0 pressure
print(new_df.memory_usage().sum())
==> 192
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.