I would like to make a subtraction with date_time in pandas python but with a shift of two rows, I don't know the function
Timestamp
2020-11-26 20:00:00
2020-11-26 21:00:00
2020-11-26 22:00:00
2020-11-26 23:30:00
Explanation:
(2020-11-26 21:00:00) - (2020-11-26 20:00:00)
(2020-11-26 23:30:00) - (2020-11-26 22:00:00)
The result must be:
01:00:00
01:30:00
Firstly you need to check if this is as type datetime.
If not, kindly do pd.to_datetime()
demo = pd.DataFrame(columns=['Timestamps'])
demotime = ['20:00:00','21:00:00','22:00:00','23:30:00']
demo['Timestamps'] = demotime
demo['Timestamps'] = pd.to_datetime(demo['Timestamps'])
Your dataframe would look like:
Timestamps
0 2020-11-29 20:00:00
1 2020-11-29 21:00:00
2 2020-11-29 22:00:00
3 2020-11-29 23:30:00
After that you can either use for loop or while and in that just do:
demo.iloc[i+1,0]-demo.iloc[i,0]
IIUC, you want to iterate on chunks of two and find the difference, one approach is to:
res = df.groupby(np.arange(len(df)) // 2).diff().dropna()
print(res)
Output
Timestamp
1 0 days 01:00:00
3 0 days 01:30:00
I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.
Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.
I have following dataframe in pandas
code time
1 003002
1 053003
1 060002
1 073001
1 073003
I want to generate following dataframe in pandas
code time new_time
1 003002 00:30:00
1 053003 05:30:00
1 060002 06:00:00
1 073001 07:30:00
1 073003 07:30:00
I am doing it with following code
df['new_time'] = pd.to_datetime(df['time'] ,format='%H%M%S').dt.time
How can I do it in pandas?
Use Series.dt.floor:
df['time'] = pd.to_datetime(df['time'], format='%H%M%S').dt.floor('T').dt.time
Or remove last 2 values by indexing, then change format to %H%M:
df['time'] = pd.to_datetime(df['time'].str[:-2], format='%H%M').dt.time
print (df)
code time
0 1 00:30:00
1 1 05:30:00
2 1 06:00:00
3 1 07:30:00
4 1 07:30:00
An option using astype:
pd.to_datetime(df_oclh.Time).astype('datetime64[m]').dt.time
'datetime64[m]' symbolizes the time we want to convert to which is datetime with minutes being the largest granulariy of time wanted. Alternatively you could use [s] for seconds (rid of milliseconds) or [H] for hours (rid of minutes, seconds and milliseconds)
I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00