I have a file with a million tweets. The first tweet occurred 2013-04-15 20:17:18 UTC. I want to update each tweet row afterward with the minutes since minsSince that first tweet.
I have found help with datetime here, and converting time here, but when I put the two together I don't get the right times. It could be something with the UTC string at the end of each published_at value.
The error it throws is:
tweets['minsSince'] = tweets.apply(timesince,axis=1)
...
TypeError: ('string indices must be integers, not str', u'occurred at index 0')
Thanks for any help.
#Import stuff
from datetime import datetime
import time
import pandas as pd
from pandas import DataFrame
#Read the csv file
tweets = pd.read_csv('BostonTWEETS.csv')
tweets.head()
#The first tweet's published_at time
starttime = datetime (2013, 04, 15, 20, 17, 18)
#Run through the document and calculate the minutes since the first tweet
def timesince(row):
minsSince = int()
tweetTime = row['published_at']
ts = time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweetTime['published_at'], '%Y-%m-%d %H:%M:%S %UTC'))
timediff = (tweetTime - starttime)
minsSince.append("timediff")
return ",".join(minsSince)
tweets['minsSince'] = tweets.apply(timesince,axis=1)
df = DataFrame(tweets)
print(df)
Sample csv file of first 5 rows.
#Import stuff
from datetime import datetime
import time
import pandas as pd
from pandas import DataFrame
#Read the csv file
tweets = pd.read_csv('sample.csv')
tweets.head()
#The first tweet's published_at time
starttime = tweets.published_at.values[0]
starttime = datetime.strptime(starttime, '%Y-%m-%d %H:%M:%S UTC')
#Run through the document and calculate the minutes since the first tweet
def timesince(row):
ts = datetime.strptime(row, '%Y-%m-%d %H:%M:%S UTC')
timediff = (ts- starttime)
timediff = divmod(timediff.days * 86400 + timediff.seconds, 60)
return timediff[0]
tweets['minSince'] = 0
tweets.minSince = tweets.published_at.map(timesince)
df = DataFrame(tweets)
print(df)
I hope this is what you are looking for.
Related
i have an dataframe with dates and would like to get the time between the first date and the last date, when i run the code below
df.sort_values('timestamp', inplace=True)
firstDay = df.iloc[0]['timestamp']
lastDay = df.iloc[len(df)-1]['timestamp']
print(firstDay)
print(lastDay)
it provides the following formate of the dates :
2016-09-24 17:42:27.839496
2017-01-18 10:24:08.629327
and I'm trying to get the different between them but they're in the str format, and I've been having trouble converting them to a form where i can get the difference
here you go :o)
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_format_str = '%Y-%m-%d %H:%M:%S.%f'
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = datetime.strptime(date_1, date_format_str)
end = datetime.strptime(date_2, date_format_str)
diff = end - start
# Get interval between two timstamps as timedelta object
diff_in_hours = diff.total_seconds() / 3600
print(diff_in_hours)
# get the difference between two dates as timedelta object
diff = end.date() - start.date()
print(diff.days)
Pandas
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = pd.to_datetime(date_1, format='%Y-%m-%d %H:%M:%S.%f')
end = pd.to_datetime(date_2, format='%Y-%m-%d %H:%M:%S.%f')
# get the difference between two datetimes as timedelta object
diff = end - start
print(diff.days)
It is common for a GTFS time to exceed 23:59:59 due to the timetable cycle. Ie, the last time may be 25:20:00 (01:20:00 the next day), so when you convert the times to datetime, you will get an error when these times are encountered.
Is there a way to convert the GTFS time values into standard datetime format, without splitting the hour out and then converting back to a string in the correct format, to then convert it to a datetime.
t = ['24:22:00', '24:30:00', '25:40:00', '26:27:00']
'0'+str(pd.to_numeric(t[0].split(':')[0])%24)+':'+':'.join(t[0].split(':')[1:])
For the above examples, i would expect to just see
['00:22:00', '00:30:00', '01:40:00', '02:27:00']
from datetime import datetime, timedelta
def gtfs_time_to_datetime(gtfs_date, gtfs_time):
hours, minutes, seconds = tuple(
int(token) for token in gtfs_time.split(":")
)
return (
datetime.strptime(gtfs_date, "%Y%m%d") + timedelta(
hours=hours, minutes=minutes, seconds=seconds
)
)
gives the following result
>>> gtfs_time_to_datetime("20191031", "24:22:00")
datetime.datetime(2019, 11, 1, 0, 22)
>>> gtfs_time_to_datetime("20191031", "24:22:00").time().isoformat()
'00:22:00'
>>> t = ['24:22:00', '24:30:00', '25:40:00', '26:27:00']
>>> [ gtfs_time_to_datetime("20191031", tt).time().isoformat() for tt in t]
['00:22:00', '00:30:00', '01:40:00', '02:27:00']
I didn't find an easy way, so i just wrote a function to do it.
If anyone else wants the solution, here is mine:
from datetime import timedelta
import pandas as pd
def list_to_real_datetime(time_list, date_exists=False):
'''
Convert a list of GTFS times to real datetime list
:param time_list: GTFS times
:param date_exists: Flag indicating if the date exists in the list elements
:return: An adjusted list of time to conform with real date times
'''
# new list of times to be returned
new_time = []
for time in time_list:
plus_day = False
hour = int(time[0:2])
if hour >= 24:
hour -= 24
plus_day = True
# reset the time to a real format
time = '{:02d}'.format(hour)+time[2:]
# Convert the time to a datetime
if not date_exists:
time = pd.to_datetime('1970-01-01 '+time, format='%Y-%m-%d')
if plus_day:
time = time + timedelta(days=1)
new_time.append(time)
return new_time
I am unexperienced with Python and am trying to parse all timestamps of the following csv as datetime objects in order to then perform functions on them (e.g. find timestamp differences etc.).
However, I can parse single lines but not the whole timestamp column. I am getting a 'KeyError: '2010-12-30 14:32:00' for the first date of the timestamp column, when reaching the line below my 'not working' comment.
Thanks in advance.
from datetime import datetime, timedelta
import pandas as pd
from dateutil.parser import parse
csvFile = pd.read_csv('runningComplete.csv')
column = csvFile['timestamp']
column = column.str.slice(0, 19, 1)
print(column)
dt1 = datetime.strptime(column[1], '%Y-%m-%d %H:%M:%S')
print(dt1)
dt2 = datetime.strptime(column[2], '%Y-%m-%d %H:%M:%S')
print(dt1)
dt3 = dt1 - dt2
print(dt3)
for row in column:
print(row)
Not working:
for row in column:
timestamp = datetime.strptime(column[row], '%Y-%m-%d %H:%M:%S')
I am taking time as input from the user in the HH:MM format. Let's say 00:00 and now I want to keep adding a minute to the time and make it 00:01, 00:02 and so on.
Also, I am taking two inputs form the user start_time and end_time as strings. How can I calculate the difference between the two times as well in minutes?
I am new to Python and any help would be appreciated!
I am using the below code:
#to calculate difference in time
time_diff = datetime.strptime(end_time, '%H:%M') - datetime.strptime(start_time, '%H:%M')
minutes = int(time_diff.total_seconds()/60)
print minutes
#to convert string to time format HH:MM
start_time = datetime.strptime(start_time, '%H:%M').time()
#to increment time by 1 minute
start_time = start_time + datetime.timedelta(minutes=1)
I am not able to increment the start_time using timedelta.
import datetime
time_diff = datetime.datetime.strptime(end_time, '%H:%M') - datetime.datetime.strptime(start_time, '%H:%M')
minutes = int(time_diff.total_seconds()/60)
print minutes
datetime is a class of the datetime module that has a classmethod called strptime. The nomenclature is a bit confusing, but this should work as you intend it to.
As for adding a time delta, you'll need store the start time as a datetime object in order to get that to work:
start_datetime = datetime.datetime.strptime(start_time, '%H:%M')
start_datetime = start_datetime + datetime.timedelta(minutes=1)
print start_datetime
First part of your question, you can use the datetime module:
from datetime import datetime as dt
from datetime import timedelta as td
UsrInput = '00:00'
fmtString = '%H:%M'
myTime = dt.strptime(UsrInput, fmtString)
increment = td(0,1)
for count in range(10):
myTime += increment
print (dt.strftime(myTime, fmtString))
Second part will also use datetime, as such:
from datetime import datetime as dt
from datetime import timedelta as td
start_time = '00:01'
end_time = '00:23'
fmtString = '%H:%M'
myStart = dt.strptime(start_time, fmtString)
myEnd = dt.strptime(end_time, fmtString)
difference = myEnd - myStart
print(td.strftime(difference, '%M')
I want to get the date time object for last hour.
Lets say the sys time is "2011-9-28 06:11:30"
I want to get the output as "2011-9-28 05" #{06 - 1 hour}
I used:
lastHourDateTime = date.today() - timedelta(hours = 1)
print lastHourDateTime.strftime('%Y-%m-%d %H:%M:%S')
However, my output is not showing the time part at all. where am I going wrong?
Date doesn't have the hour - use datetime:
from datetime import datetime, timedelta
last_hour_date_time = datetime.now() - timedelta(hours = 1)
print(last_hour_date_time.strftime('%Y-%m-%d %H:%M:%S'))
This works for me:
import datetime
lastHourDateTime = datetime.datetime.now() - datetime.timedelta(hours = 1)
print(lastHourDateTime.strftime('%Y-%m-%d %H'))
# prints "2011-09-28 12" which is the time one hour ago in Central Europe
You can achieve the same goal using pandas:
import pandas as pd
pd.Timestamp.now() - pd.Timedelta('1 hours')