How to efficiently parse Time/Date string into datetime object? - python

I'm scraping data from a news site and want to store the time and date these articles were posted. The good thing is that I can pull these timestamps right from the page of the articles.
When the articles I scrape were posted today, the output looks like this:
17:22 ET
02:41 ET
06:14 ET
When the articles were posted earlier than today, the output looks like this:
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
Current problem: I can't order my database by the time the articles were posted, because whenever I run the program, articles that were posted today are stored only with a time. Over multiple days, this will create a lot of articles with a stamp that looks as if they were posted on the day you look at the database - since there is only a time.
What I want: Add the current month/day/year in front of the time stamp on the basis of the already given format.
My idea: I have a hard time to understand how regex works. My idea would be to check the length of the imported string. If it is exactly 8, I want to add the Month, Date and Year in front. But I don't know whether this is a) the most efficient approach and b) most importantly, how to code this seemingly easy idea.
I would glady appreciate if someone can help me how to code this. The current line which grabs the time looks like this:
article_time = item.select_one('h3 small').text

Try this out and others can correct me if I overlooked something,
from datetime import datetime, timedelta
def get_datetime_from_time(time):
time, timezone = time.rsplit(' ', 1)
if ',' in time:
article_time = datetime.strptime(time, r"%b %d, %Y, %H:%M")
else:
article_time = datetime.strptime(time, r"%H:%M")
hour, minute = article_time.hour, article_time.minute
if timezone == 'ET':
hours = -4
else:
hours = -5
article_time = (datetime.utcnow() + timedelta(hours=hours)).replace(hour=hour, minute=minute) # Adjust for timezone
return article_time
article_time = item.select_one('h3 small').text
article_time = get_datetime_from_time(article_time)
What I'm doing here is I'm checking if a comma is in your time string. If it is, then it's with date, else it's without. Then I'm checking for timezone since Daylight time is different than Standard time. So I have a statement to adjust timezone by 4 or 5. Then I'm getting the UTC time (regardless of your timezone) and adjust for timezone. strptime is a function that parses time depending on a format you give it.
Note that this does not take into account an empty time string.

Handling timezones properly can get fairly involved since the standard library barely supports them (and recommends using the third-party pytz module) to do so). This would be especially true if you need it
So, one "quick and dirty" way to deal with them would be to just ignore that information and add the current day, month, and year to any timestamps encountered that don't include that. The code below demonstrates how to do that.
from datetime import datetime
scrapped = '''
17:22 ET
02:41 ET
06:14 ET
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
'''
def get_datetime(string):
string = string[:-3] # Remove timezone.
try:
r = datetime.strptime(string, "%b %d, %Y, %H:%M")
except ValueError:
try:
today = datetime.today()
daytime = datetime.strptime(string, "%H:%M")
r = today.replace(hour=daytime.hour, minute=daytime.minute, second=0, microsecond=0)
except ValueError:
r = None
return r
for line in scrapped.splitlines():
if line:
r = get_datetime(line)
print(f'{line=}, {r=}')

"I can't order my database" - to be able to do so, you'll either have to convert the strings to datetime objects or to an ordered format (low to high resolution, so year-month-day- etc.) which would allow you to sort strings correctly.
"I have a hard time to understand how regex works" - while you can use regular expressions here to somehow parse and modify the strings you have, you don't need to.
#1 If you want a convenient option that leaves you with datetime objects, here's one using dateutil:
import dateutil
times = ["17:22 ET", "02:41 ET", "06:14 ET",
"Mar 10, 2021, 16:05 ET", "Mar 08, 2021, 08:00 ET", "Feb 26, 2021, 11:23 ET"]
tzmapping = {'ET': dateutil.tz.gettz('US/Eastern')}
for t in times:
print(f"{t:>22} -> {dateutil.parser.parse(t, tzinfos=tzmapping)}")
17:22 ET -> 2021-03-13 17:22:00-05:00
02:41 ET -> 2021-03-13 02:41:00-05:00
06:14 ET -> 2021-03-13 06:14:00-05:00
Mar 10, 2021, 16:05 ET -> 2021-03-10 16:05:00-05:00
Mar 08, 2021, 08:00 ET -> 2021-03-08 08:00:00-05:00
Feb 26, 2021, 11:23 ET -> 2021-02-26 11:23:00-05:00
Note that you can easily tell dateutil's parser to use a certain time zone (e.g. to convert 'ET' to US/Eastern) and it also automatically adds today's date if the date is not present in the input.
#2 If you want to do more of the parsing yourself (probably more efficient), you can do so by extracting the time zone first, then parsing the rest and adding a date where needed:
from datetime import datetime
from zoneinfo import ZoneInfo # Python < 3.9: you can use backports.zoneinfo
# add more if you not only have ET...
tzmapping = {'ET': ZoneInfo('US/Eastern')}
# get tuples of the input string with tz stripped off and timezone object
times_zones = [(t[:t.rfind(' ')], tzmapping[t.split(' ')[-1]]) for t in times]
# parse to datetime
dt = []
for t, z in times_zones:
if len(t)>5: # time and date...
dt.append(datetime.strptime(t, '%b %d, %Y, %H:%M').replace(tzinfo=z))
else: # time only...
dt.append(datetime.combine(datetime.now(z).date(),
datetime.strptime(t, '%H:%M').time()).replace(tzinfo=z))
for t, dtobj in zip(times, dt):
print(f"{t:>22} -> {dtobj}")

Related

How to compare date in python? I have a month date year format and want to compare to the current date (within 7 days)

This is the data that is being returned from my API:
"Jun 02, 2021, 2 PMEST"
If I'm within 7 days of the current date which I'm getting by doing this:
from datetime import date
today = date.today()
print("Today's date:", today)
Just need to convert Jun to a number and 02 and compare to see if it's within 7 days in the future of the current date, then return True
APPROACH 0:
Given the format of your example data, you should be able to convert it to a datetime using this code:
datetime.strptime("Jun 02, 2021, 2 PMEST", "%b %d, %Y, %I %p%Z")
The details about this format string are here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
However, when I tested this locally, it worked for this input:
"Jun 02, 2021, 2 PMUTC"
but not for your input (which has different timezone):
"Jun 02, 2021, 2 PMEST"
I have investigated this some more and "read the docs" (https://docs.python.org/3/library/time.html).
To get EST parsing to work, you would have to change your OS timezone and reset the time module's timezones like this:
from datetime import datetime
import os
import time
os.environ["TZ"] = "US/Eastern". # change timezone
time.tzset(). # reset time.tzname tuple
datetime.strptime("Jun 02, 2021, 2 PMEST", "%b %d, %Y, %I %p%Z")
When you're done, be safe and delete the "hacked" environment variable:
del os.environ["TZ"]
Note - Since your system timezone is presumably still UTC, it can still parse UTC timezone too.
See this thread for detailed discussion: https://bugs.python.org/issue22377
Also note that the timestamp is not actually captured. The result you get with EST and UTC is a naive datetime object.
APPROACH 1
So, it seems like there is a better way to approach this.
First, you need to pip install dateutils if you don't already have it.
THen do something like this:
from dateutil import parser
from dateutil.tz import gettz
tzinfos = {"EST": gettz("US/Eastern")}
my_datetime = parser.parse("Jun 02, 2021, 2 PM EST", tzinfos=tzinfos)
What's happening here is we use gettz to get timezone information from the timezones listed in usr/share/zoneinfo. Then the parse function can (fuzzy) parse your string (no format needs to be specified!) and returns my_datetime which has timezone information on it. Here are the parser docs: https://dateutil.readthedocs.io/en/stable/parser.html
I don't know how many different timezones you need to deal with so the rest is up to you. Good luck.
Convert the date to a datetime structure and take the direct difference. Note that today must be a datetime, too.
import datetime
date_string = "Jun 02, 2021, 2 PMEST"
today = datetime.datetime.today()
date = datetime.datetime.strptime(date_string,
"%b %d, %Y, %I %p%Z") # Corrected
(date - today).days
#340

Date/Time formatting

I'm creating a python script that will display busy, no-answer and failed calls for a specific date but I'm stuck on the formatting of the date that's displayed. The start_time and end_time "variables" from Twilio print something like this: "Mon, 25 Jul 2016 16:03:53 +0000". I want to get rid of the day name and the comma since I'm saving the results into a csv file (script_name.py > some_file.csv) and the comma after the day name kind of screws up the csv structure.
In the settings.py file the time_zone variable is set to the right one (America/Chicago) and the USE_TZ variable is set to true. But anyway the output is still in UTC.
I don't know anything about Python and the things I've tried to parse call.start_time to a datetime have failed . . . I would know how to do it if it was a given value like start_time = '2016-07-26', but I don't know how to do it when the value comes from for call in client.calls.list . . .
Any guidance will be greatly appreciated!
Thanks!
from twilio.rest import TwilioRestClient
from datetime import datetime
from pytz import timezone
from dateutil import tz
# To find these visit https://www.twilio.com/user/account
account_sid = "**********************************"
auth_token = "**********************************"
client = TwilioRestClient(account_sid, auth_token)
for call in client.calls.list(
start_time="2016-07-25",
end_time="2016-07-25",
status='failed',
):
print(datetime.datetime.strptime(call.start_time, "%Y-%m-%d %H:%M:%S"))
The code I've provided does simple date and time format.
from datetime import datetime
from time import sleep
print('The Time is shown below!')
while True:
time = str(datetime.now())
time = list(time)
for i in range(10):
time.pop(len(time)-1)
time = ('').join(time)
time = time.split()
date = time[0]
time = time[1]
print('Time: '+time+', Date: '+date, end='\r')
sleep(1)
However if you looking just to format "Mon, 25 Jul 2016 16:03:53 +0000" as you said and just remove the day consider something like this:
day = "Mon, 25 Jul 2016 16:03:53 +0000"
# Convert to an array
day = list(day)
# Remove first 5 characters
for i in range(5):
day.pop(0)
day = ('').join(day)
print(day)
# You can use if statements to determine which day it is to decide how many characters to remove.
>>> "25 Jul 2016 16:03:53 +0000"
The format you need to parse is dictated by the timestamp provided by Twillo. You will likely need the following format string to properly parse the timestamp:
print(datetime.datetime.strptime(call.start_time, "%a, %d %b %Y %H:%M:%S %z"))
A great guide for the formatting string is http://strftime.org/.
Another good library for lazily converting dates from strings is the python-dateutil library found at https://dateutil.readthedocs.io/.

Convert Time to printable or localtime format

In my python code I get start and end time some thing like:
end = int(time.time())
start = end - 1800
Now start and end variables holds values like 1460420758 and 1460422558.
I am trying to convert it in a meaningful format like :
Mon Apr 11 17:50:25 PDT 2016
But am unable to do so, I tried:
time.strftime("%a %b %d %H:%M:%S %Y", time.gmtime(start))
Gives me
Tue Apr 12 00:25:58 2016
But not only the timezone but the H:M:S are wrong
As date returns me the below information:
$ date
Mon Apr 11 18:06:27 PDT 2016
How to correct it?
This one involves utilizing datetime to great the format you wish with the strftime module.
What's important is that the time information you get 'MUST' be UTC in order to do this. Otherwise, you're doomed D:
I'm using timedelta to 'add' hours to the time. It will also increments the date, too. I still would recommend using the module I shared above to handle time zones.
import time
# import datetime so you could play with time
import datetime
print int(time.time())
date = time.gmtime(1460420758)
# Transform time into datetime
new_date = datetime.datetime(*date[:6])
new_date = new_date + datetime.timedelta(hours=8)
# Utilize datetime's strftime and manipulate it to what you want
print new_date.strftime('%a %b %d %X PDT %Y')

Python - Sort Lines in File by Regex Date Match

I have a file (based on a class project) of scraped Tweets. At this point lines in the file look like:
#soandso something something Permalink 1:40 PM - 17 Feb 2016<br><br>
#soandso something something Permalink 1:32 PM - 16 Feb 2016<br><br>
I'm trying to sort the lines in the file by date. This is what I've cobbled together so far.
import re
from datetime import datetime
when = re.compile(r".+</a>(.+)<br><br>")
with open('tweets.txt','r+') as outfile:
sortme = outfile.read()
for match in re.finditer(when, sortme):
tweet = match.group(0)
when = match.group(1)
when = datetime.strptime(when, " %I:%M %p - %d %b %Y")
print when
Which will print out all the dates in the lines having converted the format
from 1:40 PM - 17 Feb 2016 to 2016-02-17 13:40:00, which I believe is a datetime. I have searched high and low over the last few days for clues about how I'd then sort all the lines in the file by datetime. Thanks for your help!
I have searched high and low over the last few days for clues about how I'd then sort all the lines in the file by datetime.
def get_time(line):
match = re.search(r"</a>\s*(.+?)\s*<br><br>", line)
if match:
return datetime.strptime(match.group(1), "%I:%M %p - %d %b %Y")
return datetime.min
lines.sort(key=get_time)
It assumes that the time is monotonous in the given time period (e.g., no DST transitions) otherwise you should convert the input time to UTC (or POSIX timestamp) first.
It seems you have already solved the regex problem... so to convert your datetime into a measurable quantity convert to seconds like so:
import time
time.mktime(when.timetuple())
then for sorting you can make a lot off different routes. the simplest example is:
import operator
s = [("ab",50),("cd",100),("ef",15)]
print sorted(s,key=operator.itemgetter(1))
## [('ef', 15), ('ab', 50), ('cd', 100)]

Python date string mm/dd/yyyy to datetime

I have to write a program where I take stocks from yahoo finance and print out certain information for the site. One of the pieces of data is the date. I need to take a date such as 3/21/2012 and converter to the following format: Mar 21, 2012.
Here is my code for the entire project.
def getStockData(company="GOOG"):
baseurl ="http://quote.yahoo.com/d/quotes.csv?s={0}&f=sl1d1t1c1ohgvj1pp2owern&e=.csv"
url = baseurl.format(company)
conn = u.urlopen(url)
content = conn.readlines()
data = content[0].decode("utf-8")
data = data.split(",")
date = data[2][1:-1]
date_new = datetime.strptime(date, "%m/%d/%Y").strftime("%B[0:3] %d, %Y")
print("The last trade for",company, "was", data[1],"and the change was", data[4],"on", date_new)
company = input("What company would you like to look up?")
getStockData(company)
co = ["VOD.L", "AAPL", "YHOO", "S", "T"]
for company in co:
getStockData(company)
You should really specify what about your code is not working (i.e., what output are you getting that you don't expect? What error message are you getting, if any?). However, I suspect your problem is with this part:
strftime('%B[0:3] %d, %Y')
Since Python won't do what you think with that attempt to slice '%B'. You should instead use '%b', which as noted in the documentation for strftime(), corresponds to the locale-abbreviated month name.
EDIT
Here is a fully functional script based on what you posted above with my suggested modifications:
import urllib2 as u
from datetime import datetime
def getStockData(company="GOOG"):
baseurl ="http://quote.yahoo.com/d/quotes.csv?s={0}&f=sl1d1t1c1ohgvj1pp2owern&e=.csv"
url = baseurl.format(company)
conn = u.urlopen(url)
content = conn.readlines()
data = content[0].decode("utf-8")
data = data.split(",")
date = data[2][1:-1]
date_new = datetime.strptime(date, "%m/%d/%Y").strftime("%b %d, %Y")
print("The last trade for",company, "was", data[1],"and the change was", data[4],"on", date_new)
for company in ["VOD.L", "AAPL", "YHOO", "S", "T"]:
getStockData(company)
The output of this script is:
The last trade for VOD.L was 170.00 and the change was -1.05 on Mar 06, 2012
The last trade for AAPL was 530.26 and the change was -2.90 on Mar 06, 2012
The last trade for YHOO was 14.415 and the change was -0.205 on Mar 06, 2012
The last trade for S was 2.39 and the change was -0.04 on Mar 06, 2012
The last trade for T was 30.725 and the change was -0.265 on Mar 06, 2012
For what it's worth, I'm running this on Python 2.7.1. I also had the line from __future__ import print_function to make this compatible with the Python3 print function you appear to be using.
Check out Dateutil. You can use it to parse a string into python datetime object and then print that object using strftime.
I've since come to a conclusion that auto detection of datetime value is not always a good idea. It's much better to use strptime and specify what format you want.

Categories

Resources