I have to write a program where I take stocks from yahoo finance and print out certain information for the site. One of the pieces of data is the date. I need to take a date such as 3/21/2012 and converter to the following format: Mar 21, 2012.
Here is my code for the entire project.
def getStockData(company="GOOG"):
baseurl ="http://quote.yahoo.com/d/quotes.csv?s={0}&f=sl1d1t1c1ohgvj1pp2owern&e=.csv"
url = baseurl.format(company)
conn = u.urlopen(url)
content = conn.readlines()
data = content[0].decode("utf-8")
data = data.split(",")
date = data[2][1:-1]
date_new = datetime.strptime(date, "%m/%d/%Y").strftime("%B[0:3] %d, %Y")
print("The last trade for",company, "was", data[1],"and the change was", data[4],"on", date_new)
company = input("What company would you like to look up?")
getStockData(company)
co = ["VOD.L", "AAPL", "YHOO", "S", "T"]
for company in co:
getStockData(company)
You should really specify what about your code is not working (i.e., what output are you getting that you don't expect? What error message are you getting, if any?). However, I suspect your problem is with this part:
strftime('%B[0:3] %d, %Y')
Since Python won't do what you think with that attempt to slice '%B'. You should instead use '%b', which as noted in the documentation for strftime(), corresponds to the locale-abbreviated month name.
EDIT
Here is a fully functional script based on what you posted above with my suggested modifications:
import urllib2 as u
from datetime import datetime
def getStockData(company="GOOG"):
baseurl ="http://quote.yahoo.com/d/quotes.csv?s={0}&f=sl1d1t1c1ohgvj1pp2owern&e=.csv"
url = baseurl.format(company)
conn = u.urlopen(url)
content = conn.readlines()
data = content[0].decode("utf-8")
data = data.split(",")
date = data[2][1:-1]
date_new = datetime.strptime(date, "%m/%d/%Y").strftime("%b %d, %Y")
print("The last trade for",company, "was", data[1],"and the change was", data[4],"on", date_new)
for company in ["VOD.L", "AAPL", "YHOO", "S", "T"]:
getStockData(company)
The output of this script is:
The last trade for VOD.L was 170.00 and the change was -1.05 on Mar 06, 2012
The last trade for AAPL was 530.26 and the change was -2.90 on Mar 06, 2012
The last trade for YHOO was 14.415 and the change was -0.205 on Mar 06, 2012
The last trade for S was 2.39 and the change was -0.04 on Mar 06, 2012
The last trade for T was 30.725 and the change was -0.265 on Mar 06, 2012
For what it's worth, I'm running this on Python 2.7.1. I also had the line from __future__ import print_function to make this compatible with the Python3 print function you appear to be using.
Check out Dateutil. You can use it to parse a string into python datetime object and then print that object using strftime.
I've since come to a conclusion that auto detection of datetime value is not always a good idea. It's much better to use strptime and specify what format you want.
Related
I'm scraping data from a news site and want to store the time and date these articles were posted. The good thing is that I can pull these timestamps right from the page of the articles.
When the articles I scrape were posted today, the output looks like this:
17:22 ET
02:41 ET
06:14 ET
When the articles were posted earlier than today, the output looks like this:
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
Current problem: I can't order my database by the time the articles were posted, because whenever I run the program, articles that were posted today are stored only with a time. Over multiple days, this will create a lot of articles with a stamp that looks as if they were posted on the day you look at the database - since there is only a time.
What I want: Add the current month/day/year in front of the time stamp on the basis of the already given format.
My idea: I have a hard time to understand how regex works. My idea would be to check the length of the imported string. If it is exactly 8, I want to add the Month, Date and Year in front. But I don't know whether this is a) the most efficient approach and b) most importantly, how to code this seemingly easy idea.
I would glady appreciate if someone can help me how to code this. The current line which grabs the time looks like this:
article_time = item.select_one('h3 small').text
Try this out and others can correct me if I overlooked something,
from datetime import datetime, timedelta
def get_datetime_from_time(time):
time, timezone = time.rsplit(' ', 1)
if ',' in time:
article_time = datetime.strptime(time, r"%b %d, %Y, %H:%M")
else:
article_time = datetime.strptime(time, r"%H:%M")
hour, minute = article_time.hour, article_time.minute
if timezone == 'ET':
hours = -4
else:
hours = -5
article_time = (datetime.utcnow() + timedelta(hours=hours)).replace(hour=hour, minute=minute) # Adjust for timezone
return article_time
article_time = item.select_one('h3 small').text
article_time = get_datetime_from_time(article_time)
What I'm doing here is I'm checking if a comma is in your time string. If it is, then it's with date, else it's without. Then I'm checking for timezone since Daylight time is different than Standard time. So I have a statement to adjust timezone by 4 or 5. Then I'm getting the UTC time (regardless of your timezone) and adjust for timezone. strptime is a function that parses time depending on a format you give it.
Note that this does not take into account an empty time string.
Handling timezones properly can get fairly involved since the standard library barely supports them (and recommends using the third-party pytz module) to do so). This would be especially true if you need it
So, one "quick and dirty" way to deal with them would be to just ignore that information and add the current day, month, and year to any timestamps encountered that don't include that. The code below demonstrates how to do that.
from datetime import datetime
scrapped = '''
17:22 ET
02:41 ET
06:14 ET
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
'''
def get_datetime(string):
string = string[:-3] # Remove timezone.
try:
r = datetime.strptime(string, "%b %d, %Y, %H:%M")
except ValueError:
try:
today = datetime.today()
daytime = datetime.strptime(string, "%H:%M")
r = today.replace(hour=daytime.hour, minute=daytime.minute, second=0, microsecond=0)
except ValueError:
r = None
return r
for line in scrapped.splitlines():
if line:
r = get_datetime(line)
print(f'{line=}, {r=}')
"I can't order my database" - to be able to do so, you'll either have to convert the strings to datetime objects or to an ordered format (low to high resolution, so year-month-day- etc.) which would allow you to sort strings correctly.
"I have a hard time to understand how regex works" - while you can use regular expressions here to somehow parse and modify the strings you have, you don't need to.
#1 If you want a convenient option that leaves you with datetime objects, here's one using dateutil:
import dateutil
times = ["17:22 ET", "02:41 ET", "06:14 ET",
"Mar 10, 2021, 16:05 ET", "Mar 08, 2021, 08:00 ET", "Feb 26, 2021, 11:23 ET"]
tzmapping = {'ET': dateutil.tz.gettz('US/Eastern')}
for t in times:
print(f"{t:>22} -> {dateutil.parser.parse(t, tzinfos=tzmapping)}")
17:22 ET -> 2021-03-13 17:22:00-05:00
02:41 ET -> 2021-03-13 02:41:00-05:00
06:14 ET -> 2021-03-13 06:14:00-05:00
Mar 10, 2021, 16:05 ET -> 2021-03-10 16:05:00-05:00
Mar 08, 2021, 08:00 ET -> 2021-03-08 08:00:00-05:00
Feb 26, 2021, 11:23 ET -> 2021-02-26 11:23:00-05:00
Note that you can easily tell dateutil's parser to use a certain time zone (e.g. to convert 'ET' to US/Eastern) and it also automatically adds today's date if the date is not present in the input.
#2 If you want to do more of the parsing yourself (probably more efficient), you can do so by extracting the time zone first, then parsing the rest and adding a date where needed:
from datetime import datetime
from zoneinfo import ZoneInfo # Python < 3.9: you can use backports.zoneinfo
# add more if you not only have ET...
tzmapping = {'ET': ZoneInfo('US/Eastern')}
# get tuples of the input string with tz stripped off and timezone object
times_zones = [(t[:t.rfind(' ')], tzmapping[t.split(' ')[-1]]) for t in times]
# parse to datetime
dt = []
for t, z in times_zones:
if len(t)>5: # time and date...
dt.append(datetime.strptime(t, '%b %d, %Y, %H:%M').replace(tzinfo=z))
else: # time only...
dt.append(datetime.combine(datetime.now(z).date(),
datetime.strptime(t, '%H:%M').time()).replace(tzinfo=z))
for t, dtobj in zip(times, dt):
print(f"{t:>22} -> {dtobj}")
I've created a script in python to parse two fields from a webpage - total revenue and it's concerning date. The fields I'm after are javascript encrypted. They are available in page source within json array. The following script can parse those two fields accordingly.
However, the problem is the date visible in that page is different from the one available in page source.
Webpage link
The date in that webpage is like this
The date in page source is like this
There is clearly a variation of one day.
After visiting that webpage when you click on this tab Quarterly you can see the results there:
I've tried with:
import re
import json
import requests
url = 'https://finance.yahoo.com/quote/GTX/financials?p=GTX'
res = requests.get(url)
data = re.findall(r'root.App.main[^{]+(.*);',res.text)[0]
jsoncontent = json.loads(data)
container = jsoncontent['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistoryQuarterly']['incomeStatementHistory']
total_revenue = container[0]['totalRevenue']['raw']
concerning_date = container[0]['endDate']['fmt']
print(total_revenue,concerning_date)
Result I get (revenue in million):
802000000 2019-06-30
Result I wish to get:
802000000 2019-06-29
When I try with this ticker AAPL, I get the exact date, so subtracing or adding a day is not an option.
How can I get the exact date from that site?
Btw, I know how to get them using selenium, so I would only like to stick to requests.
As mentioned in the comments, you need to convert the date to the appropriate timezone (EST), which can be done with datetime and dateutil.
Here is a working example:
import re
import json
import requests
from datetime import datetime, timezone
from dateutil import tz
url = 'https://finance.yahoo.com/quote/GTX/financials?p=GTX'
res = requests.get(url)
data = re.findall(r'root.App.main[^{]+(.*);',res.text)[0]
jsoncontent = json.loads(data)
container = jsoncontent['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistoryQuarterly']['incomeStatementHistory']
total_revenue = container[0]['totalRevenue']['raw']
EST = tz.gettz('EST')
raw_date = datetime.fromtimestamp(container[0]['endDate']['raw'], tz=EST)
concerning_date = raw_date.date().strftime('%d-%m-%Y')
print(total_revenue, concerning_date)
The updated section of this answer outlines the root cause of the date differences.
ORIGINAL ANSWER
Some of the raw values in your JSON are UNIX timestamps.
Reference from your code with modifications:
concerning_date_fmt = container[0]['endDate']['fmt']
concerning_date_raw = container[0]['endDate']['raw']
print(f'{concerning_date} -- {concerning_date_raw}')
# output
2019-07-28 -- 1564272000
'endDate': {'fmt': '2019-07-28', 'raw': 1564272000}
1564272000 is the number of elapsed seconds since January 01 1970. This date was the start of the Unix Epoch and the time is in Coordinated Universal Time (UTC). 1564272000 is the equivalent to: 07/28/2019 12:00am (UTC).
You can covert these timestamps to a standard datetime format by using built-in Python functions
from datetime import datetime
unix_timestamp = int('1548547200')
converted_timestamp = datetime.utcfromtimestamp(unix_timestamp).strftime('%Y-%m-%dT%H:%M:%SZ')
print (converted_timestamp)
# output Coordinated Universal Time (or UTC)
2019-07-28T00:00:00Z
reformatted_timestamp = datetime.strptime(converted_timestamp, '%Y-%m-%dT%H:%M:%SZ').strftime('%d-%m-%Y')
print (reformatted_timestamp)
# output
28-07-2019
This still does not solve your original problem related to JSON dates and column dates being different at times. But here is my current hypothesis related to the date disparities that are occurring.
The json date (fmt and raw) that are being extracted from root.App.main are in Coordinated Universal Time (UTC). This is clear because of the UNIX timestamp in raw.
The dates being displayed in the table columns seem to be in the Eastern Standard Time (EST) timezone. EST is currently UTC-4. Which means that 2019-07-28 22:00 (10pm) EST would be 2019-07-29 02:00 (2am) UTC. The server hosting finance.yahoo.com looks to be in the United States, based on the traceroute
results. These values are also in the json file:
'exchangeTimezoneName': 'America/New_York'
'exchangeTimezoneShortName': 'EDT'
There is also the possibility that some of the date differences are linked to the underlying React code, which the site uses. This issue is harder to diagnose, because the code isn't visible.
At this time I believe that the best solution would be to use the UNIX timestamp as your ground truth time reference. This reference could be used to replace the table column's date.
There is definitely some type of conversion happening between the JSON file and the columns.
NVIDIA JSON FILE: 'endDate': {'raw': 1561766400, 'fmt': '2019-06-29'}
NVIDIA Associated Total Revenue column: 6/30/2019
BUT the Total Revenue column date should be 6/28/2019 (EDT), because the UNIX time stamp for 1561766400 is 06/29/2019 12:00am (UTC).
The disparity with DELL is greater than a basic UNIX timestamp and a EDT timestamp conversion.
DELL JSON FILE:{"raw":1564704000,"fmt":"2019-08-02"}
DELL Associated Total Revenue column: 7/31/2019
If we convert the UNIX timestamp to an EDT timestamp, the result would be 8/1/2019, but that is not the case in the DELL example, which is 7/31/2019. Something within the Yahoo code base has to be causing this difference.
I'm starting to believe that React might be the culprit with these date differences, but I cannot be sure without doing more research.
If React is the root cause then the best option would be to use the date elements from the JSON data.
UPDATED ANSWER 10-17-2019
This problem is very interesting, because it seems that these column dates are linked to a company's official end of fiscal quarter and not a date conversation issue.
Here are several examples for
Apple Inc. (AAPL)
Atlassian Corporation Plc (TEAM)
Arrowhead Pharmaceuticals, Inc. (ARWR):
Their column dates are:
6/30/2019
3/31/2019
12/31/2018
9/30/2018
These dates match to these fiscal quarters.
Quarter 1 (Q1): January 1 - March 31.
Quarter 2 (Q2): April 1 - June 30.
Quarter 3 (Q3): July 1 - September 30.
Quarter 4 (Q4): October 1 - December 31
These fiscal quarter end dates can vary greatly as this DELL example shows.
DELL (posted in NASDAQ)
End of fiscal quarter: July 2019
Yahoo Finance
Column date: 7/31/2019
JSON date: 2019-08-02
From the company's website:
When does Dell Technologies’ fiscal year end?
Our fiscal year is the 52- or 53-week period ending on the Friday nearest January 31. Our 2020 fiscal year will end on January 31, 2020. For prior fiscal years, see list below: Our 2019 fiscal year ended on February 1, 2019 Our 2018 fiscal year ended on February 2, 2018 Our 2017 fiscal year ended on February 3, 2017 Our 2016 fiscal year ended on January 29, 2016 Our 2015 fiscal year ended on January 30, 2015 Our 2014 fiscal year ended on January 31, 2014 Our 2013 fiscal year ended on February 1, 2013
NOTE: The 05-03-19 and 08-02-19 dates.
These are from the JSON quarter data for DELL:
{'raw': 1564704000, 'fmt': '2019-08-02'}
{'raw': 1556841600, 'fmt': '2019-05-03'}
It seems that these column dates are linked to a company's fiscal quarter end dates. So I would recommend that you either use the JSON date as you primary reference element or the corresponding column date.
P.S. There is some type of date voodoo occurring at Yahoo, because they seem to move these column quarter dates based on holidays, weekends and end of month.
Instead of getting the fmt of the concerning_date, It's better to get the timestamp.
concerning_date = container[0]['endDate']['raw']
In the example above you will get the result 1561852800 which you can transfer into a date with a certain timezone. (Hint: use datetime and pytz). This timestamp will yield the following results based on timezone:
Date in Los Angeles*: 29/06/2019, 17:00:00
Date in Berlin* :30/06/2019, 02:00:00
Date in Beijing*: 30/06/2019, 07:00:00
Date in New York* :29/06/2019, 19:00:00
I am using Python 2.7.
I have an Adobe PDF form doc that has a date field. I extract the values using the pdfminer function. The problem I need to solve is, the user in Adobe Acrobat reader is allowed to type in strings like april 3rd 2017 or 3rd April 2017 or Apr 3rd 2017 or 04/04/2017 as well as 4 3 2017. Now the date field in Adobe is set to mm/dd/yyyy format, so when a user types in one of the values above, that is the actual value that pdfminer pulls, yet adobe will display it as 04/03/2017, but when you click on the field is shows you the actual value like the ones above. Adobe allows this and then doing it's on conversion I think to display the date as mm/dd/yyyy. There is ability to use javascript with adobe for more control, but i can't do that the users can only have and use the pdf form without any accompanying javascript file.
So I was looking to find a method with datetime in Python that would be able to accept a written date such as the examples above from a string and then convert them into a true mm/dd/yyyy format??? I saw methods for converting long and short month names but nothing that would handle day names like 1st,2nd,3rd,4th .
You could just try each possible format in turn. First remove any st nd rd specifiers to make the testing easier:
from datetime import datetime
formats = ["%B %d %Y", "%d %B %Y", "%b %d %Y", "%m/%d/%Y", "%m %d %Y"]
dates = ["april 3rd 2017", "3rd April 2017", "Apr 3rd 2017", "04/04/2017", "4 3 2017"]
for date in dates:
date = date.lower().replace("rd", "").replace("nd", "").replace("st", "")
for format in formats:
try:
print datetime.strptime(date, format).strftime("%m/%d/%Y")
except ValueError:
pass
Which would display:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
This approach has the benefit of validating each date. For example a month greater than 12. You could flag any dates that failed all allowed formats.
Just write a regular expression to get the number out of the string.
import re
s = '30Apr'
n = s[:re.match(r'[0-9]+', s).span()[1]]
print(n) # Will print 30
The other things should be easy.
Based on #MartinEvans's anwser, but using arrow library: (because it handles more cases than datetime so you don't have to use replace() nor lower())
First install arrow:
pip install arrow
Then try each possible format:
import arrow
dates = ['april 3rd 2017', '3rd April 2017', 'Apr 3rd 2017', '04/04/2017', '4 3 2017']
formats = ['MMMM Do YYYY', 'Do MMMM YYYY', 'MMM Do YYYY', 'MM/DD/YYYY', 'M D YYYY']
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
except arrow.parser.ParserError:
pass
[convert_datetime(date) for date in dates]
Will output:
04/03/2017
04/03/2017
04/03/2017
04/04/2017
04/03/2017
If you are unsure of what could be wrong in your date format, you can also output a nice error message if none of the date matches the format:
def convert_datetime(date):
for format in formats:
try:
print arrow.get(date, format).format('MM/DD/YYYY')
break
except (arrow.parser.ParserError, ValueError) as e:
pass
else:
print 'For date: "{0}", {1}'.format(date, e)
convert_datetime('124 5 2017') # test invalid date
Will output the following error message:
'For date: "124 5 2017", month must be in 1..12'
I'm creating a python script that will display busy, no-answer and failed calls for a specific date but I'm stuck on the formatting of the date that's displayed. The start_time and end_time "variables" from Twilio print something like this: "Mon, 25 Jul 2016 16:03:53 +0000". I want to get rid of the day name and the comma since I'm saving the results into a csv file (script_name.py > some_file.csv) and the comma after the day name kind of screws up the csv structure.
In the settings.py file the time_zone variable is set to the right one (America/Chicago) and the USE_TZ variable is set to true. But anyway the output is still in UTC.
I don't know anything about Python and the things I've tried to parse call.start_time to a datetime have failed . . . I would know how to do it if it was a given value like start_time = '2016-07-26', but I don't know how to do it when the value comes from for call in client.calls.list . . .
Any guidance will be greatly appreciated!
Thanks!
from twilio.rest import TwilioRestClient
from datetime import datetime
from pytz import timezone
from dateutil import tz
# To find these visit https://www.twilio.com/user/account
account_sid = "**********************************"
auth_token = "**********************************"
client = TwilioRestClient(account_sid, auth_token)
for call in client.calls.list(
start_time="2016-07-25",
end_time="2016-07-25",
status='failed',
):
print(datetime.datetime.strptime(call.start_time, "%Y-%m-%d %H:%M:%S"))
The code I've provided does simple date and time format.
from datetime import datetime
from time import sleep
print('The Time is shown below!')
while True:
time = str(datetime.now())
time = list(time)
for i in range(10):
time.pop(len(time)-1)
time = ('').join(time)
time = time.split()
date = time[0]
time = time[1]
print('Time: '+time+', Date: '+date, end='\r')
sleep(1)
However if you looking just to format "Mon, 25 Jul 2016 16:03:53 +0000" as you said and just remove the day consider something like this:
day = "Mon, 25 Jul 2016 16:03:53 +0000"
# Convert to an array
day = list(day)
# Remove first 5 characters
for i in range(5):
day.pop(0)
day = ('').join(day)
print(day)
# You can use if statements to determine which day it is to decide how many characters to remove.
>>> "25 Jul 2016 16:03:53 +0000"
The format you need to parse is dictated by the timestamp provided by Twillo. You will likely need the following format string to properly parse the timestamp:
print(datetime.datetime.strptime(call.start_time, "%a, %d %b %Y %H:%M:%S %z"))
A great guide for the formatting string is http://strftime.org/.
Another good library for lazily converting dates from strings is the python-dateutil library found at https://dateutil.readthedocs.io/.
I'm using Python 3.3. I'm getting an email from an IMAP server, then converting it to an instance of an email from the standard email library.
I do this:
message.get("date")
Which gives me this for example:
Wed, 23 Jan 2011 12:03:11 -0700
I want to convert this to something I can put into time.strftime() so I can format it nicely. I want the result in local time, not UTC.
There are so many functions, deprecated approaches and side cases, not sure what is the modern route to take?
Something like this?
>>> import time
>>> s = "Wed, 23 Jan 2011 12:03:11 -0700"
>>> newtime = time.strptime(s, '%a, %d %b %Y %H:%M:%S -0700')
>>> print(time.strftime('Two years ago was %Y', newtime))
Two years ago was 2011 # Or whatever output you wish to receive.
I use python-dateutil for parsing datetime strings. Function parse from this library is very handy for this kind of task
Do this:
import email, email.utils, datetime, time
def dtFormat(s):
dt = email.utils.parsedate_tz(s)
dt = email.utils.mktime_tz(dt)
dt = datetime.datetime.fromtimestamp(dt)
dt = dt.timetuple()
return dt
then this:
s = message.get("date") # e.g. "Wed, 23 Jan 2011 12:03:11 -0700"
print(time.strftime("%Y-%m-%d-%H-%M-%S", dtFormat(s)))
gives this:
2011-01-23-21-03-11