Okay so I am relatively new to programming and this has me absolutely stumped. Im scraping data from a website and the data changes every week. I want to run my scraping process each time the data changes starting back on 09-09-2015 and running to current.
I know how to do this easily running thru every number like 0909 then 0910 then 0911 but that is not what I need as that will be requesting way too many requests from the server that are pointless.
Here is the format of the URL
http://www.myexamplesite.com/?date=09092015
I know the simple:
for i in range(startDate, endDate):
url = 'http://www.myexamplesite.com/?date={}'.format(i)
driver.get(url)
But one thing i've never been able to figure out is manipulate pythons dateTime to accurately reflect the format the website uses.
i.e:
09092015
09162015
09232015
09302015
10072015
...
09272017
If all else fails I only need to do this once so it wouldnt take too long to just ignore the loop altogether and just manually enter the date I wish to scrape from and then just append all of my dataframes together. Im mainly curious on how to manipulate the datetime function in this sense for future projects that may require more data.
A good place to start are datetime, date and timedelta objects docs.
First, let's construct our starting date and ending date (today):
>>> from datetime import date, timedelta
>>> start = date(2015, 9, 9)
>>> end = date.today()
>>> start, end
(datetime.date(2015, 9, 9), datetime.date(2017, 9, 27))
Now let's define the unit of increment -- one day:
>>> day = timedelta(days=1)
>>> day
datetime.timedelta(1)
A nice thing about dates (date/datetime) and time deltas (timedelta) is they and can be added:
>>> start + day
datetime.date(2015, 9, 10)
We can also use format() to get that date in a human-readable form:
>>> "{date.day:02}{date.month:02}{date.year}".format(date=start+day)
'10092015'
So, when we put all this together:
from datetime import date, timedelta
start = date(2015, 9, 9)
end = date.today()
week = timedelta(days=7)
mydate = start
while mydate < end:
print("{date.day:02}{date.month:02}{date.year}".format(date=mydate))
mydate += week
we get a simple iteration over dates starting with 2015-09-09 and ending with today, incremented by 7 days (a week):
09092015
16092015
23092015
30092015
07102015
...
Take a look here
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
You can see the table pictured here for formatting dates and times and the usage.
Of course, if the format of the dates changes in the future or you are parsing different strings, you will have to make code changes. There really is no way around that.
Related
For a NLP project in python I need to generate random dates for model training purpose. Particularly, the date format must be random and coherent with a set of language locales. The formats includes those with only numbers and formats with (partially) written out day and month names, and various common punctuations.
My best solution so far is the following algorithm:
generate a datetime() object with random values (nice solution here)
randomly select a locale, i.e. pick one of ['en_US','fr_FR','it_IT','de_DE'] where in this case this list is well known and short, so not a problem.
randomly select a format string for strftime(), i.e. ['%Y-%m-%d','%d %B %Y',...]. In my case the list should reflect potentially occuring date formats in the documents that will be exposed to the NLP model in the future.
generate a sting with strftime()
Especially for 3) i do not know a better version than to hardcode the list of what I saw manually within the training documents. I could not yet find a function that would turn ocr-dates into a format string, such that i could extend the list when yet-unseen date formats come by.
Do you have any suggestions on how to come up with better randomly formatted dates, or how to improve this approach?
USE random.randrange() AND datetime.timedelta() TO GENERATE A RANDOM DATE BETWEEN TWO DATES
Call datetime.date(year, month, day) to return a datetime object representing the time indicated by year, month, and day. Call this twice to define the start and end date. Subtract the start date from the end date to get the time between the two dates. Call datetime.timedelta.days to get the number of days from the previous result datetime.timedelta. Call random.randrange(days) to get a random integer less than the previous result days. Call datetime.timedelta(days=n) to get a datetime.timedelta representing the previous result n. Add this result to the start date.
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 2, 1)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)
print(random_date)
Here is my solution. Concerning the local, all need to be available on your computer to avoid error
import random
from datetime import datetime, timedelta
import locale
LOCALE = ['en_US','fr_FR','it_IT','de_DE'] # all need to be available on your computer to avoid error
DATE_FORMAT = ['%Y-%m-%d','%d %B %Y']
def gen_datetime(min_year=1900, max_year=datetime.now().year):
# generate a datetime
start = datetime(min_year, 1, 1)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
format_date = DATE_FORMAT[random.randint(0, len(DATE_FORMAT)-1)]
locale_date = LOCALE[random.randint(0, len(LOCALE)-1)]
locale.setlocale(locale.LC_ALL, locale_date) # generate error if local are not available on your computer
return (start + (end - start) * random.random()).strftime(format_date)
date = gen_datetime()
print(date)
so I am a beginner with python and have been working with the datetime, time, and timedelta libraries a little bit. I am trying to create a piece of code that gives me the date approximately two months ago(exact_two_months_date) from today (whatever today happens to be). The catch is, I want to find that date approx. two months ago AND begin the actual start_date on the Monday of that week. So in theory, the actual start date will not be exactly two months ago. It will be the week beginning on Monday two months ago from today.
Example pseudocode:
today = '20150425' ## '%Y%m%d' ... Saturday
exact_two_months_date = '20150225' ## EXACTLY two months ago ... Wednesday
start_date = '20150223' ## this is the Monday of that week two months ago
So how do I find the 'start_date' above? If the day exactly two months ago begins on a Saturday or Sunday, then I would just want to go to the next Monday. Hope this is clear and makes sense... Once I find the start date, I would like to increment day by day(only week days) up to 'today'.
Appreciate any feedback, thanks.
Calculating with dates using python-dateutil
If a dependency on a third-party package is an option, then ☞ python-dateutil provides a convenient method to calculate with dates.
Browse the docs for ☞ relativedelta to see the wealth of supported parameters. The more calculations a package needs to do with dates, the more a helper module like dateutil justifies its dependency. For more inspiration on what it has to offer see the ☞ examples page.
Quick run-through:
>>> import datetime
>>> from dateutil.relativedelta import relativedelta
>>> today = datetime.date.today()
>>> two_m_ago = today - relativedelta(months=2)
>>> # print two_m_ago ~> datetime.date(2015, 2, 25)
>>> monday = two_m_ago - datetime.timedelta(days=two_m_ago.weekday())
>>> # print monday ~> datetime.date(2015, 2, 23)
Getting the Monday with weekday()
Once we have the date from two months ago in the variable two_m_ago, we subtract the index of the weekday() from it. This index is 0 for Monday and goes all the way to 6 for Sunday. If two_m_ago already is a Monday, then subtracting by 0 will not cause any changes.
Does something like this work for you?
import datetime
today = datetime.date.today()
delta = datetime.timedelta(days=60) # ~ 2 months
thatDay = today - delta
# subtract weekdays to get monday
thatMonday = thatDay - datetime.timedelta(days=thatDay.weekday())
Honestly, I find working with datetimes to be the hardest thing I ever have to regularly do and I make a lot of mistakes, so I'm going to work through this one and show some of the failures I regularly have with it. Here goes.
Two constraints: 1) Date two months ago, 2) Monday of that week
Date Two Months Ago
Okay, so Python's datetime library has a useful method called replace, which seems like it might help here:
>>> import datetime
>>> now = datetime.date.today()
>>> today
datetime.date(2015, 4, 25)
>>> today.month
4
>>> two_months_ago = today.replace(month=today.month-2)
>>> two_months_ago
datetime.date(2015, 2, 25)
>>> two_months_ago.month
2
But wait: what about negative numbers? That won't work:
>>> older = datetime.date(2015, 01, 01)
>>> older.replace(month=older.month-2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: month must be in 1..12
So there are two solutions:
1) I can build a 1-12 range that cycles forwards or back, or
2) To find two months previous, I can merely replace the day part of my date with the 1st day of the month I'm in and then go back 1 day to the previous month and then replace that day in the previous month with the day I want.
(If you think about it, you'll find that either of these may present bugs if I land on day 31 in a month with fewer days than that, for instance. This is part of what makes datetimes difficult.)
def previous_month(date):
current_day = date.day
first_day = date.replace(day=1)
last_day_prev_month = first_day - datetime.timedelta(days=1)
prev_month_day = last_day_prev_month.replace(day=current_day)
return prev_month_day
>>> today = datetime.date.today()
>>> older = previous_month(today)
>>> older
datetime.date(2015, 3, 25)
Well, let's say we're getting close, though, and we need to include some error-checking to make sure the day we want is a valid date inside the month we land in. Ultimately, the problem is that "two months ago" means a lot more than we think it means when we say it out loud.
Next, we'll take a crack at problem number two: How to get to the Monday of that week?
Well, datetime objects have a weekday method, so this part shouldn't be too hard and here's a nice SO answer on how to do that.
Simple version is: use the difference in weekday integers to figure out how many days to go back and do that using datetime.timedelta(days=days_difference).
Takeaway: Working with datetimes can be tough.
Date manipulation in Python is horribly convoluted. You will save a lot of time by using the arrow package which greatly simplifies these operations.
First install it
pip install arrow
Now your question:
import arrow
# get local current time
now = arrow.now('local')
# move 2 months back
old = now.replace(months=-2)
# what day of the week was that?
dow = old.isoweekday()
# reset old to Monday, for instance at 9:32 in the morning (this is just an example, just to show case)
old = old.replace(days=-dow, hour=9, minute=32, second=0)
print('now is {now}, we went back to {old}'.format(now=now.isoformat(), old=old.isoformat()))
The output:
now is 2015-04-25T20:37:38.174000+02:00, we went back to 2015-02-22T09:32:00.174000+01:00
Note that the various formats, timezones etc. are now transparent and you just need to rely on one package.
Right now I am using the following functions to calculate a date and time int like this (ymd), (hms). I believe it is easier to do this for comparison.
def getDayAsInt():
time = datetime.datetime.now()
year = time.strftime("%Y")
month=makeTimeTwoDigit(time.strftime("%m"))
day=makeTimeTwoDigit(time.strftime("%d"))
return year+month+day
def getTimeOfDay():
day=makeTimeTwoDigit(time.strftime("%d"))
hour=makeTimeTwoDigit(time.strftime("%H"))
minute=makeTimeTwoDigit(time.strftime("%M"))
second=makeTimeTwoDigit(time.strftime("%S"))
return hour+minute+second
I initially tried something like this:
'date': str(datetime.now()),
However I ran into an issue of easier generating a date range to query it. For example if today is 20140616 I can simply query dates between 20140601 and 20140616 where as generating all of the possible date times is harder. Does that make sense?
Ex I want to find out events that happened today but having a date time string stored in dynamodb is harder (more things to match to) to match.
I'm wondering if there is an easier or more efficient way? Is breaking the date and time down like that done? Should I take this:
year = time.strftime("%Y")
month=makeTimeTwoDigit(time.strftime("%m"))
day=makeTimeTwoDigit(time.strftime("%d"))
And do it inn one line? Like should I do time.strftime("%Y%m%d")?
If you are doing the comparisons in python, an easier solution would be to use builtin datetime objects and the normal comparison operators, like < and >.
from datetime import datetime
dt_object = datetime.strptime('Jun 1 2005 1:33PM', '%b %d %Y %I:%M%p')
if datetime(2006, 6, 5, 0, 0, 0) <= dt_object < datetime(2006, 6, 6, 0, 0, 0):
# do something when date is anytime on June 5th, 2006
If you must do the comparison in the query, you can use regular string comparison as long as your dates are stored in ISO-8601 format. The advantage of ISO-8601 is that chronological sorting is equivalent to lexographic sorting, i.e. you can treat them as normal strings.
The equivalent comparison using ISO-8601 format:
'2006-06-05T00:00:00Z' <= dt < '2006-06-06T00:00:00Z'
I thinking breaking the day (year/month/date) from time (hour/minute/second) is the cleanest solution for you since you want to do query on day.
I was working on code to generate the time for an entire day with 30 second intervals. I tried using DT.datetime and DT.time but I always end up with either a datetime value or a timedelta value like (0,2970). Can someone please tell me how to do this.
So I need a list that has data like:
[00:00:00]
[00:00:01]
[00:00:02]
till [23:59:59] and needed to compare it against a datetime value like 6/23/2011 6:38:00 AM.
Thanks!
Is there a reason you want to use datetimes instead of just 3 for loops counting up? Similarly, do you want to do something fancy or do you want to just compare against the time? If you don't need to account for leap seconds or anything like that, just do it the easy way.
import datetime
now = datetime.datetime.now()
for h in xrange(24):
for m in xrange(60):
for s in xrange(60):
time_string = '%02d:%02d:%02d' % (h,m,s)
if time_string == now.strftime('%H:%m:%S'):
print 'you found it! %s' % time_string
Can you give any more info about why you are doing this? It seems like you would be much much better off parsing the datetimes or using strftime to get what you need instead of looping through 60*60*24 times.
There's a great answer on how to get a list of incremental values for seconds for a 24-hour day. I reused a part of it.
Note 1. I'm not sure how you're thinking of comparing time with datetime. Assuming that you're just going to compare the time part and extracting that.
Note 2. The time.strptime call expects a 12-hour AM/PM-based time, as in your example. Its result is then passed to time.strftime that returns a 24-hour-based time.
Here's what I think you're looking for:
my_time = '6/23/2011 6:38:00 AM' # time you defined
from datetime import datetime, timedelta
from time import strftime, strptime
now = datetime(2013, 1, 1, 0, 0, 0)
last = datetime(2013, 1, 1, 23, 59, 59)
delta = timedelta(seconds=1)
times = []
while now <= last:
times.append(now.strftime('%H:%M:%S'))
now += delta
twenty_four_hour_based_time = strftime('%H:%M:%S', strptime(my_time, '%m/%d/%Y %I:%M:%S %p'))
twenty_four_hour_based_time in times # returns True
I have a large data set with a variety of Date information in the following formats:
DAYS since Jan 1, 1900 - ex: 41213 - I believe these are from Excel http://www.kirix.com/stratablog/jd-edwards-date-conversions-cyyddd
YYDayofyear - ex 2012265
I am familiar with python's time module, strptime() method, and strftime () method. However, I am not sure what these date formats above are called on if there is a python module I can use to convert these unusual date formats.
Any idea how to get the %Y%M%D format from these unusual date formats without writing my own calculator?
Thanks.
You can try something like the following:
In [1]: import datetime
In [2]: s = '2012265'
In [3]: datetime.datetime.strptime(s, '%Y%j')
Out[3]: datetime.datetime(2012, 9, 21, 0, 0)
In [4]: d = '41213'
In [5]: datetime.date(1900, 1, 1) + datetime.timedelta(int(d))
Out[5]: datetime.date(2012, 11, 2)
The first one is the trickier one, but it uses the %j parameter to interpret the day of the year you provide (after a four-digit year, represented by %Y). The second one is simply the number of days since January 1, 1900.
This is the general conversion - not sure of your input format but hopefully this can be tweaked to suit it.
On the Excel integer to Python datetime bit:
Note that there are two Excel date systems (one 1-Jan-1900 based and another 1-Jan 1904 based); see https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel for more information.
Also note that the system is NOT zero-based. So, in the 1900 system, 1-Jan-1900 is day 1 (not day 0).
import datetime
EXCEL_DATE_SYSTEM_PC=1900
EXCEL_DATE_SYSTEM_MAC=1904
i = 42129 # Excel number for 5-May-2015
d = datetime.date(EXCEL_DATE_SYSTEM_PC, 1, 1) + datetime.timedelta(i-2)
Both of these formats seems pretty straightforward to work with. The first one, in fact, is just an integer, so why don't you just do something like this?
import datetime
def days_since_jan_1_1900_to_datetime(d):
return datetime.datetime(1900,1,1) + \
datetime.timedelta(days=d)
For the second one, the details depend on exactly how the format is defined (e.g. can you always expect 3 digits after the year even when the number of days is less than 100, or is it possible that there are 2 or 1 – and if so, is the year always 4 digits?) but once you've got that part down it can be done very similarly.
According to http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
, day of the year is "%j", whereas the first case can be solved by toordinal() and fromordinal(): date.fromordinal(date(1900, 1, 1).toordinal() + x)
I'd think timedelta.
import datetime
d = datetime.timedelta(days=41213)
start = datetime.datetime(year=1900, month=1, day=1)
the_date = start + d
For the second one, you can 2012265[:4] to get the year and use the same method.
edit: See the answer with %j for the second.
from datetime import datetime
df(['timeelapsed'])=(pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()