Generate randomly formatted date strings for machine learning - python

For a NLP project in python I need to generate random dates for model training purpose. Particularly, the date format must be random and coherent with a set of language locales. The formats includes those with only numbers and formats with (partially) written out day and month names, and various common punctuations.
My best solution so far is the following algorithm:
generate a datetime() object with random values (nice solution here)
randomly select a locale, i.e. pick one of ['en_US','fr_FR','it_IT','de_DE'] where in this case this list is well known and short, so not a problem.
randomly select a format string for strftime(), i.e. ['%Y-%m-%d','%d %B %Y',...]. In my case the list should reflect potentially occuring date formats in the documents that will be exposed to the NLP model in the future.
generate a sting with strftime()
Especially for 3) i do not know a better version than to hardcode the list of what I saw manually within the training documents. I could not yet find a function that would turn ocr-dates into a format string, such that i could extend the list when yet-unseen date formats come by.
Do you have any suggestions on how to come up with better randomly formatted dates, or how to improve this approach?

USE random.randrange() AND datetime.timedelta() TO GENERATE A RANDOM DATE BETWEEN TWO DATES
Call datetime.date(year, month, day) to return a datetime object representing the time indicated by year, month, and day. Call this twice to define the start and end date. Subtract the start date from the end date to get the time between the two dates. Call datetime.timedelta.days to get the number of days from the previous result datetime.timedelta. Call random.randrange(days) to get a random integer less than the previous result days. Call datetime.timedelta(days=n) to get a datetime.timedelta representing the previous result n. Add this result to the start date.
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 2, 1)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)
print(random_date)

Here is my solution. Concerning the local, all need to be available on your computer to avoid error
import random
from datetime import datetime, timedelta
import locale
LOCALE = ['en_US','fr_FR','it_IT','de_DE'] # all need to be available on your computer to avoid error
DATE_FORMAT = ['%Y-%m-%d','%d %B %Y']
def gen_datetime(min_year=1900, max_year=datetime.now().year):
# generate a datetime
start = datetime(min_year, 1, 1)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
format_date = DATE_FORMAT[random.randint(0, len(DATE_FORMAT)-1)]
locale_date = LOCALE[random.randint(0, len(LOCALE)-1)]
locale.setlocale(locale.LC_ALL, locale_date) # generate error if local are not available on your computer
return (start + (end - start) * random.random()).strftime(format_date)
date = gen_datetime()
print(date)

Related

date_range won't include last date of interval for custom frequency

I want to create a date vector with a given fixed spacing depending on the frequency I choose. So far, this is what I got:
import pandas as pd
import datetime as dt
from datetime import date, timedelta
def getDates(sD, eD, f):
# creating the datetime object to later on make the date vector
sD = dt.datetime.strptime(sD, '%m/%d/%Y')
eD = dt.datetime.strptime(eD, '%m/%d/%Y')
sd_t = date(sD.year,sD.month,sD.day) # start date
ed_t = date(eD.year,eD.month,eD.day) # end date
# we hardcode a frequency dictionary for the frequencies and spacing that
# date vectors are going to have.
freqDict = {'1h':'40D', '4h':'162D', '1d':'1000D'}
dateVector = pd.date_range(sd_t, ed_t, freq = freqDict[f])
return dateVector
As you can see, I have only 3 frequencies I'm interested in. And the spacing between them works well, I have to play with API limitations and the limit I set up for requests is 1000. This is why I chose these custom spacings between dates, in order to allow a good amount of data points as possible according to the frequency and the API request limitations for which these dates are meant.
Unfortunately, I can't get the final date on the dateVector for some cases. If run this function with these inputs:
getDates('01/01/2020', '01/01/2021', '4h')
I get this outcome, which is missing the final date on the array ('01/01/2021'):
0 2020-01-01
1 2020-06-11
2 2020-11-20
I thought of using the closed = parameter, but it didn't get me where I wanted.
A workaround I thought of consists of using periods instead of freq, and dynamically computing the periods according to the distance (in terms of days) between the start date and the end date. But I would like to know if I can make date_range work in my favor without having to write such a routine.

How to iterate over range between two datetime objects in Python? [duplicate]

Okay so I am relatively new to programming and this has me absolutely stumped. Im scraping data from a website and the data changes every week. I want to run my scraping process each time the data changes starting back on 09-09-2015 and running to current.
I know how to do this easily running thru every number like 0909 then 0910 then 0911 but that is not what I need as that will be requesting way too many requests from the server that are pointless.
Here is the format of the URL
http://www.myexamplesite.com/?date=09092015
I know the simple:
for i in range(startDate, endDate):
url = 'http://www.myexamplesite.com/?date={}'.format(i)
driver.get(url)
But one thing i've never been able to figure out is manipulate pythons dateTime to accurately reflect the format the website uses.
i.e:
09092015
09162015
09232015
09302015
10072015
...
09272017
If all else fails I only need to do this once so it wouldnt take too long to just ignore the loop altogether and just manually enter the date I wish to scrape from and then just append all of my dataframes together. Im mainly curious on how to manipulate the datetime function in this sense for future projects that may require more data.
A good place to start are datetime, date and timedelta objects docs.
First, let's construct our starting date and ending date (today):
>>> from datetime import date, timedelta
>>> start = date(2015, 9, 9)
>>> end = date.today()
>>> start, end
(datetime.date(2015, 9, 9), datetime.date(2017, 9, 27))
Now let's define the unit of increment -- one day:
>>> day = timedelta(days=1)
>>> day
datetime.timedelta(1)
A nice thing about dates (date/datetime) and time deltas (timedelta) is they and can be added:
>>> start + day
datetime.date(2015, 9, 10)
We can also use format() to get that date in a human-readable form:
>>> "{date.day:02}{date.month:02}{date.year}".format(date=start+day)
'10092015'
So, when we put all this together:
from datetime import date, timedelta
start = date(2015, 9, 9)
end = date.today()
week = timedelta(days=7)
mydate = start
while mydate < end:
print("{date.day:02}{date.month:02}{date.year}".format(date=mydate))
mydate += week
we get a simple iteration over dates starting with 2015-09-09 and ending with today, incremented by 7 days (a week):
09092015
16092015
23092015
30092015
07102015
...
Take a look here
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
You can see the table pictured here for formatting dates and times and the usage.
Of course, if the format of the dates changes in the future or you are parsing different strings, you will have to make code changes. There really is no way around that.

Converting days since epoch to date

How can one convert a serial date number, representing the number of days since epoch (1970), to the corresponding date string? I have seen multiple posts showing how to go from string to date number, but I haven't been able to find any posts on how to do the reverse.
For example, 15951 corresponds to "2013-09-02".
>>> import datetime
>>> (datetime.datetime(2013, 9, 2) - datetime.datetime(1970,1,1)).days + 1
15951
(The + 1 because whatever generated these date numbers followed the convention that Jan 1, 1970 = 1.)
TL;DR: Looking for something to do the following:
>>> serial_date_to_string(15951) # arg is number of days since 1970
"2013-09-02"
This is different from Python: Converting Epoch time into the datetime because I am starting with days since 1970. I not sure if you can just multiply by 86,400 due to leap seconds, etc.
Use the datetime package as follows:
import datetime
def serial_date_to_string(srl_no):
new_date = datetime.datetime(1970,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
This is a function which returns the string as required.
So:
serial_date_to_string(15951)
Returns
>> "2013-09-02"
And for a Pandas Dataframe:
df["date"] = pd.to_datetime(df["date"], unit="d")
... assuming that the "date" column contains values like 18687 which is days from Unix Epoch of 1970-01-01 to 2021-03-01.
Also handles seconds and milliseconds since Unix Epoch, use unit="s" and unit="ms" respectively.
Also see my other answer with the exact reverse.

How can I get the maximum length of datetime.strftime?

Currently I'm working on a command line program and there I print out dates.
I do this with datetime.datetime.strftime:
import datetime
d = datetime.datetime(2012,12,12)
date_str = d.strftime(config.output_str)
Where config.output_str is a format string that can be set by the user.
Is there a way to tell how long the string date_str will be at maximum?
Especially if a format string like u'%d %B %Y' is used, where the length of the month (%B) depends on the language of the user?
If you are not setting the locale with the locale module, then Python uses the C locale and you can predict the maximum length produced. All strings will be in English and the maximum length per format character is known.
Parse the string yourself, count the non-format characters and map format characters to the maximum length for that field.
If you were to use locale, you'll need to calculate the max length per language. You can automate the locale-dependent fields by looping over the months, weekdays, and AM/PM and measuring the max length for the %a, %A, %b, %B, %c, %p, %x and %X formats. I'd do that on the fly as needed.
The rest of the formats do not vary by locale and have a documented maximum length (the examples in the strptime table are typical, you can rely on those documenting the field length).
Here the solution I wrote to solve this, for those who are interested.
I use the given format string format_str to figure out how long it could get.
Therefore I assume that only the month and the day can very in length.
The function loop over the months to see which has the longest form and then I loop over the days with the previously found month.
import datetime
def max_date_len(format_str):
def date_len(date):
return len(date.strftime(format_str))
def find_max_index(lst):
return max(range(len(lst)), key=lst.__getitem__)
# run through all month and add 1 to the index since we need a month
# between 1 and 12
max_month = 1 + find_max_index([date_len(datetime.datetime(2012, month, 12, 12, 12)) for month in range(1, 13)])
# run throw all days of the week from day 10 to 16 since
# this covers all weekdays and double digit days
return max([date_len(datetime.datetime(2012, max_month, day, 12, 12)) for day in range(10, 17)])

Attempting to insert an integer from a list into datetime object

What I am trying to accomplish is very simple: creating a loop from a range (pretty self explanatory below) that will insert the month into the datetime object. I know %d requires an integer, and I know that 'month' type is int...so I'm kind of stuck as to why I can't substitute my month variable. Here is my code:
all_months=range(1,13)
for month in all_months:
month_start = (datetime.date(2010,'%d',1))%month
next_month_begin= datetime.date(2010,'%d',1)%(month+1)
month_end=next_month_begin - timedelta(days=1)
print month_start
print month_end
What am I doing wrong?
All help appreciated! Thanks
There are a few things that you need to fix here.
EDIT: First, be careful with your range, since you are using month+1 to create next_month_begin, you do not want this to be greater than 12 or you will get an error.
Next, when you are trying to create the date object you are passing the month in as a string when you use (datetime.date(2010,'%d',1))%month. Your code probably throwing this error TypeError: an integer is required.
You need to give it the integer representing the month, not a string of the integer (there is a difference between 1 and '1'). This is also a simple fix, since you have variable named month that is already an integer, just use that instead of making a string. So you code should be something like:
month_start = datetime.date(2010,month,1)
I think you can figure out how to apply this to your next_month_begin assignment.
The last problem is that you need to use datetime.timedelta to tell Python to look in the datetime module for the timedelta() function -- your program would currently give you an error saying that timedelta is not defined.
Let me know if you have any problems applying these fixes. Be sure to include what the error you may be getting as well.
You've got other answers, but here's a way to get the last day of the month. Adding 31 days will get you into the next month regardless of the number of days in the current month, then moving back to the first and subtracting a day will give the ending date.
import datetime
for month in range(1,13):
month_start = datetime.date(2010,month,1)
into_next_month = month_start + datetime.timedelta(days=31)
month_end = into_next_month.replace(day=1) - datetime.timedelta(days=1)
print month_start,month_end
month is a variable and you can use it to create the datetime object. I think you want to do the following:
month_start = datetime.date(2010, month, 1)
next_month_begin = datetime.date(2010, month+1, 1)
That will work, because datetime.date() requires 3 integer arguments. '%d' % month would instead format the integer month as string. '%04d' % 3 for example would format the number 3 with 4 digits and leading zeros. But it's important to know, that even the (nearly unformatted) string "3" is different to the number 3 in Python.
And you can't write datetime(...) % 3 because the % operator will only work when used on a format string like the previous "%03d" % 3 example and not on a datetime object.
But other types might also accept the % operator (not including datetime objects). For example, integers accept the % operator to get the remainder of a division: 3 % 2 # returns 1. But there, the meaning of % is completely different, because the meaning of the operator depends on the types involved. For example, try 3 + 2 and "3" + "2". There, the meaning of + differs (integer addition vs. string concatenation), because the types are different too.
Check out the calendar module (http://docs.python.org/library/calendar.html).
It has batteries included for this sort of thing...
You could just do:
from calendar import Calendar
def start_and_end_days(year, month):
cal = Calendar()
month_days = [day for day in cal.itermonthdays(year, month) if day.month == month]
first_day = month_days[0]
last_day = month_days[-1]
return (first_day, last_day)

Categories

Resources