Python find when list of dates becomes non-consecutive - python

I have a list of dates which are mostly consecutive, for example:
['01-Jan-10', '02-Jan-10', '03-Jan-10', '04-Jan-10', '08-Jan-10', '09-Jan-10', '10-Jan-10', '11-Jan-10', '13-Jan-10']
This is just an illustration as the full list contains thousands of dates.
This list can have couple of spots where the consecutiveness breaks. In the example shown above, it is 05-Jan-10, 07-Jan-10, and then 12-Jan-10. I am looking for the minimal and maximal day in the gap time span. Is there any way to do this efficiently in python?

The datetime package from the standard library can be useful.
Check the right date format and apply it with strptime to all terms in the list, loop through a pairs and check the difference between (in days) them using timedelta arithmetics. To keep the same format (which is non-standard) you need apply strftime.
from datetime import datetime, timedelta
dates = ['01-Jan-10', '02-Jan-10', '03-Jan-10', '04-Jan-10', '08-Jan-10', '09-Jan-10', '10-Jan-10', '11-Jan-10', '13-Jan-10']
# date format code
date_format = '%d-%b-%y'
# cast to datetime objects
days = list(map(lambda d: datetime.strptime(d, date_format).date(), dates))
# check consecutive days
for d1, d2 in zip(days, days[1:]):
date_gap = (d2-d1).days
# check consecutiveness
if date_gap > 1:
# compute day boundary of the gap
min_day_gap, max_day_gap = d1 + timedelta(days=1), d2 - timedelta(days=1)
# apply format
min_day_gap = min_day_gap.strftime(date_format)
max_day_gap = max_day_gap.strftime(date_format)
# check
print(min_day_gap, max_day_gap)
#05-Jan-10 07-Jan-10
#12-Jan-10 12-Jan-10
Remark: it is not clear what would happen when the time gap is of 2 days, in this case the min & max day in the gap are identical. In that case add a conditional check date_gap == 2 and correct the behavior...
if date_gap == 2: ... elif date_gap > 1: ...
or add a comment/edit the question with a proper description.

Related

Generate randomly formatted date strings for machine learning

For a NLP project in python I need to generate random dates for model training purpose. Particularly, the date format must be random and coherent with a set of language locales. The formats includes those with only numbers and formats with (partially) written out day and month names, and various common punctuations.
My best solution so far is the following algorithm:
generate a datetime() object with random values (nice solution here)
randomly select a locale, i.e. pick one of ['en_US','fr_FR','it_IT','de_DE'] where in this case this list is well known and short, so not a problem.
randomly select a format string for strftime(), i.e. ['%Y-%m-%d','%d %B %Y',...]. In my case the list should reflect potentially occuring date formats in the documents that will be exposed to the NLP model in the future.
generate a sting with strftime()
Especially for 3) i do not know a better version than to hardcode the list of what I saw manually within the training documents. I could not yet find a function that would turn ocr-dates into a format string, such that i could extend the list when yet-unseen date formats come by.
Do you have any suggestions on how to come up with better randomly formatted dates, or how to improve this approach?
USE random.randrange() AND datetime.timedelta() TO GENERATE A RANDOM DATE BETWEEN TWO DATES
Call datetime.date(year, month, day) to return a datetime object representing the time indicated by year, month, and day. Call this twice to define the start and end date. Subtract the start date from the end date to get the time between the two dates. Call datetime.timedelta.days to get the number of days from the previous result datetime.timedelta. Call random.randrange(days) to get a random integer less than the previous result days. Call datetime.timedelta(days=n) to get a datetime.timedelta representing the previous result n. Add this result to the start date.
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 2, 1)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)
print(random_date)
Here is my solution. Concerning the local, all need to be available on your computer to avoid error
import random
from datetime import datetime, timedelta
import locale
LOCALE = ['en_US','fr_FR','it_IT','de_DE'] # all need to be available on your computer to avoid error
DATE_FORMAT = ['%Y-%m-%d','%d %B %Y']
def gen_datetime(min_year=1900, max_year=datetime.now().year):
# generate a datetime
start = datetime(min_year, 1, 1)
years = max_year - min_year + 1
end = start + timedelta(days=365 * years)
format_date = DATE_FORMAT[random.randint(0, len(DATE_FORMAT)-1)]
locale_date = LOCALE[random.randint(0, len(LOCALE)-1)]
locale.setlocale(locale.LC_ALL, locale_date) # generate error if local are not available on your computer
return (start + (end - start) * random.random()).strftime(format_date)
date = gen_datetime()
print(date)

Faster way to check if a date is within a 5000+ element long list / numpy array?

I have the following function that is called multiple times:
def next_valid_date(self, date_object):
"""Returns next valid date based on valid_dates.
If argument date_object is valid, original date_object will be returned."""
while date_object not in self.valid_dates.tolist():
date_object += datetime.timedelta(days=1)
return date_object
For reference, valid_dates is a numpy array that holds all recorded dates for a given stock pulled from yfinance. In the case of the example I've been working with, NVDA (nvidia stock), the valid_dates array has 5395 elements (dates).
I have another function, and its purpose is to create a series of start dates and end dates. In this example self.interval is a timedelta with a length of 365 days, and self.sub_interval is a timedelta with a length of 1 day:
def get_date_range_series(self):
"""Retrieves a series containing lists of start dates and corresponding end dates over a given interval."""
interval_start = self.valid_dates[0]
interval_end = self.next_valid_date(self.valid_dates[0] + self.interval)
dates = [[interval_start, interval_end]]
while interval_end < datetime.date.today():
interval_start = self.next_valid_date(interval_start + self.sub_interval)
interval_end = self.next_valid_date(interval_start + self.interval)
dates.append([interval_start, interval_end])
return pd.Series(dates)
My main issue is that it takes a lengthy period of time to execute (about 2 minutes), and I'm sure there's a far better way of doing this... Any thoughts?
I just created an alternate next_valid_date() method that calls .loc() on a pandas dataframe (the dataframe's index is a list of the valid dates, which is where the list of valid_dates comes from in the first place):
def next_valid_date_alt(self, date_object):
while True:
try:
self.stock_yf_df.loc[date_object]
break
except KeyError:
date_object += datetime.timedelta(days=1)
return date_object
Checking for the next valid date when 6/28/20 is passed in (which isn't valid, it is a weekend, and the stock market is closed) resulted in the original method taking 0.0099754 seconds to complete and the alternate method taking .0019944 seconds to complete.
What this means is that get_date_range_series() takes just over 1 second to complete when using the next_valid_date_alt() method as opposed to 70 seconds when using the next_valid_date() method. I'll definitely look into the other optimizations mentioned as well. I appreciate everyone else's responses!

Find No.of days between two dates (Tuples)

Suppose I have a tuple A. It contains two nested tuples. The nested tuples are dates of the form DD,MM,YYYY.
A = ((DD,MM,YYYY), (DD,MM,YYYY))
Now, I want to find the number of days between the two dates. I've already tried fiddling with datetime module and it only helps when objects are integers and not tuples. My problem constraint is that I cannot change the structure in which dates are represented. I suppose I can use slicing but that would be way too much work. I'm pretty new at this and I hope someone can shed some light my way.
You can use datetime.strptime to create a datetime object from the given string. Then you can subtract date1 and date2 which gives you a timedelta object and this timedelta object has a nice attribute days which gives you the number of days between two dates.
Use:
from datetime import datetime
date1 = datetime.strptime("-".join(A[0]), "%d-%m-%Y")
date2 = datetime.strptime("-".join(A[1]), "%d-%m-%Y")
diff_days = (date1 - date2).days
print(diff_days)
For example consider,
A = (("24","05","2020"), ("25","04","2020")) then the above code will print diff_days as 29.
why is slicing too much work?
import datetime
# A = ((DD,MM,YYYY), (DD,MM,YYYY))
A = ((1,1,2020), (20,4,2020))
delta = (
datetime.date(A[1][2],A[1][1],A[1][0])-
datetime.date(A[0][2],A[0][1],A[0][0])
)
Code written on my smartphone. Basic idea convert with datetime and a f string to datetime object within a list comprehension. Build the timedelta and finally get the result in different formats
A=((3,4,2000), (4,4,2000))
from datetime import datetime
dt = [datetime.strptime(f'{a[0]}.{a[1]}.{a[2]}','%d.%m.%Y') for a in A]
td = dt[1] - dt[0]
# when you want a tuple w only days
difference_tuple = (td.days)
# days, hours, minutes
days, hours, minutes = td.days, td.seconds // 3600, td.seconds // 60 % 60
difference_tuple2 = (days, hours, minutes)

How to check if a certain date is present in a dictionary and if not, return the closest date available?

I have a dictionary with many sorted dates. How could I write a loop in Python that checks if a certain date is in the dictionary and if not, it returns the closest date available? I want it to work that if after subtracting one day to the date, it checks again if now it exists in the dictionary and if not, it subtracts again until it finds a existing date.
Thanks in advance
from datetime import timedelta
def function(date):
if date not in dictio:
date -= timedelta(days=1)
return date
I've made a recursive function to solve your problem:
import datetime
def find_date(date, date_dict):
if date not in date_dict.keys():
return find_date(date-datetime.timedelta(days=1), date_dict)
else:
return date
I don't know what is the content of your dictionary but the following example should show you how this works:
import numpy as np
# creates a casual dates dictionary
months = np.random.randint(3,5,20)
days = np.random.randint(1,30,20)
dates = {
datetime.date(2019,m,d): '{}_{:02}_{:02}'.format(2019,m,d)
for m,d in zip(months,days)}
# select the date to find
target_date = datetime.date(2019, np.random.randint(3,5), np.random.randint(1,30))
# print the result
print("The date I wanted: {}".format(target_date))
print("The date I got: {}".format(find_date(target_date, dates)))
What you are looking for is possibly a while loop, although beware because if it will not find the date it will run to infinite. Perhaps you want to define a limit of attempts until the script should give up?
from datetime import timedelta, date
d1 = {
date(2019, 4, 1): None
}
def function(date, dictio):
while date not in dictio:
date -= timedelta(days=1)
return date
res_date = function(date.today(), d1)
print(res_date)

Python calculate the number of year in date column

I've recently start coding with Python, and I'm struggling to calculate the number of years between the current date and a given date.
Dataframe
I would like to calculate the number of year for each column.
I tried this but it's not working:
def Number_of_years(d1,d2):
if d1 is not None:
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]=Number_of_years(df[col],date.today())
Can anyone help me find a solution to this?
I see that the format of dates is day/month/year.
Given this format is same for all the grids, you can parse the date using the datetime module like so:
from datetime import datetime # import module
def numberOfYears(element):
# parse the date string according to the fixed format
date = datetime.strptime(element, '%d/%m/%Y')
# return the difference in the years
return datetime.today().year - date.year
# make things more interesting by vectorizing this function
function = np.vectorize(numberOfYears)
# This returns a numpy array containing difference between years.
# call this for each column, and you should be good
difference = function(df.Date_creation)
You code is basically right, but you're operating over a pandas series so you can't just call relativedelta directly:
def number_of_years(d1,d2):
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]= df[col].apply(lambda d: number_of_years(x, date.today()))

Categories

Resources