I am getting my data and some dates from an unconventional source and because of this there are some minor differences in the string dates. the big difference is that there are dates mixed in where the day is not padded by a zero, there can be a white space after the day (in the case of date 2/9 /2018) also the months are not padded by zeroes. I was getting the error that "time data does not match format '%m %d %Y' when trying datetime.strptime. how can I convert a column of dates where there are subtle differences like this? please see the code and sample data below
d_o = datetime.datetime.strptime(df['start'][1], '%m %d %Y')
1/26/2018
1/26/2018
2/2/2018
2/2/2018
2/9 /2018
2/9 /2018
1/19/2018
1/19/2018
1/26/2018
1/26/2018
2/2/2018
2/2/2018
2/9 /2018
You should use a 3rd party library such as dateutil. This library accepts a wide variety of date formats at the cost of performance.
from dateutil import parser
lst = ['1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018', '2/9 /2018',
'1/19/2018', '1/19/2018', '1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018',
'2/9 /2018']
res = [parser.parse(i) for i in lst]
Result:
[datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0),
datetime.datetime(2018, 1, 19, 0, 0),
datetime.datetime(2018, 1, 19, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0)]
You can use re.split and str.zfill:
import re
dates = ['1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018', '2/9 /2018', '1/19/2018', '1/19/2018', '1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018']
new_dates = ['{}/{}/{}'.format(a.zfill(2), *b) for a, *b in map(lambda x:re.split('[/\s]+', x), dates)]
Output:
['01/26/2018', '01/26/2018', '02/2/2018', '02/2/2018', '02/9/2018', '02/9/2018', '01/19/2018', '01/19/2018', '01/26/2018', '01/26/2018', '02/2/2018', '02/2/2018', '02/9/2018']
Related
I have a rruleset with a daily recurrence rule and now I am trying to combine an RDATE with an EXRULE.
from dateutil.rrule import rruleset, rrule, DAILY, FR
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
not_on_friday = rrule(freq=DAILY, byweekday=FR, dtstart=datetime(2022, 10, 12))
but_on_friday_21th = datetime(2022, 10, 21)
rules.exrule(not_on_friday)
rules.rdate(but_on_friday_21th)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
>>
[datetime.datetime(2022, 10, 13, 0, 0),
datetime.datetime(2022, 10, 15, 0, 0), # the 14th is excluded as expected
datetime.datetime(2022, 10, 16, 0, 0),
datetime.datetime(2022, 10, 17, 0, 0),
datetime.datetime(2022, 10, 18, 0, 0),
datetime.datetime(2022, 10, 19, 0, 0),
datetime.datetime(2022, 10, 20, 0, 0),
datetime.datetime(2022, 10, 22, 0, 0), # but the 21th is also excluded
datetime.datetime(2022, 10, 23, 0, 0)]
Now, confusingly, when I combine my EXRULE with an EXDATE it works:
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
not_on_friday = rrule(freq=DAILY, byweekday=FR, dtstart=datetime(2022, 10, 12))
but_also_not_on_the_22th_a_saturday = datetime(2022, 10, 22)
rules.exrule(not_on_friday)
rules.exdate(but_also_not_on_the_22th_a_saturday)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
>>
[datetime.datetime(2022, 10, 13, 0, 0),
datetime.datetime(2022, 10, 15, 0, 0), # the 14th still excluded
datetime.datetime(2022, 10, 16, 0, 0),
datetime.datetime(2022, 10, 17, 0, 0),
datetime.datetime(2022, 10, 18, 0, 0),
datetime.datetime(2022, 10, 19, 0, 0),
datetime.datetime(2022, 10, 20, 0, 0), # the 22th also excluded as expected
datetime.datetime(2022, 10, 23, 0, 0)]
So, if possible at all, how to combine RDATE and EXRULE in my rruleset?
In your answer you note that exrule is applied last, after all other inclusive rules which actually does appear to be in the RFC. However, at least in dateutil, you can use an rruleset as the argument to exrule, so to accomplish what you want, you can try filtering out the date that you want included from the rule that gets passed to exrule, like so:
from datetime import datetime
from dateutil.rrule import rruleset, rrule, DAILY, WEEKLY, FR
# Create an rruleset that defaults to every day
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
# Create an rruleset corresponding to the days we want to *exclude*: every
# Friday, except 2022-10-21
ex_set = rruleset()
ex_set.rrule(rrule(freq=WEEKLY, byweekday=FR, dtstart=datetime(2022, 10, 14)))
ex_set.exdate(datetime(2022, 10, 21))
# Use our second rule set as an exrule
rules.exrule(ex_set)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
Since the date you want to include never appears in the exrule, it is not filtered out:
>>> print("\n".join(map(str,
... map(datetime.date,
... rules.between(datetime(2022, 10, 12),
... datetime(2022, 10, 24))))))
2022-10-13
2022-10-15
2022-10-16
2022-10-17
2022-10-18
2022-10-19
2022-10-20
2022-10-21
2022-10-22
2022-10-23
So apparently there is no such thing as an EXRULE in the iCalendar specs. Its just RRULEs. And dateutils exdate function states in the doc string:
def exrule(self, exrule):
""" Include the given rrule instance in the recurrence set exclusion
list. Dates which are part of the given recurrence rules will not
be generated, even if some inclusive rrule or rdate matches them.
"""
So, even if I add an RDATE, if it is exclude by a rule added by exrule it will not show up in my occurrences. Same goes for the exdate function, hence my working second example.
I have data stored in a S3 bucket which uses "yyyy/MM/dd" format to store the files per date, like in this sample S3a path: s3a://mybucket/data/2018/07/03. The files in these buckets are in json.gz format and I would like to import all these files to a spark dataframe per day. After that I want to feed these spark dfs to some written code via a for loop:
for date in date_range:
s3a = 's3a://mybucket/data/{}/{}/{}/*.json.gz'.format(date.year, date.month, date.day)
df = spark.read.format('json').option("header", "true").load(s3a)
# Execute code here
In order to read the data, I tried to format the date_range like below:
from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
return pd.date_range(start=start_date, end=end_date).to_pydatetime().tolist()
date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
date_range
[datetime.datetime(2018, 3, 6, 0, 0),
datetime.datetime(2018, 3, 7, 0, 0),
datetime.datetime(2018, 3, 8, 0, 0),
datetime.datetime(2018, 3, 9, 0, 0),
datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 11, 0, 0),
datetime.datetime(2018, 3, 12, 0, 0)]
The problem is that pydatetime() returns the days and months without a '0'. How do I make sure that my code returns a list of values with '0's, like below:
[datetime.datetime(2018, 03, 06, 0, 0),
datetime.datetime(2018, 03, 07, 0, 0),
datetime.datetime(2018, 03, 08, 0, 0),
datetime.datetime(2018, 03, 09, 0, 0),
datetime.datetime(2018, 03, 10, 0, 0),
datetime.datetime(2018, 03, 11, 0, 0),
datetime.datetime(2018, 03, 12, 0, 0)]
This is one approach using .strftime("%Y/%m/%d")
Ex:
from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
return pd.date_range(start=start_date, end=end_date).strftime("%Y/%m/%d").tolist()
date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
print(date_range)
Output:
['2018/03/06',
'2018/03/07',
'2018/03/08',
'2018/03/09',
'2018/03/10',
'2018/03/11',
'2018/03/12']
for date in date_range:
s3a = 's3a://mybucket/data/{}/*.json.gz'.format(date)
print(s3a)
s3a://mybucket/data/2018/03/06/*.json.gz
s3a://mybucket/data/2018/03/07/*.json.gz
s3a://mybucket/data/2018/03/08/*.json.gz
s3a://mybucket/data/2018/03/09/*.json.gz
s3a://mybucket/data/2018/03/10/*.json.gz
s3a://mybucket/data/2018/03/11/*.json.gz
s3a://mybucket/data/2018/03/12/*.json.gz
I am trying to code a function called days15(). The function will be passed an argument called ‘myDateStr’. myDateStr is string representation of a date in the form 20170817 (that is YearMonthDay). The code in the function will create a datetime object from the string, it will then create a timedelta object with a length of 1 day. Then, it will use a list comprehension to produce a list of 15 datetime objects, starting with the date that is passed to the function
the function should return the following list.
[datetime.datetime(2017, 8, 17, 0, 0), datetime.datetime(2017, 8, 18, 0, 0), datetime.datetime(2017, 8, 19, 0, 0), datetime.datetime(2017, 8, 20, 0, 0), datetime.datetime(2017, 8, 21, 0, 0), datetime.datetime(2017, 8, 22, 0, 0), datetime.datetime(2017, 8, 23, 0, 0), datetime.datetime(2017, 8, 24, 0, 0), datetime.datetime(2017, 8, 25, 0, 0), datetime.datetime(2017, 8, 26, 0, 0), datetime.datetime(2017, 8, 27, 0, 0), datetime.datetime(2017, 8, 28, 0, 0), datetime.datetime(2017, 8, 29, 0, 0), datetime.datetime(2017, 8, 30, 0, 0), datetime.datetime(2017, 8, 31, 0, 0)]
I am stuck for the code. I have strted with the below.Please help. Thanks
from datetime import datetime, timedelta
myDateStr = '20170817'
def days15(myDateStr):
Pandas will help you in converting strings to datetime, so first you need to import it:
from datetime import datetime, timedelta
import pandas as pd
myDateStr = '20170817'
Then you can initialize an empty list that you'll later append:
datelist = []
And then you write a function:
def days15(myDateStr):
#converting to datetime
date = pd.to_datetime(myDateStr)
#loop to create 15 datetimes
for i in range(15):
newdate = date + timedelta(days=i)
#adding new dates to the list
datelist.append(newdate)
and then you can call your function and get a list of 15 datetimes:
days15(myDateStr)
As you said, there will be two steps to implement: firstly, convert the string date to a datetime object and secondly, iterate over the next 15 days using timedelta, with a list comprehension or a simple loop.
from datetime import datetime, timedelta
myDateStr = '20170817'
# Parse the string and return a datetime object
def getDateTime(date):
return datetime(int(date[:4]),int(date[4:6]),int(date[6:]))
# Iterate over the timedelta added to the starting date
def days15(myDateStr):
return [getDateTime(myDateStr) + timedelta(days=x) for x in range(15)]
When creating a pandas dataframe object (python 2.7.9, pandas 0.16.2), the first datetime field gets automatically converted into a pandas timestamp. Why? Is it possible to prevent this so as to keep the field in the original type?
Please see code below:
import numpy as np
import datetime
import pandas
create a dict:
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstart':np.array([datetime.datetime(2001, 11, 16, 0, 0),
datetime.datetime(2012, 2, 28, 0, 0), datetime.datetime(2014, 12, 22, 0, 0)],
dtype=object),
'vstop': np.array([datetime.datetime(2012, 2, 28, 0, 0),
datetime.datetime(2014, 12, 22, 0, 0), datetime.datetime(9999, 12, 31, 0, 0)],
dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'],
dtype='|S18')}
So, the vstart and vstop keys are datetime so far. However, after:
df = pandas.DataFrame(data = x)
the vstart becomes a pandas Timestamp automatically while vstop remains a datetime
type(df.vstart[0])
#class 'pandas.tslib.Timestamp'
type(df.vstop[0])
#type 'datetime.datetime'
I don't understand why the first datetime column that the constructor comes across gets converted to Timestamp by pandas. And how to tell pandas to keep the data types as they are. Can you help? Thank you.
actually I've noticed something in your data , it has nothing to do with your first or second date column in your column vstop there is a datetime with value dt.datetime(9999, 12, 31, 0, 0) , if you changed the year on this date to a normal year like 2020 for example both columns will be treated the same .
just note that I'm importing datetime module as dt
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstop': np.array([dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0), dt.datetime(2020, 12, 31, 0, 0)], dtype=object),
'vstart': np.array([dt.datetime(2001, 11, 16, 0, 0),dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0)], dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'], dtype='|S18')}
In [27]:
df = pd.DataFrame(x)
df
Out[27]:
cusip id vstart vstop
10553M10 EQ0000000000041095 2001-11-16 2012-02-28
67085120 EQ0000000000041095 2012-02-28 2014-12-22
67085140 EQ0000000000041095 2014-12-22 2020-12-31
In [25]:
type(df.vstart[0])
Out[25]:
pandas.tslib.Timestamp
In [26]:
type(df.vstop[0])
Out[26]:
pandas.tslib.Timestamp
I have a list of birthdays stored in datetime objects. How would one go about sorting these in Python using only the month and day arguments?
For example,
[
datetime.datetime(1983, 1, 1, 0, 0)
datetime.datetime(1996, 1, 13, 0 ,0)
datetime.datetime(1976, 2, 6, 0, 0)
...
]
Thanks! :)
You can use month and day to create a value that can be used for sorting:
birthdays.sort(key = lambda d: (d.month, d.day))
l.sort(key = lambda x: x.timetuple()[1:3])
If the dates are stored as strings—you say they aren't, although it looks like they are—you might use dateutil's parser:
>>> from dateutil.parser import parse
>>> from pprint import pprint
>>> bd = ['February 6, 1976','January 13, 1996','January 1, 1983']
>>> bd = [parse(i) for i in bd]
>>> pprint(bd)
[datetime.datetime(1976, 2, 6, 0, 0),
datetime.datetime(1996, 1, 13, 0, 0),
datetime.datetime(1983, 1, 1, 0, 0)]
>>> bd.sort(key = lambda d: (d.month, d.day)) # from sth's answer
>>> pprint(bd)
[datetime.datetime(1983, 1, 1, 0, 0),
datetime.datetime(1996, 1, 13, 0, 0),
datetime.datetime(1976, 2, 6, 0, 0)]
If your dates are in different formats, you might give fuzzy parsing a shot:
>>> bd = [parse(i,fuzzy=True) for i in bd] # replace line 4 above with this line