I have data stored in a S3 bucket which uses "yyyy/MM/dd" format to store the files per date, like in this sample S3a path: s3a://mybucket/data/2018/07/03. The files in these buckets are in json.gz format and I would like to import all these files to a spark dataframe per day. After that I want to feed these spark dfs to some written code via a for loop:
for date in date_range:
s3a = 's3a://mybucket/data/{}/{}/{}/*.json.gz'.format(date.year, date.month, date.day)
df = spark.read.format('json').option("header", "true").load(s3a)
# Execute code here
In order to read the data, I tried to format the date_range like below:
from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
return pd.date_range(start=start_date, end=end_date).to_pydatetime().tolist()
date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
date_range
[datetime.datetime(2018, 3, 6, 0, 0),
datetime.datetime(2018, 3, 7, 0, 0),
datetime.datetime(2018, 3, 8, 0, 0),
datetime.datetime(2018, 3, 9, 0, 0),
datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 11, 0, 0),
datetime.datetime(2018, 3, 12, 0, 0)]
The problem is that pydatetime() returns the days and months without a '0'. How do I make sure that my code returns a list of values with '0's, like below:
[datetime.datetime(2018, 03, 06, 0, 0),
datetime.datetime(2018, 03, 07, 0, 0),
datetime.datetime(2018, 03, 08, 0, 0),
datetime.datetime(2018, 03, 09, 0, 0),
datetime.datetime(2018, 03, 10, 0, 0),
datetime.datetime(2018, 03, 11, 0, 0),
datetime.datetime(2018, 03, 12, 0, 0)]
This is one approach using .strftime("%Y/%m/%d")
Ex:
from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
return pd.date_range(start=start_date, end=end_date).strftime("%Y/%m/%d").tolist()
date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
print(date_range)
Output:
['2018/03/06',
'2018/03/07',
'2018/03/08',
'2018/03/09',
'2018/03/10',
'2018/03/11',
'2018/03/12']
for date in date_range:
s3a = 's3a://mybucket/data/{}/*.json.gz'.format(date)
print(s3a)
s3a://mybucket/data/2018/03/06/*.json.gz
s3a://mybucket/data/2018/03/07/*.json.gz
s3a://mybucket/data/2018/03/08/*.json.gz
s3a://mybucket/data/2018/03/09/*.json.gz
s3a://mybucket/data/2018/03/10/*.json.gz
s3a://mybucket/data/2018/03/11/*.json.gz
s3a://mybucket/data/2018/03/12/*.json.gz
Related
I have a rruleset with a daily recurrence rule and now I am trying to combine an RDATE with an EXRULE.
from dateutil.rrule import rruleset, rrule, DAILY, FR
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
not_on_friday = rrule(freq=DAILY, byweekday=FR, dtstart=datetime(2022, 10, 12))
but_on_friday_21th = datetime(2022, 10, 21)
rules.exrule(not_on_friday)
rules.rdate(but_on_friday_21th)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
>>
[datetime.datetime(2022, 10, 13, 0, 0),
datetime.datetime(2022, 10, 15, 0, 0), # the 14th is excluded as expected
datetime.datetime(2022, 10, 16, 0, 0),
datetime.datetime(2022, 10, 17, 0, 0),
datetime.datetime(2022, 10, 18, 0, 0),
datetime.datetime(2022, 10, 19, 0, 0),
datetime.datetime(2022, 10, 20, 0, 0),
datetime.datetime(2022, 10, 22, 0, 0), # but the 21th is also excluded
datetime.datetime(2022, 10, 23, 0, 0)]
Now, confusingly, when I combine my EXRULE with an EXDATE it works:
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
not_on_friday = rrule(freq=DAILY, byweekday=FR, dtstart=datetime(2022, 10, 12))
but_also_not_on_the_22th_a_saturday = datetime(2022, 10, 22)
rules.exrule(not_on_friday)
rules.exdate(but_also_not_on_the_22th_a_saturday)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
>>
[datetime.datetime(2022, 10, 13, 0, 0),
datetime.datetime(2022, 10, 15, 0, 0), # the 14th still excluded
datetime.datetime(2022, 10, 16, 0, 0),
datetime.datetime(2022, 10, 17, 0, 0),
datetime.datetime(2022, 10, 18, 0, 0),
datetime.datetime(2022, 10, 19, 0, 0),
datetime.datetime(2022, 10, 20, 0, 0), # the 22th also excluded as expected
datetime.datetime(2022, 10, 23, 0, 0)]
So, if possible at all, how to combine RDATE and EXRULE in my rruleset?
In your answer you note that exrule is applied last, after all other inclusive rules which actually does appear to be in the RFC. However, at least in dateutil, you can use an rruleset as the argument to exrule, so to accomplish what you want, you can try filtering out the date that you want included from the rule that gets passed to exrule, like so:
from datetime import datetime
from dateutil.rrule import rruleset, rrule, DAILY, WEEKLY, FR
# Create an rruleset that defaults to every day
rules = rruleset()
daily = rrule(freq=DAILY, dtstart=datetime(2022, 10, 12))
rules.rrule(daily)
# Create an rruleset corresponding to the days we want to *exclude*: every
# Friday, except 2022-10-21
ex_set = rruleset()
ex_set.rrule(rrule(freq=WEEKLY, byweekday=FR, dtstart=datetime(2022, 10, 14)))
ex_set.exdate(datetime(2022, 10, 21))
# Use our second rule set as an exrule
rules.exrule(ex_set)
rules.between(datetime(2022,10,12), datetime(2022,10,24))
Since the date you want to include never appears in the exrule, it is not filtered out:
>>> print("\n".join(map(str,
... map(datetime.date,
... rules.between(datetime(2022, 10, 12),
... datetime(2022, 10, 24))))))
2022-10-13
2022-10-15
2022-10-16
2022-10-17
2022-10-18
2022-10-19
2022-10-20
2022-10-21
2022-10-22
2022-10-23
So apparently there is no such thing as an EXRULE in the iCalendar specs. Its just RRULEs. And dateutils exdate function states in the doc string:
def exrule(self, exrule):
""" Include the given rrule instance in the recurrence set exclusion
list. Dates which are part of the given recurrence rules will not
be generated, even if some inclusive rrule or rdate matches them.
"""
So, even if I add an RDATE, if it is exclude by a rule added by exrule it will not show up in my occurrences. Same goes for the exdate function, hence my working second example.
I would like to filter pandas using the time stamp. This works fine for all hours except 0. If I filter for dt.hour = 0, only the date is displayed and not the time. How can I have the time displayed too?
import datetime
df = pd.DataFrame({'datetime': [datetime.datetime(2005, 7, 14, 12, 30),
datetime.datetime(2005, 7, 14, 0, 0),
datetime.datetime(2005, 7, 14, 10, 30),
datetime.datetime(2005, 7, 14, 15, 30)]})
print(df[df['datetime'].dt.hour == 10])
print(df[df['datetime'].dt.hour == 0]
use strftime:
print(df[df['datetime'].dt.hour == 0].datetime.dt.strftime("%Y-%m-%d %H:%M:%S"))
The result is:
1 2005-07-14 00:00:00
Name: datetime, dtype: object
I am trying to code a function called days15(). The function will be passed an argument called ‘myDateStr’. myDateStr is string representation of a date in the form 20170817 (that is YearMonthDay). The code in the function will create a datetime object from the string, it will then create a timedelta object with a length of 1 day. Then, it will use a list comprehension to produce a list of 15 datetime objects, starting with the date that is passed to the function
the function should return the following list.
[datetime.datetime(2017, 8, 17, 0, 0), datetime.datetime(2017, 8, 18, 0, 0), datetime.datetime(2017, 8, 19, 0, 0), datetime.datetime(2017, 8, 20, 0, 0), datetime.datetime(2017, 8, 21, 0, 0), datetime.datetime(2017, 8, 22, 0, 0), datetime.datetime(2017, 8, 23, 0, 0), datetime.datetime(2017, 8, 24, 0, 0), datetime.datetime(2017, 8, 25, 0, 0), datetime.datetime(2017, 8, 26, 0, 0), datetime.datetime(2017, 8, 27, 0, 0), datetime.datetime(2017, 8, 28, 0, 0), datetime.datetime(2017, 8, 29, 0, 0), datetime.datetime(2017, 8, 30, 0, 0), datetime.datetime(2017, 8, 31, 0, 0)]
I am stuck for the code. I have strted with the below.Please help. Thanks
from datetime import datetime, timedelta
myDateStr = '20170817'
def days15(myDateStr):
Pandas will help you in converting strings to datetime, so first you need to import it:
from datetime import datetime, timedelta
import pandas as pd
myDateStr = '20170817'
Then you can initialize an empty list that you'll later append:
datelist = []
And then you write a function:
def days15(myDateStr):
#converting to datetime
date = pd.to_datetime(myDateStr)
#loop to create 15 datetimes
for i in range(15):
newdate = date + timedelta(days=i)
#adding new dates to the list
datelist.append(newdate)
and then you can call your function and get a list of 15 datetimes:
days15(myDateStr)
As you said, there will be two steps to implement: firstly, convert the string date to a datetime object and secondly, iterate over the next 15 days using timedelta, with a list comprehension or a simple loop.
from datetime import datetime, timedelta
myDateStr = '20170817'
# Parse the string and return a datetime object
def getDateTime(date):
return datetime(int(date[:4]),int(date[4:6]),int(date[6:]))
# Iterate over the timedelta added to the starting date
def days15(myDateStr):
return [getDateTime(myDateStr) + timedelta(days=x) for x in range(15)]
I'm using rrule from python dateutil and don't know how to create an rruleset for the following example:
Monday, three weeks in a row. Then a week not, then again three weeks in a row, one not, and so on.
Any advice on creating an rrule(set) for this?
One way to do this is to use an rruleset with a WEEKLY rrule and a corresponding exrule for every 4th week:
from dateutil.rrule import rrule, rruleset
from dateutil.rrule import WEEKLY
from dateutil.relativedelta import relativedelta
from datetime import datetime, timedelta
dtstart = datetime(2011, 1, 1)
rrset = rruleset()
weekly_rule = rrule(freq=WEEKLY, dtstart=dtstart)
every_4_weeks = rrule(freq=WEEKLY, interval=4,
dtstart=dtstart + relativedelta(weeks=4))
rrset.rrule(weekly_rule)
rrset.exrule(every_4_weeks)
rrset.between(dtstart, dtstart + timedelta(days=65))
The result:
[datetime.datetime(2011, 1, 8, 0, 0),
datetime.datetime(2011, 1, 15, 0, 0),
datetime.datetime(2011, 1, 22, 0, 0),
datetime.datetime(2011, 2, 5, 0, 0),
datetime.datetime(2011, 2, 12, 0, 0),
datetime.datetime(2011, 2, 19, 0, 0),
datetime.datetime(2011, 3, 5, 0, 0)]
The way it works is weekly_rule generates one date per week, and the every_4_weeks generates every 4th week, starting with the 4th week after dtstart. That gives you a 3-on 1-off schedule.
When creating a pandas dataframe object (python 2.7.9, pandas 0.16.2), the first datetime field gets automatically converted into a pandas timestamp. Why? Is it possible to prevent this so as to keep the field in the original type?
Please see code below:
import numpy as np
import datetime
import pandas
create a dict:
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstart':np.array([datetime.datetime(2001, 11, 16, 0, 0),
datetime.datetime(2012, 2, 28, 0, 0), datetime.datetime(2014, 12, 22, 0, 0)],
dtype=object),
'vstop': np.array([datetime.datetime(2012, 2, 28, 0, 0),
datetime.datetime(2014, 12, 22, 0, 0), datetime.datetime(9999, 12, 31, 0, 0)],
dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'],
dtype='|S18')}
So, the vstart and vstop keys are datetime so far. However, after:
df = pandas.DataFrame(data = x)
the vstart becomes a pandas Timestamp automatically while vstop remains a datetime
type(df.vstart[0])
#class 'pandas.tslib.Timestamp'
type(df.vstop[0])
#type 'datetime.datetime'
I don't understand why the first datetime column that the constructor comes across gets converted to Timestamp by pandas. And how to tell pandas to keep the data types as they are. Can you help? Thank you.
actually I've noticed something in your data , it has nothing to do with your first or second date column in your column vstop there is a datetime with value dt.datetime(9999, 12, 31, 0, 0) , if you changed the year on this date to a normal year like 2020 for example both columns will be treated the same .
just note that I'm importing datetime module as dt
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstop': np.array([dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0), dt.datetime(2020, 12, 31, 0, 0)], dtype=object),
'vstart': np.array([dt.datetime(2001, 11, 16, 0, 0),dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0)], dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'], dtype='|S18')}
In [27]:
df = pd.DataFrame(x)
df
Out[27]:
cusip id vstart vstop
10553M10 EQ0000000000041095 2001-11-16 2012-02-28
67085120 EQ0000000000041095 2012-02-28 2014-12-22
67085140 EQ0000000000041095 2014-12-22 2020-12-31
In [25]:
type(df.vstart[0])
Out[25]:
pandas.tslib.Timestamp
In [26]:
type(df.vstop[0])
Out[26]:
pandas.tslib.Timestamp