there. I'm trying to write a webscraper using python and need to first create a column of dates. I've gotten the list I need, but it keeps coming out as NoneType. Any ideas on how to get this to work as a dataframe?
Relevant part of code:
import datetime
from datetime import date
date1 = '2019-01-01'
date2 = '2019-01-30'
start = datetime.datetime.strptime(date1,'%Y-%m-%d')
end = datetime.datetime.strptime(date2,'%Y-%m-%d')
step = datetime.timedelta(days=1)
while start <= end:
daterange = print(start.strftime('%Y%m%d'))
start += step
type(daterange)
Thanks in advance!
Here
daterange = print(start.strftime('%Y%m%d'))
should be
daterange = start.strftime('%Y%m%d')
EXTRA:
if you want to save the daterange:
import datetime
from datetime import date
date1 = '2019-01-01'
date2 = '2019-01-30'
daterange_list = []
start = datetime.datetime.strptime(date1,'%Y-%m-%d')
end = datetime.datetime.strptime(date2,'%Y-%m-%d')
step = datetime.timedelta(days=1)
while start <= end:
daterange = start.strftime('%Y%m%d')
daterange_list.append(daterange)
start += step
type(daterange)
str
type(daterange_list)
list
Related
I'm not a Python developer but have to fix an existing code.
In this code, a method (extract) is called providing an interval of dates:
extract(start_date, end_date)
The parameters can have for exemple the values:
start_date : 2020-10-01
end_date : 2022-01-03
The problem
The issue with this call is that the extract method only support a 1 year max interval of dates. If greater, the interval must be split, for exemple as follow:
extract('2020-10-01', '2020-12-31')
extract('2021-01-01', '2021-12-31')
extract('2022-01-01', '2022-01-03')
So I'm trying to create loop where the start_date and end_date are computed dynamically. But being new to Python, I have no ideas for now how this can be done. Any help would be greatly appreciated.
EDIT
Answer to some comments here
Tried so far finding a solution starting from code like this so far:
from datetime import datetime
from dateutil import relativedelta
from datetime import datetime
from_date = datetime.strptime('2020-10-01', "%Y-%m-%d")
end_date = datetime.strptime('2022-01-03', "%Y-%m-%d")
# Get the interval between the two dates
diff = relativedelta.relativedelta(end_date, from_date)
Then I thought iterating accross the years using diff.years and adding some logic to build the start_date and end_date from there, but I thought there might be a much simplier approach.
Also saw others possibilities like here but still no final simple result found at the moment.
from_str = '2020-10-01'
end_str = '2022-01-03'
from_year = int(from_str[:4])
end_year = int(end_str[:4])
if from_year != end_year:
# from_date to end of first year
extract(from_str, f"{from_year}-12-31")
# full years
for y in range(from_year + 1, end_year):
extract(f"{y}-01-01", f"{y}-12-31")
# rest
extract(f"{end_year}-01-01", end_str)
else:
extract(from_str, end_str)
As mentioned in the comments, you can either use the datetime library or you can also use pandas if you want. The pandas version is the following (admittively not the most pretty, but it does the job):
import pandas as pd
import datetime
start = datetime.datetime(2020,10,1)
end = datetime.datetime(2022,1,3)
def extract(from_dt, to_dt):
print(f'Extracting from {from_dt} to {to_dt}')
prev_end = pd.to_datetime(start)
for next_end in pd.date_range(datetime.datetime(start.year, 12, 31), end, freq='y'):
if next_end < end:
extract(prev_end.strftime('%Y-%m-%d'), next_end.strftime('%Y-%m-%d'))
else:
extract(prev_end.strftime('%Y-%m-%d'), end.strftime('%Y-%m-%d'))
prev_end = next_end + datetime.timedelta(days=1)
if prev_end < end:
extract(prev_end.strftime('%Y-%m-%d'), end.strftime('%Y-%m-%d'))
If you need to parse the original dates from strings, check out datetime.strptime
This kind of problems are nice ones to resolve by recursion:
from datetime import datetime
start_date = '2020-10-01'
end_date = '2022-01-03'
def intervalcalc(datestart,dateend):
newdate=dateend[:4] + '-01-01'
startd = datetime.strptime(datestart, "%Y-%m-%d")
endd = datetime.strptime(newdate, "%Y-%m-%d")
if endd < startd:
print(datestart, dateend)
return True
else:
print(newdate, dateend)
previousyear=str(int(newdate[:4])-1) + '-12-31'
intervalcalc(datestart,previousyear)
intervalcalc(start_date, end_date)
output:
2022-01-01 2022-01-03
2021-01-01 2021-12-31
2020-10-01 2020-12-31
You just need to change the prints by calls to extract function.
As mentioned by #Wups the conversion to date is not really necessary, it could be an string compare as they are YYYYMMDD dates.
Also, this can be done the other way around and calculate from the start date year + '-12-31' and then compare dateend>end_date to determine the anchor for the recursion.
I'm new to Python. After a couple days researching and trying things out, I've landed on a decent solution for creating a list of timestamps, for each hour, between two dates.
Example:
import datetime
from datetime import datetime, timedelta
timestamp_format = '%Y-%m-%dT%H:%M:%S%z'
earliest_ts_str = '2020-10-01T15:00:00Z'
earliest_ts_obj = datetime.strptime(earliest_ts_str, timestamp_format)
latest_ts_str = '2020-10-02T00:00:00Z'
latest_ts_obj = datetime.strptime(latest_ts_str, timestamp_format)
num_days = latest_ts_obj - earliest_ts_obj
num_hours = int(round(num_days.total_seconds() / 3600,0))
ts_raw = []
for ts in range(num_hours):
ts_raw.append(latest_ts_obj - timedelta(hours = ts + 1))
dates_formatted = [d.strftime('%Y-%m-%dT%H:%M:%SZ') for d in ts_raw]
# Need timestamps in ascending order
dates_formatted.reverse()
dates_formatted
Which results in:
['2020-10-01T00:00:00Z',
'2020-10-01T01:00:00Z',
'2020-10-01T02:00:00Z',
'2020-10-01T03:00:00Z',
'2020-10-01T04:00:00Z',
'2020-10-01T05:00:00Z',
'2020-10-01T06:00:00Z',
'2020-10-01T07:00:00Z',
'2020-10-01T08:00:00Z',
'2020-10-01T09:00:00Z',
'2020-10-01T10:00:00Z',
'2020-10-01T11:00:00Z',
'2020-10-01T12:00:00Z',
'2020-10-01T13:00:00Z',
'2020-10-01T14:00:00Z',
'2020-10-01T15:00:00Z',
'2020-10-01T16:00:00Z',
'2020-10-01T17:00:00Z',
'2020-10-01T18:00:00Z',
'2020-10-01T19:00:00Z',
'2020-10-01T20:00:00Z',
'2020-10-01T21:00:00Z',
'2020-10-01T22:00:00Z',
'2020-10-01T23:00:00Z']
Problem:
If I change earliest_ts_str to include minutes, say earliest_ts_str = '2020-10-01T19:45:00Z', the resulting list does not increment the minute intervals accordingly.
Results:
['2020-10-01T20:00:00Z',
'2020-10-01T21:00:00Z',
'2020-10-01T22:00:00Z',
'2020-10-01T23:00:00Z']
I need it to be:
['2020-10-01T20:45:00Z',
'2020-10-01T21:45:00Z',
'2020-10-01T22:45:00Z',
'2020-10-01T23:45:00Z']
Feels like the problem is in the num_days and num_hours calculation, but I can't see how to fix it.
Ideas?
if you don't mind to use a 3rd party package, have a look at pandas.date_range:
import pandas as pd
earliest, latest = '2020-10-01T15:45:00Z', '2020-10-02T00:00:00Z'
dti = pd.date_range(earliest, latest, freq='H') # just specify hourly frequency...
l = dti.strftime('%Y-%m-%dT%H:%M:%SZ').to_list()
print(l)
# ['2020-10-01T15:45:00Z', '2020-10-01T16:45:00Z', '2020-10-01T17:45:00Z', '2020-10-01T18:45:00Z', '2020-10-01T19:45:00Z', '2020-10-01T20:45:00Z', '2020-10-01T21:45:00Z', '2020-10-01T22:45:00Z', '2020-10-01T23:45:00Z']
import datetime
from datetime import datetime, timedelta
timestamp_format = '%Y-%m-%dT%H:%M:%S%z'
earliest_ts_str = '2020-10-01T00:00:00Z'
ts_obj = datetime.strptime(earliest_ts_str, timestamp_format)
latest_ts_str = '2020-10-02T00:00:00Z'
latest_ts_obj = datetime.strptime(latest_ts_str, timestamp_format)
ts_raw = []
while ts_obj <= latest_ts_obj:
ts_raw.append(ts_obj)
ts_obj += timedelta(hours=1)
dates_formatted = [d.strftime('%Y-%m-%dT%H:%M:%SZ') for d in ts_raw]
print(dates_formatted)
EDIT:
Here is example with Maya
import maya
earliest_ts_str = '2020-10-01T00:00:00Z'
latest_ts_str = '2020-10-02T00:00:00Z'
start = maya.MayaDT.from_iso8601(earliest_ts_str)
end = maya.MayaDT.from_iso8601(latest_ts_str)
# end is not included, so we add 1 second
my_range = maya.intervals(start=start, end=end.add(seconds=1), interval=60*60)
dates_formatted = [d.iso8601() for d in my_range]
print(dates_formatted)
Both output
['2020-10-01T00:00:00Z',
'2020-10-01T01:00:00Z',
... some left out ...
'2020-10-01T23:00:00Z',
'2020-10-02T00:00:00Z']
Just change
num_hours = num_days.days*24 + num_days.seconds//3600
The problem is that num_days only takes integer values, so if it is not a multiple of 24h you will get the floor value (i.e for your example you will get 0). So in order to compute the hours you need to use both, days and seconds.
Also, you can create the list directly in the right order, I am not sure if you are doing it like this for some reason.
ts_raw.append(earliest_ts_obj + timedelta(hours = ts + 1))
Interestingly, I have searched a lot of questions but I cannot find just a simple answer to this question. Or I do find an answer but it won't allow me the flexibility to alter the format of the dates I require.
If I have a specified start and end date like this:
start = '2015-08-01' #YYY-MM-DD
end = '2020-07-06'
Is there a simple way using datetime in python to create a list of dates between these dates that adhere to this format of YYY-MM-DD? And if so, how can I subsequently reverse this list so list[0] is equal to today?
Here's a way using list comprehensions, which is far faster than the loop examples, and doesn't require any external libraries.
from datetime import date, timedelta
start = '2015-08-01'
end = '2020-07-06'
start_date = date.fromisoformat(start)
end_date = date.fromisoformat(end)
date_range = [
# end_date - timedelta(days=i) # For date objects
(end_date - timedelta(days=i)).isoformat() # For ISO-8601 strings
for i
in range((end_date - start_date).days)
]
reverse_range = list(reversed(date_range))
print(date_range[0])
print(reverse_range[0])
Output
2020-07-06
2015-08-02
You can also use pandas
import pandas as pd
start = '2015-08-01' #YYY-MM-DD
end = '2020-07-06'
pd.date_range(start, end)
# to start from today
pd.date_range(pd.Timestamp.today(), end)
You can also create a range with your desired frequency
pd.date_range(start, end, freq='14d') # every 14 dayes
pd.date_range(start, end, freq='H') # hourly and etc
The datetime.timedelta() function will help here. Try this:
import datetime
dates = []
d = datetime.date(2015,8,1)
while d <= datetime.date(2020,7,6):
dates.append(datetime.datetime.strftime(d,'%Y-%m-%d'))
d += datetime.timedelta(days=1)
This will populate the list dates, which will look like this:
['2015-08-01', '2015-08-02', '2015-08-03', .... , '2020-07-04', '2020-07-05', '2020-07-06']
EDIT:
Just use dates.append(d) instead of dates.append(datetime.datetime.strftime(d,'%Y-%m-%d')) to get a list of datetime.date objects instead of strings.
Reversing a list is pretty straight-forward in Python:
dates = dates[::-1]
After the above, dates[0] will be '2020-07-06'.
something like this ?
import datetime
def date_range(start, end):
r = (end+datetime.timedelta(days=1)-start).days
return [start+datetime.timedelta(days=i) for i in range(r)]
start = datetime.date(2015,01,01)
end = datetime.date(2020,07,06)
dateList = date_range(start, end)
print '\n'.join([str(date) for date in dateList])
For a current project, I am planning to filter a JSON file by timeranges by running several loops, each time with a slightly shifted range. The code below however yields the error TypeError: Invalid comparison between dtype=datetime64[ns] and date for line after_start_date = df["Date"] >= start_date.
I have already tried to modify the formatting of the dates both within the Python code as well as the corresponding JSON file. Is there any smart tweak to align the date types/formats?
The JSON file has the following format:
[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]
And the corresponding code looks like this:
import string
import json
import pandas as pd
import datetime
from dateutil.relativedelta import *
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = datetime.date.fromisoformat('2017-01-01')
end_date = datetime.date.fromisoformat('2017-01-31')
for i in df.iterrows():
start_date += relativedelta(months=+3)
end_date += relativedelta(months=+3)
print(start_date)
print(end_date)
after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date
between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]
print(filtered_dates)
You can use pd.to_datetime('2017-01-31') instead of datetime.date.fromisoformat('2017-01-31').
I hope this helps!
My general advice is not to use datetime module.
Use rather built-in pandasonic methods / classes like pd.to_datetime
and pd.DateOffset.
You should also close the input file as early as it is not needed, e.g.:
with open('Glassdoor_A.json', 'r') as file:
data = json.load(file)
Other weird points in your code are that:
you wrote a loop iterating over rows for i in df.iterrows():,
but never use i (control variable of this loop).
your loop works rather in a time step (not "row by row") mode,
so your loop should be rather something like "while end_date <=
last_end_date:",
the difference between start_date and end_date is just
1 month (actually they are dates of start and end of some month),
but in the loop you increase both dates by 3 months.
Below you have an example of code to look for rows in consecutive months,
up to some final date and print rows from the current month if any:
start_date = pd.to_datetime('2017-01-01')
end_date = pd.to_datetime('2017-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_rows = df[df.Date.between(start_date, end_date)]
n = len(filtered_rows.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_rows)
start_date += mnthBeg
end_date += mnthEnd
You can compare your dates using the following method
from datetime import datetime
df_subset = df.loc[(df['Start_Date'] > datetime.strptime('2018-12-31', '%Y-%m-%d'))]
I'm making a call to an API which allows me to enter the start and end date in the request url, each request needs a date range to be within a 31 day?
is it possible to loop through a date range like the example below?
max_start_date = '2017-01-01'
max_end_date = '2017-12-31'
The first loop will return '2017-01-01' to '2017-01-31' and the 2nd loop should return '2017-02-01' to '2017-03-03'
url = "https://api.awin.com/publishers/{}/transactions/?startDate={}T00%3A00%3A00&endDate={}0T01%3A59%3A59&timezone=UTC&accessToken={}".format(publisher_id, start_date, end_date,token)
Use the datetime module:
from datetime import date, timedelta
start = date(2017, 1, 1)
end = date(2017, 12, 31)
while start < end:
print(start, start + timedelta(days=31))
start += timedelta(days=31)
If you need to loop through calendar months, consider using relativedelta from dateutil:
from dateutil.relativedelta import relativedelta
while start < end:
print(start, start + relativedelta(months=1))
start += relativedelta(days=31)
(you need to install it by running pip install dateutil)
Docs:
https://docs.python.org/3/library/datetime.html
https://github.com/dateutil/dateutil
You can try the following for a list of dates:
import datetime as dt
max_start_date = '2017-01-01'
max_end_date = '2017-12-31'
def dt_range(*args):
"""
Args are positional, though you could probably
toss in an evaluation part to determine which is the min date.
"""
dt_start = dt.datetime.strptime(args[0], '%Y-%m-%d')
dt_end = dt.datetime.strptime(args[1], '%Y-%m-%d')
for i in range(int((dt_end - dt_start).days)+1):
yield dt_start + dt.timedelta(i)
for i in dt_range(max_start_date, max_end_date):
print(i.strftime('%Y-%m-%d'))