Related
I am getting my data and some dates from an unconventional source and because of this there are some minor differences in the string dates. the big difference is that there are dates mixed in where the day is not padded by a zero, there can be a white space after the day (in the case of date 2/9 /2018) also the months are not padded by zeroes. I was getting the error that "time data does not match format '%m %d %Y' when trying datetime.strptime. how can I convert a column of dates where there are subtle differences like this? please see the code and sample data below
d_o = datetime.datetime.strptime(df['start'][1], '%m %d %Y')
1/26/2018
1/26/2018
2/2/2018
2/2/2018
2/9 /2018
2/9 /2018
1/19/2018
1/19/2018
1/26/2018
1/26/2018
2/2/2018
2/2/2018
2/9 /2018
You should use a 3rd party library such as dateutil. This library accepts a wide variety of date formats at the cost of performance.
from dateutil import parser
lst = ['1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018', '2/9 /2018',
'1/19/2018', '1/19/2018', '1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018',
'2/9 /2018']
res = [parser.parse(i) for i in lst]
Result:
[datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0),
datetime.datetime(2018, 1, 19, 0, 0),
datetime.datetime(2018, 1, 19, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 1, 26, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 2, 0, 0),
datetime.datetime(2018, 2, 9, 0, 0)]
You can use re.split and str.zfill:
import re
dates = ['1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018', '2/9 /2018', '1/19/2018', '1/19/2018', '1/26/2018', '1/26/2018', '2/2/2018', '2/2/2018', '2/9 /2018']
new_dates = ['{}/{}/{}'.format(a.zfill(2), *b) for a, *b in map(lambda x:re.split('[/\s]+', x), dates)]
Output:
['01/26/2018', '01/26/2018', '02/2/2018', '02/2/2018', '02/9/2018', '02/9/2018', '01/19/2018', '01/19/2018', '01/26/2018', '01/26/2018', '02/2/2018', '02/2/2018', '02/9/2018']
I am working in python pandas and I am doing the following:
StDt = datetime(2018, 1, 1, 1, 0)
EnDt = datetime(2020, 1, 1, 1, 0)
allHours = pd.date_range(StDt, EnDt, freq='H').to_pydatetime()
The midnight hours are represented as:
datetime(2018, 1, 3, 0, 0)
datetime(2018, 1, 5, 0, 0)
Is it possible to create the series in a way such that midnight is represented as hour 24 of previous day
i.e. the above two cases will look as:
datetime(2018, 1, 2, 24, 0)
datetime(2018, 1, 4, 24, 0)
i.e. I am looking for following:
datetime(2018, 1, 3, 0, 0) = datetime(2018, 1, 2, 24, 0)
datetime(2018, 1, 5, 0, 0) = datetime(2018, 1, 4, 24, 0)
Edit:
My particular situation requires working in hour ending world and that is how the convention is in what I am working in.
Using datetimes, this is not possible. Python simply doesn't accept datetime(2018, 1, 2, 24, 0) as a valid time.
There was a request in 2010 to allow for this time to be accepted
Issue 10427: 24:-00 Hour in DateTime
which was rejected.
My only suggestion would be to consider whether you really need this time depicted as you outlined. For actual data manipulation, it should not make any difference as any operations you'd like to do in Pandas with datetimes will conform to this same restriction anyways.
I was working with similar data, and found it useful to consider that Hour Ending data labeled 1-24 is the equivalent of Hour Beginning data labeled 0-23.
So you'll have to change your rule set notation, but it should be a straightforward change.
Why do these two lines produce different results?
>>> import pytz
>>> from datetime import datetime
>>> local_tz = pytz.timezone("America/Los_Angeles")
>>> d1 = local_tz.localize(datetime(2015, 8, 1, 0, 0, 0, 0)) # line 1
>>> d2 = datetime(2015, 8, 1, 0, 0, 0, 0, local_tz) # line 2
>>> d1 == d2
False
What's the reason for the difference, and which should I use to localize a datetime?
When you create d2 = datetime(2015, 8, 1, 0, 0, 0, 0, local_tz), it does not handle daylight saving time (DST) correctly. local_tz.localize() does.
d1 is
>>> local_tz.localize(datetime(2015, 8, 1, 0, 0, 0, 0))
datetime.datetime(
2015, 8, 1, 0, 0,
tzinfo=<DstTzInfo 'America/Los_Angeles' PDT-1 day, 17:00:00 DST>
)
d2 is
>>> datetime(2015, 8, 1, 0, 0, 0, 0, local_tz)
datetime.datetime(
2015, 8, 1, 0, 0,
tzinfo=<DstTzInfo 'America/Los_Angeles' LMT-1 day, 16:07:00 STD>
)
You can see that they are not representing the same time.
d2 way is fine if you are going to work with UTC. UTC does not have daylight saving time (DST) transitions to deal with.
The correct way to handle timezone is to use local_tz.localize() to support daylight saving time (DST)
More information and additional examples can be found here:
http://pytz.sourceforge.net/#localized-times-and-date-arithmetic
I've got a sorted list of datetimes: (with day gaps)
list_of_dts = [
datetime.datetime(2012,1,1,0,0,0),
datetime.datetime(2012,1,1,1,0,0),
datetime.datetime(2012,1,2,0,0,0),
datetime.datetime(2012,1,3,0,0,0),
datetime.datetime(2012,1,5,0,0,0),
]
And I'd like to split them in to a list for each day:
result = [
[datetime.datetime(2012,1,1,0,0,0), datetime.datetime(2012,1,1,1,0,0)],
[datetime.datetime(2012,1,2,0,0,0)],
[datetime.datetime(2012,1,3,0,0,0)],
[], # Empty list for no datetimes on day
[datetime.datetime(2012,1,5,0,0,0)]
]
Algorithmically, it should be possible to achieve at least O(n).
Perhaps something like the following:
(This obviously doesn't handle missed days, and drops the last dt, but it's a start)
def dt_to_d(list_of_dts):
result = []
start_dt = list_of_dts[0]
day = [start_dt]
for i, dt in enumerate(list_of_dts[1:]):
previous = start_dt if i == 0 else list_of_dts[i-1]
if dt.day > previous.day or dt.month > previous.month or dt.year > previous.year:
# split to new sub-list
result.append(day)
day = []
# Loop for each day gap?
day.append(dt)
return result
Thoughts?
The easiest way to go is to use dict.setdefault to group entries falling on the same day and then loop over the lowest day to the highest:
>>> import datetime
>>> list_of_dts = [
datetime.datetime(2012,1,1,0,0,0),
datetime.datetime(2012,1,1,1,0,0),
datetime.datetime(2012,1,2,0,0,0),
datetime.datetime(2012,1,3,0,0,0),
datetime.datetime(2012,1,5,0,0,0),
]
>>> days = {}
>>> for dt in list_of_dts:
days.setdefault(dt.toordinal(), []).append(dt)
>>> [days.get(day, []) for day in range(min(days), max(days)+1)]
[[datetime.datetime(2012, 1, 1, 0, 0), datetime.datetime(2012, 1, 1, 1, 0)],
[datetime.datetime(2012, 1, 2, 0, 0)],
[datetime.datetime(2012, 1, 3, 0, 0)],
[],
[datetime.datetime(2012, 1, 5, 0, 0)]]
Another approach for making such groupings is itertools.groupby. It is designed for this kind of work, but it doesn't provide a way to fill-in an empty list for missing days:
>>> import itertools
>>> [list(group) for k, group in itertools.groupby(list_of_dts,
key=datetime.datetime.toordinal)]
[[datetime.datetime(2012, 1, 1, 0, 0), datetime.datetime(2012, 1, 1, 1, 0)],
[datetime.datetime(2012, 1, 2, 0, 0)],
[datetime.datetime(2012, 1, 3, 0, 0)],
[datetime.datetime(2012, 1, 5, 0, 0)]]
You can use itertools.groupby to easily handle this kind of problems:
import datetime
import itertools
list_of_dts = [
datetime.datetime(2012,1,1,0,0,0),
datetime.datetime(2012,1,1,1,0,0),
datetime.datetime(2012,1,2,0,0,0),
datetime.datetime(2012,1,3,0,0,0),
datetime.datetime(2012,1,5,0,0,0),
]
print [list(g) for k, g in itertools.groupby(list_of_dts, key=lambda d: d.date())]
Filling the gaps:
date_dict = {}
for date_value in list_of_dates:
if date_dict.has_key(date_value.date()):
date_dict[date_value.date()].append(date_value)
else:
date_dict[date_value.date()] = [ date_value ]
sorted_dates = sorted(date_dict.keys())
date = sorted_dates[0]
while date <= sorted_dates[-1]:
print date_dict.get(date, [])
date += datetime.timedelta(1)
Results:
[datetime.datetime(2012, 1, 1, 0, 0), datetime.datetime(2012, 1, 1, 1, 0)]
[datetime.datetime(2012, 1, 2, 0, 0)]
[datetime.datetime(2012, 1, 3, 0, 0)]
[]
[datetime.datetime(2012, 1, 5, 0, 0)]
This solution does not requires the original datetime list to be sorted.
list_of_dts = [
datetime.datetime(2012,1,1,0,0,0),
datetime.datetime(2012,1,1,1,0,0),
datetime.datetime(2012,1,2,0,0,0),
datetime.datetime(2012,1,3,0,0,0),
datetime.datetime(2012,1,5,0,0,0),
]
groupedByDay={}
for date in list_of_dts:
if date.date() in groupedByDay:
groupedByDay[date.date()].append(date)
else:
groupedByDay[date.date()]=[date]
Now you have a dictionary, where the date is the key and the value is a list of similar dates.
and if you are set on having a list instead
result = groupedByDay.values()
result.sort()
now results is a list of lists, where all the dates with the same day are grouped together
Im importing records from a CSV file using python csv module .
The date/Time field expects the date to be in a specific format, but
different spreadsheet programs default to different types of formats
and I dont want the user to have to change their down format.I want to
find a way to either detect the format the string is in, or only allow
several specified formats.
How to read the date/time field from the csv file and plot a graph
accordingly.
dateutil can parse date strings in a variety of formats, without you having to specify in advance what format the date string is in:
In [8]: import dateutil.parser as parser
In [9]: parser.parse('Jan 1')
Out[9]: datetime.datetime(2011, 1, 1, 0, 0)
In [10]: parser.parse('1 Jan')
Out[10]: datetime.datetime(2011, 1, 1, 0, 0)
In [11]: parser.parse('1-Jan')
Out[11]: datetime.datetime(2011, 1, 1, 0, 0)
In [12]: parser.parse('Jan-1')
Out[12]: datetime.datetime(2011, 1, 1, 0, 0)
In [13]: parser.parse('Jan 2,1999')
Out[13]: datetime.datetime(1999, 1, 2, 0, 0)
In [14]: parser.parse('2 Jan 1999')
Out[14]: datetime.datetime(1999, 1, 2, 0, 0)
In [15]: parser.parse('1999-1-2')
Out[15]: datetime.datetime(1999, 1, 2, 0, 0)
In [16]: parser.parse('1999/1/2')
Out[16]: datetime.datetime(1999, 1, 2, 0, 0)
In [17]: parser.parse('2/1/1999')
Out[17]: datetime.datetime(1999, 2, 1, 0, 0)
In [18]: parser.parse("10-09-2003", dayfirst=True)
Out[18]: datetime.datetime(2003, 9, 10, 0, 0)
In [19]: parser.parse("10-09-03", yearfirst=True)
Out[19]: datetime.datetime(2010, 9, 3, 0, 0)
Once you've collected the dates and values into lists, you can plot them with plt.plot. For example:
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
n=20
now=dt.datetime.now()
dates=[now+dt.timedelta(days=i) for i in range(n)]
values=[np.sin(np.pi*i/n) for i in range(n)]
plt.plot(dates,values)
plt.show()
Per Joe Kington's comment, a graph similar to the one above could also be made using matplotlib.dates.datestr2num instead of using dateutil.parser explicitly:
import matplotlib.pyplot as plt
import matplotlib.dates as md
import datetime as dt
import numpy as np
n=20
dates=['2011-Feb-{i}'.format(i=i) for i in range(1,n)]
dates=md.datestr2num(dates)
values=[np.sin(np.pi*i/n) for i in range(1,n)]
plt.plot_date(dates,values,linestyle='solid',marker='None')
plt.show()