Inconsistency when parsing year-weeknum string to date - python

When parsing year-weeknum strings, I came across an inconsistency when comparing the results from %W and %U (docs):
What works:
from datetime import datetime
print("\nISO:") # for reference...
for i in range(1,8): # %u is 1-based
print(datetime.strptime(f"2019-01-{i}", "%G-%V-%u"))
# ISO:
# 2018-12-31 00:00:00
# 2019-01-01 00:00:00
# 2019-01-02 00:00:00
# ...
# %U -> week start = Sun
# first Sunday 2019 was 2019-01-06
print("\n %U:")
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%U-%w"))
# %U:
# 2019-01-06 00:00:00
# 2019-01-07 00:00:00
# 2019-01-08 00:00:00
# ...
What is unexpected:
# %W -> week start = Mon
# first Monday 2019 was 2019-01-07
print("\n %W:")
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%W-%w"))
# %W:
# 2019-01-13 00:00:00 ## <-- ?! expected 2019-01-06
# 2019-01-07 00:00:00
# 2019-01-08 00:00:00
# 2019-01-09 00:00:00
# 2019-01-10 00:00:00
# 2019-01-11 00:00:00
# 2019-01-12 00:00:00
The date jumping from 2019-01-13 to 2019-01-07? What's going on here? I don't see any ambiguities in the calendar for 2019... I also tried to parse the same dates in rust with chrono, and it fails for the %W directive -> playground example. A jump backwards in Python and an error in Rust, what am I missing here?

That week goes from Monday January 7 to Sunday January 13.
%w is documented as "Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.". So 0 means Sunday (= January 13), and 1 means Monday (= January 7).

In your code, you're trying to parse the string "2019-01-0" as a day of a year, which is not a valid day. That's why you're encountering an unexpected result when using the %W format code.
If you want to parse a date, you should specify a value that is bigger then 1 not 0.
Also keep it might help to keep the style consistent with f-string
f'(2019-01-{i:02d}')
which will add the leading 0 when necessary like the following.
2019-01-00
2019-01-01
2019-01-02
2019-01-03
2019-01-04
2019-01-05
2019-01-06
Here is your modified code:
for i in range(0,7):
print(datetime.strptime(f"2019-01-{i}", "%Y-%W-%w"))

Related

Rounding up the value to the nearest hour

This is my first time to post a question here, if I don't explain the question very clearly, please give me a chance to improve the way of asking. Thank you!
I have a dataset contains dates and times like this
TIME COL1 COL2 COL3 ...
2018/12/31 23:50:23 34 DC 23
2018/12/31 23:50:23 32 NC 23
2018/12/31 23:50:19 12 AL 33
2018/12/31 23:50:19 56 CA 23
2018/12/31 23:50:19 98 CA 33
I want to create a new column and the format would be like '2018-12-31 11:00:00 PM' instead of '2018/12/31 23:10:23' and 17:40 was rounded up to 6:00
I have tried to use .dt.strftime("%Y-%m-%d %H:%M:%S") to change the format and then when I try to convert the time from 12h to 24h, I stuck here.
Name: TIME, Length: 3195450, dtype: datetime64[ns]
I found out the type of df['TIME'] is pandas.core.series.Series
Now I have no idea about how to continue. Please give me some ideas, hints or any instructions. Thank you very much!
From your example it seems you want to floor to the hour, instead of round? In any case, first make sure your TIME column is of datetime dtype.
df['TIME'] = pd.to_datetime(df['TIME'])
Now floor (or round) using the dt accessor and an offset alias:
df['newTIME'] = df['TIME'].dt.floor('H') # could use round instead of floor here
# df['newTIME']
# 0 2018-12-31 23:00:00
# 1 2018-12-31 23:00:00
# 2 2018-12-31 23:00:00
# 3 2018-12-31 23:00:00
# 4 2018-12-31 23:00:00
# Name: newTIME, dtype: datetime64[ns]
Afer that, you can format to string in a desired format, again using the dt accessor to access properties of a datetime series:
df['timestring'] = df['newTIME'].dt.strftime("%Y-%m-%d %I:%M:%S %p")
# df['timestring']
# 0 2018-12-31 11:00:00 PM
# 1 2018-12-31 11:00:00 PM
# 2 2018-12-31 11:00:00 PM
# 3 2018-12-31 11:00:00 PM
# 4 2018-12-31 11:00:00 PM
# Name: timestring, dtype: object

Time difference in pandas (from string format to datetime)

I have the following column
Time
2:00
00:13
1:00
00:24
in object format (strings). This time refers to hours and minutes ago from a time that I need to use as a start: 8:00 (it might change; in this example is 8:00).
Since the times in the column Time are referring to hours/minutes ago, what I would like to expect should be
Time
6:00
07:47
7:00
07:36
calculated as time difference (e.g. 8:00 - 2:00).
However, I am having difficulties in doing this calculation and transform the result in a datetime (keeping only hours and minutes).
I hope you can help me.
Since the Time columns contains only Hour:Minute I suggest using timedelta instead of datetime:
df['Time'] = pd.to_timedelta(df.Time+':00')
df['Start_Time'] = pd.to_timedelta('8:00:00') - df['Time']
Output:
Time Start_Time
0 02:00:00 06:00:00
1 00:13:00 07:47:00
2 01:00:00 07:00:00
3 00:24:00 07:36:00
you can do it using pd.to_datetime.
ref = pd.to_datetime('08:00') #here define the hour of reference
s = ref-pd.to_datetime(df['Time'])
print (s)
0 06:00:00
1 07:47:00
2 07:00:00
3 07:36:00
Name: Time, dtype: timedelta64[ns]
This return a series, that can be change to a dataframe with s.to_frame() for example

Hours, Date, Day Count Calculation

I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00

formatting really inconsistent dates with python

I have some really messed up dates that I'm trying to get into a consistent format %Y-%m-%d if it applies. Some of the dates lack the day, some of the dates are in the future or just plain impossible for those I'll just flag as incorrect. How might I tackle such inconsistencies with python?
sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00
you can use the dateutil parser if you want
from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
try:
print parse(d)
except Exception, err:
print 'couldn\'t parse', d, err
outputs
1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month
if you would like to flag any that arent an easy parse you can check to see if they have 3 parts to parse and if they do try and parse it or else flag it like so
flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
try:
a = None
for s in splitters:
if len(d.split(s)) == 3:
a = parse(d)
good.append(a)
if not a:
raise Exception
except Exception, err:
flagged.append(d)
Some of the values are ambiguous. You can get different result depending on priorities e.g., if you want all dates to be treated consistently; you could specify a list of formats to try:
#!/usr/bin/env python
import re
import sys
from datetime import datetime
for line in sys.stdin:
date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
try:
print(datetime.strptime(date_string, date_format).date())
break
except ValueError:
pass
else: # no break
sys.stderr.write("failed to parse " + line)
Example:
$ python . <input.txt
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01
You could use other criteria e.g., you could maximize number of dates that are parsed successfully even if some dates are treated inconsistently instead (dateutil, pandas solution might give solutions in this category).
pd.datetools.to_datetime will have a go at guessing for you, it seems to go ok with most of your your dates, although you might want to put in some additional rules?
df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]:
0 1997-07-04 00:00:00
1 2002-08-31 00:00:00
2 1995-05-20 00:00:00
3 1992-05-12 00:00:00
4 2015-06-13 00:00:00
5 1998-08-04 00:00:00
6 90/1/90
7 1977-03-10 00:00:00
8 2015-12-07 00:00:00
9 NaN
10 1998-04-03 00:00:00
11 1976-08-01 00:00:00
12 1990-03-01 00:00:00
13 2015-09-01 00:00:00
14 1974-04-01 00:00:00
15 2003-10-10 00:00:00
16 Dec-00
Name: sample, dtype: object

start end date each month in the pas 12 months

This is my current code
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1,-1,), dtstart=start_date):
yield st.combine(st, mint)
And this is output from this:
for y in TimeSeries().year():
print(y)
2013-01-31 00:00:00
2013-02-01 00:00:00
2013-02-28 00:00:00
2013-03-01 00:00:00
2013-03-31 00:00:00
2013-04-01 00:00:00
2013-04-30 00:00:00
2013-05-01 00:00:00
2013-05-31 00:00:00
2013-06-01 00:00:00
2013-06-30 00:00:00
2013-07-01 00:00:00
2013-07-31 00:00:00
2013-08-01 00:00:00
2013-08-31 00:00:00
2013-09-01 00:00:00
2013-09-30 00:00:00
2013-10-01 00:00:00
2013-10-31 00:00:00
2013-11-01 00:00:00
2013-11-30 00:00:00
2013-12-01 00:00:00
2013-12-31 00:00:00
2014-01-01 00:00:00
The question is how I can force that counting are started from 2013-01-01 00:00:00 and month end like 2013-01-31 23:59:59 and so on.
And the end of loop ends on 2014-01-31 23:59:59 instead 2014-01-01 00:00:00
Also I like make start date and end date on one line:
2013-03-01 00:00:00 2013-03-31 23:59:59
2013-04-01 00:00:00 2013-03-30 23:59:59
...
...
2014-01-01 00:00:00 2014-01-31 23:59:59
Any suggestion?
First, are you really sure that you want 2013-03-31 23:59:59. Date intervals are traditionally specified as half-open intervals—just like ranges in Python. And the reason for this is that 23:59:59 is not actually the end of a day.
Most obviously, 23:59:59.001 is later than that but on the same day. Python datetime objects include microseconds, so this isn't just a "meh, whatever" problem—if you, e.g., call now(), you can get a time that's incorrectly later than your "end of the day" on the same day.
Less obviously, on a day with a leap second, 23:59:60 is also later but on the same day.
But if you really want this, there are two obvious ways to get it:
You're already iterating dates instead of datetimes and combining the times in manually. And it's obvious when you're dealing with a day 1 vs. day -1, because the date's day member will be 1 or it won't be. So:
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1, -1,), dtstart=start_date):
yield st.combine(st, mint if st.day=1 else maxt)
Alternatively, instead of iterating both first and last days, just iterate first days, and subtract a second to get the last second of the previous month:
class TimeSeries():
def year(year):
today = datetime.now()
start_date = today+relativedelta(years=-1)
mint, maxt = datetime.min.time(), datetime.max.time()
for st in rrule(MONTHLY, count=24, bymonthday=(1,), dtstart=start_date):
dt = st.combine(st, mint)
yield dt - timedelta(seconds=1)
yield dt
As far as printing these in pairs… well, as written, that's an underspecified problem. The first value in your list is the second value in a pair—except when you run this on the 1st of a month. And likewise, the last date is the first value in a pair, except when you run this on the 31st. So, what do you want to do with them?
If this isn't obvious, look at your example. Your first value is 2013-01-31 00:00:00, but your first pair doesn't start with 2013-01-31.
There are many things you could want here:
Start with the first of the month a year ago, rather than the first first-or-last of the month that happened within the last year. And likewise for the end. So you would have 2013-01-01 in your list, and there would always be pairs.
Start with the first month that started within the last year, and likewise for the end. So you wouldn't get 2013-01-31 in your list, and there would always be pairs.
Use your current rule, and there's not a pair, use None for the missing value.
etc.
Whatever rule you actually want can be coded up pretty easily. And then you'll probably want to yield in (start, end) tuples, so the print loop can just do this:
for start, end in TimeSeries().year():
print(start, end)

Categories

Resources