Infer year from day of week and date with python datetime - python

I have data which is of the form Thu Jun 22 09:43:06 and I would like to infer the year from this to use datetime to calculate the time between two dates. Is there a way to use datetime to infer the year given the above data?

No, but if you know the range (for example 2010..2017), you can just iterate over years to see if Jun 22 falls on Thursday:
def find_year(start_year, end_year, month, day, week_day):
for y in range(start_year, end_year+1):
if datetime.datetime(y, month, day, 0, 0).weekday() == week_day:
yield y
# weekday is 0..6 starting from Monday, so 3 stands for Thursday
print(list(find_year(2010, 2017, 6, 22, 3)))
[2017]
For longer ranges, though, there might be more than one result:
print(list(find_year(2000,2017, 6, 22, 3)))
[2000, 2006, 2017]

You could also use pd.date_range to generate a lookup table
calendar = pd.date_range('2017-01-01', '2020-12-31')
dow = {i: d for i, d in enumerate(('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))}
moy = {i: d for i, d in enumerate(('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'), 1)}
lup = {'{} {} {:>2d}'.format(dow[d.weekday()], moy[d.month], d.day): str(d.year) for d in calendar}
date = 'Tue Jun 25'
print(lup[date])
# 2019
print(pd.Timestamp(date + ' ' + lup[date]))
# 2019-06-25 00:00:00
Benchmarking it in ipython, there's some decent speedup once the table is generated, but the overhead of generating the table may not be worth it unless you have a lot of dates to confirm.
In [28]: lup = gen_lookup('1-1-2010', '12-31-2017')
In [29]: date = 'Thu Jun 22'
In [30]: lup[date]
Out[30]: ['2017']
In [32]: list(find_year(2010, 2017, 6, 22, 3))
Out[32]: [2017]
In [33]: %timeit lup = gen_lookup('1-1-2010', '12-31-2017')
13.8 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [34]: %timeit yr = lup[date]
54.1 ns ± 0.547 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [35]: %timeit yr = find_year(2010, 2017, 6, 22, 3)
248 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Related

How to extract components from a pandas datetime column and assign them

The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

How can I parse '2020-07-30T20:40:33.1000000Z' using datetime.strptime

I am trying to parse and convert "2020-07-30T20:40:33.1000000Z"in Python:
from datetime import datetime
Data = [{'id': 'XXXXXXXXXXXXX', 'number': 3, 'externalId': '0000', 'dateCreated': '2020-07-30T20:40:33.1005865Z', 'dateUpdated': '2020-07-30T20:40:33.36Z'}], 'tags': []}]
for i in Data:
creationtime= datetime.strptime(i["dateCreated"],"%Y-%m-%dT%H:%M:%S")
Error:
raise ValueError("unconverted data remains: %s" %
ValueError: unconverted data remains: .1005865Z
I tried :
%Y-%m-%dT%H:%M:%S.%fZ
Can anyone please suggest the correct format that I am missing.
if you really have 7 decimal places of fractional seconds and don't care about the 1/10th of the microseconds, you could use a re.sub and datetime.fromisoformat:
import re
from datetime import datetime
s = "2020-07-30T20:40:33.1000000Z"
dt = datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=datetime.timezone.utc)
...or use dateutil's parser:
from dateutil import parser
dt = parser.parse(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=tzutc())
...or even pandas's to_datetime, if you maybe work with that lib anyway:
import pandas as pd
dt = pd.to_datetime(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
Timestamp('2020-07-30 20:40:33.100000+0000', tz='UTC')
often irrelevant (depending on use-case) but note that convenience costs you some more time:
%timeit datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
1.92 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit parser.parse(s)
79.8 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.to_datetime(s)
62.4 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Pandas DatetimeIndex: Number of periods in a frequency string?

How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.

Python Pandas: getting the rows with highest value

Hello! I have a dataframe with year (1910 ~ 2014), name, count (number of occurrence of each name) as columns. I want to create a new dataframe that shows the name with highest occurrence per year, and I'm not entirely sure about how to do this. Thanks!
Vectorized way
group = df.groupby('year')
df.loc[group['count'].agg('idxmax')]
Try this:
d = {'year': [1910, 1910, 1910,1920,1920,1920], 'name': ["Virginia", "Mary", "Elizabeth","Virginia", "Mary", "Elizabeth"], 'count': [848, 420, 747, 1048, 221, 147]}
df = pd.DataFrame(data=d)
df_results = pd.DataFrame(columns=df.columns)
years = pd.unique(df['year'])
for year in years:
tmp_df = df.loc[df['year'] == year]
tmp_df = tmp_df.sort_values(by='year')
df_results = df_results.append(tmp_df.iloc[0])
I suppose groupby & apply is good approach:
df = pd.DataFrame({
'Year': ['1910', '1910', '1911', '1911', '1911', '2014', '2014'],
'Name': ['Mary', 'Virginia', 'Elizabeth', 'Mary', 'Ann', 'Virginia', 'Elizabeth'],
'Count': [848, 270, 254, 360, 451, 81, 380]
})
df
Out:
Year Name Count
0 1910 Mary 848
1 1910 Virginia 270
2 1911 Elizabeth 254
3 1911 Mary 360
4 1911 Ann 451
5 2014 Virginia 81
6 2014 Elizabeth 380
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1911 4 1911 Ann 451
2014 6 2014 Elizabeth 380
Also you can change head(1) by head(n) to get n most frequent names per year:
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(2))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1 1910 Virginia 270
1911 4 1911 Ann 451
3 1911 Mary 360
2014 6 2014 Elizabeth 380
5 2014 Virginia 81
If you don't like new additional index, drop it via .reset_index(level=0, drop=True):
top_names = df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
top_names.reset_index(level=0, drop=True)
Out:
Year Name Count
0 1910 Mary 848
4 1911 Ann 451
6 2014 Elizabeth 380
another way of doing this is sort the values of count and de-duplicate the Year column(faster too):
df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
time results are below, you can try applying any method and see howmuch time each takes and apply accordingly:
%timeit df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
result: 917 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].agg('idxmax')]
result: 1.06 ms ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].idxmax(), :]
result: 1.13 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Python date string to date object

How do I convert a string to a date object in python?
The string would be: "24052010" (corresponding to the format: "%d%m%Y")
I don't want a datetime.datetime object, but rather a datetime.date.
You can use strptime in the datetime package of Python:
>>> import datetime
>>> datetime.datetime.strptime('24052010', "%d%m%Y").date()
datetime.date(2010, 5, 24)
import datetime
datetime.datetime.strptime('24052010', '%d%m%Y').date()
Directly related question:
What if you have
datetime.datetime.strptime("2015-02-24T13:00:00-08:00", "%Y-%B-%dT%H:%M:%S-%H:%M").date()
and you get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "/usr/local/lib/python2.7/_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/usr/local/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'H' as group 7; was group 4
and you tried:
<-24T13:00:00-08:00", "%Y-%B-%dT%HH:%MM:%SS-%HH:%MM").date()
but you still get the traceback above.
Answer:
>>> from dateutil.parser import parse
>>> from datetime import datetime
>>> parse("2015-02-24T13:00:00-08:00")
datetime.datetime(2015, 2, 24, 13, 0, tzinfo=tzoffset(None, -28800))
If you are lazy and don't want to fight with string literals, you can just go with the parser module.
from dateutil import parser
dt = parser.parse("Jun 1 2005 1:33PM")
print(dt.year, dt.month, dt.day,dt.hour, dt.minute, dt.second)
>2005 6 1 13 33 0
Just a side note, as we are trying to match any string representation, it is 10x slower than strptime
you have a date string like this, "24052010" and you want date object for this,
from datetime import datetime
cus_date = datetime.strptime("24052010", "%d%m%Y").date()
this cus_date will give you date object.
you can retrieve date string from your date object using this,
cus_date.strftime("%d%m%Y")
For single value the datetime.strptime method is the fastest
import arrow
from datetime import datetime
import pandas as pd
l = ['24052010']
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.86 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
305 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
46 µs ± 978 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a list of values the pandas pd.to_datetime is the fastest
l = ['24052010'] * 1000
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.32 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
1.76 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
44.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For ISO8601 datetime format the ciso8601 is a rocket
import ciso8601
l = ['2010-05-24'] * 1000
%timeit _ = list(map(lambda x: ciso8601.parse_datetime(x).date(), l))
241 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There is another library called arrow really great to make manipulation on python date.
import arrow
import datetime
a = arrow.get('24052010', 'DMYYYY').date()
print(isinstance(a, datetime.date)) # True
string "24052010"
In a very manual way you could just go like this:-
first split the string as (yyyy-mm-dd) format so you could get a tuple something like this (2010, 5, 24), then simply convert this tuple to a date format something like 2010-05-24.
you could run this code on a list of string object similar to above and convert the entire list of tuples object to date object by simply unpacking(*tuple) check the code below.
import datetime
#for single string simply use:-
my_str= "24052010"
date_tup = (int(my_str[4:]),int(my_str[2:4]),int(my_str[:2]))
print(datetime.datetime(*date_tup))
output: 2012-01-01 00:00:00
# for a list of string date objects you could use below code.
date_list = []
str_date = ["24052010", "25082011", "25122011","01012012"]
for items in str_date:
date_list.append((int(items[4:]),int(items[2:4]),int(items[:2])))
for dates in date_list:
# unpack all tuple objects and convert to date
print(datetime.datetime(*dates))
output:
2010-05-24 00:00:00
2011-08-25 00:00:00
2011-12-25 00:00:00
2012-01-01 00:00:00
Use time module to convert data.
Code snippet:
import time
tring='20150103040500'
var = int(time.mktime(time.strptime(tring, '%Y%m%d%H%M%S')))
print var

Categories

Resources