Python date string to date object - python

How do I convert a string to a date object in python?
The string would be: "24052010" (corresponding to the format: "%d%m%Y")
I don't want a datetime.datetime object, but rather a datetime.date.

You can use strptime in the datetime package of Python:
>>> import datetime
>>> datetime.datetime.strptime('24052010', "%d%m%Y").date()
datetime.date(2010, 5, 24)

import datetime
datetime.datetime.strptime('24052010', '%d%m%Y').date()

Directly related question:
What if you have
datetime.datetime.strptime("2015-02-24T13:00:00-08:00", "%Y-%B-%dT%H:%M:%S-%H:%M").date()
and you get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "/usr/local/lib/python2.7/_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/usr/local/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'H' as group 7; was group 4
and you tried:
<-24T13:00:00-08:00", "%Y-%B-%dT%HH:%MM:%SS-%HH:%MM").date()
but you still get the traceback above.
Answer:
>>> from dateutil.parser import parse
>>> from datetime import datetime
>>> parse("2015-02-24T13:00:00-08:00")
datetime.datetime(2015, 2, 24, 13, 0, tzinfo=tzoffset(None, -28800))

If you are lazy and don't want to fight with string literals, you can just go with the parser module.
from dateutil import parser
dt = parser.parse("Jun 1 2005 1:33PM")
print(dt.year, dt.month, dt.day,dt.hour, dt.minute, dt.second)
>2005 6 1 13 33 0
Just a side note, as we are trying to match any string representation, it is 10x slower than strptime

you have a date string like this, "24052010" and you want date object for this,
from datetime import datetime
cus_date = datetime.strptime("24052010", "%d%m%Y").date()
this cus_date will give you date object.
you can retrieve date string from your date object using this,
cus_date.strftime("%d%m%Y")

For single value the datetime.strptime method is the fastest
import arrow
from datetime import datetime
import pandas as pd
l = ['24052010']
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.86 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
305 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
46 µs ± 978 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a list of values the pandas pd.to_datetime is the fastest
l = ['24052010'] * 1000
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.32 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
1.76 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
44.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For ISO8601 datetime format the ciso8601 is a rocket
import ciso8601
l = ['2010-05-24'] * 1000
%timeit _ = list(map(lambda x: ciso8601.parse_datetime(x).date(), l))
241 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There is another library called arrow really great to make manipulation on python date.
import arrow
import datetime
a = arrow.get('24052010', 'DMYYYY').date()
print(isinstance(a, datetime.date)) # True

string "24052010"
In a very manual way you could just go like this:-
first split the string as (yyyy-mm-dd) format so you could get a tuple something like this (2010, 5, 24), then simply convert this tuple to a date format something like 2010-05-24.
you could run this code on a list of string object similar to above and convert the entire list of tuples object to date object by simply unpacking(*tuple) check the code below.
import datetime
#for single string simply use:-
my_str= "24052010"
date_tup = (int(my_str[4:]),int(my_str[2:4]),int(my_str[:2]))
print(datetime.datetime(*date_tup))
output: 2012-01-01 00:00:00
# for a list of string date objects you could use below code.
date_list = []
str_date = ["24052010", "25082011", "25122011","01012012"]
for items in str_date:
date_list.append((int(items[4:]),int(items[2:4]),int(items[:2])))
for dates in date_list:
# unpack all tuple objects and convert to date
print(datetime.datetime(*dates))
output:
2010-05-24 00:00:00
2011-08-25 00:00:00
2011-12-25 00:00:00
2012-01-01 00:00:00

Use time module to convert data.
Code snippet:
import time
tring='20150103040500'
var = int(time.mktime(time.strptime(tring, '%Y%m%d%H%M%S')))
print var

Related

How to extract components from a pandas datetime column and assign them

The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

How can I parse '2020-07-30T20:40:33.1000000Z' using datetime.strptime

I am trying to parse and convert "2020-07-30T20:40:33.1000000Z"in Python:
from datetime import datetime
Data = [{'id': 'XXXXXXXXXXXXX', 'number': 3, 'externalId': '0000', 'dateCreated': '2020-07-30T20:40:33.1005865Z', 'dateUpdated': '2020-07-30T20:40:33.36Z'}], 'tags': []}]
for i in Data:
creationtime= datetime.strptime(i["dateCreated"],"%Y-%m-%dT%H:%M:%S")
Error:
raise ValueError("unconverted data remains: %s" %
ValueError: unconverted data remains: .1005865Z
I tried :
%Y-%m-%dT%H:%M:%S.%fZ
Can anyone please suggest the correct format that I am missing.
if you really have 7 decimal places of fractional seconds and don't care about the 1/10th of the microseconds, you could use a re.sub and datetime.fromisoformat:
import re
from datetime import datetime
s = "2020-07-30T20:40:33.1000000Z"
dt = datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=datetime.timezone.utc)
...or use dateutil's parser:
from dateutil import parser
dt = parser.parse(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
datetime.datetime(2020, 7, 30, 20, 40, 33, 100000, tzinfo=tzutc())
...or even pandas's to_datetime, if you maybe work with that lib anyway:
import pandas as pd
dt = pd.to_datetime(s)
print(dt)
print(repr(dt))
2020-07-30 20:40:33.100000+00:00
Timestamp('2020-07-30 20:40:33.100000+0000', tz='UTC')
often irrelevant (depending on use-case) but note that convenience costs you some more time:
%timeit datetime.fromisoformat(re.sub('[0-9]Z', '+00:00', s))
1.92 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit parser.parse(s)
79.8 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.to_datetime(s)
62.4 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Pandas DatetimeIndex: Number of periods in a frequency string?

How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.

Python- Convert Dict hidden in List to DataFrame

I'm having trouble using the normalize in JSON for a dictionary that is being recognized as a list. The goal is to create a data frame from yahoo_finance.
from yahoofinancials import YahooFinancials
import pandas as pd
from pandas.io.json import json_normalize
ticker = 'AAPL'
yahoo_financials = YahooFinancials(ticker)
balance_sheet_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'balance')
#The return is a bit messy, I've simplified it with:
user_dict=balance_sheet_data_qt.get('balanceSheetHistoryQuarterly').get(ticker)
df=pd.DataFrame(user_dict)
But Still having trouble taking the data across the finish line, the goal would be for each quarterly date as the index for each row, and the key finances listed as columns.
You can use ChainMap from collections.
from collections import ChainMap
df = pd.DataFrame.from_dict(ChainMap(*user_dict), orient='index')
If you don't want to use ChainMap, you can iterate through the dicts in user_dict (a list), and then append these DFs to the main df.
df = pd.DataFrame()
for d in user_dict:
df = df.append(pd.DataFrame.from_dict(d, orient='index'))
ChainMap runs significantly faster for me
1.43 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vs
7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I think this is what you are looking for.
The following has the date as index and the financial data in the columns:
dataframe_entries = list()
for result in balance_sheet_data_qt.get('balanceSheetHistoryQuarterly').get('AAPL'):
extracted_date = list(result)[0]
dataframe_row = list(result.values())[0]
dataframe_row['date'] = extracted_date
dataframe_entries.append(dataframe_row)
df = pd.DataFrame(dataframe_entries).set_index('date')
Outputs:
date accountsPayable treasuryStock
2018-12-29 44293000000 ... -3588000000
2018-09-29 55888000000 ... -3454000000
2018-06-30 38489000000 ... -3111000000
2018-03-31 34311000000 ... -3064000000

What is the difference between bson.son.SON and collections.OrderedDict?

The bson.son.SON is used mainly in pymongo, to get a ordered mapping(dict).
But python already have the ordered dict in collections.OrderedDict()
I have read the docs of bson.son.SON. It did say SON is similar to OrderedDict but did not mention the difference.
So What is the difference? When should we use SON and when should we use OrderedDict?
Currently, the slight difference in both is that bson.son.SON remains backward compatible with Python 2.7 and older versions.
Also the argument that SON serializes faster than OrderedDict is no longer correct in 2018.
The son module was added in Jan 8, 2009.
collections.OrderedDict(PEP-372) was added in python in Mar 2, 2009.
While the differences in dates doesn't tell which was released first, it is interesting to see that the Mongodb already implemented an ordered map for their use case. I guess that they may have decided to keep maintaining it for backward compatibility instead of switching all SON references in their codebase to collections.OrderedDict
In small experiments with both, you easily see that collections.OrderedDict performs better than bson.son.SON.
In [1]: from bson.son import SON
from collections import OrderedDict
import copy
print(set(dir(SON)) - set(dir(OrderedDict)))
{'weakref', 'iteritems', 'iterkeys', 'itervalues', 'module', 'has_key', 'deepcopy', 'to_dict'}
In [2]: %timeit s = SON([('a',2)]); z = copy.deepcopy(s)
14.3 µs ± 758 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit s = OrderedDict([('a',2)]); z = copy.deepcopy(s)
7.54 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit s = SON(data=[('a',2)]); z = json.dumps(s)
11.5 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit s = OrderedDict([('a',2)]); z = json.dumps(s)
5.35 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In answer to your question about when to use SON,
use SON if running your software in versions of Python older than 2.7.
If you can help it, use OrderedDict from the collections module.
You can also use dict, they are ordered now in Python 3.7

Categories

Resources