Parse Date field in multiple formats in Python - python

I have a dataframe field with 480k date records.
The date field has dates with multiple formats: Jan 8, 01-01-2017, dec-08, dec 08, Dec 00, 01/01/2017
Is there a way to parse the field quicker in python?
I tried the following but it takes forever, as it's looping every line
from dateutil.parser import parse
for i in range(len(df['Date'])):
try:
df['Date'][i] = parse(df['Date'][i])
except:
pass
I even tried using .apply method. But keep getting an error because of some Dec 00 date format.
df['Date'] = df['Date'].apply(dateutil.parser.parse)
Help anyone?

Related

Trying to solve a date discrepancy issue with timezone-aware data filtered from a csv between two dates

My problem is in finding the best-practice to resolve the issue of including all requested timezone-aware dates in a selected range. Depending on how I approach it, either the enddate or the startdate fail to return their date from the csv. I cannot get both to work at the same time without tricking the code (which seems like bad coding practice, and might produce issues if the user is entering dates while browsing from a different timezone locale).
In its simplest form, I am trying to return the complete year of 2021 transaction records from a csv file. The URL call to FastAPI to run the code is this:
http://127.0.0.1:8000/v1/portfolio/tastytx?year=2021
When the user enters the year parameter in the URL the enddate and startdate get set, then the csv read into a dataframe, the dataframe filtered, and a json returned. In the following code snippet, 'frame' is the dataframe from pd.read_csv code that is not shown:
import datetime as dt
import pandas as pd
from pytz import timezone
import pytz
import json
from fastapi import FastAPI, Response
startdate = year +'-01-01'
enddate = year + '-12-31'
startdate = dt.datetime.strptime(startdate, '%Y-%m-%d') #convert string to a date
startdate = startdate.replace (tzinfo=pytz.utc) #convert date to UTC and make it timezone-aware
enddate = dt.datetime.strptime(enddate, '%Y-%m-%d')
enddate = enddate.replace (tzinfo=pytz.utc)
frame['Date'] = pd.to_datetime(frame['Date'], format='%Y-%m-%d') #turns 'Date' from object into datetime64
frame['Date'] = frame['Date'].dt.tz_convert('UTC') #converts csv 'Date' column to UTC from AEST(+10)
selectdf = frame.loc[frame['Date'].between(startdate, enddate, inclusive=True)] #filters time period
return Response(selectdf.to_json(orient="records"), media_type="application/json")
The csv file containing the data has 'Date' column in AEST timezone (i.e. UTC+10).
In my code I convert everything to UTC and then do the comparison, but with the above code the date for 1st Jan 2021 is not returned, the 2nd Jan is. What is the right way to solve this, I have tried every configuration of timezone changes and so far nothing returned both 1st Jan 2021 and 31st Dec 2021 at the same time.
Sample data:
the csv file from 'Date' Column :
Date
2021-12-31T01:35:59+1000
2021-12-31T01:35:59+1000
2021-12-31T01:09:57+1000
2021-12-31T01:09:57+1000
2021-12-30T03:02:25+1000
2021-12-30T01:52:58+1000
...
2021-01-02T00:48:29+1000
2021-01-01T02:40:03+1000
2021-01-01T00:30:00+1000
2021-01-01T00:30:00+1000
There is some confusion about the issue:
to clarify the problem, the above code will return the following epoch time (startdate) as the first record: 1609512509000 and the following epoch time (enddate) as the date of the last record: 1640878559000
For me it is translating that as 2nd Jan (I am in AEST timezone in my browser) and 31st Dec respectively, and so the json returned from the above csv data is 2nd Jan to 31st Dec, thus incorrect.
If you run it in your browser it will likely return those epoch dates records relevant to your timezone. This is my problem.

normalize different date formats

I am trying to work with an XML data that has different kinds of (string) date values, like:
'Sun, 04 Apr 2021 13:32:26 +0200'
'Sun, 04 Apr 2021 11:52:29 GMT'
I want to save these in a Django object that has a datetime field.
The script that I have written to convert a str datetime is as below:
def normalise(val):
val = datetime.strptime(val, '%a, %d %b %Y %H:%M:%S %z')
return val
Although, this does not work for every datetime value I scrape. For example for above 2 examples, the script works for the first one but crashes for the second.
What would be an ideal way of normalising all the datetime values ?
dateutil module parses many different types of formats. You can find the doc here
This is a simple example:
if __name__ == '__main__':
from dateutil.parser import parse
date_strs = ['Sun, 04 Apr 2021 13:32:26 +0200','Sun, 04 Apr 2021 11:52:29 GMT']
for d in date_strs:
print(parse(d))
output:
2021-04-04 13:32:26+02:00
2021-04-04 11:52:29+00:00
If there are other date formats that this doesn't cover you can to store specific python format strings keyed by the xml element name.

Need to convert word month into number from a table

I would like to convert the column df['Date'] to numeric time format:
the current format i.e. Oct 9, 2019 --> 10-09-2019
Here is my code but I did not get an error until printing it. Thanks for your support!
I made some changes,
I want to convert the current time format to numeric time format, ie: Oct 9, 2019 --> 10-09-2019 in a column of a table
from time import strptime
strptime('Feb','%b').tm_mon
Date_list = df['Date'].tolist()
Date_num = []
for i in Date_list:
num_i=strptime('[i[0:3]]', '%b').tm_mon
Date_num.append(num_i)
df['Date'] = Date_num
print(df['Date'])
I got the error message as follows:
KeyError
ValueError: time data '[i[0:3]]' does not match format '%b'
Date
Oct 09, 2019
Oct 08, 2019
Oct 07, 2019
Oct 04, 2019
Oct 03, 2019
assuming Date column in df is of str/object type.
can be validated by running pd.dtypes.
in such case you can convert your column directly to datetime type by
df['Date'] = df['Date'].astype('datetime64[ns]')
which will show you dates in default format of 2019-10-09. If you want you can convert this to any other date format you want very easily by doing something like
pd.dt.dt.strftime("%d-%m-%Y")
please go through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html for more info related to pandas datetime functions/operations

Custom date string to Python date object

I am using Scrapy to parse data and getting date in Jun 14, 2016
format, I have tried to parse it with datetime.strftime but
what approach should I use to convert custom date strings and what to do in my case.
UPDATE
I want to parse UNIX timestamp to save in database.
Something like this should work:
import time
import datetime
datestring = "September 2, 2016"
unixdatetime = int(time.mktime(datetime.datetime.strptime(datestring, "%B %d, %Y").timetuple()))
print(unixdatetime)
Returns: 1472792400

How to handle "/Date(1262476800000)/" [duplicate]

This question already has answers here:
Convert weird Python date format to readable date
(2 answers)
Closed 7 years ago.
Recently I received this output from an API (I think it's .NET driven).. What kind of date format is this and how do I convert it to a Python date object?
{
Id: 10900001,
ExpirationDate: "/Date(1262476800000)/",
}
It seems a timestamp but I get parse errors on fromtimestamp()
>>> from datetime import datetime
>>> datetime.fromtimestamp(float('1262476800000'))
ValueError: 'year is out of range'
You are getting the error because the timestamp is in milliseconds. You just remove the last 3 digits and it will work -
>>> from datetime import datetime, timedelta
>>> s = '1262476800540'
>>> d = datetime.fromtimestamp(float(s[:-3]))
>>> d = d + timedelta(milliseconds=int(s[-3:]))
>>> print d
2010-01-03 05:30:00.540000
>>> from datetime import datetime
>>> a = datetime.fromtimestamp(float('1262476800002')/1000)
>>> a.microsecond
2000
As it seems the API's output is JSON, I would assume it is a javascript timestamp. To convert it to python, remove the milliseconds and you should be fine.
From an online conversion tool http://www.epochconverter.com/
GMT: Sun, 03 Jan 2010 00:00:00 GMT
It seems like a UNIX time stamp in milliseconds. In other words 1262476800(000) = 2010-01-03T00:00:00+00:00
Could that be correct?
You can use fromtimestamp to convert it to a date object.
Cheers,
Anders
Also you can use time module.
In [13]: import time
In [14]: time.ctime(1262476800000)
Out[14]: 'Mon Apr 19 02:00:00 41976\n'
I think you'll have to peruse the documentation of that API. Failing which, can you reverse-engineer it? If you can create entities in that system, then do so. If you create 3 entities expiring on 1st Jan 2016, 11th Jan 2016 and 21st Jan 2016, you'll be able to see if it's (probably) a linearly increasing sequence of time-units and deduce what are the units and the base-date. Even if you can't create entities, can you get the dates in human-readable format through a human-orientated web interface?
Once you know what the number represents, you can decode it either into year, month, day ... fields, or into a number of seconds since the canonical base date (timestamp).
Or maybe you'll get lucky and another reader will recognise the format!

Categories

Resources