Using dateutil.parser to parse a date in another language - python

Dateutil is a great tool for parsing dates in string format. for example
from dateutil.parser import parse
parse("Tue, 01 Oct 2013 14:26:00 -0300")
returns
datetime.datetime(2013, 10, 1, 14, 26, tzinfo=tzoffset(None, -10800))
however,
parse("Ter, 01 Out 2013 14:26:00 -0300") # In portuguese
yields this error:
ValueError: unknown string format
Does anybody know how to make dateutil aware of the locale?

As far as I can see, dateutil is not locale aware (yet!).
I can think of three alternative suggestions:
The day and month names are hardcoded in dateutil.parser (as part of the parserinfo class). You could subclass parserinfo, and replace these names with the appropriate names for Portuguese.
Modify dateutil to get day and month names based on the user’s locale. So you could do something like
import locale
locale.setlocale(locale.LC_ALL, "pt_PT")
from dateutil.parser import parse
parse("Ter, 01 Out 2013 14:26:00 -0300")
I’ve started a fork which gets the names from the calendar module (which is locale-aware) to work on this: https://github.com/alexwlchan/dateutil
Right now it works for Portuguese (or seems to), but I want to think about it a bit more before I submit a patch to the main branch. In particular, weirdness may happen if it faces characters which aren’t used in Western European languages. I haven’t tested this yet. (See https://stackoverflow.com/a/8917539/1558022)
If you’re not tied to the dateutil module, you could use datetime instead, which is already locale-aware:
from datetime import datetime, date
import locale
locale.setlocale(locale.LC_ALL, "pt_PT")
datetime.strptime("Ter, 01 Out 2013 14:26:00 -0300",
"%a, %d %b %Y %H:%M:%S %z")
(Note that the %z token is not consistently supported in datetime.)

You could use PyICU to parse a localized date/time string in a given format:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from datetime import datetime
import icu # PyICU
df = icu.SimpleDateFormat(
'EEE, dd MMM yyyy HH:mm:ss zzz', icu.Locale('pt_BR'))
ts = df.parse(u'Ter, 01 Out 2013 14:26:00 -0300')
print(datetime.utcfromtimestamp(ts))
# -> 2013-10-01 17:26:00 (UTC)
It works on Python 2/3. It does not modify global state (locale).
If your actual input time string does not contain the explicit utc offset then you should specify a timezone to be used by ICU explicitly otherwise you can get a wrong result (ICU and datetime may use different timezone definitions).
If you only need to support Python 3 and you don't mind setting the locale then you could use datetime.strptime() as #alexwlchan suggested:
#!/usr/bin/env python3
import locale
from datetime import datetime
locale.setlocale(locale.LC_TIME, "pt_PT.UTF-8")
print(datetime.strptime("Ter, 01 Out 2013 14:26:00 -0300",
"%a, %d %b %Y %H:%M:%S %z")) # works on Python 3.2+
# -> 2013-10-01 14:26:00-03:00

The calendar module already has constants for a lot of of languages. I think the best solution is to customize the parser from dateutil using these constants. This is a simple solution and will work for a lot of languages. I didn't test it a lot, so use with caution.
Create a module localeparseinfo.py and subclass parser.parseinfo:
import calendar
from dateutil import parser
class LocaleParserInfo(parser.parserinfo):
WEEKDAYS = zip(calendar.day_abbr, calendar.day_name)
MONTHS = list(zip(calendar.month_abbr, calendar.month_name))[1:]
Now you can use your new parseinfo object as a parameter to dateutil.parser.
In [1]: import locale;locale.setlocale(locale.LC_ALL, "pt_BR.utf8")
In [2]: from localeparserinfo import LocaleParserInfo
In [3]: from dateutil.parser import parse
In [4]: parse("Ter, 01 Out 2013 14:26:00 -0300", parserinfo=PtParserInfo())
Out[4]: datetime.datetime(2013, 10, 1, 14, 26, tzinfo=tzoffset(None, -10800))
It solved my problem, but note that this is an incomplete solution for all possible dates and times. Take a look at dateutil parser.py, specially the parserinfo class variables. Take a look at HMS variable and others. You'll probably be able to use other constants from the calendar module.
You can even pass the locale string as an argument to your parserinfo class.

from dateutil.parser import parse
parse("Ter, 01 Out 2013 14:26:00 -0300",fuzzy=True)
Result:
datetime.datetime(2013, 1, 28, 14, 26, tzinfo=tzoffset(None, -10800))

One could use a context manger to temporarily set the locale and return a custom parserinfo object
Context Manager définition:
import calendar
import contextlib
import locale
from dateutil import parser
#contextlib.contextmanager
def locale_parser_info(localename):
old_locale = locale.getlocale(locale.LC_TIME)
locale.setlocale(locale.LC_TIME, localename)
class InnerParserInfo(parser.parserinfo):
WEEKDAYS = zip(calendar.day_abbr, calendar.day_name)
# dots in abbreviation make dateutil raise a Parser Error exception
MONTHS = list(zip([abr.replace(".", "") for abr in calendar.month_abbr], calendar.month_name))[1:]
try:
yield InnerParserInfo()
finally:
# Restore original locale
locale.setlocale(locale.LC_TIME, old_locale)
The actual function just wraps the call to dateutil.parser.parse in the context manager we just defined, and uses the returned parserinfo object.
def parse_localized(datestr, date_locale="pt_PT"):
with locale_parser_info(date_locale) as parserinfo:
return parser.parse(datestr, parserinfo=parserinfo)

Related

Python timezone '%z' directive for datetime.strptime() not available

Using '%z' pattern of datetime.strptime()
I have a string text that represent a date and I'm perfectly able to parse it and transform it into a clean datetime object:
date = "[24/Aug/2014:17:57:26"
dt = datetime.strptime(date, "[%d/%b/%Y:%H:%M:%S")
Except that I can't catch the entire date string with the timezone using the %z pattern as specified here
date_tz = 24/Aug/2014:17:57:26 +0200
dt = datetime.strptime(date, "[%d/%b/%Y:%H:%M:%S %z]")
>>> ValueError: 'z' is a bad directive in format '[%d/%b/%Y:%H:%M:%S %z]'
Because as this bug report says
strftime() is implemented per platform
I precise that there is no such a problem with the naive tzinfo directive '%Z'
Workaround : Casting tzinfo string into tzinfo object
I can perfectly make the following workaround by transforming the GST time format string into a tzinfo object [as suggested here][4] using dateutil module
and then insert tzinfo into datetime object
Question: Make %z available for my plateform?
But as I will obviously need %z pattern for further project I would like to find a solution to avoid this workaround and using external module for this simple task.
Can you suggest me some reading on it? I supposed that newer version of python (I'm on 2.7) can handle it but I'd rather not changing my version now for this little but crucial detail.
[EDIT]
Well, seeing comments make me reformulated my question how to parse Email time zone indicator using strptime() without being aware of locale time?
strptime() is implemented in pure Python. Unlike strftime(); it [which directives are supported] doesn't depend on platform. %z is supported since Python 3.2:
>>> from datetime import datetime
>>> datetime.strptime('24/Aug/2014:17:57:26 +0200', '%d/%b/%Y:%H:%M:%S %z')
datetime.datetime(2014, 8, 24, 17, 57, 26, tzinfo=datetime.timezone(datetime.timedelta(0, 7200)))
how to parse Email time zone indicator using strptime() without being aware of locale time?
There is no concrete timezone implementation in Python 2.7. You could easily implement the UTC offset parsing, see How to parse dates with -0400 timezone string in python?
In continue to #j-f-sebastians 's answer, here is a fix for python 2.7
Instead of using:
datetime.strptime(t,'%Y-%m-%dT%H:%M %z')
use the timedelta to account for the timezone, like this:
from datetime import datetime,timedelta
def dt_parse(t):
ret = datetime.strptime(t[0:16],'%Y-%m-%dT%H:%M')
if t[17]=='+':
ret-=timedelta(hours=int(t[18:20]),minutes=int(t[20:]))
elif t[17]=='-':
ret+=timedelta(hours=int(t[18:20]),minutes=int(t[20:]))
return ret
print(dt_parse('2017-01-12T14:12 -0530'))
The Answer of Uri is great, saved my life, but when you have
USE_TZ = True you need to be careful with the time, for avoid the warning "RuntimeWarning: DateTimeField" is better if you add the utc to the return.
import pytz
from datetime import datetime, timedelta
def dt_parse(t):
ret = datetime.strptime(t[0:19],'%Y-%m-%dT%H:%M:%S')
if t[23]=='+':
ret-=timedelta(hours=int(t[24:26]), minutes=int(t[27:]))
elif t[23]=='-':
ret+=timedelta(hours=int(t[24:26]), minutes=int(t[27:]))
return ret.replace(tzinfo=pytz.UTC)

Convert date formats to another with Python

I download RSS content from different countries with Python, but each of them use their own datetime format or time zone. For instance,
Wed, 23 Oct 2013 17:44:13 GMT
23 Oct 2013 18:21:04 +0100
23 Oct 2013 13:12:41 EDT
10-23-2013 00:12:24
At the moment, my solution is to create a different function for each RSS source and change the date to a format I will decide. But is there any way to do this automatically?
Not really. But take a look at the feedparser lib.
Different feed types and versions use wildly different date formats.
Universal Feed Parser will attempt to auto-detect the date format used
in any date element, and parse it into a standard Python 9-tuple, as
documented in the Python time module.
From the list of Recognized Date Formats it seems to me, that the library could help you out some of the way :)
Best of luck
You can try using the dateutil module to parse the datetime.
It povides the functionality to parse most of the known datetime format. Here is an example from the docs:
>>> from dateutil.parser import *
>>> parse("Thu Sep 25 10:36:28 2003")
datetime.datetime(2003, 9, 25, 10, 36, 28)
It returns a datetime object which can be directly used for manipulation. You can then also use strftime to convert it to the required format string.

Timezone not available in python, but the system timezone is properly set

As specified in the documentation:
%Z -> Time zone name (no characters if no time zone exists).
According to date, my system has the time zone properly set:
gonvaled#pegasus ~ » date
Sat Sep 28 09:14:29 CEST 2013
But this test:
def test_timezone():
from datetime import datetime
dt = datetime.now()
print dt.strftime('%Y-%m-%d %H:%M:%S%Z')
test_timezone()
Produces:
gonvaled#pegasus ~ » python test_timezone.py
2013-09-28 09:19:10
Without time zone information. Why is that? How can I force python to output time zone info?
I have also trying re-configuring the time zone with tzselect, but has not helped.
Standard Python datetime.datetime() objects do not have a timezone object attached to them. The system time is taken as is.
You'll need to install Python timezone support in the form of the pytz package; timezone definitions change too frequently to be bundled with Python itself.
pytz does not tell you what timezone your machine has been configured with. You can use the python-dateutil module for that; it has a dateutil.tz.gettz() function that returns the timezone currently in use. This is much more reliable than what Python can get from the limited C API:
>>> import datetime
>>> from dateutil.tz import gettz
>>> datetime.datetime.now(gettz())
datetime.datetime(2013, 9, 28, 8, 34, 14, 680998, tzinfo=tzfile('/etc/localtime'))
>>> datetime.datetime.now(gettz()).strftime('%Y-%m-%d %H:%M:%S%Z')
'2013-09-28 08:36:01BST'

converting a date to string for manipulation in python

I am new to python and i have written a script that converts a string date coming in to a datetime format going out. My problem is that cannot convert the datetime object back to a string for manipulation. i have a date eg 2011-08-10 14:50:10 all i need to to is add a T between the date and time and a Z at the end. unfortunately im using python 2.3 as my application will only accept that.
my code is as follows:
fromValue= ''
fromValue = document.Get(self._generic3)
fromValue = fromValue[:fromValue.rindex(" ")]
fromValue = datetime.datetime.fromtimestamp(time.mktime(time.strptime(fromValue,"%a, %d %b %Y %H:%M:%S")))
toValue = fromValue.strftime("%Y-%m-%dT %H:%M:%SZ")
This should work fine. datetime.strftime was available on Python 2.3.
You'd certainly be better off upgrading to at least Python 2.5 if at all possible. Python 2.3 hasn't even received security patches in years.
Edit: Also, you don't need to initialize or declare variables in Python; the fromValue= '' has no effect on your program.
Edit 2: In a comment, you seem to have said you have it in a string already in nearly the right format:
"2011-08-08 14:15:21"
so just do
'T'.join("2011-08-08 14:15:21".split()) + 'Z'
If you want to add the letters while it's a string.
It looks like you are trying to format the datetime in ISO-8601 format.
For this purpose, use the isoformat method.
import datetime as dt
try:
import email.utils as eu
except ImportError:
import email.Utils as eu # for Python 2.3
date_string="Fri, 08 Aug 2011 14:15:10 -0400"
ttuple=eu.parsedate(date_string)
date=dt.datetime(*ttuple[:6])
print(date.isoformat()+'Z')
yields
2011-08-08T14:15:10Z
Here is link to isoformat and parsedate in the Python2.3 docs.
fromValue.strftime('%Y-%m-%d T %H:%M:%S Z')
http://docs.python.org/release/2.3/lib/module-time.html
http://docs.python.org/release/2.3/lib/node208.html

How to parse a RFC 2822 date/time into a Python datetime?

I have a date of the form specified by RFC 2822 -- say Fri, 15 May 2009 17:58:28 +0000, as a string. Is there a quick and/or standard way to get it as a datetime object in Python 2.5? I tried to produce a strptime format string, but the +0000 timezone specifier confuses the parser.
The problem is that parsedate will ignore the offset.
Do this instead:
from email.utils import parsedate_tz
print parsedate_tz('Fri, 15 May 2009 17:58:28 +0700')
I'd like to elaborate on previous answers. email.utils.parsedate and email.utils.parsedate_tz both return tuples, since the OP needs a datetime.datetime object, I'm adding these examples for completeness:
from email.utils import parsedate
from datetime import datetime
import time
t = parsedate('Sun, 14 Jul 2013 20:14:30 -0000')
d1 = datetime.fromtimestamp(time.mktime(t))
Or:
d2 = datetime.datetime(*t[:6])
Note that d1 and d2 are both naive datetime objects, there's no timezone information stored. If you need aware datetime objects, check the tzinfo datetime() arg.
Alternatively you could use the dateutil module
from email.utils import parsedate
print parsedate('Fri, 15 May 2009 17:58:28 +0000')
Documentation.
It looks like Python 3.3 going forward has a new method parsedate_to_datetime in email.utils that takes care of the intermediate steps:
email.utils.parsedate_to_datetime(date)
The inverse of format_datetime(). Performs the same function as parsedate(), but on
success returns a datetime. If the input date has a timezone of -0000,
the datetime will be a naive datetime, and if the date is conforming
to the RFCs it will represent a time in UTC but with no indication of
the actual source timezone of the message the date comes from. If the
input date has any other valid timezone offset, the datetime will be
an aware datetime with the corresponding a timezone tzinfo.
New in version 3.3.
http://python.readthedocs.org/en/latest/library/email.util.html#email.utils.parsedate_to_datetime
There is a parsedate function in email.util.
It parses all valid RFC 2822 dates and some special cases.
email.utils.parsedate_tz(date) is the function to use. Following are some variations.
Email date/time string (RFC 5322, RFC 2822, RFC 1123) to unix timestamp in float seconds:
import email.utils
import calendar
def email_time_to_timestamp(s):
tt = email.utils.parsedate_tz(s)
if tt is None: return None
return calendar.timegm(tt) - tt[9]
import time
print(time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(email_time_to_timestamp("Wed, 04 Jan 2017 09:55:45 -0800"))))
# 2017-01-04T17:55:45Z
Make sure you do not use mktime (which interprets the time_struct in your computer’s local time, not UTC); use timegm or mktime_tz instead (but beware caveat for mktime_tz in the next paragraph).
If you are sure that you have python version 2.7.4, 3.2.4, 3.3, or newer, then you can use email.utils.mktime_tz(tt) instead of calendar.timegm(tt) - tt[9]. Before that, mktime_tz gave incorrect times when invoked during the local time zone’s fall daylight savings transition (bug 14653).
Thanks to #j-f-sebastian for caveats about mktime and mktime_tz.
Email date/time string (RFC 5322, RFC 2822, RFC 1123) to “aware” datetime on python 3.3:
On python 3.3 and above, use email.utils.parsedate_to_datetime, which returns an aware datetime with the original zone offset:
import email.utils
email.utils.parsedate_to_datetime(s)
print(email.utils.parsedate_to_datetime("Wed, 04 Jan 2017 09:55:45 -0800").isoformat())
# 2017-01-04T09:55:45-08:00
Caveat: this will throw ValueError if the time falls on a leap second e.g. email.utils.parsedate_to_datetime("Sat, 31 Dec 2016 15:59:60 -0800").
Email date/time string (RFC 5322, RFC 2822, RFC 1123) to “aware” datetime in UTC zone:
This just converts to timestamp and then to UTC datetime:
import email.utils
import calendar
import datetime
def email_time_to_utc_datetime(s):
tt = email.utils.parsedate_tz(s)
if tt is None: return None
timestamp = calendar.timegm(tt) - tt[9]
return datetime.datetime.utcfromtimestamp(timestamp)
print(email_time_to_utc_datetime("Wed, 04 Jan 2017 09:55:45 -0800").isoformat())
# 2017-01-04T17:55:45
Email date/time string (RFC 5322, RFC 2822, RFC 1123) to python “aware” datetime with original offset:
Prior to python 3.2, python did not come with tzinfo implementations, so here an example using dateutil.tz.tzoffset (pip install dateutil):
import email.utils
import datetime
import dateutil.tz
def email_time_to_datetime(s):
tt = email.utils.parsedate_tz(s)
if tt is None: return None
tz = dateutil.tz.tzoffset("UTC%+02d%02d"%(tt[9]//60//60, tt[9]//60%60), tt[9])
return datetime.datetime(*tt[:5]+(min(tt[5], 59),), tzinfo=tz)
print(email_time_to_datetime("Wed, 04 Jan 2017 09:55:45 -0800").isoformat())
# 2017-01-04T09:55:45-08:00
If you are using python 3.2, you can use the builtin tzinfo implementation datetime.timezone: tz = datetime.timezone(datetime.timedelta(seconds=tt[9])) instead of the third-party dateutil.tz.tzoffset.
Thanks to #j-f-sebastian again for note on clamping the leap second.

Categories

Resources