How to Parse 0 hour with dateutil - python

I'm trying to merge my dataframe columns which contain time info (UTC) into a single column containing datetime object/string. The columns of my df are like this:
YY MM DD HH
98 12 05 11
98 12 05 10
So, I would like a single column containing that time information.
What I've tried so far:
I've merged into a string so that I can parse them into a datetime object by
from dateutil.parser import parse
d_test = (list(df[0].map(str) + " " + df[1].map(str) + " " + df[2].map(str)
+ " " + df[3].map(str)))
Now I just have to parse the list of date strings
parse_d = []
for d in d_test:
parse_d.append(parse(d))
But this is raising me an "unknown string error". I looked into it and it arrises because some of the dates are like:
d_test[5] = '98 12 5 0'
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
tzinfo ={}
parse(d_test[5], tzinfo=tzinfo)
Maybe, I'm missing something very basic but I'm not able to understand how to create this dictionary.

In general, if you know the format of a string, you don't need to use dateutil.parser.parse to parse it, because you can use datetime.strptime with a specified string.
In this case, the only slightly unfortunate thing is that you have 2-digit years, some of which are from before 2000. In this case, I'd probably do something like this:
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
Once you've done that, you can use one of the solutions from this question (specifically this one) to convert your individual datetime columns into pandas Timestamps / datetimes.
If these are in UTC, you then use pandas.Series.tz_localize with 'UTC' to get timezone-aware datetimes.
Putting it all together:
import pandas as pd
df = pd.DataFrame(
[[98, 12, 5, 11],
[98, 12, 5, 10],
[4, 12, 5, 00]],
columns=['YY', 'MM', 'DD', 'HH'])
# Convert 2-digit years to 4-digit years
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
# Retrieve the date columns and rename them
col_renames = {'YY': 'year', 'MM': 'month', 'DD': 'day', 'HH': 'hour'}
dt_subset = df.loc[:, list(col_renames.keys())].rename(columns=col_renames)
dt_series = pd.to_datetime(dt_subset)
# Convert to UTC
dt_series = dt_series.dt.tz_localize('UTC')
# Result:
# 0 1998-12-05 11:00:00+00:00
# 1 1998-12-05 10:00:00+00:00
# 2 2004-12-05 00:00:00+00:00
# dtype: datetime64[ns, UTC]
Also, to clarify two things about this statement:
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
The correct documentation for python-dateutil is now https://dateutil.readthedocs.io.
If you are using parse, in your situation there is no reason to add UTC into a dictionary and pass it to tzinfos. If you know that your datetimes are going to be naive but that they represent times in UTC, parse them as normal to get naive datetimes, then use datetime.replace(dateutil.tz.tzutc()) to get aware datetimes. The tzinfos dictionary is for when the timezone information is actually represented in the string.
An example of what to do when you have strings representing UTC that don't contain timezone information:
from dateutil.parser import parse
from dateutil import tz
dt = parse('1998-12-05 11:00')
dt = dt.replace(tzinfo=tz.tzutc())

How about if you parse the date in this format?
parse("98/12/05 00h")

Related

Concatenate and convert string to date in Python

I have a string ‘2022.10.31’ and I want to convert it to ‘2022-10-31’ and then to date.
This is the R code:
pr1$dotvoranje <- paste(substr(pr1$dotvoranje, 1, 4), substr(pr1$dotvoranje, 6, 7), substr(pr1$dotvoranje, 9, 10), sep = "-")
pr1$dotvoranje <- as.Date(pr1$dotvoranje)
I need to do the following code in Python I found that I need to use .join() , but I have a column with strings that I need to convert to dates.
I started with this code (but I do not know how to use .join here). But this line only substracted the first four rows of that column. And I need to take that column and replace "."with "-".
depoziti['ddospevanje'] = depoziti['ddospevanje'].loc[0:4] + depoziti['ddospevanje'].loc[5:7] + depoziti['ddospevanje'].loc[8:10]
I'm going to assume that by "then to date" you mean you want a datetime object.
It is not necessary to do string.replace() in order to create the datetime object.
The datetime object uses syntax like %Y to refer to a year and so we will be using that with the function strptime() to ingest the date.
import datetime
date_as_string = "2022.10.31"
date_format = "%Y.%m.%d"
date = datetime.strptime(date_as_string, date_format)
Now you have a date time object.
If you want to then print out the date as a string, you can use strftime()
new_format = "%Y-%m-%d"
date_as_new_string = date.strftime(new_format)
print(date_as_new_string)
Further reference:
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
https://www.w3schools.com/python/python_datetime.asp -(This also contains a list of all the syntax for use with strftime() or strptime())
if you are getting your dates (as strings) in a list, then just put the above code into a for loop:
import datetime
date_format = "%Y.%m.%d"
date_as_string_list = ['2022.12.31', '2022.11.30']
date_list = []
for date_as_string in date_as_string_list:
date_list.append(datetime.strptime(date_as_string, date_format))
now you have a list of datetime objects, and you can loop through them in order to get them as strings as already shown.
Just putting it together
from datetime import datetime
dt1 = '2022.10.31'
dt2 = datetime.strptime(dt1, '%Y.%m.%d').date()
print ("type(dt2) :", type(dt2))
print ("dt2 :", dt2)
Output
type(dt2) : <class 'datetime.datetime'>
dt2 : 2022-10-31
Update : dt1 should be a series of string dates, not only one string... dt = ['2022.12.31', '2022.11.30'.....]
If it's a list, use list comprehension
dt1 = ['2022.10.31', '2022.10.1', '2022.9.1']
dt2 = [datetime.strptime(i, '%Y.%m.%d').date() for i in dt1]
dt2
If it's a column in pandas dataframe, this is one way of doing it
df = pd.DataFrame({'event' : ['x', 'y', 'z'],
'date' : ['2022.10.31', '2022.10.1', '2022.9.1']})
df['date1'] = df['date'].apply(lambda x : datetime.strptime(x, '%Y.%m.%d').date())
df

Pandas to datetime

I have a date that is formatted like this:
01-19-71
and 71 is 1971 but whenever to_datetime is used it converts is to 2071! how can I solve this problem? I am told that this would need regex but I can't imagine how since there are many cases in this data
my current code:
re_1 = r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
re_2 = r"(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ \-\.,]+(?:\d{1,2}[\w]*[ \-,]+)?[1|2]\d{3}"
re_3 = r"(?:\d{1,2}/)?[1|2]\d{3}"
# Correct misspillings
df = df.str.replace("Janaury", "January")
df = df.str.replace("Decemeber", "December")
# Extract dates
regex = "((%s)|(%s)|(%s))"%(re_1, re_2, re_3)
dates = df.str.extract(regex)
# Sort the Series
dates = pd.Series(pd.to_datetime(dates.iloc[:,0]))
dates.sort_values(ascending=True, inplace=True)
Considering that one has a string as follows
date = '01-19-71'
In order to convert to datetime object where 71 is converted to 1971 and not 2071, one can use datetime.strptime as follows
import datetime as dt
date = dt.datetime.strptime(date, '%m-%d-%y')
[Out]:
1971-01-19 00:00:00

Can't convert string object to time

I have a dataframe containing different recorded times as string objects, such as 1:02:45, 51:11, 54:24.
I can't convert to time objects, this is the error I am getting:
"time data '49:49' does not match format '%H:%M:%S"
This is the code I am using:
df_plot2 = df[['year', 'gun_time', 'chip_time']]
df_plot2['gun_time'] = pd.to_datetime(df_plot2['gun_time'], format = '%H:%M:%S')
df_plot2['chip_time'] = pd.to_datetime(df_plot2['chip_time'], format = '%H:%M:%S')
Thanks in advance for your help!
you can create a common format in the time Series by checking string len and adding the hours as zero '00:' where there are only minutes and seconds. Then parse to datetime. Ex:
import pandas as pd
s = pd.Series(["1:02:45", "51:11", "54:24"])
m = s.str.len() <= 5
s.loc[m] = '00:' + s.loc[m]
dts = pd.to_datetime(s)
print(dts)
0 2021-12-01 01:02:45
1 2021-12-01 00:51:11
2 2021-12-01 00:54:24
dtype: datetime64[ns]
I believe it may be because for %H python expects to see 01, 02, 03 etc instead of 1, 2, 3. To use your specific example 1:02:45 may have to be in the 01:02:45 format for python to be able to convert it to a datetime variable with %H:%M:$S.

ValueError: time data '10/11/2006 24:00' does not match format '%d/%m/%Y %H:%M'

I tried:
df["datetime_obj"] = df["datetime"].apply(lambda dt: datetime.strptime(dt, "%d/%m/%Y %H:%M"))
but got this error:
ValueError: time data '10/11/2006 24:00' does not match format
'%d/%m/%Y %H:%M'
How to solve it correctly?
The reason why this does not work is because the %H parameter only accepts values in the range of 00 to 23 (both inclusive). This thus means that 24:00 is - like the error says - not a valid time string.
I think therefore we have not much other options than convert the string to a valid format. We can do this by first replacing 24:00 with 00:00, and then later increment the day for these timestamps.
Like:
from datetime import timedelta
import pandas as pd
df['datetime_zero'] = df['datetime'].str.replace('24:00', '0:00')
df['datetime_er'] = pd.to_datetime(df['datetime_zero'], format='%d/%m/%Y %H:%M')
selrow = df['datetime'].str.contains('24:00')
df['datetime_obj'] = df['datetime_er'] + selrow * timedelta(days=1)
The last line thus adds one day to the rows that contain 24:00, such that '10/11/2006 24:00' gets converted to '11/11/2006 24:00'. Note however that the above is rather unsafe since depending on the format of the timestamp this will/will not work. For the above it will (probably) work, since there is only one colon. But if for example the datetimes have seconds as well, the filter could get triggered for 00:24:00, so it might require some extra work to get it working.
Your data doesn't follow the conventions used by Python / Pandas datetime objects. There should be only one way of storing a particular datetime, i.e. '10/11/2006 24:00' should be rewritten as '11/11/2006 00:00'.
Here's one way to approach the problem:
# find datetimes which have '24:00' and rewrite
twenty_fours = df['strings'].str[-5:] == '24:00'
df.loc[twenty_fours, 'strings'] = df['strings'].str[:-5] + '00:00'
# construct datetime series
df['datetime'] = pd.to_datetime(df['strings'], format='%d/%m/%Y %H:%M')
# add one day where applicable
df.loc[twenty_fours, 'datetime'] += pd.DateOffset(1)
Here's some data to test:
dateList = ['10/11/2006 24:00', '11/11/2006 00:00', '12/11/2006 15:00']
df = pd.DataFrame({'strings': dateList})
Result after transformations described above:
print(df['datetime'])
0 2006-11-11 00:00:00
1 2006-11-11 00:00:00
2 2006-11-12 15:00:00
Name: datetime, dtype: datetime64[ns]
As indicated in the documentation (https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior), hours go from 00 to 23. 24:00 is then an error.

Converting date formats python - Unusual date formats - Extract %Y%M%D

I have a large data set with a variety of Date information in the following formats:
DAYS since Jan 1, 1900 - ex: 41213 - I believe these are from Excel http://www.kirix.com/stratablog/jd-edwards-date-conversions-cyyddd
YYDayofyear - ex 2012265
I am familiar with python's time module, strptime() method, and strftime () method. However, I am not sure what these date formats above are called on if there is a python module I can use to convert these unusual date formats.
Any idea how to get the %Y%M%D format from these unusual date formats without writing my own calculator?
Thanks.
You can try something like the following:
In [1]: import datetime
In [2]: s = '2012265'
In [3]: datetime.datetime.strptime(s, '%Y%j')
Out[3]: datetime.datetime(2012, 9, 21, 0, 0)
In [4]: d = '41213'
In [5]: datetime.date(1900, 1, 1) + datetime.timedelta(int(d))
Out[5]: datetime.date(2012, 11, 2)
The first one is the trickier one, but it uses the %j parameter to interpret the day of the year you provide (after a four-digit year, represented by %Y). The second one is simply the number of days since January 1, 1900.
This is the general conversion - not sure of your input format but hopefully this can be tweaked to suit it.
On the Excel integer to Python datetime bit:
Note that there are two Excel date systems (one 1-Jan-1900 based and another 1-Jan 1904 based); see https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel for more information.
Also note that the system is NOT zero-based. So, in the 1900 system, 1-Jan-1900 is day 1 (not day 0).
import datetime
EXCEL_DATE_SYSTEM_PC=1900
EXCEL_DATE_SYSTEM_MAC=1904
i = 42129 # Excel number for 5-May-2015
d = datetime.date(EXCEL_DATE_SYSTEM_PC, 1, 1) + datetime.timedelta(i-2)
Both of these formats seems pretty straightforward to work with. The first one, in fact, is just an integer, so why don't you just do something like this?
import datetime
def days_since_jan_1_1900_to_datetime(d):
return datetime.datetime(1900,1,1) + \
datetime.timedelta(days=d)
For the second one, the details depend on exactly how the format is defined (e.g. can you always expect 3 digits after the year even when the number of days is less than 100, or is it possible that there are 2 or 1 – and if so, is the year always 4 digits?) but once you've got that part down it can be done very similarly.
According to http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
, day of the year is "%j", whereas the first case can be solved by toordinal() and fromordinal(): date.fromordinal(date(1900, 1, 1).toordinal() + x)
I'd think timedelta.
import datetime
d = datetime.timedelta(days=41213)
start = datetime.datetime(year=1900, month=1, day=1)
the_date = start + d
For the second one, you can 2012265[:4] to get the year and use the same method.
edit: See the answer with %j for the second.
from datetime import datetime
df(['timeelapsed'])=(pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()

Categories

Resources