I have a column called 'created_at' in dataframe df, its value is like '2/3/15 2:00' in UTC. Now I want to convert it to unix time, how can I do that?
I tried the script like:
time.mktime(datetime.datetime.strptime(df['created_at'], "%m/%d/%Y, %H:%MM").timetuple())
It returns error I guess the tricky part is the year is '15' instead of '2015'
Is there any efficient way that I am able to deal with it?
Thanks!
since you mention that you're working with a pandas DataFrame, you can simplify to using
import pandas as pd
import numpy as np
df = pd.DataFrame({'times': ['2/3/15 2:00']})
# to datetime, format is inferred correctly
df['datetime'] = pd.to_datetime(df['times'])
# df['datetime']
# 0 2015-02-03 02:00:00
# Name: datetime, dtype: datetime64[ns]
# to Unix time / seconds since 1970-1-1 Z
# .astype(np.int64) on datetime Series gives you nanoseconds, so divide by 1e9 to get seconds
df['unix'] = df['datetime'].astype(np.int64) / 1e9
# df['unix']
# 0 1.422929e+09
# Name: unix, dtype: float64
%Y is for 4-digit years.
Since you have 2-digits years (assuming it's 20##), you can use %y specifier instead (notice the lower-case y).
You should use lowercase %y (year without century) rather than uppercase %Y (year with century)
Related
I have a column of timestamps that I would like to convert to datetime in my pandas dataframe. The format of the dates is %Y-%m-%d-%H-%M-%S which pd.to_datetime does not recognize. I have manually entered the format as below:
df['TIME'] = pd.to_datetime(df['TIME'], format = '%Y-%m-%d-%H-%M-%S')
My problem is some of the times do not have seconds so they are shorter
(format = %Y-%m-%d-%H-%M).
How can I get all of these strings to datetimes?
I was thinking I could add zero seconds (-0) to the end of my shorter dates but I don't know how to do that.
try strftime and if you want the right format and if Pandas can't recognize your custom datetime format, you should provide it explicetly
from functools import partial
df1 = pd.DataFrame({'Date': ['2018-07-02-06-05-23','2018-07-02-06-05']})
newdatetime_fmt = partial(pd.to_datetime, format='%Y-%m-%d-%H-%M-%S')
df1['Clean_Date'] = (df1.Date.str.replace('-','').apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%d-%H-%M-%S'))
.apply(newdatetime_fmt))
print(df1,df1.dtypes)
output:
Date Clean_Date
0 2018-07-02-06-05-23 2018-07-02 06:05:23
1 2018-07-02-06-05 2018-07-02 06:05:00
Date object
Clean_Date datetime64[ns]
I tried:
df["datetime_obj"] = df["datetime"].apply(lambda dt: datetime.strptime(dt, "%d/%m/%Y %H:%M"))
but got this error:
ValueError: time data '10/11/2006 24:00' does not match format
'%d/%m/%Y %H:%M'
How to solve it correctly?
The reason why this does not work is because the %H parameter only accepts values in the range of 00 to 23 (both inclusive). This thus means that 24:00 is - like the error says - not a valid time string.
I think therefore we have not much other options than convert the string to a valid format. We can do this by first replacing 24:00 with 00:00, and then later increment the day for these timestamps.
Like:
from datetime import timedelta
import pandas as pd
df['datetime_zero'] = df['datetime'].str.replace('24:00', '0:00')
df['datetime_er'] = pd.to_datetime(df['datetime_zero'], format='%d/%m/%Y %H:%M')
selrow = df['datetime'].str.contains('24:00')
df['datetime_obj'] = df['datetime_er'] + selrow * timedelta(days=1)
The last line thus adds one day to the rows that contain 24:00, such that '10/11/2006 24:00' gets converted to '11/11/2006 24:00'. Note however that the above is rather unsafe since depending on the format of the timestamp this will/will not work. For the above it will (probably) work, since there is only one colon. But if for example the datetimes have seconds as well, the filter could get triggered for 00:24:00, so it might require some extra work to get it working.
Your data doesn't follow the conventions used by Python / Pandas datetime objects. There should be only one way of storing a particular datetime, i.e. '10/11/2006 24:00' should be rewritten as '11/11/2006 00:00'.
Here's one way to approach the problem:
# find datetimes which have '24:00' and rewrite
twenty_fours = df['strings'].str[-5:] == '24:00'
df.loc[twenty_fours, 'strings'] = df['strings'].str[:-5] + '00:00'
# construct datetime series
df['datetime'] = pd.to_datetime(df['strings'], format='%d/%m/%Y %H:%M')
# add one day where applicable
df.loc[twenty_fours, 'datetime'] += pd.DateOffset(1)
Here's some data to test:
dateList = ['10/11/2006 24:00', '11/11/2006 00:00', '12/11/2006 15:00']
df = pd.DataFrame({'strings': dateList})
Result after transformations described above:
print(df['datetime'])
0 2006-11-11 00:00:00
1 2006-11-11 00:00:00
2 2006-11-12 15:00:00
Name: datetime, dtype: datetime64[ns]
As indicated in the documentation (https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior), hours go from 00 to 23. 24:00 is then an error.
Let's assume that I have the following data:
25/01/2000 05:50
When I convert it using datetime.toordinal, it returns this value:
730144
That's nice, but this value just considers the date itself. I also want it to consider the hour and minutes (05:50). How can I do it using datetime?
EDIT:
I want to convert a whole Pandas Series.
An ordinal date is by definition only considering the year and day of year, i.e. its resolution is 1 day.
You can get the microseconds / milliseconds (depending on your platform) from epoch using
datetime.datetime.strptime('25/01/2000 05:50', '%d/%m/%Y %H:%M').timestamp()
for a pandas series you can do
s = pd.Series(['25/01/2000 05:50', '25/01/2000 05:50', '25/01/2000 05:50'])
s = pd.to_datetime(s) # make sure you're dealing with datetime instances
s.apply(lambda v: v.timestamp())
If you use python 3.x. You can get date with time in seconds from 1/1/1970 00:00
from datetime import datetime
dt = datetime.today() # Get timezone naive now
seconds = dt.timestamp()
I have two different time series. One is a series of timestamps in ms-format from the CET timezone delivered as strings. The other are unix-timestamps in s-format in the UTC timezone.
Each of them is in a column in a larger dataframe, none of them is a DatetimeIndex and should not be one.
I need to convert the CET time to UTC and then calculate the difference between both columns and I'm lost between the Datetime functionalities of Python and Pandas, and the variety of different datatypes.
Here's an example:
import pandas as pd
import pytz
germany = pytz.timezone('Europe/Berlin')
D1 = ["2016-08-22 00:23:58.254","2016-08-22 00:23:58.254",
"2016-08-22 00:23:58.254","2016-08-22 00:40:33.260",
"2016-08-22 00:40:33.260","2016-08-22 00:40:33.260"]
D2 = [1470031195, 1470031195, 1470031195, 1471772027, 1471765890, 1471765890]
S1 = pd.to_datetime(pd.Series(D1))
S2 = pd.to_datetime(pd.Series(D2),unit='s')
First problem
is with the use of tz_localize. I need the program to understand, that the data in S1 is not in UTC, but in CET. However using tz_localize like this seems to interpret the given datetime as CET assuming it's UTC to begin with:
F1 = S1.apply(lambda x: x.tz_localize(germany)).to_frame()
Trying tz_convert always throws something like:
TypeError: index is not a valid DatetimeIndex or PeriodIndex
Second problem
is that even with both of them having the same format I'm stuck because I can't calculate the difference between the two columns now:
F1 = S1.apply(lambda x: x.tz_localize(germany)).to_frame()
F1.columns = ["CET"]
F2 = S2.apply(lambda x: x.tz_localize('UTC')).to_frame()
F2.columns = ["UTC"]
FF = pd.merge(F1,F2,left_index=True,right_index=True)
FF.CET-FF.UTC
ValueError: Incompatbile tz's on datetime subtraction ops
I need a way to do these calculation with tz-aware datetime objects that are no DatetimeIndex objects.
Alternatively I need a way to make my CET-column to just look like this:
2016-08-21 22:23:58.254
2016-08-21 22:23:58.254
2016-08-21 22:23:58.254
2016-08-21 22:40:33.260
2016-08-21 22:40:33.260
2016-08-21 22:40:33.260
That is, I don't need my datetime to be tz-aware, I just want to convert it automatically by adding/subtracting the necessary amount of time with an awareness for daylight saving times.
If it weren't for DST I could just do a simple subtraction on two integers.
First you need to convert the CET timestamps to datetime and specify the timezone:
S1 = pd.to_datetime(pd.Series(D1))
T1_cet = pd.DatetimeIndex(S1).tz_localize('Europe/Berlin')
Then convert the UTC timestamps to datetime and specify the timezone to avoid confusion:
S2 = pd.to_datetime(pd.Series(D2), unit='s')
T2_utc = pd.DatetimeIndex(S1).tz_localize('UTC')
Now convert the CET timestamps to UTC:
T1_utc = T1_cet.tz_convert('UTC')
And finally calculate the difference between the timestamps:
diff = pd.Series(T1_utc) - pd.Series(T2_utc)
I have am trying to process data with a timestamp field. The timestamp looks like this:
'20151229180504511' (year, month, day, hour, minute, second, millisecond)
and is a python string. I am attempting to convert it to a python datetime object. Here is what I have tried (using pandas):
data['TIMESTAMP'] = data['TIMESTAMP'].apply(lambda x:datetime.strptime(x,"%Y%b%d%H%M%S"))
# returns error time data '20151229180504511' does not match format '%Y%b%d%H%M%S'
So I add milliseconds:
data['TIMESTAMP'] = data['TIMESTAMP'].apply(lambda x:datetime.strptime(x,"%Y%b%d%H%M%S%f"))
# also tried with .%f all result in a format error
So tried using the dateutil.parser:
data['TIMESTAMP'] = data['TIMESTAMP'].apply(lambda s: dateutil.parser.parse(s).strftime(DateFormat))
# results in OverflowError: 'signed integer is greater than maximum'
Also tried converting these entries using the pandas function:
data['TIMESTAMP'] = pd.to_datetime(data['TIMESTAMP'], unit='ms', errors='coerce')
# coerce does not show entries as NaT
I've made sure that whitespace is gone. Converting to Strings, to integers and floats. No luck so far - pretty stuck.
Any ideas?
p.s. Background info: The data is generated in an Android app as a the java.util.Calendar class, then converted to a string in Java, written to a csv and then sent to the python server where I read it in using pandas read_csv.
Just try :
datetime.strptime(x,"%Y%m%d%H%M%S%f")
You miss this :
%b : Month as locale’s abbreviated name.
%m : Month as a zero-padded decimal number.
%b is for locale-based month name abbreviations like Jan, Feb, etc.
Use %m for 2-digit months:
In [36]: df = pd.DataFrame({'Timestamp':['20151229180504511','20151229180504511']})
In [37]: df
Out[37]:
Timestamp
0 20151229180504511
1 20151229180504511
In [38]: pd.to_datetime(df['Timestamp'], format='%Y%m%d%H%M%S%f')
Out[38]:
0 2015-12-29 18:05:04.511
1 2015-12-29 18:05:04.511
Name: Timestamp, dtype: datetime64[ns]