convert object (numeric value) to datetime using pandas - python

I have the data with time column as shown below, how to convert this as date and time using pandas

Consider df:
In [208]: df
Out[208]:
Time
0 2384798300
1 1500353475
2 7006557825
3 1239779541
4 1237529231
Use datetime.fromtimestamp with df.apply:
In [200]: from datetime import datetime
In [209]: df['Time'] = df['Time'].apply(lambda x: datetime.fromtimestamp(x))
In [210]: df
Out[210]:
Time
0 2045-07-28 01:28:20
1 2017-07-18 10:21:15
2 2192-01-11 15:33:45
3 2009-04-15 12:42:21
4 2009-03-20 11:37:11

Related

i have datetime column(hh:mm:ss ) format in dataframe.i want to pivot dataframe in which i want to use aggfunc on date column

I am trying to pivot dataframe given as below. I have datetime column(hh:mm:ss) format in dataframe. I want to pivot dataframe in which I want to use aggfunc on date column.
import pandas as pd
data = {'Type':['A', 'B', 'C', 'C'],'Name':['ab', 'ef','gh', 'ij'],'Time':['02:00:00', '03:02:00', '04:00:30','01:02:20']}
df = pd.DataFrame(data)
print (df)
pivot = (
df.pivot_table(index=['Type'],values=['Time'], aggfunc='sum')
)
Type
Name
Time
A
ab
02:00:00
B
ef
03:02:00
C
gh
04:00:30
C
ij
01:02:20
Type
Time
C
04:00:3001:02:20
A
02:00:00
B
03:02:00
I want C row should be addition of two time ; 05:02:50
This looks more like groupby sum than pivot_table.
Convert to_timedelta to have appropriate dtype for duration. (Makes mathmatical operations function as expected)
groupby sum on Type and Time to get the total duration per Type.
# Convert to TimeDelta (appropriate dtype)
df['Time'] = pd.to_timedelta(df['Time'])
new_df = df.groupby('Type')['Time'].sum().reset_index()
new_df:
Type Time
0 A 0 days 02:00:00
1 B 0 days 03:02:00
2 C 0 days 05:02:50
Optional convert back to string:
new_df['Time'] = new_df['Time'].dt.to_pytimedelta().astype(str)
new_df:
Type Time
0 A 2:00:00
1 B 3:02:00
2 C 5:02:50

How can I sort time in dataframe in terms of hour,minute,seconds and milliseconds?

I have a problem about sorting time including hour,minute,seconds and milliseconds as ascending.
In my dafaframe, time column has defined as shown below. ( df['Time'].unique() )
array(['2:13:23.600', '3:13:18.700', '2:02:53.700', ..., '1:33:55.653',
'1:33:14.678', '1:34:05.715'], dtype=object)
Time is also included this kind of variables such as 42:53.700 , 5:30.622 , 10.111
How can I sort this column?
Here is my screenshot
Convert values to timedeltas by to_timedelta and then sorting by DataFrame.sort_values:
a = np.array(['2:13:23.600', '3:13:18.700', '2:02:53.700', '1:33:55.653',
'1:33:14.678', '1:34:05.715'])
df = pd.DataFrame({'Time':a})
df['Time'] = pd.to_timedelta(df['Time'])
df = df.sort_values('Time')
print (df)
Time
4 01:33:14.678000
3 01:33:55.653000
5 01:34:05.715000
2 02:02:53.700000
0 02:13:23.600000
1 03:13:18.700000
Another idea with Series.argsort for array of position and for change order pass to DataFrame.iloc, but because multiple format create multiple Series for handle each of them, join together by Series.fillna for replace non matched values (missing values):
a = np.array(['2:13:23.600', '3:13:18.700', '2:02:53.700', '1:33:55.653',
'1:33:14.678', '1:34:05.715', '42:53.700' , '5:30.622' , '10.111'])
df = pd.DataFrame({'Time':a})
d1 = pd.to_datetime(df['Time'], format='%H:%M:%S.%f', errors='coerce')
d2 = pd.to_datetime(df['Time'], format='%M:%S.%f', errors='coerce')
d3 = pd.to_datetime(df['Time'], format='%S.%f', errors='coerce')
d = d1.fillna(d2).fillna(d3)
print (d)
0 1900-01-01 02:13:23.600
1 1900-01-01 03:13:18.700
2 1900-01-01 02:02:53.700
3 1900-01-01 01:33:55.653
4 1900-01-01 01:33:14.678
5 1900-01-01 01:34:05.715
6 1900-01-01 00:42:53.700
7 1900-01-01 00:05:30.622
8 1900-01-01 00:00:10.111
Name: Time, dtype: datetime64[ns]
Check if all values are converted, so here is nonecessary empty Series:
print (d[d.isna()])
Series([], Name: Time, dtype: datetime64[ns])
And last change order:
df = df.iloc[d.argsort()]
print (df)
Time
8 10.111
7 5:30.622
6 42:53.700
4 1:33:14.678
3 1:33:55.653
5 1:34:05.715
2 2:02:53.700
0 2:13:23.600
1 3:13:18.700

Pandas isin with empty dataframe produces epoch value on datetime type instead of boolean

I've noticed that doing an isin on a DataFrame which contains datetime types where the operand is an empty DataFrame produces epoch datetime values (i.e. 1970-01-01), instead of 'False'. It seems unlikely that this is correct?
The following code demonstrates this:
(pandas = 0.19.2, numpy = 1.12.0)
import pandas as pd
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768']}
data2 = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994']}
df = pd.DataFrame(data, columns = ['date'])
df['date'] = pd.to_datetime(df['date'])
df2 = pd.DataFrame(data2, columns = ['date'])
df2['date'] = pd.to_datetime(df2['date'])
df3 = pd.DataFrame([], columns = ['date'])
df4 = pd.DataFrame()
print df.isin(df2)
print df.isin(df3)
print df.isin(df4)
This outputs:
date
0 True
1 True
2 False
date
0 1970-01-01
1 1970-01-01
2 1970-01-01
date
0 1970-01-01
1 1970-01-01
2 1970-01-01
I would normally expect a list of False values instead of '1970-01-01'? I notice that with pandas = 0.16.2 and numpy = 1.9.2, df.isin(df3) produces the more expected:
date
0 False
1 False
2 False
But df.isin(df4) is as previous.
This definitely looks like a bug to me. isin() calls DataFrame.eq as seen in the source code, and the odd behavior is reproducible with DataFrame.eq itself.
>>> df
date
0 2014-05-01 18:47:05.069722
1 2014-05-01 18:47:05.119994
2 2014-05-02 18:47:05.178768
>>> df.eq(pd.DataFrame(dict(date=[np.nan]*3)))
date
0 1970-01-01
1 1970-01-01
2 1970-01-01
I see you've now raised it to be an open issue,
Pandas isin with empty dataframe produces epoch value on datetime type instead of boolean #15473
and it should be resolved for an upcoming release.

How to convert datetime object to milliseconds

I am parsing datetime values as follows:
df['actualDateTime'] = pd.to_datetime(df['actualDateTime'])
How can I convert this datetime objects to milliseconds?
I didn't see mention of milliseconds in the doc of to_datetime.
Update (Based on feedback):
This is the current version of the code that provides error TypeError: Cannot convert input to Timestamp. The column Date3 must contain milliseconds (as a numeric equivalent of a datetime object).
import pandas as pd
import time
s1 = {'Date' : ['2015-10-20T07:21:00.000','2015-10-19T07:18:00.000','2015-10-19T07:15:00.000']}
df = pd.DataFrame(s1)
df['Date2'] = pd.to_datetime(df['Date'])
t = pd.Timestamp(df['Date2'])
df['Date3'] = time.mktime(t.timetuple())
print df
You can try pd.to_datetime(df['actualDateTime'], unit='ms')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
says this will denote in epoch, with variations 's','ms', 'ns' ...
Update
If you want in epoch timestamp of the form 14567899..
import pandas as pd
import time
t = pd.Timestamp('2015-10-19 07:22:00')
time.mktime(t.timetuple())
>> 1445219520.0
Latest update
df = pd.DataFrame(s1)
df1 = pd.to_datetime(df['Date'])
pd.DatetimeIndex(df1)
>>>DatetimeIndex(['2015-10-20 07:21:00', '2015-10-19 07:18:00',
'2015-10-19 07:15:00'],
dtype='datetime64[ns]', freq=None)
df1.astype(np.int64)
>>>0 1445325660000000000
1 1445239080000000000
2 1445238900000000000
df1.astype(np.int64) // 10**9
>>>0 1445325660
1 1445239080
2 1445238900
Name: Date, dtype: int64
Timestamps in pandas are always in nanoseconds.
This gives you milliseconds since the epoch (1970-01-01):
df['actualDateTime'] = df['actualDateTime'].astype(np.int64) / int(1e6)
This will return milliseconds from epoch
timestamp_object.timestamp() * 1000
pandas.to_datetime is to convert string or few other datatype to pandas datetime[ns]
In your instance initial 'actualDateTime' is not having milliseconds.So, if you are parsing a column which has milliseconds you will get data.
for example,
df
Out[60]:
a b
0 2015-11-02 18:04:32.926 0
1 2015-11-02 18:04:32.928 1
2 2015-11-02 18:04:32.927 2
df.a
Out[61]:
0 2015-11-02 18:04:32.926
1 2015-11-02 18:04:32.928
2 2015-11-02 18:04:32.927
Name: a, dtype: object
df.a = pd.to_datetime(df.a)
df.a
Out[63]:
0 2015-11-02 18:04:32.926
1 2015-11-02 18:04:32.928
2 2015-11-02 18:04:32.927
Name: a, dtype: datetime64[ns]
df.a.dt.nanosecond
Out[64]:
0 0
1 0
2 0
dtype: int64
df.a.dt.microsecond
Out[65]:
0 926000
1 928000
2 927000
dtype: int64
For what it's worth, to convert a single Pandas timestamp object to milliseconds, I had to do:
import time
time.mktime(<timestamp_object>.timetuple())*1000
For python >= 3.8, for e.g.
pd.DataFrame({'temp':[1,2,3]}, index = [pd.Timestamp.utcnow()]*3)
convert to milliseconds:
times = df.index.view(np.int64) // int(1e6)
print(times[0])
gives:
1666925409051
Note: to convert to seconds, similarly e.g.:
times = df.index.view(np.int64) // int(1e9)
print(times[0])
1666925409
from datetime import datetime
print datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
>>>> OUTPUT >>>>
2015-11-02 18:04:32.926

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Categories

Resources