Reverse Pandas to_datetime() method to get timestamps - python

I have a DataFrame with one column sensorTime representing timestamps in nanoseconds (currentTimeMillis() * 1000000). An example is the following:
sensorTime
1597199687312000000
1597199687315434496
1597199687320437760
1597199687325465856
1597199687330448640
1597199687335456512
1597199687340429824
1597199687345459456
1597199687350439168
1597199687355445504
I'm transforming this column to datetime using
res['sensorTime'] = pd.to_datetime(res['sensorTime'])
How can I transform the datetime back to timestamps in nanoseconds so that I get the exact values as in the example above?

The obvious way to do this would be to convert the datetime series to Unix time (nanoseconds since the epoch) using astype:
import pandas as pd
res = pd.DataFrame({'sensorTime': [1597199687312000000, 1597199687315434496, 1597199687320437760]})
res['sensorTime'] = pd.to_datetime(res['sensorTime'])
# back to nanoseconds / Unix time - .astype to get a pandas.series
s = res['sensorTime'].astype('int64')
print(type(s), s)
<class 'pandas.core.series.Series'>
0 1597199687312000000
1 1597199687315434496
2 1597199687320437760
Name: sensorTime, dtype: int64
Another option is to use a view:
# .view to get a numpy.ndarray
arr = res['sensorTime'].values.view('int64')
print(type(arr), arr)
<class 'numpy.ndarray'>
[1597199687312000000 1597199687315434496 1597199687320437760]
But be careful, it's just a view, so the underlying data stays the same. Meaning that here, if you change a value in the dataframe's Series, this change will also be visible in the view array.

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

Pandas returns different datetime formats with iloc vs. with condition

I am struggling to get consistent datetime formatting out of a pandas array. I have a df with a calendar date colum of dtype datetime64[ns].
When I access the calendar date directly by iloc, say index 966, I get a Timestamp type
df.iloc[966]['Calendar Day']
Output 1:
Timestamp('1998-09-26 00:00:00')
However, when access the same line with a conditional statement, I get a different output format
a = df[ (df['colA'] == condA ) & (df['colB'] == condB ))]['Calendar Day']
a
results in output 2:
966 1998-09-26
Name: Calendar Day, dtype: datetime64[ns]
My condition is designed such that I can only report at most 1 line, or nan.
I am puzzled: The 2 statement look equal to me, given both times I access the same col of the same row of the same df, once by index, once by condition.
How do I make them equal? I would like to calculate a different, such as
abs( (a-b).days )
to get the relative time difference. That results in AttributeError: 'Series' object has no attribute 'days'
Thanks!
A scalar index value passed to iloc gives you a scalar value in return (timestamp in this case), while boolean masking gives you a pd.Series in return. So you neither have different formatting, nor different datatypes.
scalar index:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['1998-09-25', '1998-09-26', '1998-09-27']),
'v0': [1,2,3], 'v1': [4,5,6]})
# a scalar index value gives you a specific value from one row/column:
df.iloc[1]['date']
# Timestamp('1998-09-26 00:00:00')
boolean masking:
# this gives you a series with one element:
# mask:
m = (df['v0'] == 2) & (df['v1'] == 5)
# note that m is also a series, not a scalar value:
# 0 False
# 1 True
# 2 False
# Name: v0, dtype: bool
df[m]['date']
# ...and so is the result if you apply the mask:
# 1 1998-09-26
# Name: date, dtype: datetime64[ns]
same dtype actually...
# note that the dtype also timestamp:
df[m]['date'].iloc[0]
# Timestamp('1998-09-26 00:00:00')
vs. index slice:
df.iloc[1:2]['date'] # ...also gives you pd.Series
# 1 1998-09-26
# Name: date, dtype: datetime64[ns]
...to get the arithmetic working use the dt accessor (since you're working with a Series):
a, b = df.iloc[1]['date'], df[m]['date']
abs( (a-b).dt.days )
# 1 0
# Name: date, dtype: int64
It is probably because df.iloc[966] creates a Series object when given only one index and the types are all changed to a common dtype, whereas in your version with conditional statement there is no type conversion because no Series object is created.
You can confirm this by replacing df.iloc[966] with df.iloc[966:967] and check that you get same type than with conditional statement.
See Pandas DataFrame iloc spoils the data type

pandas timestamp series to string?

I am new to python (coming from R), and I am trying to understand how I can convert a timestamp series in a pandas dataframe (in my case this is called df['timestamp']) into what I would call a string vector in R. is this possible? How would this be done?
I tried df['timestamp'].apply('str'), but this seems to simply put the entire column df['timestamp'] into one long string. I'm looking to convert each element into a string and preserve the structure, so that it's still a vector (or maybe this a called an array?)
Consider the dataframe df
df = pd.DataFrame(dict(timestamp=pd.to_datetime(['2000-01-01'])))
df
timestamp
0 2000-01-01
Use the datetime accessor dt to access the strftime method. You can pass a format string to strftime and it will return a formatted string. When used with the dt accessor you will get a series of strings.
df.timestamp.dt.strftime('%Y-%m-%d')
0 2000-01-01
Name: timestamp, dtype: object
Visit strftime.org for a handy set of format strings.
Use astype
>>> import pandas as pd
>>> df = pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
>>> df.astype(str)
0 2009-07-31
1 2010-01-10
2 NaT
dtype: object
returns an array of strings
Following on from VinceP's answer, to convert a datetime Series in-place do the following:
df['Column_name']=df['Column_name'].astype(str)

Convert all elements in float Series to integer

I have a column, having float values,in a dataframe (so I am calling this column as Float series). I want to convert all the values to integer or just round it up so that there are no decimals.
Let us say the dataframe is df and the column is a, I tried this :
df['a'] = round(df['a'])
I got an error saying this method can't be applied to a Series, only applicable to individual values.
Next I tried this :
for obj in df['a']:
obj =int(round(obj))
After this I printed df but there was no change.
Where am I going wrong?
round won't work as it's being called on a pandas Series which is array-like rather than a scalar value, there is the built in method pd.Series.round to operate on the whole Series array after which you can change the dtype using astype:
In [43]:
df = pd.DataFrame({'a':np.random.randn(5)})
df['a'] = df['a'] * 100
df
Out[43]:
a
0 -4.489462
1 -133.556951
2 -136.397189
3 -106.993288
4 -89.820355
In [45]:
df['a'] = df['a'].round(0).astype(int)
df
Out[45]:
a
0 -4
1 -134
2 -136
3 -107
4 -90
Also it's unnecessary to iterate over the rows when there are vectorised methods available
Also this:
for obj in df['a']:
obj =int(round(obj))
Does not mutate the individual cell in the Series, it's operating on a copy of the value which is why the df is not mutated.
The code in your loop:
obj = int(round(obj))
Only changes which object the name obj refers to. It does not modify the data stored in the series. If you want to do this you need to know where in the series the data is stored and update it there.
E.g.
for i, num in enumerate(df['a']):
df['a'].iloc[i] = int(round(obj))
When converting a float to an integer, I found out using df.dtypes that the column I was trying to round off was an object not a float. The round command won't work on objects so to do the conversion I did:
df['a'] = pd.to_numeric(df['a'])
df['a'] = df['a'].round(0).astype(int)
or as one line:
df['a'] = pd.to_numeric(df['a']).round(0).astype(int)
If you specifically want to round up as your question states, you can use np.ceil:
import numpy as np
df['a'] = np.ceil(df['a'])
See also Floor or ceiling of a pandas series in python?
Not sure there's much advantage to type converting to int; pandas and numpy love floats.

Categories

Resources