How to convert string to datetime with nulls - python, pandas? - python

I have a series with some datetimes (as strings) and some nulls as 'nan':
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})
I'm trying to convert these to datetime:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but I get the error:
time data 'nan' does not match format '%Y-%m-%d %H:%M:%S'
So I try to turn these into actual nulls:
df.ix[df['Date'] == 'nan', 'Date'] = np.NaN
and repeat:
df['Date'] = df['Date'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
but then I get the error:
must be string, not float
What is the quickest way to solve this problem?

Just use to_datetime and set errors='coerce' to handle duff data:
In [321]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Out[321]:
Date
0 2014-10-20 10:44:31
1 2014-10-23 09:33:46
2 NaT
3 2014-10-01 09:38:45
In [322]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 64.0 bytes
the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.
If you did this then it would work:
In [324]:
def func(x):
try:
return dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
except:
return pd.NaT
df['Date'].apply(func)
Out[324]:
0 2014-10-20 10:44:31
1 2014-10-23 09:33:46
2 NaT
3 2014-10-01 09:38:45
Name: Date, dtype: datetime64[ns]
but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.
timings
In [326]:
%timeit pd.to_datetime(df['Date'], errors='coerce')
%timeit df['Date'].apply(func)
10000 loops, best of 3: 65.8 µs per loop
10000 loops, best of 3: 186 µs per loop
We see here that using to_datetime is 3X faster.

I find letting pandas do the work to be too slow on large dataframes. In another post I learned of a technique that speeds this up dramatically when the number of unique values is much smaller than the number of rows. (My data is usually stock price or trade blotter data.) It first builds a dict that maps the text dates to their datetime objects, then applies the dict to convert the column of text dates.
def str2time(val):
try:
return dt.datetime.strptime(val, '%H:%M:%S.%f')
except:
return pd.NaT
def TextTime2Time(s):
times = {t : str2time(t) for t in s.unique()}
return s.apply(lambda v: times[v])
df.date = TextTime2Time(df.date)

Related

Convert Unix time in a DataFrame to Seconds [duplicate]

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.
So for instance I have date as 1349633705 in the index column but I'd want it to show as 10/07/2012 (or at least 10/07/2012 18:15).
For some context, here is the code I'm working with and what I've tried already:
import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date
As you can see I'm using
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn't work since I'm working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I'm not quite sure how to apply this to the whole of df.date.
Thanks.
These appear to be seconds since epoch.
In [20]: df = DataFrame(data['values'])
In [21]: df.columns = ["date","price"]
In [22]: df
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date 358 non-null values
price 358 non-null values
dtypes: float64(1), int64(1)
In [23]: df.head()
Out[23]:
date price
0 1349720105 12.08
1 1349806505 12.35
2 1349892905 12.15
3 1349979305 12.19
4 1350065705 12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')
In [26]: df.head()
Out[26]:
date price
0 2012-10-08 18:15:05 12.08
1 2012-10-09 18:15:05 12.35
2 2012-10-10 18:15:05 12.15
3 2012-10-11 18:15:05 12.19
4 2012-10-12 18:15:05 12.15
In [27]: df.dtypes
Out[27]:
date datetime64[ns]
price float64
dtype: object
If you try using:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))
and receive an error :
"pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit 's'"
This means the DATE_FIELD is not specified in seconds.
In my case, it was milli seconds - EPOCH time.
The conversion worked using below:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms'))
Assuming we imported pandas as pd and df is our dataframe
pd.to_datetime(df['date'], unit='s')
works for me.
The Pandas Documentation gives this and other format examples and wasn't included in any of the above previous answers. Link:
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
Code
pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
Alternatively, by changing a line of the above code:
# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))
It should also work.

Convert Column Timestamp to Date Time [duplicate]

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.
So for instance I have date as 1349633705 in the index column but I'd want it to show as 10/07/2012 (or at least 10/07/2012 18:15).
For some context, here is the code I'm working with and what I've tried already:
import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date
As you can see I'm using
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn't work since I'm working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I'm not quite sure how to apply this to the whole of df.date.
Thanks.
These appear to be seconds since epoch.
In [20]: df = DataFrame(data['values'])
In [21]: df.columns = ["date","price"]
In [22]: df
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date 358 non-null values
price 358 non-null values
dtypes: float64(1), int64(1)
In [23]: df.head()
Out[23]:
date price
0 1349720105 12.08
1 1349806505 12.35
2 1349892905 12.15
3 1349979305 12.19
4 1350065705 12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')
In [26]: df.head()
Out[26]:
date price
0 2012-10-08 18:15:05 12.08
1 2012-10-09 18:15:05 12.35
2 2012-10-10 18:15:05 12.15
3 2012-10-11 18:15:05 12.19
4 2012-10-12 18:15:05 12.15
In [27]: df.dtypes
Out[27]:
date datetime64[ns]
price float64
dtype: object
If you try using:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))
and receive an error :
"pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit 's'"
This means the DATE_FIELD is not specified in seconds.
In my case, it was milli seconds - EPOCH time.
The conversion worked using below:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms'))
Assuming we imported pandas as pd and df is our dataframe
pd.to_datetime(df['date'], unit='s')
works for me.
The Pandas Documentation gives this and other format examples and wasn't included in any of the above previous answers. Link:
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
Code
pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
Alternatively, by changing a line of the above code:
# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))
It should also work.

pd.to_datetime to unix timestamp to date in python is giving incoreect output [duplicate]

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.
So for instance I have date as 1349633705 in the index column but I'd want it to show as 10/07/2012 (or at least 10/07/2012 18:15).
For some context, here is the code I'm working with and what I've tried already:
import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date
As you can see I'm using
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn't work since I'm working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I'm not quite sure how to apply this to the whole of df.date.
Thanks.
These appear to be seconds since epoch.
In [20]: df = DataFrame(data['values'])
In [21]: df.columns = ["date","price"]
In [22]: df
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date 358 non-null values
price 358 non-null values
dtypes: float64(1), int64(1)
In [23]: df.head()
Out[23]:
date price
0 1349720105 12.08
1 1349806505 12.35
2 1349892905 12.15
3 1349979305 12.19
4 1350065705 12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')
In [26]: df.head()
Out[26]:
date price
0 2012-10-08 18:15:05 12.08
1 2012-10-09 18:15:05 12.35
2 2012-10-10 18:15:05 12.15
3 2012-10-11 18:15:05 12.19
4 2012-10-12 18:15:05 12.15
In [27]: df.dtypes
Out[27]:
date datetime64[ns]
price float64
dtype: object
If you try using:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))
and receive an error :
"pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit 's'"
This means the DATE_FIELD is not specified in seconds.
In my case, it was milli seconds - EPOCH time.
The conversion worked using below:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms'))
Assuming we imported pandas as pd and df is our dataframe
pd.to_datetime(df['date'], unit='s')
works for me.
The Pandas Documentation gives this and other format examples and wasn't included in any of the above previous answers. Link:
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
Code
pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
Alternatively, by changing a line of the above code:
# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))
It should also work.

Formatting Dates with Pandas [duplicate]

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.
So for instance I have date as 1349633705 in the index column but I'd want it to show as 10/07/2012 (or at least 10/07/2012 18:15).
For some context, here is the code I'm working with and what I've tried already:
import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date
As you can see I'm using
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn't work since I'm working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I'm not quite sure how to apply this to the whole of df.date.
Thanks.
These appear to be seconds since epoch.
In [20]: df = DataFrame(data['values'])
In [21]: df.columns = ["date","price"]
In [22]: df
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date 358 non-null values
price 358 non-null values
dtypes: float64(1), int64(1)
In [23]: df.head()
Out[23]:
date price
0 1349720105 12.08
1 1349806505 12.35
2 1349892905 12.15
3 1349979305 12.19
4 1350065705 12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')
In [26]: df.head()
Out[26]:
date price
0 2012-10-08 18:15:05 12.08
1 2012-10-09 18:15:05 12.35
2 2012-10-10 18:15:05 12.15
3 2012-10-11 18:15:05 12.19
4 2012-10-12 18:15:05 12.15
In [27]: df.dtypes
Out[27]:
date datetime64[ns]
price float64
dtype: object
If you try using:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))
and receive an error :
"pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit 's'"
This means the DATE_FIELD is not specified in seconds.
In my case, it was milli seconds - EPOCH time.
The conversion worked using below:
df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms'))
Assuming we imported pandas as pd and df is our dataframe
pd.to_datetime(df['date'], unit='s')
works for me.
The Pandas Documentation gives this and other format examples and wasn't included in any of the above previous answers. Link:
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
Code
pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
Alternatively, by changing a line of the above code:
# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))
It should also work.

python dataframe converting multiple datetime formats

I have a pandas.dataframe like this ('col' column has two formats):
col val
'12/1/2013' value1
'1/22/2014 12:00:01 AM' value2
'12/10/2013' value3
'12/31/2013' value4
I want to convert them into datetime, and I am considering using:
test_df['col']= test_df['col'].map(lambda x: datetime.strptime(x, '%m/%d/%Y'))
test_df['col']= test_df['col'].map(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M %p'))
Obviously either of them works for the whole df. I'm thinking about using try and except but didn't get any luck, any suggestions?
Just use to_datetime, it's man/woman enough to handle both those formats:
In [4]:
df['col'] = pd.to_datetime(df['col'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 2 columns):
col 4 non-null datetime64[ns]
val 4 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 96.0+ bytes
The df now looks likes this:
In [5]:
df
Out[5]:
col val
0 2013-12-01 00:00:00 value1
1 2014-01-22 00:00:01 value2
2 2013-12-10 00:00:00 value3
3 2013-12-31 00:00:00 value4
I had two different date formats in the same column Temps, similar to the OP, which look like the following;
01.03.2017 00:00:00.000
01/03/2017 00:13
The timings are as follows for the two different code snippets;
v['Timestamp1'] = pd.to_datetime(v.Temps)
Took 25.5408718585968 seconds
v['Timestamp'] = pd.to_datetime(v.Temps, format='%d/%m/%Y %H:%M', errors='coerce')
mask = v.Timestamp.isnull()
v.loc[mask, 'Timestamp'] = pd.to_datetime(v[mask]['Temps'], format='%d.%m.%Y %H:%M:%S.%f',
errors='coerce')
Took 0.2923243045806885 seconds
In other words, if you have a small number of known formats for your datetimes, don't use to_datetime without a format!
You can create a new column :
test_df['col1'] = pd.Timestamp(test_df['col']).to_datetime()
and then drop col and rename col1.
It works for me.
I had two formats in my column 'fecha_hechos'. The formats where:
2015/03/02
10/02/2010
what I did was:
carpetas_cdmx['Timestamp'] = pd.to_datetime(carpetas_cdmx.fecha_hechos, format='%Y/%m/%d %H:%M:%S', errors='coerce')
mask = carpetas_cdmx.Timestamp.isnull()
carpetas_cdmx.loc[mask, 'Timestamp'] = pd.to_datetime(carpetas_cdmx[mask]['fecha_hechos'], format='%d/%m/%Y %H:%M',errors='coerce')
were: carpetas_cdmx is my DataFrame
and fecha_hechos the column with my formats

Categories

Resources