Unknown string format on pd.to_datetime - python

I have a data set with a column date like this:
cod date value
0 1O8 2015-01-01 00:00:00 2.1
1 1O8 2015-01-01 01:00:00 2.3
2 1O8 2015-01-01 02:00:00 3.5
3 1O8 2015-01-01 03:00:00 4.5
4 1O8 2015-01-01 04:00:00 4.4
5 1O8 2015-01-01 05:00:00 3.2
6 1O9 2015-01-01 00:00:00 1.4
7 1O9 2015-01-01 01:00:00 8.6
8 1O9 2015-01-01 02:00:00 3.3
10 1O9 2015-01-01 03:00:00 1.5
11 1O9 2015-01-01 04:00:00 2.4
12 1O9 2015-01-01 05:00:00 7.2
The dtypes of column date is an object, for apply some function after I need to change the date column type to datatime. I try a diffrent solution like:
pd.to_datetime(df['date'], errors='raise', format ='%Y-%m-%d HH:mm:ss')
pd.to_datetime(df['date'], errors='coerce', format ='%Y-%m-%d HH:mm:ss')
df['date'].apply(pd.to_datetime, format ='%Y-%m-%d HH:mm:ss')
But the error is only the same:
TypeError: Unrecognized value type: <class 'str'>
ValueError: Unknown string format
The straight thing is that if I apply te function to a sample of data set, the function respond correctly, but if I apply it to all data set exit the error. In the data there isn missing value and the dtype is the same for all value.
How I can fix this error?

There are three issues:
pd.to_datetime and pd.Series.apply don't work in place, so your solutions won't modify your series. Assign back after conversion.
Your third solution needs errors='coerce' to guarantee no errors.
For the time component you need to use specific string formats beginning with %.
So you can use:
df = pd.DataFrame({'date': ['2015-01-01 00:00:00', '2016-12-20 15:00:20',
'2017-08-05 00:05:00', '2018-05-11 00:10:00']})
df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%Y-%m-%d %H:%M:%S')
print(df)
date
0 2015-01-01 00:00:00
1 2016-12-20 15:00:20
2 2017-08-05 00:05:00
3 2018-05-11 00:10:00
In this particular instance, the format is standard and can be omitted:
df['date'] = pd.to_datetime(df['date'], errors='coerce')

I understand you read this data for example from csv file.
df=pd.read_csv('c:/1/comptagevelo2012.csv', index_col=0, parse_dates=True)
To check:
print(df.index)
Is works better than pd.to_datetime!! I checked it!
> DatetimeIndex(['2012-01-01', '2012-02-01', '2012-03-01', '2012-04-01',
> '2012-05-01', '2012-06-01', '2012-07-01', '2012-08-01',
> '2012-09-01', '2012-10-01',
> ...
> '2012-12-22', '2012-12-23', '2012-12-24', '2012-12-25',
> '2012-12-26', '2012-12-27', '2012-12-28', '2012-12-29',
> '2012-12-30', '2012-12-31'],
> dtype='datetime64[ns]', length=366, freq=None)
Another method doesn't work for this file.
df=pd.read_csv('c:/1/comptagevelo2012.csv',index_col=0)
pd.to_datetime(df['Date'], errors='coerce', format ='%d/%m/%Y')
print(df.index)
Index(['01/01/2012', '02/01/2012', '03/01/2012', '04/01/2012', '05/01/2012',
'06/01/2012', '07/01/2012', '08/01/2012', '09/01/2012', '10/01/2012',
...
'22/12/2012', '23/12/2012', '24/12/2012', '25/12/2012', '26/12/2012',
'27/12/2012', '28/12/2012', '29/12/2012', '30/12/2012', '31/12/2012'],
dtype='object', length=366)
sorce: https://keyrus-gitlab.ml/gfeuillen/keyrus-training/blob/5f0076e3c61ad64336efc9bc3fd862bfed53125c/docker/data/python/Exercises/02%20pandas/comptagevelo2012.csv

Related

I need help in editing the date-time format using pandas in python

I had a data where my time was in UNIX format. I used the following code to convert my time in dataframe to Date format from Unix.
import pandas as pd
df = pd.read_csv(r'C:\Users\My Computer\Desktop\Data Analysis\BATS_SPY, 1D.csv')
df['time'] = pd.to_datetime(df['time'],unit='s')
print(df.head())
I get the result as
time
0 1993-01-29 14:30:00
1 1993-02-01 14:30:00
2 1993-02-02 14:30:00
3 1993-02-03 14:30:00
4 1993-02-04 14:30:00
What should I do if I only want dates (that is I want to exclude 14:30:00 from the time)
My data was as follows
time
0 728317800
1 728577000
2 728663400
3 728749800
4 728836200
Take your date series:
df['date'] = df['time'].dt.floor('D')

Python dataframe converting time date 'SylmiSeb' (2018-12-31 23:43:02+00:00) to datetime

I'm trying to convert a column of the style 2018-12-31 23:43:02+00:00 to 2018-12-31 by using pd.to_datetime . I got this database by using snscrape library (https://github.com/JustAnotherArchivist/snscrape).
However when I try this:
database_2018['date_created'] =
pd.to_datetime(database_2018['date_created'],
infer_datetime_format=True)
I get the following error: ParserError: Unknown string format: SylmiSeb
When I ask the dtype of this column date it appears as an object type. Any ideas on how to solve this?
I also tried:
database_2018['date_created'] =
pd.Timestamp(database_2018['date_created'])
.to_datetime()
But I get the following error:
TypeError: Cannot convert input [0 2018-12-31 23:43:02+00:00
1 2018-12-31 23:30:20+00:00
2 2018-12-31 23:30:00+00:00
3 2018-12-31 23:28:09+00:00
4 2018-12-31 23:28:08+00:00
...
105037 2018-01-01 00:29:18+00:00
105038 2018-01-01 00:25:04+00:00
105039 2018-01-01 00:10:03+00:00
105040 2018-01-01 00:03:28+00:00
105041 2018-01-01 00:00:44+00:00
Name: date_created, Length: 105042, dtype: object] of type <class 'pandas.core.series.Series'> to Timestamp
Thanks for the help !
Try:
database_2018['date_created'] = database_2018['date_created'].apply(
lambda x: x[:x.rfind(':')] + x[x.rfind(':')+1:]
)
database_2018['date_created'] = pd.to_datetime(
database_2018['date_created'], format='%Y-%m-%d %H:%M:%S%z')
This is the format of your dates, where %z represents UTC offset. For more information, see datetime documentation. The UTC offset needs to be without the colon character. So the first part of the code above removes that colon.
IIUC You are trying to fetch only date from a datetime column with timezone.
Setup
d="""date_created
2018-12-31 23:30:20+00:00
2018-12-31 23:30:00+00:00
2018-12-31 23:28:09+00:00
2018-12-31 23:28:08+00:00"""
df=pd.read_csv(StringIO(d))
df
date_created
0 2018-12-31 23:30:20+00:00
1 2018-12-31 23:30:00+00:00
2 2018-12-31 23:28:09+00:00
3 2018-12-31 23:28:08+00:00
Code
Option 1
df['date_created'] = pd.to_datetime(df.date_created,errors='coerce').dt.date
df
Output
date_created
0 2018-12-31
1 2018-12-31
2 2018-12-31
3 2018-12-31
Option 2, if we want to remove timezone
For timezone understanding, if you want to just remove timezone.
df['date_created'] = pd.to_datetime(df.date_created,errors='coerce').dt.tz_localize(None)
df
Output
date_created
0 2018-12-31 23:30:20
1 2018-12-31 23:30:00
2 2018-12-31 23:28:09
3 2018-12-31 23:28:08

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

python pandas parse date without delimiters 'time data '060116' does not match format '%dd%mm%YY' (match)'

I am trying to parse a date column that looks like the one below,
date
061116
061216
061316
061416
However I cannot get pandas to accept the date format as there is no delimiter (eg '/'). I have tried this below but receive the error:
ValueError: time data '060116' does not match format '%dd%mm%YY' (match)
pd.to_datetime(df['Date'], format='%dd%mm%YY')
You need add parameter errors='coerce' to_datetime, because 13 and 14 months does not exist, so this dates are converted to NaT:
print (pd.to_datetime(df['Date'], format='%d%m%y', errors='coerce'))
0 2016-11-06
1 2016-12-06
2 NaT
3 NaT
Name: Date, dtype: datetime64[ns]
Or maybe you need swap months with days:
print (pd.to_datetime(df['Date'], format='%m%d%y'))
0 2016-06-11
1 2016-06-12
2 2016-06-13
3 2016-06-14
Name: Date, dtype: datetime64[ns]
EDIT:
print (df)
Date
0 0611160130
1 0612160130
2 0613160130
3 0614160130
print (pd.to_datetime(df['Date'], format='%m%d%y%H%M', errors='coerce'))
0 2016-06-11 01:30:00
1 2016-06-12 01:30:00
2 2016-06-13 01:30:00
3 2016-06-14 01:30:00
Name: Date, dtype: datetime64[ns]
Python's strftime directives.
Your date format is wrong. You have days and months reversed. It should be:
%m%d%Y

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

Categories

Resources