Python - Pandas, issue with datetime format - python

I have a two-columns data frame, with departure and arrival times (see example below).
In order to make operations on those times, i want to convert the string into datetime format, keeping only hour/minutes/seconds information.
Example of input data - file name = table
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
I ran this code to import the table file and then to convert the type in to datetime format:
import pandas as pd
from datetime import datetime
df= pd.read_excel("table.xlsx")
df['arrival_time']= pd.to_datetime(df['arrival_time'], format= '%H:%M:%S')
but get this error:
ValueError: time data ' 07:30:00' does not match format '%H:%M:%S' (match)
What mistake i am making?

Seems like an import issue ' 07:30:00', there's a space in front. If it's a CSV you're importing you can use skipinitialspace=True.
If I import your CSV file, and use your code, it works fine:
CSV:
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
df = pd.read_csv('test.csv', skipinitialspace=True)
df['arrival_time']= pd.to_datetime(df['arrival_time'], format='%H:%M:%S').dt.time
print(df)
departure_time arrival_time
0 07:00:00 07:30:00
1 07:00:00 07:15:00
2 07:05:00 07:22:00
3 07:10:00 07:45:00
4 07:15:00 07:50:00
5 07:10:00 07:26:00
6 07:40:00 08:10:00

Related

Convert a column to a specific time format which contains different types of time formats in python

This is my data frame
df = pd.DataFrame({
'Time': ['10:00PM', '15:45:00', '13:40:00AM','5:00']
})
Time
0 10:00PM
1 15:45:00
2 13:40:00AM
3 5:00
I need to convert the time format in a specific format which is my expected output, given below.
Time
0 22:00:00
1 15:45:00
2 01:40:00
3 05:00:00
I tried using split and endswith function of str which is a complicated solution. Is there any better way to achieve this?
Thanks in advance!
here you go. One thing to mention though 13:40:00AM will result in an error since 13 is a) wrong format as AM/PM only go from 1 to 12 and b) PM (which 13 would be) cannot at the same time be AM :)
Cheers
import pandas as pd
df = pd.DataFrame({'Time': ['10:00PM', '15:45:00', '01:40:00AM', '5:00']})
df['Time'] = pd.to_datetime(df['Time'])
print(df['Time'].dt.time)
<<< 22:00:00
<<< 15:45:00
<<< 01:45:00
<<< 05:00:00

How to read a csv and aggregate data by groups?

We have a csv file and written below code to do a group by and get the max value and create an output file. But while reading final output file using data-frame read_csv , it is showing as empty..
Input file:
Manoj,2020-01-01 01:00:00
Manoj,2020-02-01 01:00:00
Manoj,2020-03-01 01:00:00
Rajesh,2020-01-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
Final output file:
Manoj,2020-03-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
and then when i am trying to read above final output file using df.read_Csv it shows dataframe empty.
import os
import re
import pandas as pd
z=open('outfile.csv','w')
fin=[]
k=open('j.csv','r')
for m in k:
d=m.split(',')[0]
if d not in fin:
fin.append(d.strip())
for p in fin:
gg=[]
g=re.compile(r'{0}'.format(p))
y=open('j.csv','r')
for b in y:
if re.search(g,b):
gg.append(b)
z.write(gg[-1].strip())
z.write('\n')
df = pd.read_csv("outfile.csv", delimiter=',', names=['Col1','Col2'], header=0)
print(df)
final output: Empty DataFrame , Index: []
Is there anything i missed , please any one suggest...
It's not necessary to use the for-loop to process the file. The data aggregation is more easily completed in pandas.
Your csv is shown without headers, so read the file in with pandas.read_csv, header=None, and use parse_dates to correctly format the datetime column.
The column with datetimes, is shown at column index 1, therefore parse_dates=[1]
This assumes the data begins on row 0 in the file and has no headers, as shown in the OP.
Create headers for the columns
As per a comment, the date component of 'datetime' can be accessed with the .dt accessor.
.groupby on name and aggregate .max()
import pandas as pd
# read the file j.csv
df = pd.read_csv('j.csv', header=None, parse_dates=[1])
# add headers
df.columns = ['name', 'datetime']
# select only the date component of datetime
df.datetime = df.datetime.dt.date
# display(df)
name datetime
0 Manoj 2020-01-01
1 Manoj 2020-02-01
2 Manoj 2020-03-01
3 Rajesh 2020-01-01
4 Rajesh 2020-05-01
5 Suresh 2020-04-01
# groupby
dfg = df.groupby('name')['datetime'].max().reset_index()
# display(dfg)
name datetime
0 Manoj 2020-03-01
1 Rajesh 2020-05-01
2 Suresh 2020-04-01
# save the file. If the headers aren't wanted, use `header=False`
dfg.to_csv('outfile.csv', index=False)
Create dataframe
import pandas as pd
df=pd.DataFrame(zip(
['Manoj','Manoj','Manoj','Rajesh','Rajesh','Suresh'],
['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-05-01','2020-04-01'],
['01:00:00','01:00:00','01:00:00','01:00:00','01:00:00','01:00:00']),
columns=['name','date','time'])
Convert date and time from string to date and time object
df['date']=pd.to_datetime(df['date'], infer_datetime_format=True).dt.date
df['time']=pd.to_datetime(df['time'],format='%H:%M:%S').dt.time
Take groupby
out=df.groupby(by=['name','time']).max().reset_index()
You can save and load it again
out.to_csv('out.csv',index=False)
df1=pd.read_csv('out.csv')
result
name time date
0 Manoj 01:00:00 2020-03-01
1 Rajesh 01:00:00 2020-05-01
2 Suresh 01:00:00 2020-04-01
Sorry, I created two separate columns for date and time, but I hope you can understand it

Date column coming in in different formats when loading csv Pandas

I'm having trouble loading some Dates from a csv file, the dates are essentially correct, but they seem to flip from YYYY-DD-MM to YYYY-MM-DD, and it would appear it depends on whether the Day with in the Date is below the 10th or not. When I look at the csv in Excel however the dates are all of the same format which weirdly is 01/04/19 (DD/MM/YY) , so completely different from what pandas is loading it as.
Here is how the Date column is coming in:
2019-01-04
2019-08-04
2019-04-15
2019-04-22
2019-04-29
2019-06-05
2019-05-13
2019-05-20
2019-05-27
2019-03-06
2019-10-06
2019-06-17
2019-06-24
2019-01-07
2019-08-07
I've tried parsing the date when loading the csv at the beginning and tried things like df['Date'] = pd.to_datetime(df['Date']) but nothing appears to work. Has anyone seen anything like this before?
you may use df['Date'] = pd.to_datetime(df['DateS'], format='%Y-%m-%d')
as follows
df = pd.DataFrame({
'DateS' : ['2019-01-04', '2019-08-04']
})
df['Date'] = pd.to_datetime(df['DateS'], format='%Y-%m-%d')
df

Pandas: datetime conversion from dtype object

I am working on a timeseries dataset which looks like this:
DateTime SomeVariable
0 01/01 01:00:00 0.24244
1 01/01 02:00:00 0.84141
2 01/01 03:00:00 0.14144
3 01/01 04:00:00 0.74443
4 01/01 05:00:00 0.99999
The date is without year. Initially, the dtype of the DateTime is object and I am trying to change it to pandas datetime format. Since the date in my data is without year, on using:
df['DateTime'] = pd.to_datetime(df.DateTime)
I am getting the error OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00
I understand why I am getting the error (as it's not according to the pandas acceptable format), but what I want to know is how I can change the dtype from object to pandas datetime format without having year in my date. I would appreciate the hints.
EDIT 1:
Since, I got to know that I can't do it without having year in the data. So this is how I am trying to change the dtype:
df = pd.read_csv(some file location)
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%y%d/%m %H:%M:%S')
df.head()
On doing that, I am getting:
ValueError: time data '2018/ 01/01 01:00:00' doesn't match format specified.
EDIT 2:
Changing the format to '%Y/%m/%d %H:%M:%S'.
My data is hourly data, so it goes till 24h. I have only provided the demo data till 5h.
I was getting the space on adding the year to the DateTime. In order to remove that, this is what I did:
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'][1:], format='%Y/%m/%d %H:%M:%S')
I am getting the following error for that:
ValueError: time data '2018/ 01/01 02:00:00' doesn't match format specified
On changing the format to '%y/%m/%d %H:%M:%S' with the same code, this is the error I get:
ValueError: time data '2018/ 01/01 02:00:00' does not match format '%y/%m/%d %H:%M:%S' (match)
The problem is because of the gap after the year but I am not able to get rid of it.
EDIT 3:
I am able to get rid of the space after adding the year, however I am still not able to change the dtype.
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'].str.strip(), format='%Y/%m/%d %H:%M:%S')
ValueError: time data '2018/01/01 01:00:00' doesn't match format specified
I noticed that there are 2 spaces between the date and the time in the error, however adding 2 spaces in the format doesn't help.
EDIT 4 (Solution):
Removed all the multiple whitespaces. Still the format was not matching. The problem was because of the time format. The hours were from 1-24 in my data and pandas support 0-23. Simply changed the time 24:00:00 to 00:00:00 and it works perfectly now.
This is not possible. A datetime object must have a year.
What you can do is ensure all years are aligned for your data.
For example, to convert to datetime while setting year to 2018:
df = pd.DataFrame({'DateTime': ['01/01 01:00:00', '01/01 02:00:00', '01/01 03:00:00',
'01/01 04:00:00', '01/01 05:00:00']})
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%Y/%m/%d %H:%M:%S')
print(df)
DateTime
0 2018-01-01 01:00:00
1 2018-01-01 02:00:00
2 2018-01-01 03:00:00
3 2018-01-01 04:00:00
4 2018-01-01 05:00:00
# Remove spaces. Have in mind this will remove all spaces.
df['DateTime'] = df['DateTime'].str.replace(" ", "")
# I'm assuming year does not matter and that 01/01 is in the format day/month.
df['DateTime'] = pd.to_datetime(df['DateTime'], format='%d/%m%H:%M:%S')

How to convert normal string to datetime in Pandas

I have normal strings with more than millions data points from .csv file with format as below:
Datetime
22/12/2015 17:00:00
22/12/2015 18:00:00
I loaded into pandas and tried to converted into datetime format by using pandas.to_datetime(df['Datetime']). However, the new time series data I got that is not correct. There are some new Datetime produced during converting process. For example, 2016-12-11 23:30:00 that does not contain in original data.
It has been a while that I worked with panda, but in your example you have a different dateformat than in the example lines from csv:
yyyy-mm-dd hh:mm:ss
instead of
mm/dd/yyyy hh:mm:ss
the to_datetime function takes a parameter "format", this should help if that is the cause.
You want to use the option dayfirst=True
pd.to_datetime(df.Datetime, dayfirst=True)
This:
Datetime
22/12/2015 17:00:00
22/12/2015 18:00:00
11/12/2015 23:30:00
Gets converted to
0 2015-12-22 17:00:00
1 2015-12-22 18:00:00
2 2015-12-11 23:30:00
Name: Datetime, dtype: datetime64[ns]

Categories

Resources