Sum timedeltas in python - python

I have multiple sets of dates/times I'm trying to manipulate in python, imported from a csv file using the pandas module. I've converted each entry from a string to datetime, and I can manipulate the data with + and -, but I get an error when trying to use 'sum()'. Specifically: "TypeError: 'Timedelta' object is not iterable".
Here is the code I'm using:
import pandas as pd
import numpy as np
from datetime import datetime
A = pd.read_csv('filename')
B = A['Start Time (UTCG)']
C = A['Stop Time (UTCG)']
DT_B = pd.to_datetime(B) #converting from string
DT_C = pd.to_datetime(C)
timediff = DT_C - DT_B
diffsum = sum(timediff)
where 'Start time' and 'Stop time' are in the format "11 Mar 2017 10:37:12.330" and B and C are lists.
I'm pretty new to python, so apologies if I'm overlooking something simple. If there's an easier way to manipulate strings of dates/times without datetime, that would be good too. Any help in getting "sum" to work would be appreciated. Thanks!

You might try using the sum method that ships with a pandas series which should handle this correctly.
>>> import pandas as pd
>>> from datetime import timedelta
>>> data = [timedelta(i) for i in range(10)]
>>> a = pd.Series(data)
>>> a.sum()
Timedelta('45 days 00:00:00')
Note that I say Series and not dataframe. When you pull the exact column out of the dataframe as you did like this C = A['Stop Time (UTCG)'] the type of C is a Series.
It might be cleaner to just create a new column from the other two in the first data frame, and then just aggregate or call the sum method on that column. Something like this:
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> from datetime import datetime, timedelta
>>> data1 = [datetime.now() for i in range(5)]
>>> data2 = [datetime.now() for i in range(5)]
>>> data = {'start': data1, 'stop': data2}
>>> df = pd.DataFrame(data)
>>> df
start stop
0 2017-03-11 22:38:11.606500 2017-03-11 22:38:37.474962
1 2017-03-11 22:38:11.606509 2017-03-11 22:38:37.474971
2 2017-03-11 22:38:11.606510 2017-03-11 22:38:37.474973
3 2017-03-11 22:38:11.606511 2017-03-11 22:38:37.474974
4 2017-03-11 22:38:11.606512 2017-03-11 22:38:37.474975
>>> df.dtypes // use dtypes to make sure the types are what you think they are
start datetime64[ns]
stop datetime64[ns]
dtype: object
>>> df['diff'] = df['stop'] - df['start']
>>> df['diff'].sum()
Timedelta('0 days 00:02:09.342313')

Related

Changing date column of csv with Python Pandas

I have a csv file like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
31.05.2022, 8,28, 8,25, 8,38, 8,23, 108,84M, 0,61%
(more than a thousand lines)
I want to change it like this:
Tarih, Şimdi, Açılış, Yüksek, Düşük, Hac., Fark %
5/31/2022, 8.28, 8.25, 8.38, 8.23, 108.84M, 0.61%
Especially "Date" format is Day.Month.Year and I need to put it in Month/Day/Year format.
i write the code like this:
import pandas as pd
import numpy as np
import datetime
data=pd.read_csv("qwe.csv", encoding= 'utf-8')
df.Tarih=df.Tarih.str.replace(".","/")
df.Şimdi=df.Şimdi.str.replace(",",".")
df.Açılış=df.Açılış.str.replace(",",".")
df.Yüksek=df.Yüksek.str.replace(",",".")
df.Düşük=df.Düşük.str.replace(",",".")
for i in df['Tarih']:
q = 1
datetime_obj = datetime.datetime.strptime(i, "%d/%m/%Y")
df['Tarih'].loc[df['Tarih'].values == q] = datetime_obj
But the "for" loop in my code doesn't work. I need help on this. Thank you
Just looking at converting the date, you can import to a datetime object with arguments for pd.read_csv, then convert to your desired format by applying strftime to each entry.
If I have the following tmp.csv:
date, value
30.05.2022, 4.2
31.05.2022, 42
01.06.2022, 420
import pandas as pd
df = pd.read_csv('tmp.csv', parse_dates=['date'], dayfirst=True)
df['date'] = df['date'].dt.strftime('%m/%d/%Y')
print(df)
output:
date value
0 05/30/2022 4.2
1 05/31/2022 42.0
2 06/01/2022 420.0

Is it possible for a pandas dataframe column to have datetime.date type?

I am using cx_oracle to fetch date from databases. I would like to put the fetched data into a pandas dataframe. My problem is that the dates are converted to numpy.datetime64 objects which I absolutely don't need.
I would like to have them as datetime.date objects. I have seen the dt.date method but it still gives back numpy datetypes.
Edit: It appears that with pandas 0.21.0 or newer, there is no problem holding python datetime.dates in a DataFrame. date-like columns are not automatically converted to datetime64[ns] dtype.
import numpy as np
import pandas as pd
import datetime as DT
print(pd.__version__)
# 0.21.0.dev+25.g50e95e0
dates = [DT.date(2017,1,1)+DT.timedelta(days=2*i) for i in range(3)]
df = pd.DataFrame({'dates': dates, 'foo': np.arange(len(dates))})
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
df['dates'] = (df['dates'] + pd.Timedelta(days=1))
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
For older versions of Pandas:
There is a way to prevent a Pandas DataFrame from automatically converting
datelike values to datetime64[ns] by assigning an additional value such as an
empty string which is not datelike to the column. After the DataFrame is
formed, you can remove the non-datelike value:
import pandas as pd
import datetime as DT
dates = [DT.date(2017,1,1)+DT.timedelta(days=i) for i in range(10)]
df = pd.DataFrame({'dates':['']+dates})
df = df.iloc[1:]
print(all([isinstance(item, DT.date) for item in df['dates']]))
# True
Clearly, programming this kind of shenanigan into serious code feels entirely wrong since we're subverting the intent of the developers.
There are also computational speed advantages to using datetime64[ns]s over lists or object arrays of datetime.dates.
Moreover, if df[col] has dtype datetime64[ns] then df[col].dt.date.values returns an object NumPy array of python datetime.dates:
import pandas as pd
import datetime as DT
dates = [DT.datetime(2017,1,1)+DT.timedelta(days=2*i) for i in range(3)]
df = pd.DataFrame({'dates': dates})
print(repr(df['dates'].dt.date.values))
# array([datetime.date(2017, 1, 1), datetime.date(2017, 1, 3),
# datetime.date(2017, 1, 5)], dtype=object)
So you could perhaps enjoy the best of both worlds by keeping the column as datetime64[ns] and using df[col].dt.date.values to obtain datetime.dates when necessary.
On the other hand, the datetime64[ns]s and Python datetime.dates have different ranges of representable dates.
datetime64[ns]s can represent datetimes from 1678 AD to 2262 AD.
datetime.dates can represent dates from DT.date(0,1,1) to DT.date(9999,1,1).
If the reason why you want to use datetime.dates instead of datetime64[ns]s is to overcome the limited range of representable dates, then perhaps a better alternative is to use a pd.PeriodIndex:
import pandas as pd
import datetime as DT
dates = [DT.date(2017,1,1)+DT.timedelta(days=2*i) for i in range(10)]
df = pd.DataFrame({'dates':pd.PeriodIndex(dates, freq='D')})
print(df)
# dates
# 0 2017-01-01
# 1 2017-01-03
# 2 2017-01-05
# 3 2017-01-07
# 4 2017-01-09
# 5 2017-01-11
# 6 2017-01-13
# 7 2017-01-15
# 8 2017-01-17
# 9 2017-01-19

Date difference in hours (Excel data import)?

I need to calculate hour difference between two dates (format: year-month-dayTHH:MM:SS I could also potentially transform data format to (format: year-month-day HH:MM:SS) from huge excel file. What is the most efficient way to do it in Python? I have tried to use Datatime/Time object (TypeError: expected string or buffer), Timestamp (ValueError) and DataFrame (does not give hour result).
Excel File:
Order_Date Received_Customer Column3
2000-10-06T13:00:58 2000-11-06T13:00:58 1
2000-10-21T15:40:15 2000-12-27T10:09:29 2
2000-10-23T10:09:29 2000-10-26T10:09:29 3
..... ....
Datatime/Time object code (TypeError: expected string or buffer):
import pandas as pd
import time as t
data=pd.read_excel('/path/file.xlsx')
s1 = (data,['Order_Date'])
s2 = (data,['Received_Customer'])
s1Time = t.strptime(s1, "%Y:%m:%d:%H:%M:%S")
s2Time = t.strptime(s2, "%Y:%m:%d:%H:%M:%S")
deltaInHours = (t.mktime(s2Time) - t.mktime(s1Time))
print deltaInHours, "hours"
Timestamp (ValueError) code:
import pandas as pd
import datetime as dt
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df.to = [pd.Timestamp('Order_Date')]
df.fr = [pd.Timestamp('Received_Customer')]
(df.fr-df.to).astype('timedelta64[h]')
DataFrame (does not return the desired result)
import pandas as pd
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
df['Received_Customer'] = pd.to_datetime(df['Received_Customer'])
answer = df.dropna()['Order_Date'] - df.dropna()['Received_Customer']
answer.astype('timedelta64[h]')
print(answer)
Output:
0 24 days 16:38:07
1 0 days 00:00:00
2 20 days 12:39:52
dtype: timedelta64[ns]
Should be something like this:
0 592 hour
1 0 hour
2 492 hour
Is there another way to convert timedelta64[ns] into hours than answer.astype('timedelta64[h]')?
For each of your solutions you mixed up datatypes and methods. Whereas I do not find the time to explicitly explain your mistakes, yet i want to help you by providing a (probably non optimal) solution.
I built the solution out of your previous tries and I combined it with knowledge from other questions such as:
Convert a timedelta to days, hours and minutes
Get total number of hours from a Pandas Timedelta?
Note that i used Python 3. I hope that my solution guides your way. My solution is this one:
import pandas as pd
from datetime import datetime
import numpy as np
d = pd.read_excel('C:\\Users\\nrieble\\Desktop\\check.xlsx',header=0)
start = [pd.to_datetime(e) for e in data['Order_Date'] if len(str(e))>4]
end = [pd.to_datetime(e) for e in data['Received_Customer'] if len(str(e))>4]
delta = np.asarray(s2Time)-np.asarray(s1Time)
deltainhours = [e/np.timedelta64(1, 'h') for e in delta]
print (deltainhours, "hours")

In python pandas, how can I convert this formatted date string to datetime

I have tried several ways of using to_datetime, but so far I can only get it to return the dtype as "object"
pd.to_datetime(pd.Series(['28Dec2013 19:23:15']),dayfirst=True)
The return from this command is:
0 28Dec2013 19:23:15
dtype: object
You can pass a format parameter to the to_datetime function.
>>> import pandas as pd
>>> df = pd.to_datetime(pd.Series(['28Dec2013 19:23:15']),format="%d%b%Y %H:%M:%S",dayfirst=True)
>>> df
0 2013-12-28 19:23:15
dtype: datetime64[ns]
In case you need to convert existing columns in a dataframe here the solution using a helper function conv and the apply method.
import datetime
import pandas as pd
def conv(x):
return datetime.datetime.strptime(x, '%d%b%Y %H:%M:%S')
series = pd.Series(['28Dec2013 19:23:15'])
converted = series.apply(conv)
0 2013-12-28 19:23:15
dtype: datetime64[ns]
Pandas does not recognize that datetime format.
>>> pd.to_datetime(Series(['28Dec2013 19:23:15']))
0 28Dec2013 19:23:15
dtype: object
>>> pd.to_datetime(Series(['28 Dec 2013 19:23:15']))
0 2013-12-28 19:23:15
dtype: datetime64[ns]
You will need to parse the strings you are feeding into the Series. Regular expressions will likely be a good solution for this.

pandas: convert datetime to end-of-month

I have written a function to convert pandas datetime dates to month-end:
import pandas
import numpy
import datetime
from pandas.tseries.offsets import Day, MonthEnd
def get_month_end(d):
month_end = d - Day() + MonthEnd()
if month_end.month == d.month:
return month_end # 31/March + MonthEnd() returns 30/April
else:
print "Something went wrong while converting dates to EOM: " + d + " was converted to " + month_end
raise
This function seems to be quite slow, and I was wondering if there is any faster alternative? The reason I noticed it's slow is that I am running this on a dataframe column with 50'000 dates, and I can see that the code is much slower since introducing that function (before I was converting dates to end-of-month).
df = pandas.read_csv(inpath, na_values = nas, converters = {open_date: read_as_date})
df[open_date] = df[open_date].apply(get_month_end)
I am not sure if that's relevant, but I am reading the dates in as follows:
def read_as_date(x):
return datetime.datetime.strptime(x, fmt)
Revised, converting to period and then back to timestamp does the trick
In [104]: df = DataFrame(dict(date = [Timestamp('20130101'),Timestamp('20130131'),Timestamp('20130331'),Timestamp('20130330')],value=randn(4))).set_index('date')
In [105]: df
Out[105]:
value
date
2013-01-01 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-30 2.545073
In [106]: df.index = df.index.to_period('M').to_timestamp('M')
In [107]: df
Out[107]:
value
2013-01-31 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-31 2.545073
Note that this type of conversion can also be done like this, the above would be slightly faster, though.
In [85]: df.index + pd.offsets.MonthEnd(0)
Out[85]: DatetimeIndex(['2013-01-31', '2013-01-31', '2013-03-31', '2013-03-31'], dtype='datetime64[ns]', name=u'date', freq=None, tz=None)
If the date column is in datetime format and is set to starting day of the month, this will add one month of time to it:
df['date1']=df['date'] + pd.offsets.MonthEnd(0)
import pandas as pd
import numpy as np
import datetime as dt
df0['Calendar day'] = pd.to_datetime(df0['Calendar day'], format='%m/%d/%Y')
df0['Calendar day'] = df0['Calendar day'].apply(pd.datetools.normalize_date)
df0['Month Start Date'] = df0['Calendar day'].dt.to_period('M').apply(lambda r: r.start_time)
This code should work. Calendar Day is a column in which date is given in the format %m/%d/%Y. For example: 12/28/2014 is 28 December, 2014. The output comes out to be 2014-12-01 in class 'pandas.tslib.Timestamp' type.
you can also use numpy to do it faster:
import numpy as np
date_array = np.array(['2013-01-01', '2013-01-15', '2013-01-30']).astype('datetime64[ns]')
month_start_date = date_array.astype('datetime64[M]')
In case the date is not in the index but in another column (works for Pandas 0.25.0):
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(date = [pd.Timestamp('20130101'),
pd.Timestamp('20130201'),
pd.Timestamp('20130301'),
pd.Timestamp('20130401')],
value = np.random.rand(4)))
print(df.to_string())
df.date = df.date.dt.to_period('M').dt.to_timestamp('M')
print(df.to_string())
Output:
date value
0 2013-01-01 0.295791
1 2013-02-01 0.278883
2 2013-03-01 0.708943
3 2013-04-01 0.483467
date value
0 2013-01-31 0.295791
1 2013-02-28 0.278883
2 2013-03-31 0.708943
3 2013-04-30 0.483467
What you are looking for might be:
df.resample('M').last()
The other method as said earlier by #Jeff:
df.index = df.index.to_period('M').to_timestamp('M')

Categories

Resources