I am trying to use Pandas to sum the time (hours, minutes) of a series. The data comes from a TimeField
class PhoneRecord ( models.Model ):
et = models.TimeField ( null=True, blank=True )
In python I get the record and convert to a dataframe.
phone = PhoneRecord.objects.all()
df = read_frame ( phone )
df.et = df.et.fillna ( '00:00:00' ) # some records are blank
df [ "time" ] = pd.to_datetime(df.et, format = '%H:%M:%S', errors = 'coerce')
this gives me the following output.
0 00:00:35
1 00:00:29
2 00:00:00
3 00:00:00
4 00:00:37
......
When I try to sum
df.time.sum ()
I get errors like: unsupported operand type(s) for +: 'datetime.time' and 'datetime.time'
What do I need to do to be able to sum and average the data.
Thank you.
You just need to run a custom 1-liner here to combine time objects into timedelta objects which can then be summed together. (see the "print" line)
from datetime import datetime, timedelta
import pandas as pd
phone = PhoneRecord.objects.all()
df = pd.DataFrame(list([i.__dict__ for i in phone])) # create pd.df from model query
df.et = df.et.fillna('00:00:00') # some records are blank
print(df.et)
print("SUM:", sum([datetime.combine(datetime.min, time) - datetime.min for time in df.et.tolist()], timedelta()))
You should get something like this:
0 00:00:20
1 00:00:20
2 00:00:50
3 00:00:30
4 00:00:20
SUM: 0:02:20
I had to change things a bit to get them to work on my end so hopefully, it is the same with you and your version of Pandas and Django. Hope this helps!
All computations, such as getting averages, counting etc. should be performed (if possible) using database engine. I don't know about underlying problem, but using pandas on the server-side to get average values is definitely overkill. You need to look at aggregation facility of Django.
Also, you probably need to restructure the model. If you need to store duration of a phone conversation, you can use FloatField instead, e.g.
class PhoneRecord(models.Model):
duration = models.FloatField(blank=True, default=0.0, help_text=_('duration in seconds'))
# other fields...
# also, you can set up the duration field with `editable=False`, and
# calculate its value each time the record is created
In this case you can use Avg:
from django.db.models import Avg
PhoneRecords.objects.all().aggregate(Avg('duration'))
and get something like this
{'duration_avg': 12.3}
Related
I am trying to get a series of ages from a persons list, but the age generated change with each query because is the age in a specific event so i can accomplish this with a simple loop, extracting the timedelta from the diference:
[ (event_date - user.birth_date).days/365.25 for user in User.objects.all() ]
event_date is always a datatime.date object anr user.birth_date too. I consider it a "Static"
field or constant because it is outside the database.
This gives me the correct results, but since I am doing this many times and i have other calculations to do I wanted to generate the ages from the database using the F() expresion.
``from django.db.models import ExpressionWrapper, fields
diference = ExpressionWrapper(
event_date- F('birth_date'),
output_field=fields.DurationField())
qs_ages= self.annotate(age_dist=diference)
this should give me a field named age_dist that will be a timedelta contains the total days between the two dates, so now I should do this and should give me the same result as above.
[ user.age_dist.days/365.25 for user in User.objects.all() ]
But this does not work, the result is a time delta of microseconds
What i am doing wrong? and how should I include the static value of event_date to the expression?
And... going beyond. Is there any way to get the days of the resultant time delta from the ExpressionWrapper?
since you're doing the operation on a datetime object it'll return you a datetime object which is converted to timedelta microseconds when you use DurationField as an the output field.
you can workaround this by doing everything in the database :
age = ExpressionWrapper(
Cast((event_date.date() - F('birth_date')) / 365.25 ,output_field=fields.PositiveIntegerField()),
output_field=fields.PositiveIntegerField()
)
I am trying to import a dataframe from a spreadsheet using pandas and then carry out numpy operations with its columns. The problem is that I obtain the error specified in the title: TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The reason for this is that my dataframe contains a column with dates, like:
ID Date
519457 25/02/2020 10:03
519462 25/02/2020 10:07
519468 25/02/2020 10:12
... ...
And Numpy requires the format to be floating point numbers, as so:
ID Date
519457 43886.41875
519462 43886.42153
519468 43886.425
... ...
How can I make this change without having to modify the spreadsheet itself?
I have seen a lot of posts on the forum asking the opposite, and asking about the error, and read the docs on xlrd.xldate, but have not managed to do this, which seems very simple.
I am sure this kind of problem has been dealt with before, but have not been able to find a similar post.
The code I am using is the following
xls=pd.ExcelFile(r'/home/.../TwoData.xlsx')
xls.sheet_names
df=pd.read_excel(xls,"Hoja 1")
df["E_t"]=df["Date"].diff()
Any help or pointers would be really appreciated!
PS. I have seen solutions that require computing the exact number that wants to be obtained, but this is not possible in this case due to the size of the dataframes.
You can convert the date into the Unix timestamp. In python, if you have a datetime object in UTC, you can the timestamp() to get a UTC timestamp. This function returns the time since epoch for that datetime object.
Please see an example below-
from datetime import timezone
dt = datetime(2015, 10, 19)
timestamp = dt.replace(tzinfo=timezone.utc).timestamp()
print(timestamp)
1445212800.0
Please check the datetime module for more info.
I think you need:
#https://stackoverflow.com/a/9574948/2901002
#rewritten to vectorized solution
def excel_date(date1):
temp = pd.Timestamp(1899, 12, 30) # Note, not 31st Dec but 30th!
delta = date1 - temp
return (delta.dt.days) + (delta.dt.seconds) / 86400
df["Date"] = pd.to_datetime(df["Date"]).pipe(excel_date)
print (df)
ID Date
0 519457 43886.418750
1 519462 43886.421528
2 519468 43886.425000
I am trying to format my time data to be displayed in hours:minutes:seconds (e.g. 36:30:30). The main goal is to be able to aggregate the times so that totals can be displayed in number of hours. I do not want to have totals in number of days.
My time data start as strings, in the format "HH:MM:SS". With pandas, I convert these to timedelta values using:
df["date column"] = pd.to_timedelta(df["date column"])
There is one record that is "24:00:00", but the above line of code gives that as "1 day".
Is there a way to display this time as 24:00:00?
IIUC, we can use np.timedelta64 to change your timedelta object into a numerical representation of it self.
import numpy as np
df = pd.DataFrame({'hours' : ['34:00:00','23:45:22','11:00:11'] })
hours = pd.to_timedelta(df['hours']) / np.timedelta64(1,'h')
print(hours)
0 34.000000
1 23.756111
2 11.003056
Name: hours, dtype: float64
I want to generate the sum of distance and seconds traveled by day. I want to use a groupby function to calculate the sum of the orders per day.
I have the following code:
import pandas as pd
orders = pd.read_csv('complete.csv', delimiter=',', encoding='ISO-8859-1')
orders['datetime'] = pd.to_datetime(orders['datetime'])
orders.groupby(orders.datetime.dt.date).sum()
print(orders)
The complete csv file looks as follow:
datetime,restaurant,customer_address,amount,restaurant_address,meters,seconds
2018-01-01 15:41:37,Name,9711AR,50.5,9722AC,2268.3,606.0
2018-08-13 16:57:52,Name,9711AR,22.3,9722AC,2268.3,606.0
2018-09-21 17:38:53,Name,9711AR,66.89,9722AC,2268.3,606.0
2018-11-09 18:37:26,Name,9711AR,42.66,9722AC,2268.3,606.0
2018-01-01 18:28:04,Name,9711AJ,70.75,9746RD,4090.4,1039.5
I want to generate a sum of meters and seconds for each day.
I think I have some trouble with the 'datetime' object that it does not recognize it as a date or something.
Any ideas?
I think your code is good, the only issue is that orders.groupby(orders.datetime.dt.date).sum() do not update orders, you can add
orders = orders.groupby(orders.datetime.dt.date).sum() if you want to do so
Python 3.6.0
I am importing a file with Unix timestamps.
I’m converting them to Pandas datetime and rounding to 10 minutes (12:00, 12:10, 12:20,…)
The data is collected from within a specified time period, but from different dates.
For our analysis, we want to change all dates to the same dates before doing a resampling.
At present we have a reduce_to_date that is the target for all dates.
current_date = pd.to_datetime('2017-04-05') #This will later be dynamic
reduce_to_date = current_date - pd.DateOffset(days=7)
I’ve tried to find an easy way to change the date in a series without changing the time.
I was trying to avoid lengthy conversions with .strftime().
One method that I've almost settled is to add the reduce_to_date and df['Timestamp'] difference to df['Timestamp']. However, I was trying to use the .date() function and that only works on a single element, not on the series.
GOOD!
passed_df['Timestamp'][0] = passed_df['Timestamp'][0] + (reduce_to_date.date() - passed_df['Timestamp'][0].date())
NOT GOOD
passed_df['Timestamp'][:] = passed_df['Timestamp'][:] + (reduce_to_date.date() - passed_df['Timestamp'][:].date())
AttributeError: 'Series' object has no attribute 'date'
I can use a loop:
x=1
for line in passed_df['Timestamp']:
passed_df['Timestamp'][x] = line + (reduce_to_date.date() - line.date())
x+=1
But this throws a warning:
C:\Users\elx65i5\Documents\Lightweight Logging\newmain.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The goal is to have all dates the same, but leave the original time.
If we can simply specify the replacement date, that’s great.
If we can use mathematics and change each date according to a time delta, equally as great.
Can we accomplish this in a vectorized fashion without using .strftime() or a lengthy procedure?
If I understand correctly, you can simply subtract an offset
passed_df['Timestamp'] -= pd.offsets.Day(7)
demo
passed_df=pd.DataFrame(dict(
Timestamp=pd.to_datetime(['2017-04-05 15:21:03', '2017-04-05 19:10:52'])
))
# Make sure your `Timestamp` column is datetime.
# Mine is because I constructed it that way.
# Use
# passed_df['Timestamp'] = pd.to_datetime(passed_df['Timestamp'])
passed_df['Timestamp'] -= pd.offsets.Day(7)
print(passed_df)
Timestamp
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
using strftime
Though this is not ideal, I wanted to make a point that you absolutely can use strftime. When your column is datetime, you can use strftime via the dt date accessor with dt.strftime. You can create a dynamic column where you specify the target date like this:
pd.to_datetime(passed_df.Timestamp.dt.strftime('{} %H:%M:%S'.format('2017-03-29')))
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
Name: Timestamp, dtype: datetime64[ns]
I think you need convert df['Timestamp'].dt.date to_datetime, because output of date is python date object, not pandas datetime object:
df=pd.DataFrame({'Timestamp':pd.to_datetime(['2017-04-05 15:21:03','2017-04-05 19:10:52'])})
print (df)
Timestamp
0 2017-04-05 15:21:03
1 2017-04-05 19:10:52
current_date = pd.to_datetime('2017-04-05')
reduce_to_date = current_date - pd.DateOffset(days=7)
df['Timestamp'] = df['Timestamp'] - reduce_to_date + pd.to_datetime(df['Timestamp'].dt.date)
print (df)
Timestamp
0 2017-04-12 15:21:03
1 2017-04-12 19:10:52