I want to generate the sum of distance and seconds traveled by day. I want to use a groupby function to calculate the sum of the orders per day.
I have the following code:
import pandas as pd
orders = pd.read_csv('complete.csv', delimiter=',', encoding='ISO-8859-1')
orders['datetime'] = pd.to_datetime(orders['datetime'])
orders.groupby(orders.datetime.dt.date).sum()
print(orders)
The complete csv file looks as follow:
datetime,restaurant,customer_address,amount,restaurant_address,meters,seconds
2018-01-01 15:41:37,Name,9711AR,50.5,9722AC,2268.3,606.0
2018-08-13 16:57:52,Name,9711AR,22.3,9722AC,2268.3,606.0
2018-09-21 17:38:53,Name,9711AR,66.89,9722AC,2268.3,606.0
2018-11-09 18:37:26,Name,9711AR,42.66,9722AC,2268.3,606.0
2018-01-01 18:28:04,Name,9711AJ,70.75,9746RD,4090.4,1039.5
I want to generate a sum of meters and seconds for each day.
I think I have some trouble with the 'datetime' object that it does not recognize it as a date or something.
Any ideas?
I think your code is good, the only issue is that orders.groupby(orders.datetime.dt.date).sum() do not update orders, you can add
orders = orders.groupby(orders.datetime.dt.date).sum() if you want to do so
Related
II currently have a very large .csv with 2 million rows. I've read in the csv and only have 2 columns, number and timestamp (in unix). My goal is to grab the last and largest number for each day (eg. 1/1/2021, 1/2/2021, etc.)
I have converted unix to datetime and used df.groupby('timestamp').tail(1) but am still not able to return the last row per day. Am I using the groupby wrong?
import pandas as pd
def main():
df = pd.read_csv('blocks.csv', usecols=['number', 'timestamp'])
print(df.head())
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
x = df.groupby('timestamp').tail(1)
print(x)
if __name__ == '__main__':
main()
Desired Output:
number timestamp
11,509,218 2021-01-01
11,629,315 2021-01-02
11,782,116 2021-01-03
12,321,123 2021-01-04
...
The "problem" lies in the grouper, use .dt.date for correct grouping (assuming your data is already sorted):
x = df.groupby(df['timestamp'].dt.date).tail(1)
print(x)
Doesn't seem like you're specifying the aggregation function, nor the aggregation frequency (hour, day, minute?)
My take would be something along the lines of
df.resample("D", on="timestamp").max()
There's a couple of ways to group by time, alternatively
df.groupby(pd.Grouper(key='timestamp', axis=0,
freq='D', sort=True)).max()
Regards
I am trying to format my time data to be displayed in hours:minutes:seconds (e.g. 36:30:30). The main goal is to be able to aggregate the times so that totals can be displayed in number of hours. I do not want to have totals in number of days.
My time data start as strings, in the format "HH:MM:SS". With pandas, I convert these to timedelta values using:
df["date column"] = pd.to_timedelta(df["date column"])
There is one record that is "24:00:00", but the above line of code gives that as "1 day".
Is there a way to display this time as 24:00:00?
IIUC, we can use np.timedelta64 to change your timedelta object into a numerical representation of it self.
import numpy as np
df = pd.DataFrame({'hours' : ['34:00:00','23:45:22','11:00:11'] })
hours = pd.to_timedelta(df['hours']) / np.timedelta64(1,'h')
print(hours)
0 34.000000
1 23.756111
2 11.003056
Name: hours, dtype: float64
What I need to do is basically calculate the responses received over a period of time.
I.E
07/07/2019 | 6
08/07/2019 | 7
And plot the above to a graph.
But the current data is in the below format:
07/07/2019 17:33:07
07/07/2019 12:00:03
08/07/2019 21:10:05
08/07/2019 20:06:09
So far,
import pandas as pd
df = pd.read_csv('survey_results_public.csv')
df.head()
df['Timestamp'].value_counts().plot(kind="bar")
plt.show()
But the above doesn't look good.
You are counting all values in the timestamp column so you will have 1 response per timestamp.
You should parse the timestamp column, check the unique dates and then count the number of timestamps that belong to each date.
Only then should you plot the data.
So do something like this:
import pandas as pd
import datetime
def parse_timestamps(timestamp):
datetime.datetime.strptime(timestamp, '%d/%m/%Y %H:%M:%S')
df = pandas.read_csv('survey_results_public.csv')
df["Date"]=df["Timestamp"].map(lambda t: parse_timestamps(t).date())
df["Date"].value_counts().plot(kind="bar")
i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate
df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100),
'Metric': np.random.rand(100),
'Date': randdates})
df.head()
Out[1]:
Type Metric Date
0 A 0.442970 2018-08-02
1 A 0.611648 2018-02-11
2 B 0.202763 2018-03-16
3 A 0.295577 2018-01-09
4 A 0.895391 2018-11-11
Now I want to aggregate by 'Type' and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric':
df.groupby('Type')['Metric'].agg(('mean', 'std'))
For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.
So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?
My Ideal output would be to get a Date_mean and Date_stdev column representing a range for my modeled dates.
You can use timestamps (Unix)
Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).
You can convert all your dates to timestamps liks this:
import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800
And from there you can calculate what you need.
You can compute min, max, and mean using the built-in operations of the datetime:
date = dt.datetime.date
df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))
Out[490]:
Type
A (2018-06-10, 2018-01-11, 2018-11-08)
B (2018-05-20, 2018-01-20, 2018-12-31)
C (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object
I used date(x) to make sure the output fits here, it's not really needed.
I have a pandas dataframe that records times of events that occur from today's 08:00 AM to tomorrow's 07:00 AM, each day(Therefore, I don't want to add date values, to save the storage and to simply matintain it). So, it looks like this:
>>> df.Time[63010:]
63010 23:59:59.431256 # HH:MM:SS.ffffff
63011 23:59:59.431256
63012 23:59:59.431256
63013 23:59:59.431256
63014 23:59:59.431256
63015 23:59:59.618764
63016 23:59:59.821756
63017 23:59:59.821756
63018 23:59:59.821756
63019 23:59:59.821756
63020 00:00:00.025058 # date changes here
63021 00:00:00.025058
63022 00:00:00.025058
63023 00:00:00.228202
63024 00:00:00.228202
63025 00:00:00.228202
63026 00:00:00.228202
.....
I want to make a new dataframe that records time intervals between each event, so I tried:
>>> TimeDiff = df.Time.diff(periods=1)
But it gets a value that I don't intend to get, which is:
63018 00:00:00
63019 00:00:00
63020 -1 days +00:00:00.203302 <-- -1 days?
63021 00:00:00
63022 00:00:00
I know that it happens because I don't have date values. How can I fix this problem without adding dates?
If you know that your error is due to missing date values then you should try pandas build in function to_datetime:
Example: df['date_col'] = pd.to_datetime(df['date_col'])
you can also adjust the format of the date by adding a format argument like so:
Example: df['date_col'] = pd.to_datetime(df['date_col'], format="%m/%d/%Y)