Python pandas groupby manipulation?

Python pandas groupby manipulation? - python

Examples of how the df looks like:
customer order_datetime
a 01-03-2017 12:00:00 PM
b 01-04-2017 12:00:00 PM
c 01-07-2017 12:00:00 PM
a 01-08-2017 12:00:00 PM
b 01-09-2017 12:00:00 PM
a 01-11-2017 12:00:00 PM
There's 2 thing that I wanted to achieve but I'm still in the learning process, really appreciate any help to guide me in the right direction.
Create a list of "time between orders" where I can find the min, mean, max
Find out if "time between order" gets faster/slower, i.e. time between order_3 and order_2 vs time between order_2 and order_1

This example should set you in the right direction for your assignment.
First I'm creating a DataFrame similar to the one you show in the question:
import pandas as pd
import numpy as np
import datetime as dt
orders = pd.DataFrame({
'client': np.random.randint(65, 70, size=15),
'date': np.random.randint(0, 30, size=15)})
orders.client = orders.client.apply(chr)
orders.date = orders.date.apply(
pd.to_datetime, unit='d', origin=dt.date(2017, 1, 1), box=False)
# Sorting here is not necessary, just for visualization
orders.sort_values(['client', 'date'], inplace=True)
orders.reset_index(inplace=True, drop=True)
orders.head()
>>>>
client date
0 A 2017-01-27
1 A 2017-01-29
2 A 2017-01-30
3 B 2017-01-03
4 B 2017-01-13
The key to the solution is in the line orders.groupby('client').date.apply(pd.Series.sort_values).diff().
First we use groupby to group the orders using client as a key, then we select the date column only and sort the dates in each group with pd.Series.sort_values, finally we use diff to compute the difference of each record with the following one (here's why the dates in each group must be sorted).
The rest of the code is just to visualize the result, i.e. renaming the Series you obtain and concatenating it with the initial DataFrame.
diff_df = pd.concat([
orders,
orders.groupby('client').date.diff().rename('diff')], axis=1)
diff_df.head(10)
>>>>
client date diff
0 A 2017-01-27 NaT
1 A 2017-01-29 2 days
2 A 2017-01-30 1 days
3 B 2017-01-03 NaT
4 B 2017-01-13 10 days
5 B 2017-01-18 5 days
6 B 2017-01-24 6 days
7 C 2017-01-01 NaT
8 C 2017-01-02 1 days
9 C 2017-01-03 1 days
Once you have the time differences you can compute all kinds of in-group metrics you need.
First you can try pd.Series.describe:
diff_df.groupby('client').diff.describe()
>>>>
count mean std min \
client
A 1 5 days 00:00:00 NaT 5 days 00:00:00
B 1 12 days 00:00:00 NaT 12 days 00:00:00
C 3 4 days 00:00:00 1 days 17:34:09.189773 2 days 00:00:00
D 1 4 days 00:00:00 NaT 4 days 00:00:00
E 4 5 days 00:00:00 3 days 03:53:40.789838 2 days 00:00:00
25% 50% 75% max
client
A 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
B 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00
C 3 days 12:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
D 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00
E 2 days 18:00:00 4 days 12:00:00 6 days 18:00:00 9 days 00:00:00
If that is not enough you can define your own aggregations.
You will need a list of functions if you work on a single Series:
metrics = [pd.Series.count, pd.Series.min, pd.Series.max, pd.Series.mean]
diff_df.groupby('client').diff.aggregate(metrics)
>>>>
count nunique min max mean
client
A 1 1 5 days 5 days 5 days
B 1 1 12 days 12 days 12 days
C 3 2 2 days 5 days 4 days
D 1 1 4 days 4 days 4 days
E 4 4 2 days 9 days 5 days
Or a dictionary of of {column -> function, column -> function_list} if you work on the whole DataFrame:
metrics = {
'date': [pd.Series.count, pd.Series.nunique],
'diff': [pd.Series.min, pd.Series.max, pd.Series.mean],
}
diff_df.groupby('client').aggregate(metrics)
>>>>
diff date
min max mean count nunique
client
A 5 days 5 days 5 days 2 2
B 12 days 12 days 12 days 2 2
C 2 days 5 days 4 days 4 4
D 4 days 4 days 4 days 2 2
E 2 days 9 days 5 days 5 5

Related

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.

While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.

You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.

Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)

If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

How to get difference of time between two datetime columns pandas?

I have two different columns in my dataset,
start end
0 2015-01-01 2017-01-01
1 2015-01-02 2015-06-02
2 2015-01-03 2015-12-03
3 2015-01-04 2020-11-25
4 2015-01-05 2025-07-27
I want the difference between start and end in a specific way, here's my desired output.
year_diff month_diff
2 1
0 6
0 12
5 11
10 7
Here the day is not important to me, only month and year. I've tried to period to get diff but it returns just different in months only. how can I achieve my desired output?
df['end'].dt.to_period('M') - df['start'].dt.to_period('M'))

Try:
df["year_diff"]=df["end"].dt.year.sub(df["start"].df.year)
df["month_diff"]=df["end"].dt.month.sub(df["start"].df.month)

This solution assumes that the number of days that make up a year (365) and a month (30) are constant. If the datetimes are strings, convert them into a datetime object. In a Pandas DataFrame this can be done like so
def to_datetime(dataframe):
new_dataframe = pd.DataFrame()
new_dataframe[0] = pd.to_datetime(dataframe[0], format="%Y-%m-%d")
new_dataframe[1] = pd.to_datetime(dataframe[1], format="%Y-%m-%d")
return new_dataframe
Next, column 1 can be subtracted from column 0 to give the difference in days. We can divide this number by 365 using the // operator to get the number of whole years. We can get the number of remaining days using the % operator and divide this by 30 using the // operator the get the number of whole months.
def get_time_diff(dataframe):
dataframe[2] = dataframe[1] - dataframe[0]
diff_dataframe = pd.DataFrame(columns=["year_diff", "month_diff"])
for i in range(0, dataframe.index.stop):
year_diff = dataframe[2][i].days // 365
month_diff = (dataframe[2][i].days % 365) // 30
diff_dataframe.loc[i] = [year_diff, month_diff]
return diff_dataframe
An example output from using these functions would be
start end days_diff year_diff month_diff
0 2019-10-15 2021-08-11 666 days 1 10
1 2020-02-11 2022-10-13 975 days 2 8
2 2018-12-17 2020-09-16 639 days 1 9
3 2017-01-03 2017-01-28 25 days 0 0
4 2019-12-21 2022-03-10 810 days 2 2
5 2018-08-08 2019-05-07 272 days 0 9
6 2017-06-18 2020-08-01 1140 days 3 1
7 2017-11-14 2020-04-17 885 days 2 5
8 2019-08-19 2020-05-10 265 days 0 8
9 2018-05-05 2020-09-08 857 days 2 4
Note: This will give the number of whole years and months. Hence, if there is a remainder of 29 days, one day short from a month, this will not be counted.

split, groupby, combine in Pandas to find a difference in dates

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.

You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas groupby manipulation? - python

Related

int64 to HHMM string

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

How to get difference of time between two datetime columns pandas?

split, groupby, combine in Pandas to find a difference in dates

Categories

Resources