Python Pandas - Difference between groupby keys with repeated valyes

Python Pandas - Difference between groupby keys with repeated valyes - python

I have some data with dates of sales to my clients.
The data looks like this:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)
The expected result is like:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then
I appreciate for help!

IIUC you can combine several groupby operations:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
output:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days

Another way to do this, with transform:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())

Related

Isues with date format in pandas

I am working with a dataset that contains dates in the American M-D-Y format.
When I load the dataset into a Pandas data frame and change the column type to the date format the dates get messed up.
Example: In the data set the first date is written as (11/04/2015) which means the 11th of April 2015. But when I convert to DateTime and use sort the data frame by the date the first date is (01/08/2015) which is incorrect. How can I change the column to DateTime and not get this messup?
dataset example :
IDX_CUSTOMER_ITEM_CODE IDX_COMPANY QtySold TotalOnHand Date
0 131 1 3 26 11/04/2015
1 134 1 3 17 11/04/2015
2 137 1 3 114 11/04/2015
3 140 1 3 18 11/04/2015
4 179 1 1 21 11/04/2015
... ... ... ... ... ...
1048570 1059 10 0 23 04/03/2017
1048571 1075 10 3 14 04/03/2017
1048572 2135 10 2 4 04/03/2017
1048573 1035 10 2 3 04/03/2017
1048574 1038 10 0 5 04/03/2017
The first date is 11 of April 2015 and last 4th march 2017.
When I do:
transactions['Date'] = pd.to_datetime(transactions['Date'])
The oldest date becomes 01/08/2015 and the latest 31/12/2016 which is incorrect. so tired:
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%dd-%mm-%yy')
Got the following error:
time data '11/04/2015' does not match format '%dd-%mm-%yy' (match)

You can also use dayfirst parameter:
pd.to_datetime(df['Date'], dayfirst=True)
Output:
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04

You format is wrong. You can refer to Python strftime reference for the meaning of % code.
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%d/%m/%Y')
print(transactions['Date'])
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
Name: Date, dtype: datetime64[ns]

How to divide a pandas dataframe into several dataframes by month and year

I have a dataframe with different columns (like price, id, product and date) and I need to divide this dataframe into several dataframes based on the current date of the system (current_date = np.datetime64(date.today())).
For example, if today is 2020-02-07 I want to divide my main dataframe into three different ones where df1 would be the data of the last month (data of 2020-01-07 to 2020-02-07), df2 would be the data of the last three months (excluding the month already in df1 so it would be more accurate to say from 2019-10-07 to 2020-01-07) and df3 would be the data left on the original dataframe.
Is there some easy way to do this? Also, I've been trying to use Grouper but I keep getting this error over an over again: NameError: name 'Grouper' is not defined (my Pandas version is 0.24.2)

You can use offsets.DateOffset for last 1mont and 3month datetimes, filter by boolean indexing:
rng = pd.date_range('2019-10-10', periods=20, freq='5d')
df = pd.DataFrame({'date': rng, 'id': range(20)})
print (df)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
18 2020-01-08 18
19 2020-01-13 19
current_date = pd.to_datetime('now').floor('d')
print (current_date)
2020-02-07 00:00:00
last1m = current_date - pd.DateOffset(months=1)
last3m = current_date - pd.DateOffset(months=3)
m1 = (df['date'] > last1m) & (df['date'] <= current_date)
m2 = (df['date'] > last3m) & (df['date'] <= last1m)
#filter non match m1 or m2 masks
m3 = ~(m1 | m2)
df1 = df[m1]
df2 = df[m2]
df3 = df[m3]
print (df1)
date id
18 2020-01-08 18
19 2020-01-13 19
print (df2)
date id
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
print (df3)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5

Python Pandas replace all values if index is larger than a date

I am looking for some help on a pandas data frame.
I have a data frame with the following structure
Date(indexed) Total Clients Sales Headcount Total Products
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 6
2020-02-01 1124 10 10
2020-03-01 1199 10 11
How can I fill in the column total products with 0's if the date is after 2020-01-01?
Expected outcome:
Date(indexed) Total Clients Sales Headcount Total Products
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 6
2020-02-01 1124 10 0
2020-03-01 1199 10 0

Make sure that your date column contains timestamps.
# Assuming `Date(indexed)` means that this column is the index of the dataframe.
df.index = pd.to_datetime(df.index)
Then use .loc to set all values from and including 2020 to zero.
df.loc['2020':, 'Total Products'] = 0
>>> df
Total Clients Sales Headcount Total Products
Date
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 0
2020-02-01 1124 10 0
2020-03-01 1199 10 0

using .loc to assign values based on a boolean.
# df['Date(indexed)'] = pd.to_datetime(df['Date(indexed)'])
df.loc[df['Date(indexed)'] > '2020-01-01','Total Products'] = 0
print(df)
Date(indexed) Total Clients Sales Headcount Total Products
0 2019-11-01 1005 5 4
1 2019-12-01 1033 5 5
2 2020-01-01 1045 10 6
3 2020-02-01 1124 10 0
4 2020-03-01 1199 10 0

Sort pandas csv list with string date

I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb ·   1
1 47 1 Feb ·   1
2 119 6 Feb ·   1
8 101 7 hrs ·   1
9 536 11 min ·   1
10 53 2 hrs ·   1
11 20 11 Feb ·   3
3 15 1 hrs ·   2
4 33 7 Feb ·   1
5 153 4 Feb ·   3
6 34 3 min ·   2
7 26 3 Feb ·   3
I want to sort the csv by date. What's the easiest way to do that ?

You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390

You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas - Difference between groupby keys with repeated valyes - python

Related

Isues with date format in pandas

How to divide a pandas dataframe into several dataframes by month and year

Python Pandas replace all values if index is larger than a date

Sort pandas csv list with string date

Cumulative Sum by date (Month)

Categories

Resources