Python Pandas replace all values if index is larger than a date - python

I am looking for some help on a pandas data frame.
I have a data frame with the following structure
Date(indexed) Total Clients Sales Headcount Total Products
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 6
2020-02-01 1124 10 10
2020-03-01 1199 10 11
How can I fill in the column total products with 0's if the date is after 2020-01-01?
Expected outcome:
Date(indexed) Total Clients Sales Headcount Total Products
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 6
2020-02-01 1124 10 0
2020-03-01 1199 10 0

Make sure that your date column contains timestamps.
# Assuming `Date(indexed)` means that this column is the index of the dataframe.
df.index = pd.to_datetime(df.index)
Then use .loc to set all values from and including 2020 to zero.
df.loc['2020':, 'Total Products'] = 0
>>> df
Total Clients Sales Headcount Total Products
Date
2019-11-01 1005 5 4
2019-12-01 1033 5 5
2020-01-01 1045 10 0
2020-02-01 1124 10 0
2020-03-01 1199 10 0

using .loc to assign values based on a boolean.
# df['Date(indexed)'] = pd.to_datetime(df['Date(indexed)'])
df.loc[df['Date(indexed)'] > '2020-01-01','Total Products'] = 0
print(df)
Date(indexed) Total Clients Sales Headcount Total Products
0 2019-11-01 1005 5 4
1 2019-12-01 1033 5 5
2 2020-01-01 1045 10 6
3 2020-02-01 1124 10 0
4 2020-03-01 1199 10 0

Related

Python Pandas - Difference between groupby keys with repeated valyes

I have some data with dates of sales to my clients.
The data looks like this:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)
The expected result is like:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then
I appreciate for help!
IIUC you can combine several groupby operations:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
output:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
Another way to do this, with transform:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())

Isues with date format in pandas

I am working with a dataset that contains dates in the American M-D-Y format.
When I load the dataset into a Pandas data frame and change the column type to the date format the dates get messed up.
Example: In the data set the first date is written as (11/04/2015) which means the 11th of April 2015. But when I convert to DateTime and use sort the data frame by the date the first date is (01/08/2015) which is incorrect. How can I change the column to DateTime and not get this messup?
dataset example :
IDX_CUSTOMER_ITEM_CODE IDX_COMPANY QtySold TotalOnHand Date
0 131 1 3 26 11/04/2015
1 134 1 3 17 11/04/2015
2 137 1 3 114 11/04/2015
3 140 1 3 18 11/04/2015
4 179 1 1 21 11/04/2015
... ... ... ... ... ...
1048570 1059 10 0 23 04/03/2017
1048571 1075 10 3 14 04/03/2017
1048572 2135 10 2 4 04/03/2017
1048573 1035 10 2 3 04/03/2017
1048574 1038 10 0 5 04/03/2017
The first date is 11 of April 2015 and last 4th march 2017.
When I do:
transactions['Date'] = pd.to_datetime(transactions['Date'])
The oldest date becomes 01/08/2015 and the latest 31/12/2016 which is incorrect. so tired:
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%dd-%mm-%yy')
Got the following error:
time data '11/04/2015' does not match format '%dd-%mm-%yy' (match)
You can also use dayfirst parameter:
pd.to_datetime(df['Date'], dayfirst=True)
Output:
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
You format is wrong. You can refer to Python strftime reference for the meaning of % code.
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%d/%m/%Y')
print(transactions['Date'])
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
Name: Date, dtype: datetime64[ns]

Get the Minimum and Maximum value within specific date range in DataFrame

I have a DataFrame that has the columns 'From' (datetime), 'To' (datetime). There are some overlapping in the ranges of different rows of the table.
Here is the simplified version of criteria dataframe (the date range is vary and overlapping with each other):
df1= pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D')})
From To
0 2020-01-01 2020-01-05
1 2020-01-03 2020-01-07
2 2020-01-05 2020-01-09
3 2020-01-07 2020-01-11
4 2020-01-09 2020-01-13
5 2020-01-11 2020-01-15
6 2020-01-13 2020-01-17
7 2020-01-15 2020-01-19
8 2020-01-17 2020-01-21
9 2020-01-19 2020-01-23
10 2020-01-21 2020-01-25
11 2020-01-23 2020-01-27
12 2020-01-25 2020-01-29
13 2020-01-27 2020-01-31
14 2020-01-29 2020-02-02
15 2020-01-31 2020-02-04
And I have a dataframe which keep the daily high and low value like this
random.seed(0)
df2= pd.DataFrame({'Date': pd.date_range(start='2020-01-01', end='2020-01-31'), 'High': [random.randint(7,15)+5 for i in range(31)], 'Low': [random.randint(0,7)-1 for i in range(31)]})
Date High Low
0 2020-01-01 18 6
1 2020-01-02 18 6
2 2020-01-03 12 3
3 2020-01-04 16 -1
4 2020-01-05 20 -1
5 2020-01-06 19 0
6 2020-01-07 18 5
7 2020-01-08 16 -1
8 2020-01-09 19 6
9 2020-01-10 17 4
10 2020-01-11 15 2
11 2020-01-12 20 4
12 2020-01-13 14 0
13 2020-01-14 16 2
14 2020-01-15 14 2
15 2020-01-16 13 2
16 2020-01-17 16 1
17 2020-01-18 20 6
18 2020-01-19 14 0
19 2020-01-20 16 0
20 2020-01-21 13 4
21 2020-01-22 13 6
22 2020-01-23 17 0
23 2020-01-24 19 3
24 2020-01-25 20 3
25 2020-01-26 13 0
26 2020-01-27 17 4
27 2020-01-28 18 2
28 2020-01-29 17 3
29 2020-01-30 15 6
30 2020-01-31 20 0
Then I hope to get the maximum and minimum value based on the From Date and To Date in df1, Here is the expected result:
result = pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D'), 'High':[20,20,20,19,20,20,16,20,20,17,20,20,20,20,20,20], 'Low':[-1,-1,-1,-1,0,0,1,0,0,0,0,0,0,0,0,0]})
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 1
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0
I have tried to use resampling method, but it seems not support custom date range. I'm looking for a reasonably efficient and elegant way of doing this. Thank you very much.
With the size of the data, I think you should consider another approach, the idea is to vectorize by chunk over df1 the comparison between dates with df2. It is lot more lines than other solutions, but it will be way faster for large dataframes.
# this is a parameter you can play with,
# but if your df1 is in memory, this value should work
nb_split = int((len(df1)*len(df2))//4e6)+1
# work with arrays of flaot
arr1 = df1[['From','To']].astype('int64').to_numpy().astype(float)
arr2 = df2.astype('int64').to_numpy().astype(float)
# create result array
arr_out = np.zeros((len(arr1), 2), dtype=float)
i = 0 #index position
for arr1_sp in np.array_split(arr1, nb_split, axis=0):
# get length of the chunk
lft = len(arr1_sp)
# get the min datetime in From and max in To
min_from = arr1_sp[:, 0].min()
max_to = arr1_sp[:, 1].max()
# select the rows of arr2 tht are within the min and max date of the split
arr2_sp = arr2[(arr2[:,0]>=min_from)&(arr2[:,0]<=max_to), :]
# create an bool arraywith True when the date in arr2_sp is above from and below to
# each row is the reuslt for each row of arr1_sp
m = np.less_equal.outer(arr1_sp[:,0], arr2_sp[:, 0])\
&np.greater_equal.outer(arr1_sp[:,1], arr2_sp[:, 0])
# use this mask to get the values high and low within the range row-wise
# and replace where the mask was False by np.nan
arr_high = arr2_sp[:,1]*m
arr_high[~m] = np.nan
arr_low = arr2_sp[:,2]*m
arr_low[~m] = np.nan
# put the result in the result array
arr_out[i:i+lft, 0] = np.nanmax(arr_high, axis=1)
arr_out[i:i+lft, 1] = np.nanmin(arr_low, axis=1)
i += lft #update first idx position for next loop
# create the columns in df1
df1['High'] = arr_out[:, 0]
df1['Low'] = arr_out[:, 1]
I tried with df1 with 10000 rows and df2 5000 rows, and this method is about 102ms while the method with apply getHighLow2is about 8s, so 80 time faster this way. Adn the results where the same.
Here is a function which does this:
Checks the dates which are in the from/to interval
Gets the maximum and minimum values of the High and Low columns respectively
def get_high_low(d1):
high = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "High"].max()
low = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "Low"].max()
return pd.Series([high, low], index=["High", "Low"])
Then we can just apply this function and concatenate the result with the dates.
pd.concat([df1, df1.apply(get_high_low, axis=1)], axis=1)
The result
From To High Low
0 2020-01-01 2020-01-05 19 4
1 2020-01-03 2020-01-07 17 5
2 2020-01-05 2020-01-09 19 5
3 2020-01-07 2020-01-11 19 2
4 2020-01-09 2020-01-13 17 4
5 2020-01-11 2020-01-15 19 4
6 2020-01-13 2020-01-17 19 5
7 2020-01-15 2020-01-19 18 5
8 2020-01-17 2020-01-21 18 0
9 2020-01-19 2020-01-23 19 3
10 2020-01-21 2020-01-25 19 5
11 2020-01-23 2020-01-27 19 5
12 2020-01-25 2020-01-29 17 5
13 2020-01-27 2020-01-31 17 3
14 2020-01-29 2020-02-02 17 1
15 2020-01-31 2020-02-04 13 -1
I would do a cross merge and query, then groupby:
(df1.assign(dummy=1)
.merge(df2.assign(dummy=1), on='dummy') # this is cross merge
.drop('dummy', axis=1) # remove the `dummy` column
.query('From<=Date<=To') # only choose valid data
.groupby(['From','To']) # groupby `From` and `To`
.agg({'High':'max','Low':'min'}) # aggregation
.reset_index()
)
Output:
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 0
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0
You can create a simple function that gets the min and max within a given date renge. Than use the apply function to add the columns.
def MaxMin(row):
dfRange = df2[(df2['Date']>=row['From'])&(df2['Date']<=row['To'])] # df2 rows within a given date range
row['High'] = dfRange['High'].max()
row['Low'] = dfRange['Low'].min()
return row
df1 = df1.apply(MaxMin, axis =1)
Define the following function:
def getHighLow(row):
wrk = df2[df2.Date.between(row.From, row.To)]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
Then run:
df1.join(df1.apply(getHighLow, axis=1))
According to the DRY rule, it is better to find wrk (a set of rows between
given dates) once and then (form wrk) extract maximal High and
minimal Low.
Another advantage over the other solution: My code runs quicker by about
30 % (at least on my computer, measurements performed using %timeit).
Edit
Yet quicker solution is when the search in df2 can be performed by index
instead of "from regular column".
As a preparatory step run:
df2a = df2.set_index('Date')
Then define another variant of getHighLow function:
def getHighLow2(row):
wrk = df2a.loc[row.From : row.To]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
To get the result, run:
df1.join(df1.apply(getHighLow2, axis=1))
For your data, the execution time is about a half of the other solution
(not including the time to create df2a, but it can be created just in this form (with Date as the index)).

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

How to do calculation on pandas dataframe that require processing multiple rows?

I have a dataframe from which I need to calculate a number of features from. The dataframe df looks something like this for a object and an event:
id event_id event_date age money_spent rank
1 100 2016-10-01 4 150 2
2 100 2016-09-30 5 10 4
1 101 2015-12-28 3 350 3
2 102 2015-10-25 5 400 5
3 102 2015-10-25 7 500 2
1 103 2014-04-15 2 1000 1
2 103 2014-04-15 3 180 6
From this I need to know for each id and event_id (basically each row), what was the number of days since the last event date, total money spend upto that date, avg. money spent upto that date, rank in last 3 events etc.
What is the best way to work with this kind of problem in pandas where for each row I need information from all rows with the same id before the date of that row, and so the calculations? I want to return a new dataframe with the corresponding calculated features like
id event_id event_date days_last_event avg_money_spent total_money_spent
1 100 2016-10-01 278 500 1500
2 100 2016-09-30 361 196.67 590
1 101 2015-12-28 622 675 1350
2 102 2015-10-25 558 290 580
3 102 2015-10-25 0 500 500
1 103 2014-04-15 0 1000 1000
2 103 2014-04-15 0 180 180
I came up with the following solution:
df1= df.sort_values(by="event_date",ascending = False)
g = df1.groupby(by=["id"])
df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]})
df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)

Categories

Resources