Create multiple columns within groupby for calculation? - python

I have a dataset like this:
SKU,Date,Inventory,Sales,Incoming
2010,2017-01-01 0:00:00,85,126,252
2010,2017-02-01 0:00:00,382,143,252
2010,2017-03-01 0:00:00,414,139,216
2010,2017-04-01 0:00:00,468,120,216
7770,2017-01-01 0:00:00,7,45,108
7770,2017-02-01 0:00:00,234,64,216
7770,2017-03-01 0:00:00,160,69,36
7770,2017-04-01 0:00:00,150,50,72
7870,2017-01-01 0:00:00,41,29,36
7870,2017-02-01 0:00:00,95,18,36
7870,2017-03-01 0:00:00,112,16,36
7870,2017-04-01 0:00:00,88,19,0
Inventory Quantity is the "actual" recorded quantity, which may differ from the hypothetical remaining quantity, which is what I am trying to calculate.
Sales Quantity actually extends much longer into the future. In those rows, the other two columns will have NA.
I want to create the following:
Take only the first Inventory value of each SKU
Use the first value to calculate the hypothetical remaining quantity by using a recursive formula [Earliest inventory] - [Sales for that month] - [Incoming qty for that month] (Note: Earliest inventory is a fixed quantity for each SKU). Store the output in a column called "End of Month Part 1".
Create another column called "Buy Quantity" with the following criteria: If remaining quantity is less than 50, then create a new column that indicates the buy amount (let's say it's 30 for all 3 SKUs) (i.e. increase the quantity by 30). If the remaining quantity is more than 50, then the buy amount is zero.
Create another column called "End of Month Part 2" that adds "End of Month Part 1" with "Buy Quantity"
I am able to obtain the first quantity of each SKU using the following code, and merge it into a column called first_qty into the dataset
first_qty_series = dataset.groupby(by=['SKU']).nth(0)['Inventory']
first_qty_series = pd.DataFrame(dataset).reset_index().rename(columns={'Inventory': 'Earliest inventory'})
dataset = pd.merge(dataset, pd.DataFrame(first_qty_series), on='SKU' )
As for the remainder quantity, I thought of using cumsum() on the two columns dataset['Sales'] and dataset['Incoming'] but I think it won't work because the cumsum() will sum across ALL SKUs.
That's why I think I need to perform the calculation in groupby. But I don't know what else to do.
(Edit:) Expected output is:
Thank you guys!

Here is a way to do the 4 columns you want.
1 - Another method, using loc and drop_duplicates to fill the first row for each 'SKU' with the value from 'Inventory', and then use ffill to fill the following rows, but your method is good.
dataset.loc[dataset.drop_duplicates(['SKU']).index,'Earliest inventory'] = dataset['Inventory']
dataset['Earliest inventory'] = dataset['Earliest inventory'].ffill().astype(int)
2 - Indeed you need cumsum and groupby to create the column 'End of Month Part 1', not on the column 'Earliest inventory' as the value is the same on every row for a same 'SKU'. Note: according to your result (and logic), I change the - with + before the column 'Incoming', and if I misunderstood the problem, just change the sign.
dataset['End of Month Part 1'] = (dataset['Earliest inventory']
- dataset.groupby('SKU')['Sales'].cumsum()
+ dataset.groupby('SKU')['Incoming'].cumsum())
3 - The column 'Buy Quantity' can be implemented using loc again meeting the condition on value less than 50 in column 'End of Month Part 1', then fillna with 0
dataset.loc[dataset['End of Month Part 1'] <= 50, 'Buy Quantity'] = 30
dataset['Buy Quantity'] = dataset['Buy Quantity'].fillna(0).astype(int)
4 - Finally the last column is just adding the two lasts created
dataset['End of Month Part 2'] = dataset['End of Month Part 1'] + dataset['Buy Quantity']
If I understood well the 4 points, you should get the dataset with the new columns

Related

Using a Rolling Function in Pandas based on Date and a Categorical Column

Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)

What better way to implement this idea on a Pandas dataframe?

I have a pandas dataframe called Df which contains 26,000 rows. This dataframe includes 10 columns called "first price", "second price" and .... "tenth price".
I want to create a new column called "y" to this dataframe like this
For example, the 26th row of the "y" column indicates the name of the column whose value in row 26 of that column is closer to the number of the first column (the column whose name is the first price) of the 27th row(26+1) than the elements of the other columns in the 26th row.
I implemented this code with an algorithm, but this algorithm works very slowly for a sample of 1000, let alone 26,000!
y=[]
for i in range(1000):
y.append((abs(df[df.index==(i)]-df["first price"][i+1])).idxmin(axis=1)[i])
for i in range(1000,len(df)):
y.append(0)
df["y"]=y
Do you know a better way?
You want to reshape the data to make it tidy. It's not good to have a bunch of columns all with the same value type (first price, second price, etc.). Better to have the type in its own column and the price beside it. Since you are comparing everything to the first price, you can leave it in its own index column and melt the rest of the columns into pairs of 'price number' and 'price' before finding the minimum of each 'item' (or what you had as rows in your example):
# example data:
np.random.seed(11)
df = (pd.DataFrame(np.random.choice(range(100), (6,4)),
columns=['first', 'second', 'third', 'fourth'])
.rename_axis('item_id')
.reset_index())
# reshape data to make it easier to work with
df = df.melt(id_vars=['item_id', 'first'], var_name='price_number', value_name='price')
# calculate price differences
df['price_diff'] = (df.price - df['first']).abs()
# find the minimum price difference for each item
df_closest = df.loc[df.groupby('item_id')['price_diff'].idxmin()]

Difference of 2 columns in pandas dataframe with some given conditions

I have a sheet like this. I need to calculate absolute of "CURRENT HIGH" - "PREVIOUS DAY CLOSE PRICE" of particular "INSTRUMENT" and "SYMBOL".
So I used .shift(1) function of pandas dataframe to create a lagged close column and then I am subtracting current HIGH and lagged close column but that also subtracts between 2 different "INSTRUMENTS" and "SYMBOL". But if a new SYMBOL or INSTRUMENTS appears I want First row to be NULL instead of subtracting current HIGH and lagged close column.
What should I do?
I believe you need if all days are consecutive per groups:
df['new'] = df['HIGH'].sub(df.groupby(['INSTRUMENT','SYMBOL'])['CLOSE'].shift())

How can I create a multi-indexed pandas dataframe within a for loop?

I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.

Create a new column in a second dataframe by passing column values of a different dataframe and a scalar to a function in Pandas Python?

My dataframe1 contains the day column which has numeric data from 1 to 7 for each day of the week. 1 - Monday, 2 - Tuesday...etc.
This day column is the day of Departure of a flight.
I need to create a new column dayOfBooking in a second dataframe2 which finds day of the week based on the number of days before a person books a flight and the day of departure of the flight.
For that I've written this function:
def findDay(dayOfDeparture, beforeDay):
beforeDay = int(beforeDay)
beforeDay = beforeDay % 7
if((dayOfDeparture - beforeDay) > 0):
dayAns = currDay - beforeDay;
else:
dayAns = 7 - abs(dayOfDeparture - beforeDay)
return(dayAns)
I want something like:
dataframe2["dayOfBooking"] = findDay(dataframe1["day"], i)
where i is the scalar value.
I can see that findDay takes the entire column day of dataframe1 instead of taking a single value for each row.
Is there an easy way to accomplish this like when we want a third column to be the sum of two other columns for each row, we can just write this:
dataframe["sum"] = dataframe2["val1"] + dataframe2["val2"]
EDIT: Figured it out. Answer and explanation below.
df2["colname"] = df.apply(lambda row: findDay(row['col'], i), axis = 1)
We have to use the apply function if we want to extract each row value of a particular column and pass it to a user defined function.
axis = 1 denotes that every row value is being taken for that column.

Categories

Resources