I have a transaction sales dataframe:
print(df)
dt_op quantity cod_id
20/01/18 1 100
20/01/18 8 102
21/01/18 1 100
21/01/18 10 102
...
And I would like to define a new variable "speed" as "cumulative_sales / days_elapsed_since_the_launch_of_that_product, for every different item in "cod_id".
I tried with:
start = min(df["dt_op"])
df["running_days"] = (df["dt_op"] - start).astype('timedelta64[D]')
df["csum"] = df.quantity.cumsum()
df["speed"] = df["csum"] / df["running_days"]
But it does not compute it for every single item; I would avoid for-loops for computational issues and slow running time.
Try to save the first launching date for every 'cod_id' in a new column with grouby:
df2 = df.groupby(['cod_id']).dt_op.min()
and merge it back to your dataframe
df = pd.merge(df, df2, on='cod_id', how='left')
then create a new column as the data difference between the minimum date and the first one. And you can calculate the csum always like above and diveded through the date difference.
Related
I have a table of skus that need to be placed in locations. Based on the volume that a sku has determines how many locations a sku needs. There are a limited number of locations so I need to prioritize based on how much volume will be in a location. Then once in order apply the locations. When the location is full the volume for the location should be the location volume, for the last location the remainder volume.
Current table setup
So the end result should look like this.
enter image description here
I was hoping to iterate based on the number of locations needed and create a row in a new table while reducing the number of listed locations by row.
Something like this.
rows = int(sum(df['locations_needed']))
new_locs = []
for i in range(rows):
if df['locations_needed'] > 1:
new_locs.append(df['SKU'], df['location_amount'])
df['locations_needed'] - 1
else:
new_locs.append(df['SKU'], df['remainder_volume'])
df['locations_needed'] - 1
Use repeat method from pd.Index:
out = (df.reindex(df.index.repeat(df['locations_needed'].fillna(0).astype(int)))
.reset_index(drop=True))
print(out)
# Output
SKU location_amount locations_needed
0 FAKESKU 2300 3.0
1 FAKESKU 2300 3.0
2 FAKESKU 2300 3.0
3 FAKESKU2 2100 2.0
4 FAKESKU2 2100 2.0
Building off of using repeat as suggested by Corralien, you then set the value for the last of the groupby to the remainder volume. The reorder and reset the index again. So,
#create row for each potential location by sku
df=df.loc[df.index.repeat(df.locations_needed)]
#reset index
df= df.reset_index(drop= True)
#fill last row in group (sku) with remainder volume
df2= df['SKU'].duplicated(keep= 'last')
df.loc[~df2,'location_amount'] = df['remainder_volume']
#reorder and reset index
df = df.sort_values(by=['location_amount'], ascending=False)
df['locations_needed] = 1
df= df.reset_index(drop= True)
I have a database that includes monthly time series data on around 15 different indicators. The data is all in the same format, year-to-date values and year-to-date growth. January data is missing, with data for each indicator starting with the year-to-date total as of February.
For each indicator I want to turn the year-to-date data into monthly values. The code below does that.
But I want to be able to run this as a loop over all the 15 indictators, and then automatically rename each dataframe that results to include a reference to the category it belongs to. For example, one category of data is sales in value terms, so when I apply the code to that category, I want the output of df_m to be renamed as sales_m, and df_yoy as sales_yoy.
I thought I could so this by defining a list of the 15 indicators to start with, and then somehow assigning that list to the dataframes produced by the loop. But I can't make that work.
category = ['sales', 'construction']
df_m = df.loc[:, df.columns.str.contains('Monthly')]
df_ytd = df.drop(df.filter(regex='Monthly').columns, axis=1)
df_ytd = df_ytd.fillna(method='bfill', limit=1)
df_ytd.loc[df_ytd.index.month.isin([1,2]), :] = df_ytd / 2
df_ytd.columns = df_ytd.columns.str.replace(', YTD', '')
df_m.columns = df_m.columns.str.replace('YTD, ', '').str.replace(', Monthly', '')
df_m = df_m.fillna(df_ytd)
df_yoy = df_m.pct_change(periods=12) * 100
sales_m = df_m
Given a data frame with start time of a new time period (a new work shift), sum all sales that occur up to next time period (work shift).
import pandas as pd
df_checkpoints = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl']})
df_sales = pd.DataFrame({'time':[2,6,7,9,15], 'soldCount':[1,2,3,4,5]})
# This is the wanted output...
df_output = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl'], 'totSold':[1,9,5]})
So pd.merge_asof does what I want except it only does 1:1 merge. Best would be to get a multiIndex dataframe with index[0] being the checkpoints and index[1] being the sales rows, such that I can aggregate freely afterwards. Last resort would be an ugly O(n) loop.
Number of rows in each df is a couple of millions.
Any idea?
You can use pd.cut
For instance if you want to group by range you can use like this.
As you aware I added 24 to show finish of range
pd.cut(df_sales["time"], [1,5,10,24])
If you want to automate this you can use like this:
get your checkpoints, add 24 to finish time, group it, sum sales, reset index for concat
group_and_sum = df_sales.groupby(pd.cut(df_sales["time"], df_checkpoints['time'].append(pd.Series(24))),as_index = False).sum().drop('time',axis=1)
concat 2 dataframes for names
pd.concat([group_and_sum,df_checkpoints],axis=1)
output
soldCount time shift
0 1 1 Adam
1 9 5 Ben
2 5 10 Carl
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96