I'm a newbie to Python, and have done my absolute best to exhaust all resources before posting here looking for assistance. I have spent all weekend and all day today trying to come up with what I feel ought a straightforward scenario to code for using two dataframes, but, for the life of me, I am spinning my wheels and not making any significant progress.
The situation is there is one dataframe with Sales Data:
CUSTOMER ORDER SALES_DATE SALES_ITEM_NUMBER UNIT_PRICE SALES_QTY
001871 225404 01/31/2018 03266465555 1 200
001871 225643 02/02/2018 03266465555 2 600
001871 225655 02/02/2018 03266465555 3 1000
001956 228901 05/29/2018 03266461234 2.2658 20
and a second dataframe with Purchasing Data:
PO_DATE PO_ITEM_NUMBER PO_QTY PO_PRICE
01/15/2017 03266465555 1000 1.55
01/25/2017 03266465555 500 5.55
02/01/2017 03266461234 700 4.44
02/01/2017 03266461234 700 2.22
All I'm trying to do is to figure out what the maximum PO_PRICE could be for each of the lines on the Sales Order dataframe, because I'm trying to maximize the difference between what I bought it for, and what I sold it for.
When I first looked at this, I figured a straightforward nested for loop would do the trick, and increment the counters. The issue though is that I'm not well versed enough in dataframes, and so I keep getting hung up trying to access the elements within them. The thing to keep in mind as well is that I've sold 1800 of the first item, but, only bought 1500 of them. So, as I iterate through this:
For the first Sales Order row, I sold 200. The Max_PO_PRICE = $5.55 [for 500 of them]. So, I need to deduct 200 from the PO_QTY dataframe, because I've now accounted for them.
For the second Sales Order row, I sold 600. There are still 300 I can claim that I bought for $5.55, but, then I've exhausted all of those 500, and so the best I can no do is dip into the other row, which has the Max_PO_PRICE = $1.55 (for 1,000 of them). So for this one, I'd be able to claim 300 at $5.55, and the other $300 at $1.55. I can't claim for more than I bought.
Here's the code I've come up with, and, I think I may have gone about this all wrong, but, some guidance and advice would be beyond incredibly appreciated and helpful.
I'm not asking anyone to write my code for me, but, simply to advise what approach you would have taken, and if there is a better way. I figure there has to be....
Thanks in advance for your feedback and assistance.
-Clare
for index1,row1 in sales.iterrows():
SalesQty = sales.loc[index1]["SALES_QTY"]
for index2,row2 in purchases.iterrows():
if (row1['SALES_ITEM_NUMBER']==row2['PO_ITEM_NUMBER']) and (row2['PO_QTY']>0):
# Find the Maximum PO Price in the result set
max_PO_Price = abc["PO_PRICE"].max()
xyz = purchases.loc[index2]
abc = abc.append(xyz)
if(SalesQty <= Purchase_Qty):
print("Before decrement, PO_QTY = ",??????? *<==== this is where I'm struggle busing****)
print()
+index2
#Drop the data from the xyz DataFrame
xyz=xyz.iloc[0:0]
#Drop the data from the abc DataFrame
abc=abc.iloc[0:0]
+index1
This looks like something SQL would elegantly handle through analytical functions. Fortunately Pandas comes with most (but not all) of this functionality and it's a lot faster than doing nested iterrows. I'm not a Pandas expert by any means but I'll give it a whizz. Apologies if I've misinterpreted the question.
Makes sense to group the SALES_QTY, we'll use this to track how much QTY we have:
sales_grouped = sales.groupby(["SALES_ITEM_NUMBER"], as_index = False).agg({"SALES_QTY":"sum"})
Let's group the table into one so we can iterate over one table instead of two. We can use a JOIN action on the common column "PO_ITEM_NUMBER" and "SALES_ITEM_NUMBER", or what Pandas likes to call it "merge". While we're at it let's sort the table categorised by "PO_ITEM_NUMBER" and with the most expensive "PO_PRICE" on the top, this is and the next code block is the equivalent of a FN OVER PARTITION BY ORDER BY SQL analytical function.
sorted_table = purchases.merge(sales_grouped,
how = "left",
left_on = "PO_ITEM_NUMBER",
right_on = "SALES_ITEM_NUMBER").sort_values(by = ["PO_ITEM_NUMBER", "PO_PRICE"],
ascending = False)
Let's create a column CUM_PO_QTY with the cumulative sum of the PO_QTY (partitioned/grouped by PO_ITEM_NUMBER). We'll use this to mark when we go over the max SALES_QTY.
sorted_table["CUM_PO_QTY"] = sorted_table.groupby(["PO_ITEM_NUMBER"], as_index = False)["PO_QTY"].cumsum()
This is where the custom part comes in, we can integrate custom functions to apply row-by-row (or by column even) along the dataframe using apply(). We're creating two columns TRACKED_QTY which is simply the SALES_QTY minus CUM_PO_QTY so we know when we have run into the negative, and PRICE_SUM which will eventually be the maximum value gained or spent. But for now: If the TRACKED_QTY is less than 0 we multiply by the PO_QTY else the SALES_QTY for conservation purposes.
sorted_table[["TRACKED_QTY", "PRICE_SUM"]] = sorted_table.apply(lambda x: pd.Series([x["SALES_QTY"] - x["CUM_PO_QTY"],
x["PO_QTY"] * x["PO_PRICE"]
if x["SALES_QTY"] - x["CUM_PO_QTY"] >= 0
else x["SALES_QTY"] * x["PO_PRICE"]]), axis = 1)
To handle the trailing TRACKED_QTY negatives, we can filter the positive using a conditional mask, and groupby the negative revealing only the maximum PRICE_SUM value.
Then simply append these two tables and sum them.
evaluated_table = sorted_table[sorted_table["TRACKED_QTY"] >= 0]
evaluated_table = evaluated_table.append(sorted_table[sorted_table["TRACKED_QTY"] < 0].groupby(["PO_ITEM_NUMBER"], as_index = False).max())
evaluated_table = evaluated_table.groupby(["PO_ITEM_NUMBER"], as_index = False).agg({"PRICE_SUM":"sum"})
Hope this works for you.
Related
Background: I am having a list of several hundred departments that I would like to allocate budget as follow:
Each DEPT has an AMT_TOTAL budget within given number of months. They also have a monthly limit LIMIT_MONTH that they cannot exceed.
As each DEPT plans to spend their budget as fast as possible, we assume they will spend up to their monthly limit until AMT_TOTAL runs out. The amount be forecast they will spend, given this assumption, is in AMT_ALLOC_MONTH
My objective is to calculate the AMT_ALLOC_MONTH column, given the LIMIT_MONTH and AMT_TOTAL column. Based on what I've read and searched, I believe a combination of fillna and cumsum() can do the job. So far, the Python dataframe I've managed to generate is as followed:
I planned to fill the NaN using the following line:
table['AMT_ALLOC_MONTH'] = min((table['AMT_TOTAL'] - table.groupby('DEPT')['AMT_ALLOC_MONTH'].cumsum()).ffill, table['LIMIT_MONTH'])
My objective is to have the AMT_TOTAL minus the cumulative sum of AMT_ALLOC_MONTH (excluding the NaN values), grouped by DEPT; the result is then compared with value in column LIMIT_MONTH, and the smaller value is filled in the NaN cells. The process is repeated till all NaN cells of each DEPT is filled.
Needless to say, the result did not come up as I expected; the code line only works with the 1st NaN after the cell with value; subsequent NaN cells just copy the value above it. If there is a way to fix the issue, or a new & more intuitive way to do this, please help. Truly appreciated!
Try this:
for department in table['DEPT'].unique():
subset = table[table['DEPT'] == department]
for index, row in subset.iterrows():
subset = table[table['DEPT'] == department]
cumsum = subset.loc[:index-1, 'AMT_ALLOC_MONTH'].sum()
limit = row['LIMIT_MONTH']
remaining = row['AMT_TOTAL'] - cumsum
table.at[index, 'AMT_ALLOC_MONTH'] = min(remaining, limit)
It't not very elegant I guess, but it seems to work..
Here's my data:
https://docs.google.com/spreadsheets/d/1Nyvx2GXUFLxrJdRTIKNAqIVvGP7FyiQ9NrjKiHoX3kE/edit?usp=sharing
Dataset
It's a small part of dataset with 100s of order_id.
I want to find duration in #timestamp column with respect to order_id. Example. for order_id 3300400, duration will be from index 6 to index 0. Similarly for all other order ids.
I want to have the sum of items.quantity and items.price with respect to order ids. Ex. for order_id 3300400, sum of items.quantity = 2 and sum of items.price = 499+549 = 1048. Similarly for other order_ids.
I am new to python but I think it will need the use of loops. Any help will be highly appreciated.
Thanks and Regards,
Shantanu Jain
you have figured out how to use the groupby() method which is good. In order to work out the diff in timestamps its a little more work.
# Function to get first and last stamps within group
def get_index(df):
return df.iloc[[0, -1]]
# apply function and then use diff method on ['#timestamp']
df['time_diff'] = df.groupby('order_id').apply(get_index)['#timestamp'].diff()
I haven't tested any of this code, and it will only work if your time stamps are pd.timestamps. It should at least give you an idea on where to start
Let's say we have several dataframes that contain relevant information that need to be compiled into one single dataframe. There are several conditions involved in choosing which pieces of data can be brought over to the results dataframe.
Here are 3 dataframes (columns only) that we need to pull and compile data from:
df1 = ["Date","Order#","Line#","ProductID","Quantity","Sale Amount"]
df2 = ["Date","PurchaseOrderID","ProductID","Quantity","Cost"]
df3 = ["ProductID","Quantity","Location","Cost"]
df3 is the only table in this set that actually contains a unique non-repeating key "productid". The other two dataframes have keys, but they can repeat. the only way to find uniqueness is to refer to date and the other foreign keys.
Now, we'd like the desired result to show which all products grouped by product where df1.date after x date, where df2.quantity<5, where df3.quantity>0. Ideally the results would show the df3.quantity, df.cost (sum both in grouping), most recent purchase date from df2.date, and total number of sale by part from df1.count where all above criteria met.
This is the quickest example I could come up with on this issue. I'm able to accomplish this in VBA with only one problem... it's EXCRUCIATINGLY slow. I understand how list comprehension and perhaps other means of completing this task would be faster than VBA (maybe?), but it would still take a while with all of the logic and decision making that happens behind the scenes.
This example doesn't exactly show the complexities but any advice or direction you have to offer may help me and others understand how to treat these kinds of problems in Python. Any expert opinion, advice, direction is very much appreciated.
If I understand correctly:
You simply need to apply the conditions as filters on each dataframe, then group by ProductID and put it together.
df1 = df1[df1.Date > x].groupby('ProductID').agg({'Quantity':'sum','Sale Amount':'sum'})
df2 = df2.groupby('ProductID').agg({'Date':'max','Quantity':'sum','Cost':'sum'})
df2 = df2[df2.Quantity > 5].copy()
df3 = df3[df3.Quantity > 0].copy()
Once you have all of those, probably something like:
g = [i for i in list(df3.index) if i in list(df2.index) and i in list(df1.index)]
df = df3.loc[g] #use df3 as a frame, with only needed indexes
I am not sure what you want to pull from df1 and df2 - but it will look something like:
df = df.join(df2['col_needed'])
You may need to rename columns to avoid overlap.
This avoids inefficient looping and should be orders of magnitude faster than a loop in VBA.
here's a pretty basic question, but my brain is giving up on me and I would really appreciate some help.
I have a dataset with 10000 rows.
I have an area name column with 100 unique area names.
I have a type column with types ranging from 1 to 10.
And I have a spend column.
I would like to group it by area name, and add a new column with an average spend per name (or even in the old spend column).
However:
I only want the average of the types from 1-7. So I want to exclude any types 8, 9 or 10 that are in that are in the area.
Except, if an area contains only types 8, 9 or 10. In that case, I want the average of that spend.
What I've played with, but haven't managed to actually do it:
Approach 1:
Create 2 datasets, one with types1-7, another where there's only types 8, 9 or 10 in an area:
main=['1.','2.', '3.','4.', '5.', '6.', '7.']
eight_to_ten=['8.', '9.', '10.']
df_main = df[df['Type'].isin(main)]
df_main['avg_sales'] = df_main.groupby(['Area Name'])['Sales'].mean()
Approach 2:
df_new['avg_sales'] = df[df['Type'].isin(main)].groupby('Area Name')['Sales'].mean()
I assume there is a really short way of doing this, most likely without having to split the dataset into 2 and then concat it back.
Is it easier to do it with a for loop?
Any help would be appreciated
I believe you need filter first rows by lists and if need new column per groups use GroupBy.transform:
m1 = df['Type'].isin(main)
m2 = df['Type'].isin(eight_to_ten)
df = df_main[m1 | m2].copy()
df['avg_sales'] = df.groupby(['Area Name', m1])['Sales'].transform('mean')
Or for new DataFrame with aggregation add new array for distinguish groups:
arr = np.where(m1, 'first','second')
df1 = df.groupby(['Area Name', arr])['Sales'].mean().reset_index()
I would like to add a column at the end of a dataframe containing the moving average (EWM) for a specific value.
Currently, I am using 2 for loops:
for country in Country_Names:
for i in i_Codes:
EMA = df[(df['COUNTRY_NAME']==country) & (df['I_CODE']==i)].KRI_VALUE.ewm(span=6, adjust=False).mean()
df.loc[(df['COUNTRY_NAME']==country) & (df['I_CODE']==i), 'EMA'] = EMA
This is really quite slow (takes a few minutes - I have more than 50,000 rows...): does anyone have a better idea?
Many thanks!
ODO22
I'm gonna guess how it might work without seeing the data,
df['EMA'] = (df.groupby([Country_Names,i_Codes])
.transform(lambda x:x.KRI_VALUE.ewm(span=6, adjust=False).mean())