The database I'm using shows the total number of debtors for every town for every quarter.
Since there's 43 towns listed, there's 43 'total debtors' per quarter (30-Sep-17, etc).
My goal is to find the total number of debtors for every quarter (so theoretically, finding the sum of every 43 'total debtors' listed) but I'm not quite sure how.
I've tried using the sum() function, but I'm sure how to make it so it only adds the total quarter by quarter.
Here's what the database looks like and my attempt (I printed the first 50 rows just to provide an idea of what it looks like)
https://i.imgur.com/h1y43j8.png
Sorry in advance if the explanation was a bit unclear.
You should use groupby. It's a nice pandas function to do exactly what you are trying to do. It groups the df according to whatever column you pick.
total_debtors_pq = df.groupby('Quarter end date')['Total number of debtors'].sum()
You can then extract the total for each quarter from total_debtors_pq.
Related
Background: I am having a list of several hundred departments that I would like to allocate budget as follow:
Each DEPT has an AMT_TOTAL budget within given number of months. They also have a monthly limit LIMIT_MONTH that they cannot exceed.
As each DEPT plans to spend their budget as fast as possible, we assume they will spend up to their monthly limit until AMT_TOTAL runs out. The amount be forecast they will spend, given this assumption, is in AMT_ALLOC_MONTH
My objective is to calculate the AMT_ALLOC_MONTH column, given the LIMIT_MONTH and AMT_TOTAL column. Based on what I've read and searched, I believe a combination of fillna and cumsum() can do the job. So far, the Python dataframe I've managed to generate is as followed:
I planned to fill the NaN using the following line:
table['AMT_ALLOC_MONTH'] = min((table['AMT_TOTAL'] - table.groupby('DEPT')['AMT_ALLOC_MONTH'].cumsum()).ffill, table['LIMIT_MONTH'])
My objective is to have the AMT_TOTAL minus the cumulative sum of AMT_ALLOC_MONTH (excluding the NaN values), grouped by DEPT; the result is then compared with value in column LIMIT_MONTH, and the smaller value is filled in the NaN cells. The process is repeated till all NaN cells of each DEPT is filled.
Needless to say, the result did not come up as I expected; the code line only works with the 1st NaN after the cell with value; subsequent NaN cells just copy the value above it. If there is a way to fix the issue, or a new & more intuitive way to do this, please help. Truly appreciated!
Try this:
for department in table['DEPT'].unique():
subset = table[table['DEPT'] == department]
for index, row in subset.iterrows():
subset = table[table['DEPT'] == department]
cumsum = subset.loc[:index-1, 'AMT_ALLOC_MONTH'].sum()
limit = row['LIMIT_MONTH']
remaining = row['AMT_TOTAL'] - cumsum
table.at[index, 'AMT_ALLOC_MONTH'] = min(remaining, limit)
It't not very elegant I guess, but it seems to work..
I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)
I'm a newbie to Python, and have done my absolute best to exhaust all resources before posting here looking for assistance. I have spent all weekend and all day today trying to come up with what I feel ought a straightforward scenario to code for using two dataframes, but, for the life of me, I am spinning my wheels and not making any significant progress.
The situation is there is one dataframe with Sales Data:
CUSTOMER ORDER SALES_DATE SALES_ITEM_NUMBER UNIT_PRICE SALES_QTY
001871 225404 01/31/2018 03266465555 1 200
001871 225643 02/02/2018 03266465555 2 600
001871 225655 02/02/2018 03266465555 3 1000
001956 228901 05/29/2018 03266461234 2.2658 20
and a second dataframe with Purchasing Data:
PO_DATE PO_ITEM_NUMBER PO_QTY PO_PRICE
01/15/2017 03266465555 1000 1.55
01/25/2017 03266465555 500 5.55
02/01/2017 03266461234 700 4.44
02/01/2017 03266461234 700 2.22
All I'm trying to do is to figure out what the maximum PO_PRICE could be for each of the lines on the Sales Order dataframe, because I'm trying to maximize the difference between what I bought it for, and what I sold it for.
When I first looked at this, I figured a straightforward nested for loop would do the trick, and increment the counters. The issue though is that I'm not well versed enough in dataframes, and so I keep getting hung up trying to access the elements within them. The thing to keep in mind as well is that I've sold 1800 of the first item, but, only bought 1500 of them. So, as I iterate through this:
For the first Sales Order row, I sold 200. The Max_PO_PRICE = $5.55 [for 500 of them]. So, I need to deduct 200 from the PO_QTY dataframe, because I've now accounted for them.
For the second Sales Order row, I sold 600. There are still 300 I can claim that I bought for $5.55, but, then I've exhausted all of those 500, and so the best I can no do is dip into the other row, which has the Max_PO_PRICE = $1.55 (for 1,000 of them). So for this one, I'd be able to claim 300 at $5.55, and the other $300 at $1.55. I can't claim for more than I bought.
Here's the code I've come up with, and, I think I may have gone about this all wrong, but, some guidance and advice would be beyond incredibly appreciated and helpful.
I'm not asking anyone to write my code for me, but, simply to advise what approach you would have taken, and if there is a better way. I figure there has to be....
Thanks in advance for your feedback and assistance.
-Clare
for index1,row1 in sales.iterrows():
SalesQty = sales.loc[index1]["SALES_QTY"]
for index2,row2 in purchases.iterrows():
if (row1['SALES_ITEM_NUMBER']==row2['PO_ITEM_NUMBER']) and (row2['PO_QTY']>0):
# Find the Maximum PO Price in the result set
max_PO_Price = abc["PO_PRICE"].max()
xyz = purchases.loc[index2]
abc = abc.append(xyz)
if(SalesQty <= Purchase_Qty):
print("Before decrement, PO_QTY = ",??????? *<==== this is where I'm struggle busing****)
print()
+index2
#Drop the data from the xyz DataFrame
xyz=xyz.iloc[0:0]
#Drop the data from the abc DataFrame
abc=abc.iloc[0:0]
+index1
This looks like something SQL would elegantly handle through analytical functions. Fortunately Pandas comes with most (but not all) of this functionality and it's a lot faster than doing nested iterrows. I'm not a Pandas expert by any means but I'll give it a whizz. Apologies if I've misinterpreted the question.
Makes sense to group the SALES_QTY, we'll use this to track how much QTY we have:
sales_grouped = sales.groupby(["SALES_ITEM_NUMBER"], as_index = False).agg({"SALES_QTY":"sum"})
Let's group the table into one so we can iterate over one table instead of two. We can use a JOIN action on the common column "PO_ITEM_NUMBER" and "SALES_ITEM_NUMBER", or what Pandas likes to call it "merge". While we're at it let's sort the table categorised by "PO_ITEM_NUMBER" and with the most expensive "PO_PRICE" on the top, this is and the next code block is the equivalent of a FN OVER PARTITION BY ORDER BY SQL analytical function.
sorted_table = purchases.merge(sales_grouped,
how = "left",
left_on = "PO_ITEM_NUMBER",
right_on = "SALES_ITEM_NUMBER").sort_values(by = ["PO_ITEM_NUMBER", "PO_PRICE"],
ascending = False)
Let's create a column CUM_PO_QTY with the cumulative sum of the PO_QTY (partitioned/grouped by PO_ITEM_NUMBER). We'll use this to mark when we go over the max SALES_QTY.
sorted_table["CUM_PO_QTY"] = sorted_table.groupby(["PO_ITEM_NUMBER"], as_index = False)["PO_QTY"].cumsum()
This is where the custom part comes in, we can integrate custom functions to apply row-by-row (or by column even) along the dataframe using apply(). We're creating two columns TRACKED_QTY which is simply the SALES_QTY minus CUM_PO_QTY so we know when we have run into the negative, and PRICE_SUM which will eventually be the maximum value gained or spent. But for now: If the TRACKED_QTY is less than 0 we multiply by the PO_QTY else the SALES_QTY for conservation purposes.
sorted_table[["TRACKED_QTY", "PRICE_SUM"]] = sorted_table.apply(lambda x: pd.Series([x["SALES_QTY"] - x["CUM_PO_QTY"],
x["PO_QTY"] * x["PO_PRICE"]
if x["SALES_QTY"] - x["CUM_PO_QTY"] >= 0
else x["SALES_QTY"] * x["PO_PRICE"]]), axis = 1)
To handle the trailing TRACKED_QTY negatives, we can filter the positive using a conditional mask, and groupby the negative revealing only the maximum PRICE_SUM value.
Then simply append these two tables and sum them.
evaluated_table = sorted_table[sorted_table["TRACKED_QTY"] >= 0]
evaluated_table = evaluated_table.append(sorted_table[sorted_table["TRACKED_QTY"] < 0].groupby(["PO_ITEM_NUMBER"], as_index = False).max())
evaluated_table = evaluated_table.groupby(["PO_ITEM_NUMBER"], as_index = False).agg({"PRICE_SUM":"sum"})
Hope this works for you.
I'm new to python, and I have this assignment I have to deliver soon.
I have a .xlsx file that I've imported with pandas. It's a file from my workplace which tells us the day (mon - sat), time (from 10 am - 8 pm), sales per hour, visiting customers and customers that actually bought from the store (5 rows, 65 col). How can I get the total sales from each of the days? I tried to get the sum from monday by writing the cols from that day, but it wasn't accurate.
monday = (data['Sales per hour'][1:12].sum())
Is there a better way to sum the data from monday without having to write down the cols [1:12].sum())?
Here is a pic of the file I'm using. I want to get the total sum for each of the days and plot them into a histogram. I's also like to plot a comparison histogram between visiting customers and buying customers.
The file
#You can try Pandas's Group by to resolve your issue
First, rename the column for better use remove blank space from the name
data.rename(columns = {'Sales per hour':'Sales_per_hour'}, inplace = True)
Daywise_Data=data.groupby('Day').Sales_per_hour.sum().reset_index()
This will give you day-wise data into a separate data frame which can be used further to plot the histogram.
Probably a naive question but new to this :
I have a column with 100000 entries having dates from Jan 1, 2018 to August 1, 2019.( repeated entries as well) I want to create a new column wherein I want to divide a number lets say 3500 in such a way that sum(new_column) for a particular day is less than or equal to 3500.
For example lets say 01-01-2018 has 40 entries in the dataset, then 3500 is to be distributed randomly between 40 entries in such a way that the total of these 40 rows is less than or equal to 3500 and it needs to be done for all the dates in the dataset.
Can anyone advise me as to how to achieve that.
EDIT : The excel file is Here
Thanks
My answer is not the best but may work for you. But because you have 100000 entries, it will probably slow down performance, so use it and paste values, because the solution uses function RANDBETWEEN and it keeps recalculating every time you make a change in a cell.
So I made a data test like this:
First column ID would be the dates, and second column would be random numbers.
And bottom right corner shows totals, so as you can see, totals for each number sum up 3500.
The formula I've used is:
=IF(COUNTIF($A$2:$A$7;A2)=1;3500;IF(COUNTIF($A$2:A2;A2)=COUNTIF($A$2:$A$7;A2);3500-SUMIF($A$1:A1;A2;$B$1:B1);IF(COUNTIF($A$2:A2;A2)=1;RANDBETWEEN(1;3500);RANDBETWEEN(1;3500-SUMIF($A$1:A1;A2;$B$1:B1)))))
And it works pretty good. Just pressing F9 to recalculate the worksheet, gives random numbers, but all of them sum up 3500 all the time.
Hope you can adapt this to your needs.
UPDATE: You need to know that my solution will always force the numbers to sum up 3500. In any case the sum of all values would be less than 3500. You'll need to adapt that part. As i said, not my best answer...
UPDATE 2: Uploaded a sample file to my Gdrive in case you want to check how it works. https://drive.google.com/open?id=1ivW2b0b05WV32HxcLc11gP2JWvdYTa84
You will need 2 columns
I to count the number of dates and then one for the values
Formula in B2 is =COUNTIF($A$2:$A$51,A2)
Formula in C2 is =RANDBETWEEN(1,3500/B2)
Column B is giving the count of repetition for each date
Column C is giving a random number whose sum will be at maximum 3500 for each count
The range in formula in B column is $A$2:$A$51, which you can change according to your data
EDIT
For each date in your list you can apply a formula like below
The formula in D2 is =SUMIF(B:B,B2,C:C)
For the difference value for each unique date you can use a pivot and apply the formula on sum of each date like below
Formula in J2 is =3500-I2
Sorry - a little late to the party but this looked like a fun challenge!
The simplest way I could think of is to add a rand() column (then hard code, if required) and then another column which calculates the 3500 split per date, based on the rand() column.
Here's the function:
=ROUNDDOWN(3500*B2/SUMIF($A$2:$A$100000,A2,$B$2:$B$100000),0)
Illustrated here: