I have a huge dataset with a lot of different client names, bills etc.
Now I want to show the 4 clients with the cumulated total highest bill.
So far I have used the groupby function:
data.groupby(by = ["CustomerName","Bill"], as_index=False).sum()
I tried to group by the name of the customers and the bill but it's not giving me the total sum of all the individual customer orders but only each single order from the customer.
Can someone help and tell me how I can receive on the first position customer x (with the highest accumulated bill) and the sum of all his orders and on position 2 the customer with the second highest accumulated bill and so on?
Big thanks!
Since, I don't know the full structure of your data data frame, I recommend subsetting the relevant columns first:
data = data[["CustomerName", "Bill"]]
Then, you just need to group by CustomerName and sum over all columns (Bill in that case):
data.groupby(by=["CustomerName"]).sum()
Finally, you need to sort by the Bill column in non-ascending order:
data.sort_values(by='Bill', ascending=False)
print(data.head(4))
Related
QUESTION
What is the quickest/best (most pythonic) way of continuously (1000 times) add a set (e.g 35) of rows to a df (e.g. 40000 rows), if the criteria for each ID of that row is that the ID of that row is not already twice in that df, and the ID should follow from a certain strategy (e.g. the lowest possible number)?
CONTEXT
A compartment in a warehouse can hold a maximum of two different items. The location is given by a locationID (A115.14604A0) that I can split up in Tower (A), floor (11), aisle (5), column (146), shelf (04) and lastly, compartment (A0). I get only a line when an item exists. Example:
Date item ID location_id ean quantity volume total v
0 2020-08-17 9200000074604211 A115.14604A0 7710958034710 1 820000 820000.0
1 2020-08-17 9200000122486821 A123.12702E1 2407047440917 2 450000 900000.0
So there is never a line when quantity is 0. Hence not all locations can be 'looped over' by the DF itself (although this is not a good solution anyway I think).
I need to 'fill' this df with new items, but I never have to add/change numbers in the original stock file, only append new rows. So items will always form a new row, but can have the same location ID with a maximum of two times. This format needs to stay the same.
STRATEGIES I HAVE THOUGHT OF:
Original plan
Group by stock on unique Item_ID, create column (nr_unique)
Join with an empty template of the DF, to get all the locations (if no match, nr_unique = 0)
For loop over Tower (A), floor (11), aisle (5), column (146), shelf (04) and lastly, compartment, following a certain logic (for instance start with lowest tower).
If quantity of a location is < 1, append a new line to the original stock file. Repeat this (1000 times) for each new set (35) of items.
NOT pythonic, bit smarter
For loop over Tower (A), floor (11), aisle (5), column (146), shelf (04) and lastly, compartment, following a certain logic (for instance start with lowest tower). If not in DF, create (concatenate) & append, if only once in DF, create (concatenate) & append, if twice or more in DF, continue in loop.
Repeat (1000 times) for each new set (35) of items.
Dumb vectorized method
Group by stock on unique Item_ID, create column (nr_unique)
Join with an empty template of the DF, to get all the locations (if no match, nr_unique = 0)
Get all rows where N <2
Select x rows based on strategy (for instance where aisle is lowest)
For my original DF, select those rows.
Add 1 to all those (or loop over them if you want everything to be fully filled to 2)
Repeat (1000 times) for every set (35) of items.
????? there must be a smarter way to do this?
EVALUATION
I really lean towards 1 but everywhere I look people warn me for looping over dataframes. This needs to loop over all items every time. Hence it will be slow. What I am looking for is a way to add rows to my df in a pythonic matter.
Regards,
Charles
I ended up creating a class that produced separate compartments, filling each one with items and volumes.
I am new to pandas and stackoverflow so i will try my best to explain what my problem is.
I have a dataframe as the below and i would like to aggregate rows with same Customer Id and Date (so each Customer id-Date combination only needs to repeat ones) using multiple logics:
Sum of quantity for that date-customer id (how many pieces in total the customer bought each purchase day date)
Count of Sales id for that date-customer id (how many sales order the customer placed each purchase day)
distinct count of shop id for that date-customer id (from how many shops the customer placed orders each purchase day)
Last we have in the product category 2 products only that i identified as 0 or 1, I would like to add 2 columns that count number of sales orders of product category 0 and count number of sales orders of product category 1.
I tried using the below code to solve the first 3 points but without success.
df = df.groupby('customer id','date').sum('Quantity').count('Sales id').nunique('Shop id')
Really struggling with the last fourth point.
Hope you can help me out here.
Dataframe
Desired Output
I found the solution of the first 3 points using agg() method:
df.groupby(['Customer id','Date'],as_index=False).agg({'Quantity' : ['sum'], 'Sale id' : ['count'], 'shop id' : ['nunique']})
Ideally I would add within agg() two additional aggregations to that count 'Product category' when == '0' and when =='1'
Any ideas?
I have the following dataframe (this is a sample there are many rows)
Student ID avg
0 205842 68.333333
1 280642 74.166667
I want to sort by decreasing average percentage grade, and, if equal the Increasing Student ID.
I have been able to sort with one parameter like below, however I'm unsure how to sort with two as I want
df_pct_scores.sort_values(by='avg', ascending=False)
Please see if this works:
df_pct_scores.sort_values(by = ['avg','ID'], ascending=[False, True])
I have been working with a dataset which contains information about houses that have been sold on a particular market. There are two columns, 'price', and 'date'.
I would like to make a line plot to show how the prices of this market have chaged over time.
The problem is, I see that some houses have been sold at the same date but with a diferent price.
So ideally i would need to get the mean/average price of the house sold on each date before plotting.
So for example, if I had something like this:
DATE / PRICE
02/05/2015 / $100
02/05/2015 / $200
I would need to get a new row with the following average:
DATE / PRICE
02/05/2015 / $150
I just havent been able to figure it out yet. I would appreciate anyone who could guide me in this matter. Thanks in advance.
Assuming you're using pandas:
pd.groupby('DATE')['PRICE'].mean()
Background: I am trying to use data from a csv file to make asks questions and make conclusions base on data. The data is a log of patient visits from a clinic in Brazil, including additional patient data, and whether the patient was a no show or not. I have chosen to examine correlations between the patient's age and the no show data.
Problem: Given visit number, patient ID, age, and no show data, how do I compile an array of ages that correlate with the each unique patient ID (so that I can evaluate the mean age of total unique patients visiting the clinic).
My code:
# data set of no shows at a clinic in Brazil
noshow_data = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
noshow_df = pd.DataFrame(noshow_data)
Here is the beginning of the code, with the head of the whole dataframe of the csv given
# Next I construct a dataframe with only the data I'm interested in:
ptid = noshow_df['PatientId']
ages = noshow_df['Age']
noshow = noshow_df['No-show']
ptid_ages_noshow = pd.DataFrame({'PatientId' : pt_id, 'Ages' : ages,
'No_show' : noshow})
ptid_ages_noshow
Here I have sorted the data to show the multiple visits of a unique patient
# Now, I know how to determine the total number of unique patients:
# total number of unique patients
num_unique_pts = noshow_df.PatientId.unique()
len(num_unique_pts)
If I want to find the mean age of all the patients during the course of all visits I would use:
# mean age of all vists
ages = noshow_data['Age']
ages.mean()
So my question is this, how could I find the mean age of all the unique patients?
You can simply use the groupby function available in pandas with restriction to the concerned columns :
ptid_ages_noshow[['PatientId','Ages']].groupby('PatientId').mean()
So you only want to keep one appointment per patient for the calculation? This is how to do it:
noshow_df.drop_duplicates('PatientId')['Age'].mean()
Keep in mind that the age of people changes over time. You need to decide how you want to handle this.