Ranking values based on two columns

Ranking values based on two columns - python

I'm trying to devise a way to rank accounts from best to worst based on their telephone duration and margin.
The data looks like this;
ID TIME_ON_PHONE MARGIN
1 1235 1256
2 12 124
3 1635 0
4 124 652
5 0 4566
Any suggestions on how to rank them from best to worst?
ID 5 = best as we have spent no time on the phone but their margin is the most.
ID 3 = worst as we've spend ages on the phone but no orders.
I've put it into excel to try and devise a solution but I can't get the ranking correct.

I would suggest creating a new metric like
New Metric = Margin / Time on phone
to compare each row.
To create a column with this metric just use:
dataframe["new_metric"] = dataframe["MARGIN"]/dataframe["TIME_ON_PHONE"]
Having 0 values in the TIME_ON_PHONE column will lead to an error, so I recommend replacing those values with a very small one, like 0.001 or something.
After that you can simply use this line of code to sort your rows:
dataframe = dataframe.sort_values("new_metric", ascending = False)
That way you would end up with the first ID being the best one, the second ID the second best one... etc.
Hope it helps.

Related

Pandas: How to I find the average number of days between visits, grouped by customer?

In Python/Pandas, I want to create a column in my dataframe that shows the average number of days between customer visits at a venue. That is, for each customer, what are the average number of days between that customer's visits?
Data looks like
Image of My Data
Sorry I'm really inexperienced and don't know how to type the data up other than this. I am following the solution in this StackOverflow answer, except that that person wanted the average number of days between visits in general, and I want days between visits for each customer. Thank you.

For each customer, you would need to have date_of_first_visit and date_of_most_recent_visit as well as the number_of_visits. Then the equation would be something like
days_since_first = date_of_most_recent_visit - date_of_first_visit
average_days_between = days_since_first / number_of_visits

I think this might work:
avg_days_btw_visit_p_customer = df.groupby('CustomerID', as_index=False)['Days_Btw_Visit'].agg('mean')
This will return the folowing:
CustomerID Days_Btw_Visit
0 1 9.5
1 2 29.0
2 3 NaN
3 4 3.0
Also, if you want to get rid of the NaN, you can use:
avg_days_btw_visit_p_customer = avg_days_btw_visit_p_customer.dropna()

Struggling with iterating through dataframes

I'm a newbie to Python, and have done my absolute best to exhaust all resources before posting here looking for assistance. I have spent all weekend and all day today trying to come up with what I feel ought a straightforward scenario to code for using two dataframes, but, for the life of me, I am spinning my wheels and not making any significant progress.
The situation is there is one dataframe with Sales Data:
CUSTOMER ORDER SALES_DATE SALES_ITEM_NUMBER UNIT_PRICE SALES_QTY
001871 225404 01/31/2018 03266465555 1 200
001871 225643 02/02/2018 03266465555 2 600
001871 225655 02/02/2018 03266465555 3 1000
001956 228901 05/29/2018 03266461234 2.2658 20
and a second dataframe with Purchasing Data:
PO_DATE PO_ITEM_NUMBER PO_QTY PO_PRICE
01/15/2017 03266465555 1000 1.55
01/25/2017 03266465555 500 5.55
02/01/2017 03266461234 700 4.44
02/01/2017 03266461234 700 2.22
All I'm trying to do is to figure out what the maximum PO_PRICE could be for each of the lines on the Sales Order dataframe, because I'm trying to maximize the difference between what I bought it for, and what I sold it for.
When I first looked at this, I figured a straightforward nested for loop would do the trick, and increment the counters. The issue though is that I'm not well versed enough in dataframes, and so I keep getting hung up trying to access the elements within them. The thing to keep in mind as well is that I've sold 1800 of the first item, but, only bought 1500 of them. So, as I iterate through this:
For the first Sales Order row, I sold 200. The Max_PO_PRICE = $5.55 [for 500 of them]. So, I need to deduct 200 from the PO_QTY dataframe, because I've now accounted for them.
For the second Sales Order row, I sold 600. There are still 300 I can claim that I bought for $5.55, but, then I've exhausted all of those 500, and so the best I can no do is dip into the other row, which has the Max_PO_PRICE = $1.55 (for 1,000 of them). So for this one, I'd be able to claim 300 at $5.55, and the other $300 at $1.55. I can't claim for more than I bought.
Here's the code I've come up with, and, I think I may have gone about this all wrong, but, some guidance and advice would be beyond incredibly appreciated and helpful.
I'm not asking anyone to write my code for me, but, simply to advise what approach you would have taken, and if there is a better way. I figure there has to be....
Thanks in advance for your feedback and assistance.
-Clare
for index1,row1 in sales.iterrows():
SalesQty = sales.loc[index1]["SALES_QTY"]
for index2,row2 in purchases.iterrows():
if (row1['SALES_ITEM_NUMBER']==row2['PO_ITEM_NUMBER']) and (row2['PO_QTY']>0):
# Find the Maximum PO Price in the result set
max_PO_Price = abc["PO_PRICE"].max()
xyz = purchases.loc[index2]
abc = abc.append(xyz)
if(SalesQty <= Purchase_Qty):
print("Before decrement, PO_QTY = ",??????? *<==== this is where I'm struggle busing****)
print()
+index2
#Drop the data from the xyz DataFrame
xyz=xyz.iloc[0:0]
#Drop the data from the abc DataFrame
abc=abc.iloc[0:0]
+index1

This looks like something SQL would elegantly handle through analytical functions. Fortunately Pandas comes with most (but not all) of this functionality and it's a lot faster than doing nested iterrows. I'm not a Pandas expert by any means but I'll give it a whizz. Apologies if I've misinterpreted the question.
Makes sense to group the SALES_QTY, we'll use this to track how much QTY we have:
sales_grouped = sales.groupby(["SALES_ITEM_NUMBER"], as_index = False).agg({"SALES_QTY":"sum"})
Let's group the table into one so we can iterate over one table instead of two. We can use a JOIN action on the common column "PO_ITEM_NUMBER" and "SALES_ITEM_NUMBER", or what Pandas likes to call it "merge". While we're at it let's sort the table categorised by "PO_ITEM_NUMBER" and with the most expensive "PO_PRICE" on the top, this is and the next code block is the equivalent of a FN OVER PARTITION BY ORDER BY SQL analytical function.
sorted_table = purchases.merge(sales_grouped,
how = "left",
left_on = "PO_ITEM_NUMBER",
right_on = "SALES_ITEM_NUMBER").sort_values(by = ["PO_ITEM_NUMBER", "PO_PRICE"],
ascending = False)
Let's create a column CUM_PO_QTY with the cumulative sum of the PO_QTY (partitioned/grouped by PO_ITEM_NUMBER). We'll use this to mark when we go over the max SALES_QTY.
sorted_table["CUM_PO_QTY"] = sorted_table.groupby(["PO_ITEM_NUMBER"], as_index = False)["PO_QTY"].cumsum()
This is where the custom part comes in, we can integrate custom functions to apply row-by-row (or by column even) along the dataframe using apply(). We're creating two columns TRACKED_QTY which is simply the SALES_QTY minus CUM_PO_QTY so we know when we have run into the negative, and PRICE_SUM which will eventually be the maximum value gained or spent. But for now: If the TRACKED_QTY is less than 0 we multiply by the PO_QTY else the SALES_QTY for conservation purposes.
sorted_table[["TRACKED_QTY", "PRICE_SUM"]] = sorted_table.apply(lambda x: pd.Series([x["SALES_QTY"] - x["CUM_PO_QTY"],
x["PO_QTY"] * x["PO_PRICE"]
if x["SALES_QTY"] - x["CUM_PO_QTY"] >= 0
else x["SALES_QTY"] * x["PO_PRICE"]]), axis = 1)
To handle the trailing TRACKED_QTY negatives, we can filter the positive using a conditional mask, and groupby the negative revealing only the maximum PRICE_SUM value.
Then simply append these two tables and sum them.
evaluated_table = sorted_table[sorted_table["TRACKED_QTY"] >= 0]
evaluated_table = evaluated_table.append(sorted_table[sorted_table["TRACKED_QTY"] < 0].groupby(["PO_ITEM_NUMBER"], as_index = False).max())
evaluated_table = evaluated_table.groupby(["PO_ITEM_NUMBER"], as_index = False).agg({"PRICE_SUM":"sum"})
Hope this works for you.

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!

what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Grouping values based on another column and summing those values together

I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.

It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]

Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.

To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)

You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()

You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.