How to apply percentile ranking on the pivot table? - python

How to apply percentile ranking on the pivot table ?
Dummy Dataset
import pandas as pd
df = pd.DataFrame({"Business": ["Hotel","Hotel", "Transport", "Agri", "Tele","Hotel", "Transport", "Agri", "Tele"],
"Location": ["101","101", "101", "101", "103",'105','102','103','106'],
"Area" : ['A','A','A','A','B','C','D','B','F']})
activity_cat_countby_subarea = df.groupby(['Area', 'Location','Business']).size().reset_index(name='counts')
activity_cat_countby_subarea = activity_cat_countby_subarea.reset_index().sort_values(['counts'], ascending=False)
After converting to the pivot table here I am applying the ranking on the overall count level.
activity_cat_countby_subarea['overll_pct_rank'] = activity_cat_countby_subarea['counts'].rank(pct=True)
But my requirement I need to apply the ranking based on each business count. i.e I need to find the ranking for each business i.e "hotel" and their count.
Kindly assist let me know if you need more information

Instead of doing this:
activity_cat_countby_subarea['overll_pct_rank'] = activity_cat_countby_subarea['counts'].rank(pct=True)
Do this:
activity_cat_countby_subarea['overll_pct_rank']=activity_cat_countby_subarea.groupby(['Business','counts']).rank(pct=True)
activity_cat_countby_subarea.sort_index(inplace=True)
#Output
index Area Location Business counts overll_pct_rank
0 0 A 101 Agri 1 0.5
1 1 A 101 Hotel 2 1.0
2 2 A 101 Transport 1 0.5
3 3 B 103 Agri 1 1.0
4 4 B 103 Tele 1 0.5
5 5 C 105 Hotel 1 1.0
6 6 D 102 Transport 1 1.0
7 7 F 106 Tele 1 1.0

Related

Apply a softmax function on groupby in the same pandas dataframe

I have been looking to apply the following softmax function from https://machinelearningmastery.com/softmax-activation-function-with-python/
from scipy.special import softmax
# define data
data = [1, 3, 2]
# calculate softmax
result = softmax(data)
# report the probabilities
print(result)
[0.09003057 0.66524096 0.24472847]
I am trying to apply this to a dataframe which is split by groups, and return the probabilites row by row for a group.
My dataframe is:
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','12','12','12'],
'Name': ['Joe','Jack','John','James','Jim'],
'Rating':[30,32,2.5,3,4],
}
df = pd.DataFrame(data=d)
df
EventNo Name Rating
0 10 Joe 30.0
1 10 Jack 32.0
2 12 John 2.5
3 12 James 3.0
4 12 Jim 4
In this instance there are two different events (10 and 12) where for event 10 the values are data = [30,32] and event 12 data = [2.5,3,4]
My expected result would be a new column probabilities with the results:
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.1192
1 10 Jack 32.0 0.8807
2 12 John 2.5 0.1402
3 12 James 3.0 0.2312
4 12 Jim 4 0.6285
Any help on how to do this on all groups in the dataframe would be much appreciated! Thanks!
You can use groupby followed by transform which returns results indexed by the original dataframe. A simple way to do it would be
df["Probabilities"] = df.groupby('EventNo')["Rating"].transform(softmax)
The result is
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.119203
1 10 Jack 32.0 0.880797
2 12 John 2.5 0.140244
3 12 James 3.0 0.231224
4 12 Jim 4.0 0.628532

Pandas reshape a multicolumn dataframe long to wide with conditional check

I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()

Count votes of a survey by the answer

I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns
I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1

Python - Pivot and create histograms from Pandas column, with missing values

Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Categories

Resources