pandas grouping and visualization - python

I have to do some analysis using Python3 and pandas with a dataset which is shown as a toy example-
data
'''
location importance agent count
0 London Low chatbot 2
1 NYC Medium chatbot 1
2 London High human 3
3 London Low human 4
4 NYC High human 1
5 NYC Medium chatbot 2
6 Melbourne Low chatbot 3
7 Melbourne Low human 4
8 Melbourne High human 5
9 NYC High chatbot 5
'''
My aim is to group the location and then count the number of Low, Medium and/or High 'importance' column for each location. So far, the code I have come up with is-
data.groupby(['location', 'importance']).aggregate(np.size)
'''
agent count
location importance
London High 1 1
Low 2 2
Melbourne High 1 1
Low 2 2
NYC High 2 2
Medium 2 2
'''
This grouping and count aggregation contains index as the grouping objects-
data.groupby(['location', 'importance']).aggregate(np.size).index
I don't know how to proceed next? Also, how can I visualize this?
Help?

I think you need DataFrame.pivot_table, added aggfunc=sum for aggregate if duplicates and then use DataFrame.plot:
df = data.pivot_table(index='location', columns='importance', values='count', aggfunc='sum')
df.plot()
If need counts of pairs location with importance use crosstab:
df = pd.crosstab(data['location'], data['importance'])
df.plot()

Related

Counting non-filtered value_counts along with filtered values in pandas

Assuming that I have a dataframe of pastries
Pastry Flavor Qty
0 Cupcake Cheese 3
1 Cakeslice Chocolate 2
2 Tart Honey 2
3 Croissant Raspberry 1
And I get the value count of a specific flavor per pastry
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts()
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Then to get the percentile of that flavor qty, I could do this
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts().describe(percentiles=[.75, .85, .95])
And I'd get something like this (from full dataframe)
count 35.00000
mean 1.485714
std 0.853072
min 1.000000
50% 1.000000
75% 2.000000
85% 2.000000
95% 3.300000
max 4.000000
Where the total different pastries that are cheese flavored is 35, so the total cheese qty is distributed amongst those 35 pastries. The mean of qty is 1.48, max qty is 4 (cupcake and tart) etc, etc.
What I want to do is bring that 95th percentile down by counting all other values which are not 'Cheese' in the flavor column, however value_counts() is only counting the ones that are 'Cheese' because I filtered the dataframe. How can I also count the non Cheese rows, so that my percentiles will go down and will represent the distribution of Cheese total in the entire dataframe?
This is an example output:
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Swiss Roll 1
Baklava 0
Cannoli 0
Where the non-cheese flavor pastries are being included with 0 as qty, from there I can just get the percentiles and they will be reduced since there are 0 values now diluting them.
I decided to go and try the long way to try and solve this question and my result gave me the same answer as this question
Here is the long way, in case anyone is curious.
pastries = {}
for p in df['Pastry'].unique():
pastries[p] = df[(df['Flavor'] == 'Cheese') & (df['Pastry'] == p)]['Pastry'].count()
newdf = pd.DataFrame.from_dict(pastries.items())
newdf.describe(percentiles=[.75, .85, .95])

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

Get unique values from pandas series of lists

I have a column in DataFrame containing list of categories. For example:
0 [Pizza]
1 [Mexican, Bars, Nightlife]
2 [American, New, Barbeque]
3 [Thai]
4 [Desserts, Asian, Fusion, Mexican, Hawaiian, F...
6 [Thai, Barbeque]
7 [Asian, Fusion, Korean, Mexican]
8 [Barbeque, Bars, Pubs, American, Traditional, ...
9 [Diners, Burgers, Breakfast, Brunch]
11 [Pakistani, Halal, Indian]
I am attempting to do two things:
1) Get unique categories - My approach is have a empty set, iterate through series and append each list.
my code:
unique_categories = {'Pizza'}
for lst in restaurant_review_df['categories_arr']:
unique_categories = unique_categories | set(lst)
This give me a set of unique categories contained in all the lists in the column.
2) Generate pie plot of category counts and each restaurant can belong to multiple categories. For example: restaurant 11 belongs to Pakistani, Indian and Halal categories. My approach is again iterate through categories and one more iteration through series to get counts.
Are there simpler or elegant ways of doing this?
Thanks in advance.
Update using pandas 0.25.0+ with explode
df['category'].explode().value_counts()
Output:
Barbeque 3
Mexican 3
Fusion 2
Thai 2
American 2
Bars 2
Asian 2
Hawaiian 1
New 1
Brunch 1
Pizza 1
Traditional 1
Pubs 1
Korean 1
Pakistani 1
Burgers 1
Diners 1
Indian 1
Desserts 1
Halal 1
Nightlife 1
Breakfast 1
Name: Places, dtype: int64
And with plotting:
df['category'].explode().value_counts().plot.pie(figsize=(8,8))
Output:
For older verions of pandas before 0.25.0
Try:
df['category'].apply(pd.Series).stack().value_counts()
Output:
Mexican 3
Barbeque 3
Thai 2
Fusion 2
American 2
Bars 2
Asian 2
Pubs 1
Burgers 1
Traditional 1
Brunch 1
Indian 1
Korean 1
Halal 1
Pakistani 1
Hawaiian 1
Diners 1
Pizza 1
Nightlife 1
New 1
Desserts 1
Breakfast 1
dtype: int64
With plotting:
df['category'].apply(pd.Series).stack().value_counts().plot.pie()
Output:
Per #coldspeed's comments
from itertools import chain
from collections import Counter
pd.DataFrame.from_dict(Counter(chain(*df['category'])), orient='index').sort_values(0, ascending=False)
Output:
Barbeque 3
Mexican 3
Bars 2
American 2
Thai 2
Asian 2
Fusion 2
Pizza 1
Diners 1
Halal 1
Pakistani 1
Brunch 1
Breakfast 1
Burgers 1
Hawaiian 1
Traditional 1
Pubs 1
Korean 1
Desserts 1
New 1
Nightlife 1
Indian 1

How do I find the highest count of the same values with Pandas Python?

I am trying finding the most popular major for each university.
Here is a sample of the table:
Institution Major_Name Count Major
School 1 Art 2 First
School 1 English 12 First
School 1 Math 7 First
School 1 Art 6 Second
School 1 English 4 Second
School 1 Math 3 Second
School 2 Art 9
School 2 English 4
School 2 Math 13
I want the final outcome to look like this where the rest of the rows will disappear:
Institution Major_Name Count Major
School 1 English 12 First
School 1 Art 6 Second
School 2 Math 13
Thanks in advance. Very new to using Pandas!
You can do a groupby on Institution and then apply the max function:
In [547]: df.groupby('Institution', as_index=False).max()
Out[547]:
Institution Major Count
0 School 1 Math 12
1 School 2 Math 13
The as_index=False attribute will prevent the resultant GroupBy object from assigning Institution as the new index.
Based on your edit: To group by Institution as well as Major, you can specify multiple columns to group by:
In [563]: df.fillna('').groupby(['Institution', 'Major'], as_index=False).max()
Out[563]:
Institution Major Major_Name Count
0 School1 First Math 12
1 School1 Second Math 6
2 School2 Math 13

Categories

Resources