Rename the less frequent categories by "OTHER" python - python

In my dataframe I have some categorical columns with over 100 different categories. I want to rank the categories by the most frequent. I keep the first 9 most frequent categories and the less frequent categories rename them automatically by: OTHER
Example:
Here my df :
print(df)
Employee_number Jobrol
0 1 Sales Executive
1 2 Research Scientist
2 3 Laboratory Technician
3 4 Sales Executive
4 5 Research Scientist
5 6 Laboratory Technician
6 7 Sales Executive
7 8 Research Scientist
8 9 Laboratory Technician
9 10 Sales Executive
10 11 Research Scientist
11 12 Laboratory Technician
12 13 Sales Executive
13 14 Research Scientist
14 15 Laboratory Technician
15 16 Sales Executive
16 17 Research Scientist
17 18 Research Scientist
18 19 Manager
19 20 Human Resources
20 21 Sales Executive
valCount = df['Jobrol'].value_counts()
valCount
Sales Executive 7
Research Scientist 7
Laboratory Technician 5
Manager 1
Human Resources 1
I keep the first 3 categories then I rename the rest by "OTHER", how should I proceed?
Thanks.

Convert your series to categorical, extract categories whose counts are not in the top 3, add a new category e.g. 'Other', then replace the previously calculated categories:
df['Jobrol'] = df['Jobrol'].astype('category')
others = df['Jobrol'].value_counts().index[3:]
label = 'Other'
df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)
Note: It's tempting to combine categories by renaming them via df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label)), but this won't work as this will imply multiple identically labeled categories, which isn't possible.
The above solution can be adapted to filter by count. For example, to include only categories with a count of 1 you can define others as so:
counts = df['Jobrol'].value_counts()
others = counts[counts == 1].index

Use value_counts with numpy.where:
need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')
valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
Name: Jobrol, dtype: int64
Another solution:
N = 3
s = df['Jobrol'].value_counts()
valCount = s.iloc[:N].append(pd.Series(s.iloc[N:].sum(), index=['OTHER']))
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
dtype: int64

One line solution:
limit = 500
df['Jobrol'] = df['Jobrol'].map({x[0]: x[0] if x[1] > limit else 'other' for x in dict(df['Jobrol'].value_counts()).items()})

Related

How to recode the columns based on condition

I have a bigdata to analyze that includes many rows with columns.
I would like to make a new column('Recode_Brand') copying 'Brand' column based on the condition that only displays Top 10 brands and 'Others'
Then how can I make the condition and the logic?
It will be perfect if I can use the condition as below;
Brand_list = ['Google', 'Apple', 'Amazon', 'Microsoft', 'Tencent', 'Facebook', 'Visa', 'McDonald's', 'Alibaba', 'AT&T']
I am quite new to Pandas and need your support. Highly appreciate in advance.
enter image description here
Just use the 2018 column, for example:
df['Recode_Brand'] = df.apply(lambda row: row['Brand'] if row['2018'] <= 10 else 'Other', axis=1)
Or otherwise if you need that brands list you can do:
Brand_list = ["Google", "Apple", "Amazon", "Microsoft", "Tencent", "Facebook", "Visa", "McDonald's", "Alibaba", "AT&T"]
df['Recode_Brand'] = df.apply(lambda row: row['Brand'] if row['Brand'] in Brand_list else 'Other', axis=1)
NB If your string contains a ' character as in McDonald's, you have to either wrap it in double quotes ", or to escape that character with \'.
Use numpy.where to find Brand in top10 and add a new column:
df = pd.DataFrame({'2018':[7,8,3,12,15,16,10,9,4,5,11,1,14,2,13,6],
'Brand':['Google','Apple','Amazon','Microsoft','Tencent','Facebook','Visa','McDonalds','Alibaba','AT&T',
'IBM','Verizon','Marlboro','Coca-Cola','MasterCard','UPS']})
Create a new series with top 10 brands
top10 = df.nsmallest(10, '2018')
And add a new column, Recode_Brand if brand is in top10 else 'Others'
df['Recode_Brand'] = np.where((df['Brand'].eq(top10['Brand'])),df['Brand'],'Others')
print(df)
2018 Brand Recode_Brand
0 7 Google Google
1 8 Apple Apple
2 3 Amazon Amazon
3 12 Microsoft Others
4 15 Tencent Others
5 16 Facebook Others
6 10 Visa Visa
7 9 McDonalds McDonalds
8 4 Alibaba Alibaba
9 5 AT&T AT&T
10 11 IBM Others
11 1 Verizon Verizon
12 14 Marlboro Others
13 2 Coca-Cola Coca-Cola
14 13 MasterCard Others
15 6 UPS UPS

How to select the row with same value in between two data frames?

I have the following huge dataset with me, containing a number of different app names and Sentiments under the attribute sent:
App Sent
0 10 Best Foods for You Positive
1 10 Best Foods for You Positive
2 10 Best Foods for You NaN
3 10 Best Foods for You Positive
4 10 Best Foods for You Positive
5 10 Best Foods for You Positive
6 10 Best Foods for You Positive
7 10 Best Foods for You NaN
8 10 Best Foods for You Neutral
9 10 Best Foods for You Neutral
10 10 Best Foods for You Positive
11 10 Best Foods for You Positive
12 10 Best Foods for You Positive
13 10 Best Foods for You Positive
... ...
64289 Houzz Interior Design Ideas NaN
64290 Houzz Interior Design Ideas NaN
64291 Houzz Interior Design Ideas NaN
64292 Houzz Interior Design Ideas NaN
64293 Houzz Interior Design Ideas NaN
64294 Houzz Interior Design Ideas NaN`
I want to figure out the app which has generated approximately the same ratio for positive and negative sentiments (i.e finding the apps which have the same number of positive and negative sentiments or closest number one)
I tried separating the above data frame into two data frames with values positive and negative.
And then grouping then with count
For example:
Positive dataframe p:
Sent
App
10 Best Foods for You 162
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室 31
11st 23
1800 Contacts - Lens Store 64
1LINE – One Line with One Touch 27
2018Emoji Keyboard 😂 Emoticons Lite -sticker&gif 25
21-Day Meditation Experience 68
2Date Dating App, Love and matching 26
2GIS: directory & navigator 23
2RedBeans 31
2ndLine - Second Phone Number 17
30 Day Fitness Challenge - Workout at Home 27
365Scores - Live Scores 5
3D Live Neon Weed Launcher 2
4 in a Row 17
7 Day Food Journal Challenge 9
7 Minute Workout 10
7 Weeks - Habit & Goal Tracker 10
8 Ball Pool 104
850 Sports News Digest 38
8fit Workouts & Meal Planner 137
95Live -SG#1 Live Streaming App 34
A Call From Santa Claus! 20
A Word A Day 3
A&E - Watch Full Episodes of TV Shows 30
A+ Gallery - Photos & Videos 24
...
HipChat - Chat Built for Teams 19
Hipmunk Hotels & Flights 30
Hitwe - meet people and chat 2
Hole19: Golf GPS App, Rangefinder & Scorecard 18
Home Decor Showpiece Art making: Medium Difficulty 16
Home Scouting® MLS Mobile 13
Home Security Camera WardenCam - reuse old phones 18
Home Street – Home Design Game 42
Home Workout - No Equipment 24
Home Workout for Men - Bodybuilding 22
Home workouts - fat burning, abs, legs, arms,chest 8
HomeWork 1
Homes.com 🏠 For Sale, Rent 11
Homescapes 27
Homesnap Real Estate & Rentals 25
Homestyler Interior Design & Decorating Ideas 19
Homework Planner 33
Honkai Impact 3rd 54
Hopper - Watch & Book Flights 54
Horoscopes – Daily Zodiac Horoscope and Astrology 21
Horses Live Wallpaper 22
Hostelworld: Hostels & Cheap Hotels Travel App 42
Hot Wheels: Race Off 14
HotelTonight: Book amazing deals at great hotels 93
Hotels Combined - Cheap deals 15
Hotels.com: Book Hotel Rooms & Find Vacation Deals 39
Hotspot Shield Free VPN Proxy & Wi-Fi Security 17
Hotstar 14
Hotwire Hotel & Car Rental App 16
Housing-Real Estate & Property 8
[853 rows x 1 columns]
And the negative dataframe n:
Sent
App
10 Best Foods for You 10
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室 1
11st 7
1800 Contacts - Lens Store 6
1LINE – One Line with One Touch 8
2018Emoji Keyboard 😂 Emoticons Lite -sticker&gif 1
21-Day Meditation Experience 10
2Date Dating App, Love and matching 7
2GIS: directory & navigator 6
2RedBeans 2
2ndLine - Second Phone Number 7
30 Day Fitness Challenge - Workout at Home 2
4 in a Row 3
7 Minute Workout 1
7 Weeks - Habit & Goal Tracker 4
8 Ball Pool 106
850 Sports News Digest 1
8fit Workouts & Meal Planner 19
95Live -SG#1 Live Streaming App 20
A Call From Santa Claus! 14
A&E - Watch Full Episodes of TV Shows 3
A+ Gallery - Photos & Videos 7
A+ Mobile 9
ABC Kids - Tracing & Phonics 1
ABC News - US & World News 29
ABC Preschool Free 8
...
Hill Climb Racing 13
Hill Climb Racing 2 11
Hily: Dating, Chat, Match, Meet & Hook up 29
Hinge: Dating & Relationships 11
HipChat - Chat Built for Teams 26
Hipmunk Hotels & Flights 1
Hitwe - meet people and chat 7
Home Decor Showpiece Art making: Medium Difficulty 5
Home Scouting® MLS Mobile 12
Home Security Camera WardenCam - reuse old phones 4
Home Street – Home Design Game 13
Home Workout - No Equipment 1
Homes.com 🏠 For Sale, Rent 3
Homescapes 25
Homesnap Real Estate & Rentals 6
Homestyler Interior Design & Decorating Ideas 7
Homework Planner 4
Honkai Impact 3rd 22
Hopper - Watch & Book Flights 18
Horoscopes – Daily Zodiac Horoscope and Astrology 1
Horses Live Wallpaper 2
Hostelworld: Hostels & Cheap Hotels Travel App 12
Hot Wheels: Race Off 6
HotelTonight: Book amazing deals at great hotels 17
Hotels Combined - Cheap deals 7
Hotels.com: Book Hotel Rooms & Find Vacation Deals 21
Hotspot Shield Free VPN Proxy & Wi-Fi Security 3
Hotstar 14
Hotwire Hotel & Car Rental App 6
Housing-Real Estate & Property 10
[782 rows x 1 columns]
Doing this I could find the app name which has equal p["Sent"].values:
df=p.loc[p["Sent"]==n["Sent"]]
print(df)
Output:
ValueError: Can only compare identically-labeled Series objects
You are comparing dataframes with different rows.
I would do like this. Consider this situation.
name p n
app1 5 5
app2 8 6
app3 7 7
app4 10 8
app5 3 NaN
This code print the name of app and num where 'p' and 'n' numbers are the same.
# make dataframe p, n
p = pd.DataFrame([5, 8, 7, 10 ,3], index=['app1', 'app2', 'app3', 'app4', 'app5'], columns=['p'])
n = pd.DataFrame([5, 6, 7, 8, None], index=['app1', 'app2', 'app3', 'app4', 'app5'], columns=['n'])
# combine p and n with concat
df = pd.concat([p, n], axis=1)
# check equality
for i in range(len(df)):
if df.iloc[i]['p'] == df.iloc[i]['n']:
print(df.index[i], df.iloc[i]['p'])
# Outputs are
# app1 5.0
# app3 7.0

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

how to check whether column of text contains specific string or not in pandas

I have following dataframe in pandas
job_desig salary
senior analyst 12
junior researcher 5
scientist 20
sr analyst 12
Now I want to generate one column which will have a flag set as below
sr = ['senior','sr']
job_desig salary senior_profile
senior analyst 12 1
junior researcher 5 0
scientist 20 0
sr analyst 12 1
I am doing following in pandas
df['senior_profile'] = [1 if x.str.contains(sr) else 0 for x in
df['job_desig']]
You can join all values of list by | for regex OR, pass to Series.str.contains and last cast to integer for True/False to 1/0 mapping:
df['senior_profile'] = df['job_desig'].str.contains('|'.join(sr)).astype(int)
If necessary, use word boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in sr)
df['senior_profile'] = df['job_desig'].str.contains(pat).astype(int)
print (df)
job_desig salary senior_profile
0 senior analyst 12 1
1 junior researcher 5 0
2 scientist 20 0
3 sr analyst 12 1
Soluttion with sets, if only one word values in list:
df['senior_profile'] = [int(bool(set(sr).intersection(x.split()))) for x in df['job_desig']]
You can just do it by simply using str.contains
df['senior_profile'] = df['job_desig'].str.contains('senior') | df['job_desig'].str.contains('sr')

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Categories

Resources