Selecting data based on number of occurences using Python / Pandas - python

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).

IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638

Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Related

Dataframe Insert Labels if filename starts with a 'b'

I want to create a dataframe and give a lable to each file, based on the first letter of the filename:
This is where I created the dataframe, which works out fine:
[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... NaN
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... NaN
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... NaN
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... NaN
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... NaN
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... NaN
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... NaN
2222 t_399.txt Be careful how you codeA new European directiv... NaN
2223 t_400.txt US cyber security chief resignsThe man making ... NaN
2224 t_401.txt Losing yourself in online gamingOnline role pl... NaN
Now I want to insert labels based on the filename
for index, row in df.iterrows():
if row['file'].startswith('b'):
row['label'] = 0
elif row['file'].startswith('e'):
row['label'] = 1
elif row['file'].startswith('p'):
row['label'] = 2
elif row['file'].startswith('s'):
row['label'] = 3
else:
row['label'] = 4
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 4
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 4
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 4
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 4
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 4
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
2222 t_399.txt Be careful how you codeA new European directiv... 4
2223 t_400.txt US cyber security chief resignsThe man making ... 4
2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
As you can see, every row got the label 4. What did I do wrong?
here is one way to do it
instead of for loop, you can use map to assign the values to the label
# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4
#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)
file text label
0 0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 0
1 1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 0
2 2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 0
3 3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 0
4 4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 0
5 2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
6 2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
7 2222 t_399.txt Be careful how you codeA new European directiv... 4
8 2223 t_400.txt US cyber security chief resignsThe man making ... 4
9 2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
According to the documentation usage of iterrows() to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Therefore do instead as follows.
def label():
if row['file'].startswith('b'):
return 0
elif row['file'].startswith('e'):
return 1
elif row['file'].startswith('p'):
return 2
elif row['file'].startswith('s'):
return 3
else:
return 4
df['label'] = df.apply(lambda row :label(row[0]),axis=1)

partial match to compare 2 columns from different dataframes using fuzzy wuzzy

I want to compare this dataframe df1:
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160
with df2 as shown below:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500
I want to compare Product column in df1 with Product df2. In order to find The good id of the product. And if there is similarity < 80 it will put 'Remove' in the ID field
NB: The text of the Product column in df1 and df2 are not 100% matched
Can Anyone help me with this or how can i use fuzzy wazzy to get the good id?
Here is my code
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])
What i got :
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0
What is expected to be :
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
You can make a wrapper function:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)
output:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
NB. the output is not exactly what you expect, you have to explain why rows 4/5 should be dropped

Sum based on grouping in pandas dataframe?

I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26

Get unique values from pandas series of lists

I have a column in DataFrame containing list of categories. For example:
0 [Pizza]
1 [Mexican, Bars, Nightlife]
2 [American, New, Barbeque]
3 [Thai]
4 [Desserts, Asian, Fusion, Mexican, Hawaiian, F...
6 [Thai, Barbeque]
7 [Asian, Fusion, Korean, Mexican]
8 [Barbeque, Bars, Pubs, American, Traditional, ...
9 [Diners, Burgers, Breakfast, Brunch]
11 [Pakistani, Halal, Indian]
I am attempting to do two things:
1) Get unique categories - My approach is have a empty set, iterate through series and append each list.
my code:
unique_categories = {'Pizza'}
for lst in restaurant_review_df['categories_arr']:
unique_categories = unique_categories | set(lst)
This give me a set of unique categories contained in all the lists in the column.
2) Generate pie plot of category counts and each restaurant can belong to multiple categories. For example: restaurant 11 belongs to Pakistani, Indian and Halal categories. My approach is again iterate through categories and one more iteration through series to get counts.
Are there simpler or elegant ways of doing this?
Thanks in advance.
Update using pandas 0.25.0+ with explode
df['category'].explode().value_counts()
Output:
Barbeque 3
Mexican 3
Fusion 2
Thai 2
American 2
Bars 2
Asian 2
Hawaiian 1
New 1
Brunch 1
Pizza 1
Traditional 1
Pubs 1
Korean 1
Pakistani 1
Burgers 1
Diners 1
Indian 1
Desserts 1
Halal 1
Nightlife 1
Breakfast 1
Name: Places, dtype: int64
And with plotting:
df['category'].explode().value_counts().plot.pie(figsize=(8,8))
Output:
For older verions of pandas before 0.25.0
Try:
df['category'].apply(pd.Series).stack().value_counts()
Output:
Mexican 3
Barbeque 3
Thai 2
Fusion 2
American 2
Bars 2
Asian 2
Pubs 1
Burgers 1
Traditional 1
Brunch 1
Indian 1
Korean 1
Halal 1
Pakistani 1
Hawaiian 1
Diners 1
Pizza 1
Nightlife 1
New 1
Desserts 1
Breakfast 1
dtype: int64
With plotting:
df['category'].apply(pd.Series).stack().value_counts().plot.pie()
Output:
Per #coldspeed's comments
from itertools import chain
from collections import Counter
pd.DataFrame.from_dict(Counter(chain(*df['category'])), orient='index').sort_values(0, ascending=False)
Output:
Barbeque 3
Mexican 3
Bars 2
American 2
Thai 2
Asian 2
Fusion 2
Pizza 1
Diners 1
Halal 1
Pakistani 1
Brunch 1
Breakfast 1
Burgers 1
Hawaiian 1
Traditional 1
Pubs 1
Korean 1
Desserts 1
New 1
Nightlife 1
Indian 1

delete part of a row in pandas / shift up part of a row ? Align Column Headings

So I have a data frame where the headings I want do not currently line up:
In [1]: df = pd.read_excel('example.xlsx')
print (df.head(10))
Out [1]: Portfolio Asset Country Quantity
Unique Identifier Number of fund B24 B65 B35 B44
456 2 General Type A UNITED KINGDOM 1
123 3 General Type B US 2
789 2 General Type C UNITED KINGDOM 4
4852 4 General Type C UNITED KINGDOM 4
654 1 General Type A FRANCE 3
987 5 General Type B UNITED KINGDOM 2
321 1 General Type B GERMANY 1
951 3 General Type A UNITED KINGDOM 2
357 4 General Type C UNITED KINGDOM 3
As we can see; above the first 2 column headings there are 2 blank cells and below the next 4 column headings are "B" numbers which I don't care about.
So 2 questions; How can I shift up the first 2 columns without having a column heading to identify them with (due to the blank cells above)?
And how can I delete just Row 2 of the remaining columns and have the data below move up to take the place of the "B" numbers?
I found some similar questions already asked python: shift column in pandas dataframe up by one but nothing that solves the particular intricacies above I don't think.
Also I'm quite new to Python and Pandas so if this is really basic I apologise!
IIUC you can use:
#create df from multiindex in columns
df1 = pd.DataFrame([x for x in df.columns.values])
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio B24
3 Asset B65
4 Country B35
5 Quantity B44
#if len of string < 4, give value from column 0 to column 1
df1.loc[df1.iloc[:,1].str.len() < 4, 1] = df1.iloc[:,0]
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio Portfolio
3 Asset Asset
4 Country Country
5 Quantity Quantity
#set columns by first columns of df1
df.columns = df1.iloc[:,1]
print df
0 Unique Identifier Number of fund Portfolio Asset Country \
0 456 2 General Type A UNITED KINGDOM
1 123 3 General Type B US
2 789 2 General Type C UNITED KINGDOM
3 4852 4 General Type C UNITED KINGDOM
4 654 1 General Type A FRANCE
5 987 5 General Type B UNITED KINGDOM
6 321 1 General Type B GERMANY
7 951 3 General Type A UNITED KINGDOM
8 357 4 General Type C UNITED KINGDOM
0 Quantity
0 1
1 2
2 4
3 4
4 3
5 2
6 1
7 2
8 3
EDIT by comments:
print df.columns
Index([u'Portfolio', u'Asset', u'Country', u'Quantity'], dtype='object')
#set first row by columns names
df.iloc[0,:] = df.columns
#reset_index
df = df.reset_index()
#set columns from first row
df.columns = df.iloc[0,:]
df.columns.name= None
#remove first row
print df.iloc[1:,:]
Unique Identifier Number of fund Portfolio Asset Country Quantity
1 456 2 General Type A UNITED KINGDOM 1
2 123 3 General Type B US 2
3 789 2 General Type C UNITED KINGDOM 4
4 4852 4 General Type C UNITED KINGDOM 4
5 654 1 General Type A FRANCE 3
6 987 5 General Type B UNITED KINGDOM 2
7 321 1 General Type B GERMANY 1
8 951 3 General Type A UNITED KINGDOM 2
9 357 4 General Type C UNITED KINGDOM 3

Categories

Resources