Create dummy variables if value is in list - python

I'm working with the Zomato Bangalore Restaurant data set found here. One of my pre-processing steps is to create dummy variables for the types of cuisine each restaurant serves. I've used panda's explode to split the cuisines and I've created lists for the top 30 cuisines and the not top 30 cuisines. I've created a sample dataframe below.
sample_df = pd.DataFrame({
'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
'cuisines_lst': [
['North Indian', 'Chinese'],
['Chinese', 'North Indian', 'Thai'],
['Cafe', 'Mexican', 'Italian']
]
})
I've created the top and not top lists. In the actual data I'm using the top 30 but for the sake of the example, it's the top 2 and not top 2.
top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()
What I would like is to create a dummy variable for all cuisines in the top list with the suffix _bin and create a final dummy variable other if a restaurant has one of the cuisines from the not top list. The desired output looks like this:
name
cuisines_lst
Chinese_bin
North Indian_bin
Other
Jalsa
[North Indian, Chinese]
1
1
0
Spice Elephant
[Chinese, North Indian, Thai]
1
1
1
San Churro Cafe
[Cafe, Mexican, Italian]
0
0
1

Create the dummies, then reduce by duplicated indices to get your columns for the top 2:
a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[top2].sum().add_suffix('_bin')
If you want it in alphabetical order (in this case, Chinese followed by North Indian), add an intermediate step to sort columns with a.sort_index(axis=1).
Do the same for the other values, but reducing columns as well by passing axis=1 to any:
b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[not_top2].sum() \
.any(axis=1).astype(int).rename('Other')
Concatenating on indices:
>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
name cuisines_lst North Indian_bin Chinese_bin Other
0 Jalsa [North Indian, Chinese] 1 1 0
1 Spice Elephant [Chinese, North Indian, Thai] 1 1 1
2 San Churro Cafe [Cafe, Mexican, Italian] 0 0 1
It may be strategic if you are operating on lots of data to create an intermediate data frame containing the exploded dummies on which the group-by operation can be performed.

Related

In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.

Looping through 2 columns in Pandas DataFrame with and without a specific condition

I have a DataFrame with more than 11,000 observations. I want to loop through 2 columns from the several columns. below is the 2 columns that I want to loop through:
df1: EWZ sales_territory
0 90164.0 north
1 246794.0 north
2 216530.0 north
3 80196.0 north
4 12380.0 north
11002 224.0 east
11003 1746.0 east
11004 7256.0 east
11005 439.0 east
11006 13724.0 east
The data shown here is the first 5 and last 5 observations of the columns.
The sales_territory column has north,south,east and west observations. EWZ is the population size.
I want to select all east that have same value of population and also with north, south, west with same value of population and append in a variable. I.e, if there are 20 north that have 20,000 as population size, I want to select them. Same with others.
I tried using nested if but frankly speaking, I don't know how to specify the condition for EWZ column. I tried Iterrows(), but I could not find my way out.
How do I figure this out?
You don't have to use a for loop. You can use:
df1.groupby(['EWZ', 'sales_territory']).apply(your_function)
and achieve the desired result. If you want to get a list of unique values then you can just drop duplicates using df1[['EWZ', 'sales_territory']].drop_duplicates()
If you don't care about the EWZ values while selecting you can use 4 loc statements like
df1.loc[df1['sales_territory'] == 'north', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'east', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'south', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'west', 'new_column'] = value

Select multi rows contains a special value in multi columns

I have a df, like
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
2 C Coco Gym Beer
... ... ... ... ...
If I want to select and get a new df which rows contains 'Park'.
Desired result:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
...
another new df which rows contains 'Gym'.
Desired results:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
2 C Coco Gym Beer
...
How could I do it?
There is no problem to select park in one column, df.[df['1st'] == 'park'] but have problems to select from multi columns 1st, 2nd, 3rd etc.
You can perform "or" operations in pandas by using the pipe |, so in this specific case, you could try:
df_filtered = df[(df['1st'] == 'park') | (df['2nd'] == 'park') | (df['3rd'] == 'park')]
Alternatively, you could use the .any() function with the argument axis=1 which will return a row where there is any match:
df_filtered = df[df[['1st', '2nd', '3rd']].isin(['park']).any(axis=1)]

Filter Series/DataFrame by another DataFrame

Let's suppose I have a Series (or DataFrame) s1, for example list of all Universities and Colleges in the USA:
University
0 Searcy Harding University
1 Angwin Pacific Union College
2 Fairbanks University of Alaska Fairbanks
3 Ann Arbor University of Michigan
And another Series (od DataFrame) s2, for example list of all cities in the USA:
City
0 Searcy
1 Angwin
2 New York
3 Ann Arbor
And my desired output (bascially an intersection of s1 and s2):
Uni City
0 Searcy
1 Angwin
2 Fairbanks
3 Ann Arbor
The thing is: I'd like to create a Series that consists of cities but only these, that have a university/college. My very first thought was to remove "University" or "College" parts from the s1, but it turns out that it is not enough, as in case of Angwin Pacific Union College. Then I thought of leaving only the first word, but that excludes Ann Arbor.
Finally, I got a series of all the cities s2 and now I'm trying to use it as a filter (something similiar to .contains() or .isin()), so if a string s1 (Uni name) contains any of the elements of s2 (city name), then return only the city name.
My question is: how to do it in a neat way?
I would try to build a list comprehension of cities that are contained in at least one university name:
pd.Series([i for i in s2 if s1.str.contains(i).any()], name='Uni City')
With your example data it gives:
0 Searcy
1 Angwin
2 Ann Arbor
Name: Uni City, dtype: object
Data Used
s=pd.DataFrame({'University':['Searcy Harding University','Angwin Pacific Union College','Fairbanks University of Alaska Fairbanks','Ann Arbor University of Michigan']})
s2=pd.DataFrame({'City':['Searcy','Angwin','Fairbanks','Ann Arbor']})
Convert s2.City to set to create an iterator
st=set(s2.City.unique().tolist())
Calculate s['Uni City'] using the next() function to return the next item from the iterator.
s['Uni City']=s['University'].apply(lambda x: next((i for i in st if i in x)), np.nan)
Outcome

Removing duplicate elements within a pandas cell and counting the number of elements

I have a dataframe like this:
Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan
I want to get the following dataframe (without space between the cities):
Destinations_2 no_destinations
Paris,Oslo,Milan,Athens,Amsterdam 5
Boston,New York,London,Paris 4
Nice,Paris,Milan 3
How to remove duplicates within a cell?
You can use a list comprehension which is faster than using apply() (replace Col with the original column name) :
df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)
Col no_destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)
if there are spaces in between use
df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)
Output
Destinations nodestinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
# your data:
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)
>>>
Destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam
1 Boston,New York, Boston,London, Paris,New York
2 Nice,Paris, Milan,Paris, Nice,Milan
First: make every row of your column a list.
df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))
>>>
Destinations
0 [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1 [Boston, New York, Boston, London, Paris, New York]
2 [Nice, Paris, Milan, Paris, Nice, Milan]
Second: removes dups from the lists
df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))
# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))
>>>
Destinations
0 [Paris, Oslo, Milan, Athens, Amsterdam]
1 [Boston, New York, London, Paris]
2 [Nice, Paris, Milan]
Finally, create desired columns
df['no_destinations'] = df.Destinations.apply(lambda x: len(x))
df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))
All steps use the apply and lambda functions, you can chain or nest them together if you want
All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations). Let me try to answer both of your queries.
The idea below is to apply a method on the Destinations column which returns 2 series named Destinations_2 and no_destinations which contain unique elements separated by comma with no space, and count of unique elements, respectively.
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan'
]}
def remove_dups(x):
data = set(x.replace(" ", "").split(','))
return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])
df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())
Output:
Note: As you are not concerned with the order, I have used set above. If you need to maintain the order, you will have to replace set with some other logic to remove duplicates.

Categories

Resources