I have a df, like
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
2 C Coco Gym Beer
... ... ... ... ...
If I want to select and get a new df which rows contains 'Park'.
Desired result:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
...
another new df which rows contains 'Gym'.
Desired results:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
2 C Coco Gym Beer
...
How could I do it?
There is no problem to select park in one column, df.[df['1st'] == 'park'] but have problems to select from multi columns 1st, 2nd, 3rd etc.
You can perform "or" operations in pandas by using the pipe |, so in this specific case, you could try:
df_filtered = df[(df['1st'] == 'park') | (df['2nd'] == 'park') | (df['3rd'] == 'park')]
Alternatively, you could use the .any() function with the argument axis=1 which will return a row where there is any match:
df_filtered = df[df[['1st', '2nd', '3rd']].isin(['park']).any(axis=1)]
Related
I have a string as String = 'Oil - this company
In my dataframe df1:
id CompanyName
1 Oil - this company
2 oil
3 oily
4 comp
I want to keep the rows that contain part of CompanyName
My final df should be: df1
id CompanyName
1 Oil - this company
2 oil
I tried:
df = df[df['CompanyName'].str.contains(String)]
but it deleted the second row 2 oil
Is there any way to keep the company Name that contains part of the string?
I'm working with the Zomato Bangalore Restaurant data set found here. One of my pre-processing steps is to create dummy variables for the types of cuisine each restaurant serves. I've used panda's explode to split the cuisines and I've created lists for the top 30 cuisines and the not top 30 cuisines. I've created a sample dataframe below.
sample_df = pd.DataFrame({
'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
'cuisines_lst': [
['North Indian', 'Chinese'],
['Chinese', 'North Indian', 'Thai'],
['Cafe', 'Mexican', 'Italian']
]
})
I've created the top and not top lists. In the actual data I'm using the top 30 but for the sake of the example, it's the top 2 and not top 2.
top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()
What I would like is to create a dummy variable for all cuisines in the top list with the suffix _bin and create a final dummy variable other if a restaurant has one of the cuisines from the not top list. The desired output looks like this:
name
cuisines_lst
Chinese_bin
North Indian_bin
Other
Jalsa
[North Indian, Chinese]
1
1
0
Spice Elephant
[Chinese, North Indian, Thai]
1
1
1
San Churro Cafe
[Cafe, Mexican, Italian]
0
0
1
Create the dummies, then reduce by duplicated indices to get your columns for the top 2:
a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[top2].sum().add_suffix('_bin')
If you want it in alphabetical order (in this case, Chinese followed by North Indian), add an intermediate step to sort columns with a.sort_index(axis=1).
Do the same for the other values, but reducing columns as well by passing axis=1 to any:
b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[not_top2].sum() \
.any(axis=1).astype(int).rename('Other')
Concatenating on indices:
>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
name cuisines_lst North Indian_bin Chinese_bin Other
0 Jalsa [North Indian, Chinese] 1 1 0
1 Spice Elephant [Chinese, North Indian, Thai] 1 1 1
2 San Churro Cafe [Cafe, Mexican, Italian] 0 0 1
It may be strategic if you are operating on lots of data to create an intermediate data frame containing the exploded dummies on which the group-by operation can be performed.
So this is a common question but I cant find an answer that fits this particular scenario.
So I have a Dataframe with columns for genres eg "Drama, Western" and one hot encoded versions of the genres so for the drama and western there is a 1 in both columns but where its just Western genre its 1 for that column 0 for drama.
I want a filtered dataframe containing rows with only Western and no other genre. Im trying to oversample for a model as it is a minor class but I don't want to increase other genre counts as a byproduct
There are multiple rows so I can't use the index and there are multiple genres so I can't use a condition like df[(df['Western']==1) & (df['Drama']==0) without having to account for 24 genres.
Index | Genre | Drama | Western | Action | genre 4 |
0 Drama, Western 1 1 0 0
1 Western 0 1 0 0
3 Action, Western 0 1 1 0
If I understand your question correctly, you want those rows where only 'Western' is 1, i.e. the genre is only Western, nothing else.
Why do you have to use the encoded columns then? Just use the original 'Genre' column where the data is in string format. No need to overcomplicate things.
new_df = df[df['Genre']=='Western']
Make a column_list of genre like column_list = ['Western', 'Drama', 'Action', ...] and find its sum, if its sum is equal to 1, then we can compare the value of 'Western' column if it is equal to 1. Try this out, this should return the Index of row where only 'Western' is 1:
column_list = ['Western', 'Drama', 'Action', ...]
df.loc[df[column_list].sum(axis=1)==1 and df['Western']==1, 'Index']
If you haven't got the Genre column, you could do
df[
(df['Western']==1)
&
(df[df.columns.difference(['Western'])]==0).all(axis=1)
]
I've read through the pandas documentation on merging but am still quite confused on how to apply them to my case. I have 2 dataframes that I'd like to merge - I'd like to merge on the common column 'Town', and also merge on the 'values' in a column which are the 'column names' in the 2nd df.
The first df summarizes the top 5 most common venues in each town:
The second df summarizes the frequencies of all the venue categories in each town:
The output I want:
Ang Mo Kio | Food Court | Coffee Shop | Dessert Shop | Chinese Restaurant | Japanese Restaurant | Freq of Food Court | Freq of Coffee Shop |...
What I've tried with merge:
newdf = pd.merge(sg_onehot_grouped, sg_venues_sorted, left_on=['Town'], right_on=['1st Most Common Venue'])
#only trying the 1st column because wanted to scale down my code
but I got an empty dataframe with the column names as all the columns from both dataframes.
Appreciate any help. Thanks.
I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88