Python: Multiple Filters of Grouped Data

Python: Multiple Filters of Grouped Data - python

I am trying to filter out the list below to show only the line items that have the same Supplier, same Quality (can be an infinite amount of ratings), but different Type (would only be two different values).
For example, I could use Supplier ABC A rated wood or steel but would not be able to do the same switch with Supplier DEF (given wood and steel have different Quality). The desired output would be a table only showing ABC's A rated steel and wood and GHI's B rated steel and wood.
I figured out how to only show supplies that offer both wood and steel (i.e. eliminates JKL) but cannot figure out how to further filter to show suppliers with different Type but equal Quality.
df.groupby('Supplier').filter(lambda x:x['Type'].nunique()>1)
Any help would be greatly appreciated!
Input Data:
Supplier Quality Type
0 ABC A Wood
1 ABC B Steel
2 ABC A Steel
3 DEF B Steel
4 DEF A Wood
5 GHI C Wood
6 GHI A Wood
7 GHI A Steel
8 JKL A Wood
9 JKL A Wood

Just group by on both Supplier and Quality:
df.groupby(['Supplier', 'Quality']).filter(lambda x: x['Type'].nunique() > 1)
Supplier Quality Type
0 ABC A Wood
2 ABC A Steel
6 GHI A Wood
7 GHI A Steel

based on what you have tried you are looking for unique types grater than 1 so you can do something like this:
df2 = df.groupby(['Supplier', 'Quality'])['Type'].unique().to_frame()
df2[df2['Type'].str.len() >1]
Type
Supplier Quality
ABC A [Wood, Steel]
GHI A [Wood, Steel]

One way is to use drop_duplicates and duplicated with keep=False:
key_cols = ['Supplier', 'Quality']
res = df.drop_duplicates().loc[:, key_cols]
res = res.loc[res.duplicated(keep=False)]\
.drop_duplicates()
print(res)
Supplier Quality
0 ABC A
6 GHI A

Related

How to divide a list to allocate it to another dataframe based on sum of values?

I have two dataframes for example:
First dataframe contains the name and kind of chocolate they want:
Name
Chocolate
Kirti
Nutella
Rahul
Lindt
Sam
Lindt
Joy
Lindt
Mrinal
Kit Kat
Sai
Lindt
The second dataframe contains shop and availability of each item in shop:
Shop
Chocolate
Count
Shop 1
Lindt
2
Shop 2
Lindt
3
Shop 1
Nutella
5
The end result that I'm looking for should return a dataframe which indicates which shop the people can go to.
Rahul, Sam, Joy and Sai are 4 people who want Lindt. 2 of them can go to Shop 1 and other 2 can go to shop 3 to ensure everyone can get lindt Chocolate.
Now we can randomly assign 2 of them to shop 1 and 2 of them to Shop 2.
Similarly with other chocolates and resulting dataframe will be
Name
Chocolate
Shop
Kirti
Nutella
Shop 1
Rahul
Lindt
Shop 1
Sam
Lindt
Shop 1
Joy
Lindt
Shop 2
Mrinal
Kit Kat
NA
Sai
Lindt
Shop 2
In above case, Mrinal doesn't get assigned any shop because no shop has KitKat available
I've been trying to do a vlookup in Python using map but all people who want Lindt get assigned Shop 2. I want to assign them in such a way that divides the qty available in each shop so that everyone possible can get chocolate.
Here's the code that I wrote as of now:
df_demand = pd.DataFrame({'Name': ['Kirti','Rahul','Sam','Joy','Mrinal','Sai'],
'Chocolate': ['Nutella','Lindt','Lindt','Lindt','Kit-Kat','Lindt']})
df_inventory = pd.DataFrame({'Shop':['Shop1','Shop2','Shop1'],
'Chocolate':['Lindt','Lindt','Nutella'],
'Count':[2,3,5]})
df_inventory = df_inventory.sort_values(by = ['Count'], ascending = False, kind = "mergesort")
df_inventory= df_inventory.drop_duplicates(subset ="Chocolate")
df_inv1= df_inventory.set_index('Chocolate').to_dict()['Shop']
df_demand['Shop'] = df_demand['Chocolate'].map(df_inv1)
Output of above code:

A way to do this is to count Chocolate need/sale opportunity up and then use that number to merge the request of the kids with the corresponding shops.
df = pd.DataFrame(
[['Shop1','Lindt',1],
['Shop1','Milka',1],
['Shop2','Lindt',3],
['Shop3','Lindt',3],
['Shop3','Milka',3]]
,columns=['Shop','Chocolate','Count'])
dk = pd.DataFrame(
[['Alfred','Milka'],
['Berta','Milka'],
['Charlie','Milka'],
['Darius','Milka'],
['Emil','Milka'],
['George','Lindt'],
['Francois','Milka']],
columns =['Name','Chocolate'])
df['max_satisfaction']=df.groupby('Chocolate').cumsum()
df['min_satisfaction'] = df['max_satisfaction']-df['Count']
df['satisfies']=df.apply(lambda x:list(range(x[-1],x[-2])),axis=1)
df = df.explode('satisfies')
dk['request_number'] = dk.groupby('Chocolate').cumcount()
dk = dk.merge(df,how='left',
left_on=['Chocolate','request_number'],
right_on=['Chocolate','satisfies'])
dk[['Name','Chocolate','Shop']]
Note that this solution will be quite expensive if the shops have way more supply than demand. A limit to prevent the explosion of df could be however easily implemented.

Find how often products are sold together in Python DataFrame

I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.
Order
Avocado
Mango
Chili
1546
500
20
0
861153
200
500
5
1657446
500
20
0
79854
200
500
1
4654
500
20
0
74654
0
500
800
I found out what combinations are often together with this code (abbrivated here to 3 products).
size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)
Now I want to know per product how often it is bought solo and how often with other products.
Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:
Product
Avocado
Mango
Chili
AVOCADO
100%
20 %
1%
MANGO
20 %
100%
3%
CHILI
20%
30%
100%

First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:
df2 = 1*(df.set_index('Order') > 0)
Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.
For actual numbers of orders that contain (product1,product2), we can do
df3 = pd.DataFrame(data = df2.values.T#df2.values, columns = df2.columns, index = df2.columns)
df3 looks like this:
Avocado Mango Chili
------- --------- ------- -------
Avocado 5 5 2
Mango 5 6 3
Chili 2 3 3
eg there are 2 orders that contain Avocado and Chili
If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:
df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0)).T#df2.values), columns = df2.columns, index = df2.columns)
df4 is:
Avocado Mango Chili
------- --------- ------- -------
Avocado 1 1 0.4
Mango 0.833333 1 0.5
Chili 0.666667 1 1
the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili

How to get unique values with many different URLs

I have a dataframe that looks something like the one below.
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 NaN
3 shirt www.Amazon.com/fas19axl Amazon
4 shoes www.walmart.com/ywsg141q NaN
I'm not sure if Pandas can get the unique variables in the URL column and fill it into the NaNs in the company column.
The dataframe that I will like looks like that below
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
Edit: I have lowered all the URLs but i'm not sure how just extract the keywords like Amazon, costco, etc. Thanks

Add Series.str.extract for values between first and second .:
df.Company = df.URLs.str.lower().str.extract('\.(.+)\.', expand=False)
print (df)
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
If want replace only missing values also use Series.fillna:
df.Company = df.Company.fillna(df.URLs.str.lower().str.extract('\.(.+)\.', expand=False))

lower your URLs before processing:
df.URLs = df.URLs.str.lower()

How to 'explode' Pandas Column Value to unique row

So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !

IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()

You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk

Sort text in second column based on values in first column

in python i would like to separate the text in different rows based on the values of the first number. So:
Harry went to School 100
Mary sold goods 50
Sick man
using the provided information below:
number text
1 Harry
1 Went
1 to
1 School
1 100
2 Mary
2 sold
2 goods
2 50
3 Sick
3 Man
for i in xrange(0, len(df['number'])-1):
if df['number'][i+1] == df['number'][i]:
# append text (e.g Harry went to school 100)
else:
# new row (Mary sold goods 50)

You can use groupby,
for name,group in df.groupby(df['number']):
print ' '.join([i for i in group['text']])
Result
Harry Went to School 100
Mary sold goods 50
Sick Man

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Multiple Filters of Grouped Data - python

Just group by on both Supplier and Quality: df.groupby(['Supplier', 'Quality']).filter(lambda x: x['Type'].nunique() > 1) Supplier Quality Type 0 ABC A Wood 2 ABC A Steel 6 GHI A Wood 7 GHI A Steel

based on what you have tried you are looking for unique types grater than 1 so you can do something like this: df2 = df.groupby(['Supplier', 'Quality'])['Type'].unique().to_frame() df2[df2['Type'].str.len() >1] Type Supplier Quality ABC A [Wood, Steel] GHI A [Wood, Steel]

One way is to use drop_duplicates and duplicated with keep=False: key_cols = ['Supplier', 'Quality'] res = df.drop_duplicates().loc[:, key_cols] res = res.loc[res.duplicated(keep=False)]\ .drop_duplicates() print(res) Supplier Quality 0 ABC A 6 GHI A

Related

How to divide a list to allocate it to another dataframe based on sum of values?

Find how often products are sold together in Python DataFrame

How to get unique values with many different URLs

How to 'explode' Pandas Column Value to unique row

Sort text in second column based on values in first column

Categories

Resources