How to get unique values with many different URLs - python

I have a dataframe that looks something like the one below.
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 NaN
3 shirt www.Amazon.com/fas19axl Amazon
4 shoes www.walmart.com/ywsg141q NaN
I'm not sure if Pandas can get the unique variables in the URL column and fill it into the NaNs in the company column.
The dataframe that I will like looks like that below
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
Edit: I have lowered all the URLs but i'm not sure how just extract the keywords like Amazon, costco, etc. Thanks

Add Series.str.extract for values between first and second .:
df.Company = df.URLs.str.lower().str.extract('\.(.+)\.', expand=False)
print (df)
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
If want replace only missing values also use Series.fillna:
df.Company = df.Company.fillna(df.URLs.str.lower().str.extract('\.(.+)\.', expand=False))

lower your URLs before processing:
df.URLs = df.URLs.str.lower()

Related

How to divide a list to allocate it to another dataframe based on sum of values?

I have two dataframes for example:
First dataframe contains the name and kind of chocolate they want:
Name
Chocolate
Kirti
Nutella
Rahul
Lindt
Sam
Lindt
Joy
Lindt
Mrinal
Kit Kat
Sai
Lindt
The second dataframe contains shop and availability of each item in shop:
Shop
Chocolate
Count
Shop 1
Lindt
2
Shop 2
Lindt
3
Shop 1
Nutella
5
The end result that I'm looking for should return a dataframe which indicates which shop the people can go to.
Rahul, Sam, Joy and Sai are 4 people who want Lindt. 2 of them can go to Shop 1 and other 2 can go to shop 3 to ensure everyone can get lindt Chocolate.
Now we can randomly assign 2 of them to shop 1 and 2 of them to Shop 2.
Similarly with other chocolates and resulting dataframe will be
Name
Chocolate
Shop
Kirti
Nutella
Shop 1
Rahul
Lindt
Shop 1
Sam
Lindt
Shop 1
Joy
Lindt
Shop 2
Mrinal
Kit Kat
NA
Sai
Lindt
Shop 2
In above case, Mrinal doesn't get assigned any shop because no shop has KitKat available
I've been trying to do a vlookup in Python using map but all people who want Lindt get assigned Shop 2. I want to assign them in such a way that divides the qty available in each shop so that everyone possible can get chocolate.
Here's the code that I wrote as of now:
df_demand = pd.DataFrame({'Name': ['Kirti','Rahul','Sam','Joy','Mrinal','Sai'],
'Chocolate': ['Nutella','Lindt','Lindt','Lindt','Kit-Kat','Lindt']})
df_inventory = pd.DataFrame({'Shop':['Shop1','Shop2','Shop1'],
'Chocolate':['Lindt','Lindt','Nutella'],
'Count':[2,3,5]})
df_inventory = df_inventory.sort_values(by = ['Count'], ascending = False, kind = "mergesort")
df_inventory= df_inventory.drop_duplicates(subset ="Chocolate")
df_inv1= df_inventory.set_index('Chocolate').to_dict()['Shop']
df_demand['Shop'] = df_demand['Chocolate'].map(df_inv1)
Output of above code:
A way to do this is to count Chocolate need/sale opportunity up and then use that number to merge the request of the kids with the corresponding shops.
df = pd.DataFrame(
[['Shop1','Lindt',1],
['Shop1','Milka',1],
['Shop2','Lindt',3],
['Shop3','Lindt',3],
['Shop3','Milka',3]]
,columns=['Shop','Chocolate','Count'])
dk = pd.DataFrame(
[['Alfred','Milka'],
['Berta','Milka'],
['Charlie','Milka'],
['Darius','Milka'],
['Emil','Milka'],
['George','Lindt'],
['Francois','Milka']],
columns =['Name','Chocolate'])
df['max_satisfaction']=df.groupby('Chocolate').cumsum()
df['min_satisfaction'] = df['max_satisfaction']-df['Count']
df['satisfies']=df.apply(lambda x:list(range(x[-1],x[-2])),axis=1)
df = df.explode('satisfies')
dk['request_number'] = dk.groupby('Chocolate').cumcount()
dk = dk.merge(df,how='left',
left_on=['Chocolate','request_number'],
right_on=['Chocolate','satisfies'])
dk[['Name','Chocolate','Shop']]
Note that this solution will be quite expensive if the shops have way more supply than demand. A limit to prevent the explosion of df could be however easily implemented.

How to remove duplicates when upon editing an entity the originals are not replaced?

Consider that we have a dataset that represents some purchases. Products that have been bought together have the same basket ID.
When a purchased product is edited (e.g. the wrong price was inserted at first) it does not replace the original record. Instead, a new record is made for EVERY product of that basket ID and a new Basket ID is assigned to the purchase.
For example consider a purchase of a bottle of milk and a chocolate:
Product Price BasketID PreviousBasketID
0 Milk 2 1234 Null
1 Chocolate 3 1234 Null
Let's say that we'd like to edit the price of chocolate. Then the dataset would be:
Product Price BasketID PreviousBasketID
0 Milk 2 1234 Null
1 Chocolate 3 1234 Null
2 Milk 2 5678 1234
3 Chocolate 4 5678 1234
Is there a way to keep only the latest version of the basket (i.e. BasketID = 5678) and get rid of any previous versions?
Can you remove any rows that have a BasketID that appears in PreviousBasketID?
Something like:
df = df[~df["BasketID"].isin(df["PreviousBasketID"])]
Here the ~ means bitwise not. See here for more info.

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Python: Multiple Filters of Grouped Data

I am trying to filter out the list below to show only the line items that have the same Supplier, same Quality (can be an infinite amount of ratings), but different Type (would only be two different values).
For example, I could use Supplier ABC A rated wood or steel but would not be able to do the same switch with Supplier DEF (given wood and steel have different Quality). The desired output would be a table only showing ABC's A rated steel and wood and GHI's B rated steel and wood.
I figured out how to only show supplies that offer both wood and steel (i.e. eliminates JKL) but cannot figure out how to further filter to show suppliers with different Type but equal Quality.
df.groupby('Supplier').filter(lambda x:x['Type'].nunique()>1)
Any help would be greatly appreciated!
Input Data:
Supplier Quality Type
0 ABC A Wood
1 ABC B Steel
2 ABC A Steel
3 DEF B Steel
4 DEF A Wood
5 GHI C Wood
6 GHI A Wood
7 GHI A Steel
8 JKL A Wood
9 JKL A Wood
Just group by on both Supplier and Quality:
df.groupby(['Supplier', 'Quality']).filter(lambda x: x['Type'].nunique() > 1)
Supplier Quality Type
0 ABC A Wood
2 ABC A Steel
6 GHI A Wood
7 GHI A Steel
based on what you have tried you are looking for unique types grater than 1 so you can do something like this:
df2 = df.groupby(['Supplier', 'Quality'])['Type'].unique().to_frame()
df2[df2['Type'].str.len() >1]
Type
Supplier Quality
ABC A [Wood, Steel]
GHI A [Wood, Steel]
One way is to use drop_duplicates and duplicated with keep=False:
key_cols = ['Supplier', 'Quality']
res = df.drop_duplicates().loc[:, key_cols]
res = res.loc[res.duplicated(keep=False)]\
.drop_duplicates()
print(res)
Supplier Quality
0 ABC A
6 GHI A

How to 'explode' Pandas Column Value to unique row

So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk

Categories

Resources