Use a different dataframe to replace value of text in dataframe - python

I have a simple dataframe (df1) where I am replacing values with the replace function (see below). Instead of always having to change the names of the items I want to replace in the code, I would like this to be done from an excel sheet, where either the columns or rows give the different names that should be replaced. I would import the excel as a dataframe (df2). All I am missing is the scrip that would turn the info from df2 into the replace function.
df1 = pd.DataFrame({'Product':['Tart', 'Cookie', 'Black'],
'Quantity': [1234, 4, 333]})
print(df1)
Product Quantity
0 Tart 1234
1 Cookie 4
2 Black 333
This is what I used so far
sales = sales.replace (["Tart","Tart2", "Cookie", "Cookie2"], "Tartlet")
sales = sales.replace (["Ham and cheese Sandwich" , "Chicken focaccia"], "Sandwich")
After replacement
print(df1)
Product Quantity
0 Tartlet 1234
1 Tartlet 4
2 Black 333
This is how my dataframe 2 could look like (I am flexible how to design it) after I imported it from an excel file
df2 = pd.read_excel (setup_folder / "Product Replacements.xlsx", index_col= 0)
print (df2)
Tartlet Sandwich
0 Tart Ham and cheese Sandwich
1 Tart2 Chicken Focaccia
2 Cookie2 nan

Use:
df2 = pd.DataFrame({'Tartlet':['Tart', 'Tart2', 'Cookie'],
'Sandwich': ['Ham and Cheese Sandwich', 'Chicken Focaccia', 'another']})
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in df2.items() for k in oldv}
print (d1)
{'Tart': 'Tartlet', 'Tart2': 'Tartlet', 'Cookie': 'Tartlet', 'Ham and Cheese Sandwich':
'Sandwich', 'Chicken Focaccia': 'Sandwich', 'another': 'Sandwich'}
df1['Product'] = df1['Product'].replace(d1)
#for improve performance
#df1['Product'] = df1['Product'].map(d1).fillna(df1['Product'])
print (df1)
Product Quantity
0 Tartlet 1234
1 Tartlet 4
2 Black 333

Related

how to add multiple values ​and make the row repeat for the number of values

I have a list of objects by each name and a dataframe like this.
Jimmy = ['chair','table','pencil']
Charles = ['smartphone','cake']
John = ['clock','paper']
id
name
1
Jimmy
2
Charles
3
John
I would like to use a loop that allows me to obtain the following result.
id
name
picks
1
Jimmy
chair
1
Jimmy
table
1
Jimmy
pencil
2
Charles
smartphone
2
Charles
cake
3
John
clock
3
John
paper
You can assign and explode:
values = {'Jimmy': Jimmy, 'Charles': Charles, 'John': John}
out = df.assign(picks=df['name'].map(values)).explode('picks')
Or set up a DataFrame, stack and merge:
values = {'Jimmy': Jimmy, 'Charles': Charles, 'John': John}
out = df.merge(
pd.DataFrame.from_dict(values, orient='index')
.stack().droplevel(1).rename('picks'),
left_on='name', right_index=True
)
output:
id name picks
0 1 Jimmy chair
0 1 Jimmy table
0 1 Jimmy pencil
1 2 Charles smartphone
1 2 Charles cake
2 3 John clock
2 3 John paper
We can make a dataframe relating names to picks, then join them together with merge:
import pandas as pd
#dataframe from question
df = pd.DataFrame()
df["id"] = [1, 2, 3]
df["name"] = ["Jimmy", "Charles", "John"]
#dataframe relating names to picks.
picks_df = pd.DataFrame()
picks_df["name"] = ["Jimmy", "Jimmy", "Jimmy", "Charles", "Charles", "John", "John"]
picks_df["picks"] = ["chair", "table", "pencil", "smartphone", "cake", "clock", "paper"]
#Merge and print
print(pd.merge(df, picks_df, on="name"))

How to calculate a column using the most common words calculated from another dataframe in Python?

Example of the dataframe:
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
cake
I calculated the most common words of the dataframe (with stop words removed)
from collections import Counter
Counter(" ".join(cake['Description']).split()).most_common()
I put this into a new dataframe and reset the index
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
count.columns = ['Words', 'Values']
count.index= np.arange(1, len(count)+1)
count.head()
The Values is in the 'count' dataframe. The Days_Sold is in the 'cake' dataframe. What I would like to do now is if the common word in the 'count' dataframe shows up, like cupcake, how long would this take for me to sell the product using the 'cake' dataframe, and that would go through every common word in the 'count' dataframe until it's done? The answer should come out to be (3+4+1) 8 for cupcake.
My actual dataframe is over 3000 lines (and not exactly about cakes). The description is longer. I need over 40 common words, adjustable to my need.
This is why I can't be typing in each word. I believe this requires a 'nested for loop'. But I am stuck on it.
for day in cake:
for top in count:
top= count.Words
day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]
The error says: 'int' object is not iterable
Thank you!
Update:
Thank you so much to everyone helping me on this large project. I am posting my solution to #3, adjusted from the answer by Mark Moretto.
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df
# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "Values", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
df_freq
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics
df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)
Another way, though I didn't really remove any words by frequency or anything.
# <...your starter code for dataframe creation...>
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
# Get count of words
df_counts = (df.groupby("Description")
.size()
.reset_index()
.rename(columns={0: "word_count"})
)
# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "word_count", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()
print(df_metrics)
Output:
word_count Days_Sold
Description
blueberry 2 4
cookie 3 2
cupcake 3 4
grape 1 2
lemon 1 1
organic 3 2
strawberry 2 3
Given
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
count.columns = ['Words', 'Values']
count.index= np.arange(1, len(count)+1)
Your final count dataframe looks like:
Words Values
1 cupcake 3
2 cookie 3
3 organic 3
4 strawberry 2
5 blueberry 2
You can:
Convert the index to be a column, see How to convert index of a pandas dataframe into a column
Then, reindex your count dataframe by Words
Finally, you can use .loc(<key>)['Values] to get the no. of days
count_by_words = count.set_index('Words')
count_by_words.loc['cupcake']['Values']
The count_by_words DataFrame will look like this:
index Values
Words
cupcake 1 3
cookie 2 3
organic 3 3
strawberry 4 2
blueberry 5 2
grape 6 1
lemon 7 1
If the goal is to estimate max days sold based on the words in the description, you can try:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
df = pd.DataFrame(data=cup)
word_counter = Counter() # Keeps track of the word count
word_days = defaultdict(list) # Keeps track of the max days sold
max_days = {}
# Iterate each row at a time.
for _, s in df.iterrows():
words = s['Description'].split()
word_counter += Counter(words)
for word in words:
# Keep tracks of different days_sold given a specific word.
word_days[word].append(s['Days_Sold'])
# If the max days for a word is lower than the row's days_sold
if max(word_days.get(word, 0)) < s['Days_Sold']:
# Set the max_days for the word as current days_sold
max_days[word] = s['Days_Sold']
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})
df2.loc['strawberry']['max_days_sold']
[out]:
max_days_sold word_count
strawberry 3 2
cupcake 4 3
blueberry 4 2
cookie 2 3
grape 2 1
organic 2 3
lemon 1 1

How to iterate through pandas columns and rows simultaneously?

I have two df A & B, I want to iterate through df B's certain columns and check values of all its rows and see if values exist in one of the columns in A, and use fill null values with A's other columns' values.
df A:
country region product
USA NY apple
USA NY orange
UK LON banana
UK LON chocolate
CANADA TOR syrup
CANADA TOR fish
df B:
country ID product1 product2 product3 product4 region
USA 123 other stuff other stuff apple NA NA
USA 456 orange other stuff other stuff NA NA
UK 234 banana other stuff other stuff NA NA
UK 766 other stuff other stuff chocolate NA NA
CANADA 877 other stuff other stuff syrup NA NA
CANADA 109 NA fish NA other stuff NA
so I want to iterate through dfB and for example see if dfA.product (apple) is in columns of dfB.product1-product4 if true such as the first row of dfB indicates, then I want to add the region value from dfA.region into dfB's region which now is currently NA.
here is the code I have, I am not sure if it is right:
import pandas as pd
from tqdm import tqdm
def fill_null_value(dfA, dfB):
for i, row in tqdm(dfA.iterrows()):
for index, row in tqdm(dfB.iterrows()):
if dfB['product1'][index] == dfA['product'][i]:
dfB['region'] = dfA['region '][i]
elif dfB['product2'][index] == dfA['product'[i]:
dfB['region'] = dfA['region'][i]
elif dfB['product3'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
elif dfB['product4'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
else:
dfB['region '] = "not found"
print('outputing data')
return dfB.to_excel('test.xlsx')
If i where you I would create some join and then concat them and drop duplicates
df_1 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product1'], how='right')
df_2 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product2'], how='right')
df_3 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product3'], how='right')
df_4 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product4'], how='right')
df = pd.concat([df_1, df_2, df_3, df_4]).drop_duplicates()
The main issue here seems to be finding a single column for products in your second data set that you can do your join on. It's not clear how exactly you are deciding what values in the various product columns in df_b are meant to be used as keys to lookup vs. the ones that are ignored.
Assuming, though, that your df_a contains an exhaustive list of product values and each of those values only ever occurs in a row once you could do something like this (simplifying your example):
import pandas as pd
df_a = pd.DataFrame({'Region':['USA', 'Canada'], 'Product': ['apple', 'banana']})
df_b = pd.DataFrame({'product1': ['apple', 'xyz'], 'product2': ['xyz', 'banana']})
product_cols = ['product1', 'product2']
df_b['Product'] = df_b[product_cols].apply(lambda x: x[x.isin(df_a.Product)][0], axis=1)
df_b = df_b.merge(df_a, on='Product')
The big thing here is generating a column that you can join on for your lookup

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Problematic values

I have a dataframe that has a 'Trousers' column containing many different types of trousers. Most of the trousers would start by their type. For instance: Jeans- Replay-blue, or Chino- Uniqlo-~, or maybe Smart-Next-~). Others would just have a type but just a long name (2 or 3 strings)
What I want is to loop through that column to change the values to just Jean if jeans is in the cell,or Chinos if Chino is in the cell and so on.... so I can easily group them.
How can achieve that through with my for loop?
It seems you need split and then select first value of lists by str[0]:
df['type'] = df['Trousers'].str.split('-').str[0]
Sample:
df = pd.DataFrame({'Trousers':['Jeans- Replay-blue','Chino- Uniqlo-~','Smart-Next-~']})
print (df)
Trousers
0 Jeans- Replay-blue
1 Chino- Uniqlo-~
2 Smart-Next-~
df['type'] = df['Trousers'].str.split('-').str[0]
print (df)
Trousers type
0 Jeans- Replay-blue Jeans
1 Chino- Uniqlo-~ Chino
2 Smart-Next-~ Smart
df['Trousers'] = df['Trousers'].str.split('-').str[0]
print (df)
Trousers
0 Jeans
1 Chino
2 Smart
Another solution with extract:
df['Trousers'] = df['Trousers'].str.extract('([a-zA-z]+)-', expand=False)
print (df)
Trousers
0 Jeans
1 Chino
2 Smart

Categories

Resources