I have two dataframes the first one:
df1:
product price
0 apples 1.99
1 bananas 1.20
2 oranges 1.49
3 lemons 0.5
4 Olive Oil 8.99
df2:
product product.1 product.2
0 apples bananas Olive Oil
1 bananas lemons oranges
2 Olive Oil bananas oranges
3 lemons apples bananas
I want a column in the second dataframe to be the sum of the prices base on the price of each item in the first dataframe. So desired outcome would be:
product product.1 product.2 total_price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
What is the best way to accomplish this? I have tried merging the dataframes on the name for each of the columns in df2 but this seems time consuming especially as df1 gets more rows and df2 gets more columns.
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.1')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.2')
df['Total_Price'] = df['price']+df['price.1']+df['price.2']
You can try something like below:
First, converting df1 to dictionary of keys and values
Using dictionary in above with applymap followed by sum
May be following snippet will do something similar:
dictionary_val = { k[0]: k[1] for k in df1.values }
df2['Total_Price'] = df2.applymap(lambda row: dictionary_val[row]).sum(axis=1) # Note not creating new dataframe but using existing one
Then result is df2:
product product.1 product.2 Total_Price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
Related
I have a dataframe with information, where the rows are not related to eachother:
Fruits Vegetables Protein
1 Apple Spinach Beef
2 Banana Cucumber Chicken
3 Pear Carrot Pork
I essentially just want to create a pandas series with all of that information, I want it to look like this:
All Foods
1 Apple
2 Banana
3 Pear
4 Spinach
5 Cucumber
6 Carrot
7 Beef
8 Chicken
9 Pork
How can I do this in pandas?
Dump into numpy and create a new dataframe:
out = df.to_numpy().ravel(order='F')
pd.DataFrame({'All Foods' : out})
All Foods
0 Apple
1 Banana
2 Pear
3 Spinach
4 Cucumber
5 Carrot
6 Beef
7 Chicken
8 Pork
Just pd.concat them together (and reset the index).
all_foods = pd.concat([foods[col] for col in foods.columns])
You can unstack the dataframe to get the values and then create a df/series:
df = pd.DataFrame({'Fruits':['Apple','Banana', 'Pear'], 'Vegetables':['Spinach', 'Carrot', 'Cucumber'], 'Protein':['Beef', 'Chicken', 'Pork']})
pd.DataFrame({'All Foods' : df.unstack().values})
This should help:
import pandas as pd
# Loading file with fruits, vegetables and protein
dataset = pd.read_csv('/fruit.csv')
# This is where you should apply your code
# Unpivoting (creating one column out of 3 columns)
df_unpivot = pd.melt(dataset, value_vars=['Fruits', 'Vegetables', 'Protein'])
# Renaming column from value to All Foods
df_finalized = df_unpivot.rename(columns={'value': 'All Foods'})
# Printing out "All Foods" column
print(df_finalized["All Foods"])
I have a table like this
Name Type Food Variant and Price
A Cake {‘Choco’:100, ‘Cheese’:100, ‘Mix’: 125}
B Drinks {‘Cola’:25, ‘Milk’:35}
C Side dish {‘French Fries’:20}
D Bread {None:10}
I want to use the keys and values of dictionaries in the Variant and Price column as 2 different columns but I am still confused, here is the output that I want:
Name Type Food Variant Price
A Cake Choco 100
A Cake Cheese 100
A Cake Mix 125
B Drinks Cola 25
B Drinks Milk 35
C Side dish French Fries 20
D Bread NaN 10
Can anyone help me to figure it out?
Create list of tuples and then use DataFrame.explode, last create 2 columns:
df['Variant and Price'] = df['Variant and Price'].apply(lambda x: list(x.items()))
df = df.explode('Variant and Price').reset_index(drop=True)
df[['Variant','Price']] = df.pop('Variant and Price').to_numpy().tolist()
print (df)
Name Type Food Variant Price
0 A Cake Choco 100
1 A Cake Cheese 100
2 A Cake Mix 125
3 B Drinks Cola 25
4 B Drinks Milk 35
5 C Side dish French Fries 20
6 D Bread None 10
Or create 2 columns and then use DataFrame.explode:
df['Variant'] = df['Variant and Price'].apply(lambda x: list(x.keys()))
df['Price'] = df.pop('Variant and Price').apply(lambda x: list(x.values()))
df = df.explode(['Variant', 'Price']).reset_index(drop=True)
In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel
I have 2 pandas DataFrames, this one:
item inStock description
Apples 10 a juicy treat
Oranges 34 mediocre at best
Bananas 21 can be used as phone prop
<...many other fruits...>
Kiwi 0 too fuzzy
and a lookup table with only a subset of the items above:
item Price
Apples 1.99
Oranges 6.99
I would like to scan through the first table and fill in a price column for the DataFrame when the fruit in the first DataFrame matches the fruit in the second:
item inStock description Price
Apples 10 a juicy treat 1.99
Oranges 34 mediocre at best 6.99
Bananas 21 can be used as phone prop
<...many other fruits...>
Kiwi 0 too fuzzy
I've looked at examples with the built-in lookup function, as well as using a where-in type function but I cannot seem to get the syntax to work. Can someone help me out?
import pandas as pd
df_item= pd.read_csv('Item.txt')
df_price= pd.read_csv('Price.txt')
df_final=pd.merge(df_item,df_price ,on='item',how='left' )
print df_final
output
item inStock description Price
0 Apples 10 a juicy treat 1.99
1 Oranges 34 mediocre at best 6.99
2 Bananas 21 can be used as phone prop NaN
3 Kiwi 0 too fuzzy NaN