Pandas - check if dataframe columns contain key:value pairs from a dictionary - python

This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?

d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel

Related

how to combine everything in a pandas dataframe into another dataframe

I have a dataframe with information, where the rows are not related to eachother:
Fruits Vegetables Protein
1 Apple Spinach Beef
2 Banana Cucumber Chicken
3 Pear Carrot Pork
I essentially just want to create a pandas series with all of that information, I want it to look like this:
All Foods
1 Apple
2 Banana
3 Pear
4 Spinach
5 Cucumber
6 Carrot
7 Beef
8 Chicken
9 Pork
How can I do this in pandas?
Dump into numpy and create a new dataframe:
out = df.to_numpy().ravel(order='F')
pd.DataFrame({'All Foods' : out})
All Foods
0 Apple
1 Banana
2 Pear
3 Spinach
4 Cucumber
5 Carrot
6 Beef
7 Chicken
8 Pork
Just pd.concat them together (and reset the index).
all_foods = pd.concat([foods[col] for col in foods.columns])
You can unstack the dataframe to get the values and then create a df/series:
df = pd.DataFrame({'Fruits':['Apple','Banana', 'Pear'], 'Vegetables':['Spinach', 'Carrot', 'Cucumber'], 'Protein':['Beef', 'Chicken', 'Pork']})
pd.DataFrame({'All Foods' : df.unstack().values})
This should help:
import pandas as pd
# Loading file with fruits, vegetables and protein
dataset = pd.read_csv('/fruit.csv')
# This is where you should apply your code
# Unpivoting (creating one column out of 3 columns)
df_unpivot = pd.melt(dataset, value_vars=['Fruits', 'Vegetables', 'Protein'])
# Renaming column from value to All Foods
df_finalized = df_unpivot.rename(columns={'value': 'All Foods'})
# Printing out "All Foods" column
print(df_finalized["All Foods"])

python Pandas: VLOOKUP multiple cells on column

I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
​~~~

Using pandas, how can I sort a table on all values that contains a string element from a list of string elements?

I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple

Eliminate Column Repetition in Pandas Dataframe

I have a data frames where I am trying to find all possible combinations of itself and a fraction of itself. The following data frames is a much scaled down version of the one I am running. The first data frame (fruit1) is a fraction of the second data frame (fruit2).
FruitSubDF FruitFullDF
apple apple
cherry cherry
banana banana
peach
plum
By running the following code
df1 = pd.DataFrame(list(product(fruitDF.iloc[0:3,0], fruitDF.iloc[0:5,0])), columns=['fruit1', 'fruit2'])
the output is
Fruit1 Fruit2
0 apple banana
1 apple apple
2 apple cherry
3 apple peach
4 apple plum
5 cherry banana
6 cherry apple
7 cherry cherry
.
.
18 banana banana
19 banana peach
20 banana plum
My problem is I want to remove elements with the same two fruits regardless of which fruit is in which column as below. So I am considering (apple,cherry) and (cherry,apple) as the same but I am unsure of an efficient way instead of iterRows to weed out the unwanted data as most pandas functions I find will remove based on the order.
Fruit1 Fruit2
0 apple banana
1 apple cherry
2 apple apple
3 apple peach
4 apple plum
5 cherry banana
6 cherry cherry
.
.
15 banana plum
First, I created a piece of code to replicate your DataFrame. I took my code here :stack overflow
import pandas as pd
Fruit1=['apple', 'cherry', 'banana']
Fruit2=['banana', 'apple', 'cherry']
index = pd.MultiIndex.from_product([Fruit1, Fruit2], names = ["Fruit1", "Fruit2"])
df = pd.DataFrame(index = index).reset_index()
Then, you can use the lexicographial order to filter the dataframe.
df[df['Fruit1']<=df['Fruit2']]
I have the result you wanted to obtain.
EDIT : you edited your post but it seems to still do the job.
You can use itertools to achieve it -
import itertools
fruits = ['banana', 'cherry', 'apple']
pd.DataFrame((itertools.permutations(fruits, 2)), columns=['fruit1', 'fruit2'])

Can you create a flag field for tokens based off of a dataframe column of tokens?

I have a dataframe that looks like this:
Beverage Ingredients Ingredients_Tokens
Orange Juice Orange Juice Concentrate, Orange Pulp [orange, juice, concentrate, orange, pulp]
Root Beer Sugar, Water, Caramel Color [sugar, water, caramel, color]
... ... ...
Apple Juice INGREDIENTS: CONTAINS PURE FILTERED WATER, CONCENTRATED APPLE JUICE [pure, filtered, water, concentrated, apple, juice]
I want to take the ingredients_tokens field and create flag fields for each token that appears more than 20 times in the whole dataframe so that my final dataframe has all of the Beverages and whether they contain the tokens listed, like
Beverage Token_Orange Token_Sugar Token_Water ... Token_Apple
Orange_Juice 1 0 0 0
Root Beer 0 1 1 0
...
Apple Juice 0 0 1 1
I tried a loop that tried to create the Token variable and then store it, something like (47 is total number of tokens):
df=pd.DataFrame()
for i in range (0,47):
T['Token'] = T['Ingredients_Tokens'][i]
df = df.append([Q])
df = pd.DataFrame(df)
But am not sure where to go
One option if you're on one of the more recent versions of pandas is to use .explode:
In [167]: df
Out[167]:
thing ingredients
0 oj [orange, juice, pulp]
1 root beer [roots, beer]
In [168]: df.explode("ingredients").set_index("ingredients", append=True).unstack().notnull()
Out[168]:
thing
ingredients beer juice orange pulp roots
0 False True True True False
1 True False False False True

Categories

Resources