Eliminate Column Repetition in Pandas Dataframe - python

I have a data frames where I am trying to find all possible combinations of itself and a fraction of itself. The following data frames is a much scaled down version of the one I am running. The first data frame (fruit1) is a fraction of the second data frame (fruit2).
FruitSubDF FruitFullDF
apple apple
cherry cherry
banana banana
peach
plum
By running the following code
df1 = pd.DataFrame(list(product(fruitDF.iloc[0:3,0], fruitDF.iloc[0:5,0])), columns=['fruit1', 'fruit2'])
the output is
Fruit1 Fruit2
0 apple banana
1 apple apple
2 apple cherry
3 apple peach
4 apple plum
5 cherry banana
6 cherry apple
7 cherry cherry
.
.
18 banana banana
19 banana peach
20 banana plum
My problem is I want to remove elements with the same two fruits regardless of which fruit is in which column as below. So I am considering (apple,cherry) and (cherry,apple) as the same but I am unsure of an efficient way instead of iterRows to weed out the unwanted data as most pandas functions I find will remove based on the order.
Fruit1 Fruit2
0 apple banana
1 apple cherry
2 apple apple
3 apple peach
4 apple plum
5 cherry banana
6 cherry cherry
.
.
15 banana plum

First, I created a piece of code to replicate your DataFrame. I took my code here :stack overflow
import pandas as pd
Fruit1=['apple', 'cherry', 'banana']
Fruit2=['banana', 'apple', 'cherry']
index = pd.MultiIndex.from_product([Fruit1, Fruit2], names = ["Fruit1", "Fruit2"])
df = pd.DataFrame(index = index).reset_index()
Then, you can use the lexicographial order to filter the dataframe.
df[df['Fruit1']<=df['Fruit2']]
I have the result you wanted to obtain.
EDIT : you edited your post but it seems to still do the job.

You can use itertools to achieve it -
import itertools
fruits = ['banana', 'cherry', 'apple']
pd.DataFrame((itertools.permutations(fruits, 2)), columns=['fruit1', 'fruit2'])

Related

how to combine everything in a pandas dataframe into another dataframe

I have a dataframe with information, where the rows are not related to eachother:
Fruits Vegetables Protein
1 Apple Spinach Beef
2 Banana Cucumber Chicken
3 Pear Carrot Pork
I essentially just want to create a pandas series with all of that information, I want it to look like this:
All Foods
1 Apple
2 Banana
3 Pear
4 Spinach
5 Cucumber
6 Carrot
7 Beef
8 Chicken
9 Pork
How can I do this in pandas?
Dump into numpy and create a new dataframe:
out = df.to_numpy().ravel(order='F')
pd.DataFrame({'All Foods' : out})
All Foods
0 Apple
1 Banana
2 Pear
3 Spinach
4 Cucumber
5 Carrot
6 Beef
7 Chicken
8 Pork
Just pd.concat them together (and reset the index).
all_foods = pd.concat([foods[col] for col in foods.columns])
You can unstack the dataframe to get the values and then create a df/series:
df = pd.DataFrame({'Fruits':['Apple','Banana', 'Pear'], 'Vegetables':['Spinach', 'Carrot', 'Cucumber'], 'Protein':['Beef', 'Chicken', 'Pork']})
pd.DataFrame({'All Foods' : df.unstack().values})
This should help:
import pandas as pd
# Loading file with fruits, vegetables and protein
dataset = pd.read_csv('/fruit.csv')
# This is where you should apply your code
# Unpivoting (creating one column out of 3 columns)
df_unpivot = pd.melt(dataset, value_vars=['Fruits', 'Vegetables', 'Protein'])
# Renaming column from value to All Foods
df_finalized = df_unpivot.rename(columns={'value': 'All Foods'})
# Printing out "All Foods" column
print(df_finalized["All Foods"])

python Pandas: VLOOKUP multiple cells on column

I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
​~~~

Using pandas, how can I sort a table on all values that contains a string element from a list of string elements?

I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple

Find name(s) of highest-value columns in each pandas dataframe row--Including tied values

I have a dataframe that records the number and type of fruits owned by various people. I'd like to add a column that indicates the top fruit(s) for each person. If a person has 2+ top-ranking fruits (aka, a tie), I want a list (or tuple) of them all.
Input
For example, let's say my input is this dataframe:
# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
{'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
{'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
{'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]
# Add people's names as an index
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])
# Print the dataframe
df
. . . which creates the input dataframe:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count
Shawn strawberry 23 orange 4 grape 27
Monica apples 45 mango 45 orange 12
Jamal blueberry 30 grapefruit 32 cherry 94
Tracy pineapple 4 grape 4 lemon 67
Target output
What I'd like to get is a new column that gives the name of the top fruit for each person. If the person has two (or more) fruits that tied for first, I'd like a list or a tuple of those fruits:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 (apple,mango)
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
My attempt far
The closest I've gotten is based on https://stackoverflow.com/a/38955365/6480859.
Problems:
If there is a tie for top fruit, it only captures one top fruit (Monica's top fruit is only apple.)
It's really complicated. Not really a problem, but if there is a more straightforward path, I'd like to learn it.
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']
# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()
# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it
# doesn't represent tied results.
nlargest = 1
# The next two lines are suggested from
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . .
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order],
columns=['top{}'.format(i) for i in range(1, nlargest+1)],
index=only_counts_df.index)
# Join the results back to our original dataframe
result = df.join(result).copy()
# The dataframe now reports the name of the column that
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
if row['top1'] == 'fruit0_count':
return row['fruit0']
elif row['top1'] == 'fruit1_count':
return row['fruit1']
elif row['top1'] == 'fruit2_count':
return row['fruit2']
else:
return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result
. . . which outputs:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 apple
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
Monica's top fruit should be apple and mango.
Any tips are welcome, thanks!
Idea is filter each pair and unpair column to df1 and df2, then compare values by max and filter with DataFrame.mask, last get non missing values in apply:
df1 = df.iloc[:, ::2]
df2 = df.iloc[:, 1::2]
mask = df2.eq(df2.max(axis=1), axis=0)
df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
print (df)
fruit0 fruit0_count fruit1 fruit1_count fruit2 \
Shawn strawberry 23 orange 4 grape
Monica apple 45 mango 45 orange
Jamal blueberry 30 grapefruit 32 cherry
Tracy pineapple 4 grape 4 lemon
fruit2_count top
Shawn 27 [grape]
Monica 12 [apple, mango]
Jamal 94 [cherry]
Tracy 67 [lemon]
Here's what I've come up with:
maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
df_masked = df[[f"fruit{i}" for i in range(3)]][
mask.rename(lambda x: x.replace("_count", ""), axis=1)
]
df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)
This will return
fruit0 fruit0_count ... fruit2_count top_fruit
Shawn strawberry 23 ... 27 [grape]
Monica apple 45 ... 12 [apple, mango]
Jamal blueberry 30 ... 94 [cherry]
Tracy pineapple 4 ... 67 [lemon]

Pandas - check if dataframe columns contain key:value pairs from a dictionary

This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel

Categories

Resources