Pandas: Used on priority of entries eliminate duplicate values

Pandas: Used on priority of entries eliminate duplicate values - python

i have a pandas dateframe like this:
FRUITS COLOURS
0 apple red
1 berry black
2 apple green
3 grapes green
4 apple black
5 grapes red
6 tomato black
7 tomato green
keeping in mind the priority order of COLOURS red > green > black, i want to eliminate all the duplicate entries in FRUITS
Desired output should be:
FRUITS COLOURS
0 apple red
1 berry black
2 grapes red
3 tomato green

You can set the order by setting COLOUR to an ordered categorical, then sorting, and dropping the duplicate FRUITS:
df['COLOURS'] = pd.Categorical(df['COLOURS'], categories=['red','green','black'],ordered=True)
df.sort_values('COLOURS').drop_duplicates('FRUITS').sort_index()
FRUITS COLOURS
0 apple red
1 berry black
5 grapes red
7 tomato green

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.

df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Pandas get group, how to get group by keywords out of a phrase?

So let's say I have the following:
Item
Quantity
Blue Banana
3
Red Banana
4
Green Banana
1
Blue Apple
2
Orange Apple
6
I would like to grab all of the bananas and add them, no matter the color.
Or I would like to grab all Blue item, no matter the fruit type, and add them.

You can use a dictionary comprehension and str.contains:
words = ['banana', 'blue']
pd.Series({w: df.loc[df['Item'].str.contains(w, case=False), 'Quantity'].sum()
for w in words})
output:
banana 8
blue 5

How to create a DataFrame with random sample from list?

I have a list named list1
list1 = ['Banana','Apple','Pear','Strawberry','Muskmelon','Apricot','Peach','Plum','Cherry','Blackberry','Raspberry','Cranberry','Grapes','Greenapple','Kiwi','Watermelon','Orange','Lychee','Custardapples','Jackfruit','Pineapple','Mango']
I want to form a df with specific columns and random data from list1
Eg:
a b c d e f
0 Banana Orange Lychee Custardapples Jackfruit Pineapple
1 Apple Pear Strawberry Muskmelon Apricot Peach
2 Raspberry Cherry Plum Kiwi Mango Blackberry
A structure something like this but with random data from list1?
There can't be any duplicate/repeated values present.

If every item from the list can end up everywhere in the DataFrame you could write:
pd.DataFrame(np.random.choice(list1, 3*6, replace=False).reshape(3, 6), columns=list("abcdef"))
Out:
a b c d e f
0 Lychee Peach Apricot Pear Plum Grapes
1 Cherry Jackfruit Blackberry Cranberry Kiwi Apple
2 Orange Greenapple Watermelon Banana Custardapples Raspberry
The replace-parameter in np.random.choice() is True by default, so for unique values you need to set it to False.

Split (Explode) String into Multiple columns and rows - Python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application Role
1 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
2 Blue Pink Red
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Roles.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Roles',1),['Application'])
The following output code reads
UserId Application
1 Grey Blue
1 White
2 Yellow Purple
2 Blue Pink
i do not know how to add the second column (role) in the code for the second split after ::

Given this dataframe:
UserId Application
0 1 Grey Blue::Black Orange;White::Green
1 2 Yellow Purple::Orange Grey;Blue Pink::Red
you could at least achieve the last two columns directly via
df.Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_0', 'level_1'])
which results in
0 1
0 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
3 Blue Pink Red
However, defining UserId as index before would also provide the proper UserId column:
result = df.set_index('UserId').Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_1'])
result.columns = ['UserId', 'Application', 'Role']
UserId Application Role
0 1 Grey Blue Black Orange
1 1 White Green
2 2 Yellow Purple Orange Grey
3 2 Blue Pink Red

Python & pandas: how to group in a nonstandard way

I know basic pandas functions, but I'm not clear how to group in this case.
I have a dataframe with a list of various fruits and their characteristics:
fruit x1 x2
apple red sweet
apple yellow sweet
apple green tart
apple red sweet
apple red sweet
How could I sum each fruit (there are more after apples) like this?
3 apples: red and sweet
1 apple: yellow and sweet
1 apple: green and tart
I've looked at groupby, tried an apply function, and looked over pandas documentation, but this escapes me.
Any ideas? Thank you so much.

With Counter
import pandas as pd
from collections import Counter
pd.Series(Counter(map(tuple, df.values)))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64
pd.factorize and np.bincount
i, r = pd.factorize(list(map(tuple, df.values)))
pd.Series(dict(zip(r, np.bincount(i))))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64

You can try below:
df['count']=0
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index()
The output will be as below:
fruit x1 x2 count
0 apple green tart 1
1 apple red sweet 3
2 apple yellow sweet 1
of course, you can concatenate columns after this to make it exactly as your required output.
And if u want the count to be sorted :
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index().sort_values(by=['count'],ascending=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Used on priority of entries eliminate duplicate values - python

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

Pandas get group, how to get group by keywords out of a phrase?

How to create a DataFrame with random sample from list?

Split (Explode) String into Multiple columns and rows - Python

Python & pandas: how to group in a nonstandard way

Categories

Resources