Loop through pandas grouped data index - python

Struggling newbie, we are trying to list the items of a grouped dataframe.
To highlight the problem please see a simplified example below.
First group the items:
data = {'colour': ['red','purple','green','purple','blue','red'], 'item': ['hat','scarf','belt','belt','hat','scarf'], 'material': ['felt','wool','leather','wool','plastic','wool']}
df = pd.DataFrame(data=data)
grpd_df = df.groupby(df['item']).apply(lambda df:df.reset_index(drop=True))
grpd_df
colour item material
item
belt 0 green belt leather
1 purple belt wool
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
Then find all the rows that have a red item
df = grpd_df[grpd_df['colour'].eq('red').groupby(level=0).transform('any')]
print (df)
colour item material
item
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
We would like to then loop over a list of the items in the grpd_df i.e. hat and scarf. We've tried a df.index.levels but this outputs all the items including belt.

You can using IndexSlice and get_level_values, to achieve it .
grpd_df.loc[pd.IndexSlice[list(set(df.index.get_level_values(0).tolist())),:]]
Out[302]:
colour item material
item
hat 0 red hat felt
1 blue hat plastic
scarf 0 purple scarf wool
1 red scarf wool
If need the level of the index from df
set(df.index.get_level_values(0).tolist())
Out[303]: {'hat', 'scarf'}

Related

Splitting a column into two in dataframe

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks
Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Mutli indexing using pandas

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
I tried using the below code
df.groupby(['CATEGORY', 'COLOR']).size().unstack(fill_value=0)
which does not give me what I need. The output I am expecting is
CATEGORY COLOR ITEM
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I believe you need DataFrame.set_index with DataFrame.sort_index:
df1 = df.set_index(['CATEGORY', 'COLOR']).sort_index()
print (df1)
ITEM
CATEGORY COLOR
BIKE BLACK 54519
BLUE 23661
CAR BLACK 14582
RED 48684
JEEP WHITE 45685
If order is omportant convert both columns to ordered categoricals:
cols = ['CATEGORY', 'COLOR']
for c in cols:
df[c] = pd.Categorical(df[c], categories=df[c].drop_duplicates(), ordered=True)
df1 = df.set_index(cols).sort_index()
print (df1)
ITEM
CATEGORY COLOR
CAR RED 48684
BLACK 14582
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685
I will use Categorical first, since this will keep the original CATEGORY order
df.CATEGORY=pd.Categorical(df.CATEGORY,df.CATEGORY.unique())
s=df.sort_values(['CATEGORY','COLOR']).set_index(['CATEGORY','COLOR'])
ITEM
CATEGORY COLOR
CAR BLACK 14582
RED 48684
BIKE BLACK 54519
BLUE 23661
JEEP WHITE 45685

Replace duplicates with first values in dataframe

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.
Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

Python Pandas dataframe cross-referencing and new column generation

I want to generate a dataframe that contains lists of a person's potential favorite crayon colors, based on their favorite color. I have two dataframes that contain the necessary information:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
I want to reference one database against the other by matching the df1 color entry to the df2 color entry, and returning the corresponding possible_crayons values as a list in a new column in df1. Any terms that did not find a match would be labeled N/A. So the desired output would be:
person favorite_color possible_crayons_list
Jeff blue [navy, aqua]
Marie purple [periwinkle, royal purple]
Jenna brown NaN
Mike green [forest green, pink]
I've tried:
mergedDF = pd.merge(df1, df2, how='left')
However, this results in the following:
person color possible_crayons
0 Jeff blue navy
1 Jeff blue aqua
2 Marie purple periwinkle
3 Marie purple royal purple
4 Jenna brown NaN
5 Mike green forest green
6 Mike green pine
Is there any way to achieve my desired output of lists?
We can use DataFrame.merge with how='left' and then GroupBy.agg with as_index=False:
new_df= ( df1.merge(df2,how='left',on='color')
.groupby(['color','person'],as_index=False).agg(list) )
Output
print(new_df)
color person possible_crayons
0 blue Jeff [navy, aqua]
1 brown Jenna [nan]
2 green Mike [forest green, pine]
3 purple Marie [periwinkle, royal purple]
Use this:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
tmp = df2.groupby('color')['possible_crayons'].apply(list)
mergedDF = df1.merge(tmp, how='left', left_on='color', right_index=True)
print(mergedDF)
mergedDF2 = mergedDF.groupby('color')['possible_crayons'].apply(list).reset_index(name='new_possible_crayons')

How to use python to search multiple datafile columns for string and copy to a new column if found?

I am trying to use Python to find matches to a substring in multiple columns of a dataframe, and copy the entire string, if substring is found, to a new column.
The data strings are extracted from comma-separated strings in another df. So there are varying numbers of strings across each row. The string in column A may or may not be the one I want to copy. If it isn't, the string in column B will be. Some rows include data in columns D and E, but we don't have to use those. (In the real world, these are website urls and I'm trying to gather only the ones from a specific domain, which might be the first one, or the second one on the row. I used simpler strings for the example.) I am trying to use np.where, but I am not getting consistent results, particularly if the correct string is in column A but not repeated in column B. Np.where appears to only apply the "y" and never the "x". I've also tried variations on if/where in loops without good results.
import pandas as pd
df = pd.DataFrame({"A": ["blue lorry", "yellow cycle", "red car", "blue lorry", "red truck", "red bike", "blue jeep", "yellow skate", "red bus"], "B": ["red train", "red cart", "red car", "red moto",'', "red bike", "red diesel", "red carriage",''], "C": ['','','', "red moto",'', "red bike", "red diesel", "red carriage",''], "D": ['','','', "red moto",'', "red bike", '','','']})
This produces df:
A B C D
0 blue lorry red train
1 yellow cycle red cart
2 red car red car
3 blue lorry red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike
6 blue jeep red diesel red diesel
7 yellow skate red carriage red carriage
8 red bus
When I run:
df['Red'] = np.where("red" in df['A'], df['A'], df['B'])
It returns:
A B C D Red
0 blue lorry red train red train
1 yellow cycle red cart red cart
2 red car red car red car
3 blue lorry red moto red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike red bike
6 blue jeep red diesel red diesel red diesel
7 yellow skate red carriage red carriage red carriage
8 red bus
The column Red values for lines 4 and 8 are missing, when I expected it to copy the (correct) strings from column A.
I understand the basic structure is: numpy.where(condition, x, y)
I tried to apply code so the condition is to look for "red" and copy the string in column A if "red" is found, or the string in column B if it isn't. But it seems I'm only getting the column B string. Any help is appreciated.
Obviously I'm new here. I gleaned some help for np.where from these topics, but I think there are some differences between using numeric values and strings, and my multiple columns:
np.where Not Working in my Pandas
Efficiently replace values from a column to another column Pandas DataFrame
Update Value in one column, if string in other column contains something in list
str.contains works where "in" condition did not. Correct code is:
df['Red'] = np.where(df['A'].str.contains('red'), df['A'], df['B'])
Thanks to Terry!

Categories

Resources