Explode multiple columns in CSV with varying/unmatching element counts using Pandas - python

I'm trying to use the explode function in pandas on 2 columns in a CSV that have varying element counts. I understand that one of the limitations of a multi-explode currently is that you can't have nonmatching element counts in the target columns, so I'm wondering what you can do to get around this or if there's something completely different besides explode?
Input:
Fruit
Color
Origin
Apple
Red, Green
USA; Canada
Plum
Purple
USA
Mango
Red, Yellow
Mexico; USA
Pepper
Red, Green
Mexico
Desired Output:
Fruit
Color
Origin
Apple
Red
USA
Apple
Green
Canada
Plum
Purple
USA
Mango
Red
Mexico
Mango
Yellow
USA
Pepper
Red
Mexico
Pepper
Green
Mexico
There is never more than 1 Origin value for rows with only 1 Color value.
Color values are always separated by ", " and Origin values are always separated by "; "
My code so far:
import pandas as pd
df = pd.read_csv('fruits.csv')
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
df = df.explode(['Color','Origin'])
df.to_csv('explode_fruit.csv', encoding='utf-8')
I get this error when running: "ValueError: columns must have matching element counts"

The error is most likely due to the unequal number of values for colour and origin in the last row. As you have mentioned There is never more than 1 Origin value for rows with only 1 Color value. , you can try the following:
import pandas as pd
df = pd.DataFrame( {'Fruit':['Apple', 'Plum','Mango','Pepper'],
'Color': ['Red, Green', 'Purple', 'Red, Yellow','Red, Green'],
'Origin':['USA; Canada', 'USA', 'Mexico; USA', 'Mexico']
})
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
# ensuring equal number of color and origin in each cell
df['Origin'] =df.apply(lambda x: x['Origin']* len(x['Color']) if len(x['Color'])>len(x['Origin']) else x['Origin'], axis=1)
df = df.explode(['Color','Origin']).reset_index(drop=True)

Related

Splitting a column into two in dataframe

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks
Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Removing right part of string from pandas column if equal to another pandas column

I am having a nan value when trying to get left part of a string a pandas dataframe, where the left condition is depending on the lengh of the cell in another column of the dataframe :
Example of df :
Phrase
Color
Paul like red
red
Mike like green
green
John like blue
blue
My objectives is to obtain a series of the first part of the phrase => before "like {Color}".
Here it would be :
|First Name|
| Paul |
| Mike |
| John |
i try to call the function below :
df["First Name"] = df["Phrase"].str[:- df["Color"].str.len() - 6]
But i keep having Nan value results. It seems my length calculation of the colors can't transmit to my str[:-x] function.
Can someone help me understand what is happening here and find a solution ?
Thanks a lot. Have a nice day.
Consider below df:
In [128]: df = pd.DataFrame({'Phrase':['Paul like red', 'Mike like green', 'John like blue', 'Mark like black'], 'Color':['red', 'green', 'blue', 'brown']})
In [129]: df
Out[129]:
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
3 Mark like black brown
Use numpy.where:
In [134]: import numpy as np
In [132]: df['First Name'] = np.where(df.apply(lambda x: x['Color'] in x['Phrase'], 1), df.Phrase.str.split().str[0], np.nan)
In [133]: df
Out[133]:
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
3 Mark like black brown NaN
Lets break this down and try to understand whats going on.. .str returns a pandas.Series.str (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) object and you want to slice it using a vector.
So basically you are trying to do pandas.Series.str[: <some_vector>] where <some_vector> is - df["Color"].str.len() - 6
Unfortunately, pandas offers no way to slice using a vector, check all methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
So we are restricted to using pandas.Series.str[: some_value]
Now since this some_value changes for every row, you can use the .apply method over each row as follows:
df = pd.DataFrame({
'Phrase': ['Paul like red', 'Mike like green', 'John like blue'],
'Color': ['red', 'green', 'blue']
})
>>>
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
def func(x):
return x['Phrase'][:-len(x['Color'])-6]
df['First Name'] = df.apply(func, axis=1)
>>>
print (df)
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
Here I have used the same logic but passed the value as a scalar using .apply

How do I assign the results of a pandas.series.str.contains method in pandas to a new column

I have the following code:
import pandas
dict1 = {
"Country" :['USA','France', 'Spain', 'Italy', 'Germany', 'South Africa', 'Portugal', 'Brazil'],
"Variety" : ['Pinot Gris', 'Pinot Blanc', 'White Blend', 'Sauvignon Blanc', 'Frappato', 'Portuguese Red', 'Red Blend', 'Pinot Noir'],
"Grade" : [80, 85, 83, 87, 88, 89, 84, 86],
}
df = pandas.DataFrame(dict1)
df['Type'] = ''
What I am trying to do is to go through each row and if a value contains Red or Noir in the Variety column, assign it to a new value called Red and append Red to that index in the Type column.
I used the pandas string contains method but it only returns Boolean values and when I try to loop through it says I can't (of course because they are Boolean values). Does anyone know how to resolve this??
str.contains is supposed to return a boolean array. This is because a string will either contain your substring, or it won't. If you want to then overwrite all of those instances where the outputted boolean array is True or False you'll need to combine str.contains with numpy.where:
import numpy as np
df["Type"] = np.where(df["Variety"].str.contains(r"Red|Noir"), "Red", "NOT RED")
print(df)
Country Variety Grade Type
0 USA Pinot Gris 80 NOT RED
1 France Pinot Blanc 85 NOT RED
2 Spain White Blend 83 NOT RED
3 Italy Sauvignon Blanc 87 NOT RED
4 Germany Frappato 88 NOT RED
5 South Africa Portuguese Red 89 Red
6 Portugal Red Blend 84 Red
7 Brazil Pinot Noir 86 Red
np.where takes a boolean array, and assigns values to wherever it is True or False. In this case I assigned "Red" to wherever our boolean array was True and "NOT RED" wherever the array was False.

Python Pandas dataframe cross-referencing and new column generation

I want to generate a dataframe that contains lists of a person's potential favorite crayon colors, based on their favorite color. I have two dataframes that contain the necessary information:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
I want to reference one database against the other by matching the df1 color entry to the df2 color entry, and returning the corresponding possible_crayons values as a list in a new column in df1. Any terms that did not find a match would be labeled N/A. So the desired output would be:
person favorite_color possible_crayons_list
Jeff blue [navy, aqua]
Marie purple [periwinkle, royal purple]
Jenna brown NaN
Mike green [forest green, pink]
I've tried:
mergedDF = pd.merge(df1, df2, how='left')
However, this results in the following:
person color possible_crayons
0 Jeff blue navy
1 Jeff blue aqua
2 Marie purple periwinkle
3 Marie purple royal purple
4 Jenna brown NaN
5 Mike green forest green
6 Mike green pine
Is there any way to achieve my desired output of lists?
We can use DataFrame.merge with how='left' and then GroupBy.agg with as_index=False:
new_df= ( df1.merge(df2,how='left',on='color')
.groupby(['color','person'],as_index=False).agg(list) )
Output
print(new_df)
color person possible_crayons
0 blue Jeff [navy, aqua]
1 brown Jenna [nan]
2 green Mike [forest green, pine]
3 purple Marie [periwinkle, royal purple]
Use this:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
tmp = df2.groupby('color')['possible_crayons'].apply(list)
mergedDF = df1.merge(tmp, how='left', left_on='color', right_index=True)
print(mergedDF)
mergedDF2 = mergedDF.groupby('color')['possible_crayons'].apply(list).reset_index(name='new_possible_crayons')

Get Colour Formatted table in Pandas

I have an two pandas dataframe like this:
Df1:
Name Age Hobby
ABC 23 Reading
GHI 25 Playing
DF2:
Name Age Hobby
Green Yellow Green
Green NaN Red
What I am looking is 3rd data-frame which makes a df in such a way that:
ABC and GHI are colured in Green
23 is in Yellow and 25 remains white as it is Nan
Reading in Green and Playing in Red
Any help on the same
Use styles with custom function:
def color(x):
c = 'background-color: '
return Df2.apply(lambda x: x.str.lower()).radd(c).fillna('')
Df1.style.apply(color,axis=None).to_excel('styled.xlsx', engine='openpyxl', index=False)
Output

Categories

Resources