Splitting a column into two in dataframe

Splitting a column into two in dataframe - python

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks

Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Related

Removing right part of string from pandas column if equal to another pandas column

I am having a nan value when trying to get left part of a string a pandas dataframe, where the left condition is depending on the lengh of the cell in another column of the dataframe :
Example of df :
Phrase
Color
Paul like red
red
Mike like green
green
John like blue
blue
My objectives is to obtain a series of the first part of the phrase => before "like {Color}".
Here it would be :
|First Name|
| Paul |
| Mike |
| John |
i try to call the function below :
df["First Name"] = df["Phrase"].str[:- df["Color"].str.len() - 6]
But i keep having Nan value results. It seems my length calculation of the colors can't transmit to my str[:-x] function.
Can someone help me understand what is happening here and find a solution ?
Thanks a lot. Have a nice day.

Consider below df:
In [128]: df = pd.DataFrame({'Phrase':['Paul like red', 'Mike like green', 'John like blue', 'Mark like black'], 'Color':['red', 'green', 'blue', 'brown']})
In [129]: df
Out[129]:
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
3 Mark like black brown
Use numpy.where:
In [134]: import numpy as np
In [132]: df['First Name'] = np.where(df.apply(lambda x: x['Color'] in x['Phrase'], 1), df.Phrase.str.split().str[0], np.nan)
In [133]: df
Out[133]:
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
3 Mark like black brown NaN

Lets break this down and try to understand whats going on.. .str returns a pandas.Series.str (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) object and you want to slice it using a vector.
So basically you are trying to do pandas.Series.str[: <some_vector>] where <some_vector> is - df["Color"].str.len() - 6
Unfortunately, pandas offers no way to slice using a vector, check all methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
So we are restricted to using pandas.Series.str[: some_value]
Now since this some_value changes for every row, you can use the .apply method over each row as follows:
df = pd.DataFrame({
'Phrase': ['Paul like red', 'Mike like green', 'John like blue'],
'Color': ['red', 'green', 'blue']
})
>>>
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
def func(x):
return x['Phrase'][:-len(x['Color'])-6]
df['First Name'] = df.apply(func, axis=1)
>>>
print (df)
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
Here I have used the same logic but passed the value as a scalar using .apply

Replace duplicates with first values in dataframe

df
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23226 BIKE BLUE
54252 BIKE BLACK
df.loc[df.duplicated(['CATEGORY', 'COLOR','ITEM']), 'ITEM'] = 'ITEM' Does not give me required output. I need the output a below.
ITEM CATEGORY COLOR
48684 CAR RED
54519 BIKE BLACK
14582 CAR BLACK
45685 JEEP WHITE
23661 BIKE BLUE
23661 BIKE BLUE
54519 BIKE BLACK
If the CATEGORY and COLOR are the same replace the ITEM number should be replaced with the first value.

Use GroupBy.transform with GroupBy.first by all values:
df['ITEM'] = df.groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df)
ITEM CATEGORY COLOR
0 48684 CAR RED
1 54519 BIKE BLACK
2 14582 CAR BLACK
3 45685 JEEP WHITE
4 23661 BIKE BLUE
5 23661 BIKE BLUE
6 54519 BIKE BLACK
If want filter only duplicated for improve performance (if is more unique rows and less duplicates) add DataFrame.duplicated by 2 columns with keep=False and apply groupby only for filter rows by boolean indexing, also assign to filtered column ITEM:
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

Python Pandas dataframe cross-referencing and new column generation

I want to generate a dataframe that contains lists of a person's potential favorite crayon colors, based on their favorite color. I have two dataframes that contain the necessary information:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
I want to reference one database against the other by matching the df1 color entry to the df2 color entry, and returning the corresponding possible_crayons values as a list in a new column in df1. Any terms that did not find a match would be labeled N/A. So the desired output would be:
person favorite_color possible_crayons_list
Jeff blue [navy, aqua]
Marie purple [periwinkle, royal purple]
Jenna brown NaN
Mike green [forest green, pink]
I've tried:
mergedDF = pd.merge(df1, df2, how='left')
However, this results in the following:
person color possible_crayons
0 Jeff blue navy
1 Jeff blue aqua
2 Marie purple periwinkle
3 Marie purple royal purple
4 Jenna brown NaN
5 Mike green forest green
6 Mike green pine
Is there any way to achieve my desired output of lists?

We can use DataFrame.merge with how='left' and then GroupBy.agg with as_index=False:
new_df= ( df1.merge(df2,how='left',on='color')
.groupby(['color','person'],as_index=False).agg(list) )
Output
print(new_df)
color person possible_crayons
0 blue Jeff [navy, aqua]
1 brown Jenna [nan]
2 green Mike [forest green, pine]
3 purple Marie [periwinkle, royal purple]

Use this:
df1 = pd.DataFrame({'person':['Jeff','Marie','Jenna','Mike'], 'color':['blue', 'purple', 'brown', 'green']}, columns=['person','color'])
df2 = pd.DataFrame({'possible_crayons':['christmas red','infra red','scarlet','sunset orange', 'neon carrot','lemon','forest green','pine','navy','aqua','periwinkle','royal purple'],'color':['red','red','red','orange','orange','yellow','green','green','blue','blue','purple','purple']}, columns=['possible_crayons','color'])
tmp = df2.groupby('color')['possible_crayons'].apply(list)
mergedDF = df1.merge(tmp, how='left', left_on='color', right_index=True)
print(mergedDF)

mergedDF2 = mergedDF.groupby('color')['possible_crayons'].apply(list).reset_index(name='new_possible_crayons')

How to use python to search multiple datafile columns for string and copy to a new column if found?

I am trying to use Python to find matches to a substring in multiple columns of a dataframe, and copy the entire string, if substring is found, to a new column.
The data strings are extracted from comma-separated strings in another df. So there are varying numbers of strings across each row. The string in column A may or may not be the one I want to copy. If it isn't, the string in column B will be. Some rows include data in columns D and E, but we don't have to use those. (In the real world, these are website urls and I'm trying to gather only the ones from a specific domain, which might be the first one, or the second one on the row. I used simpler strings for the example.) I am trying to use np.where, but I am not getting consistent results, particularly if the correct string is in column A but not repeated in column B. Np.where appears to only apply the "y" and never the "x". I've also tried variations on if/where in loops without good results.
import pandas as pd
df = pd.DataFrame({"A": ["blue lorry", "yellow cycle", "red car", "blue lorry", "red truck", "red bike", "blue jeep", "yellow skate", "red bus"], "B": ["red train", "red cart", "red car", "red moto",'', "red bike", "red diesel", "red carriage",''], "C": ['','','', "red moto",'', "red bike", "red diesel", "red carriage",''], "D": ['','','', "red moto",'', "red bike", '','','']})
This produces df:
A B C D
0 blue lorry red train
1 yellow cycle red cart
2 red car red car
3 blue lorry red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike
6 blue jeep red diesel red diesel
7 yellow skate red carriage red carriage
8 red bus
When I run:
df['Red'] = np.where("red" in df['A'], df['A'], df['B'])
It returns:
A B C D Red
0 blue lorry red train red train
1 yellow cycle red cart red cart
2 red car red car red car
3 blue lorry red moto red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike red bike
6 blue jeep red diesel red diesel red diesel
7 yellow skate red carriage red carriage red carriage
8 red bus
The column Red values for lines 4 and 8 are missing, when I expected it to copy the (correct) strings from column A.
I understand the basic structure is: numpy.where(condition, x, y)
I tried to apply code so the condition is to look for "red" and copy the string in column A if "red" is found, or the string in column B if it isn't. But it seems I'm only getting the column B string. Any help is appreciated.
Obviously I'm new here. I gleaned some help for np.where from these topics, but I think there are some differences between using numeric values and strings, and my multiple columns:
np.where Not Working in my Pandas
Efficiently replace values from a column to another column Pandas DataFrame
Update Value in one column, if string in other column contains something in list

str.contains works where "in" condition did not. Correct code is:
df['Red'] = np.where(df['A'].str.contains('red'), df['A'], df['B'])
Thanks to Terry!

Partial string matching on large data set

So I have been working on this for a while and just haven't got any where and just not sure what to do.Fairly new to pandas and python.
Data set is actually 15,000 product names. All in different formats, some have multiple dashes up to 6, some hyphens, different lengths,The rows are product names with variants.
The code i'm using keeps returning only the first letter as oppose to the partial string when I use it on a large data set.
Works just fine on a small data set which I was using to test it.
I'm assuming this is happening because:
I haven't created a stop section when it matches a full partial string
because its trying to match up words as oppose to individual characters and stopping when it finds a difference.
What is the best way to overcome this on a large data set, what am I missing? or am I going to have to do this manual?
Original test data set
`1.star t-shirt-large-red
2.star t-shirt-large-blue
3.star t-shirt-small-red
4.beautiful rainbow skirt small
5.long maxwell logan jeans- light blue -32L-28W
6.long maxwell logan jeans- Dark blue -32L-28W`
Desired data set/output:
`COL1 COL2 COL3 COL4
1[star t-shirt] [large] [red] NONE
2[star t-shirt] [large] [blue] NONE
3[star t-shirt] [small] [red] NONE
4[beautiful rainbow skirt small] [small] NONE NONE
5[long maxwell logan jeans] [light blue] [32L] [28W]
6[long maxwell logan jeans] [Dark blue] [32L] [28W]`
Here is the code I was helped with in a previous question I asked:
`df['onkey'] = 1
df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]
df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')`
`df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))
df = df[['COL1','other']]
df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)`
` COL1 0 1 2
0 star t-shirt large red NaN
1 star t-shirt large blue NaN
2 star t-shirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W`
***************update*****************
This is your input list of product,some have variants and some don't.
When searching for duplicates strings to determine what are the products with variants and products without variants;nothing comes up because they are all seen as unique values due to the variants being added on at the end of the string.
So what I would like to do is group the partial or similar strings together(the longest match), extract the longest matching string within the group and then put the differences into other columns.
If the product /string is unique just print into the column with the extracted longest string.
star t-shirt-large-red
star t-shirt-large-blue
star t-shirt-small-red
beautiful rainbow skirt small
long maxwell logan jeans- light blue -32L-28W
long maxwell logan jeans- Dark blue -32L-28W
Organic and natural candy - 3 Pack - Mint
Organic and natural candy - 3 Pack - Vanilla
Organic and natural candy - 3 Pack - Strawberry
Organic and natural candy - 3 Pack - Chocolate
Organic and natural candy - 3 Pack - Banana
Organic and natural candy - 3 Pack - Cola
Organic and natural candy - 12 Pack Assorted
Morgan T-shirt Company - Small/Medium-Blue
Morgan T-shirt Company - Medium/Large-Blue
Morgan T-shirt Company - Medium/Large-red
Morgan T-shirt Company - Small/Medium-Red
Morgan T-shirt Company - Small/Medium-Green
Morgan T-shirt Company - Medium/Large-Green
Nelly dress leopard small
col1 col2 col3 col4
star t-shirt large red
star t-shirt large blue
star t-shirt small red
beautiful rainbow skirt small
Long maxwell logan jeans light blue 32L 28W
Long maxwell logan jeans Dark blue 32L 28W
Organic and natural candy 3 Pack Mint
Organic and natural candy 3 Pack Vanilla
Organic and natural candy 3 Pack Strawberry
Organic and natural candy 3 Pack Chocolate
Organic and natural candy 3 Pack Banana
Organic and natural candy 3 Pack Cola
Organic and natural candy 12 Pack Assorted
Morgan T-shirt Company Small/Medium Blue
Morgan T-shirt Company Medium/Large Blue
Morgan T-shirt Company Medium/Large Red
Morgan T-shirt Company Small/Medium Red
Morgan T-shirt Company Small/Medium Green
Morgan T-shirt Company Medium/Large Green
Nelly dress Leopard Small
Bijoux
Princess PJ-set
Lemon tank top Yellow Medium

Constructing a DataFrame df as follows:
df = pd.DataFrame()
df = df.append(['1.star t-shirt-large-red'])
df = df.append(['2.star t-shirt-large-blue'])
df = df.append(['4.beautiful rainbow skirt small'])
df = df.append(['5.long maxwell logan jeans- light blue -32L-28W'])
df = df.append(['6.long maxwell logan jeans- Dark blue -32L-28W'])
df.columns = ['Product']
The following code
(a) strips any whitespace,
(b) splits by the period ('.') and grabs what follows,
(c) replaces 't-shirt' with 'tshirt' because of further operations (change this back if you want after the operation)
(d) splits again by '-' and expands to give your dataframe.
df['Product'].str.strip().str.split('.').str.get(1).str.replace('t-shirt', 'tshirt').str.split('-', expand = True)
Output:
0 1 2 3
0 star tshirt large red None
0 star tshirt large blue None
0 beautiful rainbow skirt small None None None
0 long maxwell logan jeans light blue 32L 28W
0 long maxwell logan jeans Dark blue 32L 28W
Given the inconsistency in nomenclature for your product, there will be edge-cases that are missed (ex : beautiful rainbow skirt small). You may have to fish them out again.

A solution which is quite simple to understand, debug and flexibly extend is the following:
Consider that your initial product names are held in a list called strings.
Then the solution is the following line:
mydf = pd.concat([pd.DataFrame([make_row(row, 4)], columns=['COL1', 'COL2', 'COL3', 'COL4']) for row in strings], ignore_index=True)
where we have defined the parsing function make_row to be:
def make_row(string, num_cols):
cols = [item.strip() for item in string[2:].split('-')] # ignore numbering, split on hyphen and strip whitespace
if len(cols) < num_cols:
cols += [np.nan]*(num_cols - len(cols)) # fill with NaN missing values
return cols
The first line defining cols could also be simply cols = string.split('-'), in which case you could do the formatting afterwards with:
mydf.applymap(lambda x: x if pd.isnull(x) else str.strip(x))
Now in your case, I see that there is a hyphen in some of your product names, in which case you might want to 'sanitize' them in advance (or inside make_row, as you wish), with something like:
strings = [item.replace('t-shirt', 'tshirt') for item in strings]
Example input:
strings = ['1.one-two-three', '2. one-two', '3.one-two-three-four', '4.one - two -three -four ']
Output:
COL1 COL2 COL3 COL4
0 one two three NaN
1 one two NaN NaN
2 one two three four
3 one two three four
Output for question's data (after correcting typo for item 4):
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W
Edit:
If you additionally want to "group" the items together, then you can:
a) Use sort_values (pandas doc) on the column COL1 after you get a dataframe as described above to simply display the rows corresponding to the same product the one after the other, or
b) use group_by to actually get a grouped dataframe like this:
grouped_df = mydf.groupby("COL1")
This will allow you to get each group like this:
grouped_df.get_group("star tshirt")
Producing following output:
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.