categorize string from one column in another column - python

categorize string from one column in another column - python - python

I have dataframe with 3 columns. Column A contains titles on a lot of products, Column B contains all brand names and Column C contains models/series of all products. Column A got +2000 rows, column B got about 50 rows and Column C got about 200 rows. I want to create a new Column D, that categorizes if the Title in Column A includes Brand, Models or is Generic
Example on my dataframe and desired result in Column D
Column A Column B Column C Column D
Running shoes Nike Airmax 2 Generic
Nike airmax 2 Adidas All stars Model/series
Airmax 2 Converse Ultraboost Model/series
Nike Shoes Puma Questar Brand
If a row in column A contains brand and model I want Column D to categorize the row as model/serie. All rows in Column A that cannot get match with Brand or Models/series should be categorized as Generic.
I began trying with this:
df['Column D'] = df.apply(lambda x: x.Column_b in x.Column_a, axis=1)
Here I got an error because column B has a lot less rows than Column A.
Then i wondered if looping even is the right way to do it or if i need to do a regex.
Any help on how to accomplish getting the desired Column D, would be highly appreciated.

Use, Series.str.contains to create a boolean masks m1 where the truthy values in this mask corresponds to the condition where Column A contains the values from Column B in the similar manner create the boolean mask m2, then use np.select to select the values from choices based on the conditions based on m1 and m2:
m1 = df['Column A'].str.contains('|'.join(df['Column B']), case=False)
m2 = df['Column A'].str.contains('|'.join(df['Column C']), case=False)
df['Column D'] = np.select(
[m1 & m2, m1, m2], ['Model/series', 'Brand', 'Model/series'], 'Generic')
# print(df)
Column A Column B Column C Column D
0 Running shoes Nike Airmax 2 Generic
1 Nike airmax 2 Adidas All stars Model/series
2 Airmax 2 Converse Ultraboost Model/series
3 Nike Shoes Puma Questar Brand

Maybe something like:
df['D'] = ['Brand' if x in df['B'].values else 'Model/Series' if x in df['C'].values else 'Generic' for x in df['A']]
I'm not 100% sure if your data can contain both a column B and a column C instance in a column A instance, but if so it's trivial to add another else if inside the list comprehension to catch both

Related

Is there a way to put together or merge lists dataframe together to create a dataframe based on their index?

DF1 =[
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
]
DF2 = [ NY, FL ]
in this case DF1 and DF2 have two indexes.
The result I am looking for is the following
Main_DF =
[
Column A
Column B
Column C
Cell 1
Cell 2
NY
Cell 3
Cell 4
NY
Column A
Column B
Column C
Cell 1
Cell 2
FL
Cell 3
Cell 4
FL
]
I tried to use pd.concat, assign and insert
none give me the way I'm looking for the result to be

Lists hold references to dataframes. So, you can amend the dataframes and not need to amend the list at all.
So, I'd do something like...
for df, val in zip(DF1, DF2):
df['Column C'] = val
Using zip allows you to iterate though the two lists in sync with each other;
1st element of DF1 goes in to df, and 1st element of DF2 goes into val
2nd element of DF1 goes in to df, and 2nd element of DF2 goes into val
and so on

Extract a number in one column after - using pandas

I want to extract whatever come after -
My data in column A looks like
Column A
Column B
001-3
5
002-14
6
what I want is
Column A
Column B
Column C
001-3
3
5
002-14
14
6
Is there any function like scan for SAS in python, so I can extract the what come after "-" in column A and place it at column B, after move B to C

# reassign Column B to Column C
df['Column C'] = df['Column B']
#split from right and limit to only single split, then take the right value to create column B
df['Column B']=df['Column A'].str.rsplit('-',n=1,expand=True)[1]
df
Column A Column B Column C
0 001-3 3 5
1 002-14 14 6

How to use pandas to rename rows when they are the same in a column?

According to：
How to use pandas to rename rows when they are the same in column A?
My dataframe is :
I want to use pandas to rename Hospital when a row with the same value in the Hospital column has a different value in the GeneralRepresentation column. And when a row with the same value in the Hospital column has the same value in the GeneralRepresentation column, no renaming is done for Hospital. And for hospitals without GeneralRepresentation, keep the name of the hospital the same.
The effect I want is shown below:
When I use Beny's code in How to use pandas to rename rows when they are the same in column A?:
g = df.groupby('Hospital')['GeneralRepresentation']
s1 = g.transform(lambda x :x.factorize()[0]+1).astype(str)
s2 = g.transform('nunique')
df['Hospital'] = np.where(s2==1, df['Hospital'], df['Hospital'] + '_' + s1,)
The effect is shown below：
But what I want is for the name of the hospital to remain the same when a hospital does not have a GeneralRepresentation, the effect is like the second picture, how do I modify this code to fulfil my requirement?

Problem is with missing values, for misisng values is factorize set to -1, so if add 1 get 0 for last 2 rows, in my solution is replaced NaN to empty strings before groupby for prevent it:
g = df.fillna({'GeneralRepresentation':''}).groupby('Hospital')['GeneralRepresentation']
s1 = g.transform(lambda x :x.factorize()[0]+1).astype(str)
s2 = g.transform('nunique')
df['Hospital'] = np.where(s2==1, df['Hospital'], df['Hospital'] + '_' + s1)
print (df)
Hospital GeneralRepresentation
0 a a
1 b_1 b
2 b_2 c
3 c_1 d
4 c_2 e
5 d NaN
6 t NaN

Use np.select(listof conditions, list of choices, alternative)
a=~(df['GeneralRepresentation'].str.contains('\w'))
b= ((df['GeneralRepresentation'].str.contains('\w'))&(df['Hospital'].duplicated(keep=False))&(df['GeneralRepresentation'].duplicated(keep=False)))
df['Hospital'] np.select([a,b],[df['Hospital']+'_'+(df.groupby('Hospital').cumcount()+1).astype(str),''],df['Hospital'])

Python merge insert duplicate column values instead of creating _x & _y columns

Hi I have 2 data sets:
Data A:
Column A Column B Column C
Hello NaN John
Bye NaN Mike
Data B:
Column A Column B
Hello 123
Raw data:
a = pd.DataFrame([['Hello', np.nan,'John'],['Bye',np.nan,'Mike']], columns=['Column A','Column B','Column C'])
b = pd.DataFrame([['Hello', 123]], columns=['Column A','Column B'])
I want to merge Data A & B using left join (as Data A should be the main data and only bring in if they have matching Column A on Data B), and want to bring in Data B's Column B's numeric onto Data A's Column B.
The columns match but my script below results in two Column B's.
df=a.merge(b, on ='Column A', how='left')
df:
Column A Column B_x Column C Column B_y
Hello NaN John 123
Bye NaN Mike
I want the following result:
Column A Column B Column C
Hello 123 John
Bye NaN Mike
Please note I need to effectively insert Column B's data correlating to Column A, not just push Data B into Data A in exact row order. I need the code to find the match for Column A regardless of which row it's located in and insert them appropriately.

You don't need a merge for this as a merge will bring the columns of the two dataframes together. Since your dataframes follow the same structure, fillna or update:
a.fillna(b, inplace = True) # not in place unless you specify inplace=True
a.update(b) # modifies NA in place using non-NA values from another DataFrame
print(a)
Column A Column B Column C
0 Hello 123.0 John
1 Bye NaN Mike

Remove one of duplicate value in two columns of dataframe

I am working on google collaboratory and I have two column on panda dataframe which some of the rows has similar value
A B
Syd Syd
Aus Del
Mir Ard
Dol Dol
I wish that the value in column B which has duplicate value with column A to be deleted, like below :
A B
Syd
Aus Del
Mir Ard
Dol
I try to use drop_duplicates() like this one Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C but it will delete the entire column B. Any suggestions smarter ways to solve this problem?
Thanks before!

There is no need to use drop_duplicates, you can simply compare the column A with B, then mask the values in B where they are equal to A
df['B'] = df['B'].mask(df['A'].eq(df['B']))
Alternatively you can also use boolean indexing with loc to mask the duplicated values
df.loc[df['A'].eq(df['B']), 'B'] = np.nan
A B
0 Syd NaN
1 Aus Del
2 Mir Ard
3 Dol NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

categorize string from one column in another column - python - python

Related

Is there a way to put together or merge lists dataframe together to create a dataframe based on their index?

Extract a number in one column after - using pandas

How to use pandas to rename rows when they are the same in a column?

Python merge insert duplicate column values instead of creating _x & _y columns

Remove one of duplicate value in two columns of dataframe

Categories

Resources