Merging 2 dataframe using similar columns - python

I have 2 dataframe listed as follow
df
Type Breed Common Color Other Color Behaviour
Golden Big Gold White Fun
Corgi Small Brown White Crazy
Bulldog Medium Black Grey Strong
df2
Type Breed Behaviour Bark Sound
Pug Small Sleepy Ak
German Shepard Big Cool Woof
Puddle Small Aggressive Ek
I wanted to merge 2 dataframe by columns Type, Breed and Behavior.
Therefore, my desire output would be:
Type Breed Behavior
Golden Big Fun
Corgi Small Crazy
Bulldog Medium Strong
Pug Small Sleepy
German Shepard Big Cool
Puddle Small Aggressive

You need concat:
print (pd.concat([df1[['Type','Breed','Behaviour']],
df2[['Type','Breed','Behaviour']]], ignore_index=True))
Type Breed Behaviour
0 Golden Big Fun
1 Corgi Small Crazy
2 Bulldog Medium Strong
3 Pug Small Sleepy
4 German Shepard Big Cool
5 Puddle Small Aggressive
More general is use intersection for columns of both DataFrames:
cols = df1.columns.intersection(df2.columns)
print (cols)
Index(['Type', 'Breed', 'Behaviour'], dtype='object')
print (pd.concat([df1[cols], df2[cols]], ignore_index=True))
Type Breed Behaviour
0 Golden Big Fun
1 Corgi Small Crazy
2 Bulldog Medium Strong
3 Pug Small Sleepy
4 German Shepard Big Cool
5 Puddle Small Aggressive
More general if df1 and df2 have no NaN values use dropna for removing columns with NaN:
print (pd.concat([df1 ,df2], ignore_index=True))
Bark Sound Behaviour Breed Common Color Other Color Type
0 NaN Fun Big Gold White Golden
1 NaN Crazy Small Brown White Corgi
2 NaN Strong Medium Black Grey Bulldog
3 Ak Sleepy Small NaN NaN Pug
4 Woof Cool Big NaN NaN German Shepard
5 Ek Aggressive Small NaN NaN Puddle
print (pd.concat([df1 ,df2], ignore_index=True).dropna(1))
Behaviour Breed Type
0 Fun Big Golden
1 Crazy Small Corgi
2 Strong Medium Bulldog
3 Sleepy Small Pug
4 Cool Big German Shepard
5 Aggressive Small Puddle

using join dropping columns that don't overlap
df1.T.join(df2.T, lsuffix='_').dropna().T.reset_index(drop=True)

Related

How to collapse pandas rows for select column values to minimal combinations and map back to original rows

Context:
I have a pandas dataframe with 7 columns (taste, color, temperature, texture, shape, age_of_participant, name_of_participant).
Of the 7 columns, taste, color, temperature, texture and shape can have overlapping values across multiple rows (i.e taste could be sour for more than one row)
I'm trying to collapse all the rows into the lowest number of combinations given
taste,color,temperature,texture and shape values while ignoring NA's ( in other words, overwriting them). The next part is to map each of these rows to the original rows.
Mock data set:
data_set = [
{'color':'brown', 'age_of_participant':23, 'name_of_participant':'feb'},
{'taste': 'sour', 'color':'green', 'temperature': 'hot', 'age_of_participant':16,'name_of_participant': 'joe'},
{'taste': 'sour', 'color':'green', 'texture':'soft', 'age_of_participant':17,'name_of_participant': 'jane'},
{'color':'green','age_of_participant':18,'name_of_participant': 'jeff'},
{'taste': 'sweet', 'color':'red', 'age_of_participant':19,'name_of_participant': 'joke'},
{'taste': 'sweet', 'temperature': 'cold', 'age_of_participant':20,'name_of_participant': 'jolly'},
{'taste': 'salty', 'color':'purple', 'texture':'soft', 'age_of_participant':21,'name_of_participant': 'jupyter'},
{'taste': 'salty', 'color':'brown', 'age_of_participant':22,'name_of_participant': 'january'}
]
import pandas as pd
import random
data_set = random.sample(data_set, k=len(data_set))
data_frame = pd.DataFrame(data_set)
print(data_frame)
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot NaN
1 17 green jane sour NaN soft
2 18 green jeff NaN NaN NaN
3 19 red joke sweet NaN NaN
4 20 NaN jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
What I've attempted:
# These columns are used to do the grouping since age_of_participant and name_of_participant are unique per row
values_that_can_be_grouped = ['taste', 'color','temperature','texture']
sub_set = data_frame[values_that_can_be_grouped].drop_duplicates().reset_index(drop=False)
my_unique_set = sub_set.groupby('taste', as_index=False).first()
print(my_unique_set)
taste index color temperature texture
0 2 green
1 salty 6 brown
2 sour 1 green soft
3 sweet 4 cold
At this point I'm not quite sure how I can map the rows above to all original rows except for indices 2,6,1,4. I checked pandas code and doesn't look like the other indices are preserved anywhere?
What I'm trying to achieve:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
data_frame.assign(color=data_frame.color.ffill()).groupby('color').apply(lambda x: x.ffill().bfill())
Out[1089]:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
IIUC I feel using ffill and bfill for each taste and color, then groupby them is safer here
df.taste.fillna(df.groupby('color').taste.apply(lambda x : x.ffill().bfill()),inplace=True)
df.color.fillna(df.groupby('taste').color.apply(lambda x : x.ffill().bfill()),inplace=True)
df=df.groupby(['color','taste']).apply(lambda x : x.ffill().bfill())
df
age_of_participant color ... temperature texture
0 16 green ... hot soft
1 17 green ... hot soft
2 18 green ... hot soft
3 19 red ... cold NaN
4 20 red ... cold NaN
5 21 purple ... NaN soft
6 22 brown ... NaN NaN
[7 rows x 6 columns]

Fill column with conditional mode of another column

Given the below list, I'd like to fill in the 'Color Guess' column with the mode of the 'Color' column conditional on 'Type' and 'Size' and ignoring NULL, #N/A, etc.
For example, what's the most common color for SMALL CATS, what's the most common color for MEDIUM DOGS, etc.
Type Size Color Color Guess
Cat small brown
Dog small black
Dog large black
Cat medium white
Cat medium #N/A
Dog large brown
Cat large white
Cat large #N/A
Dog large brown
Dog medium #N/A
Cat small #N/A
Dog small white
Dog small black
Dog small brown
Dog medium white
Dog medium #N/A
Cat large brown
Dog small white
Dog large #N/A
As BarMar already stated in the comments, we can use pd.Series.mode here from the linked answer. Only trick here is, that we have to use groupby.transform, since we want the data back in the same shape as your dataframe:
df['Color Guess'] = df.groupby(['Type', 'Size'])['Color'].transform(lambda x: pd.Series.mode(x)[0])
Type Size Color Color Guess
0 Cat small brown brown
1 Dog small black black
2 Dog large black brown
3 Cat medium white white
4 Cat medium NaN white
5 Dog large brown brown
6 Cat large white brown
7 Cat large NaN brown
8 Dog large brown brown
9 Dog medium NaN white
10 Cat small NaN brown
11 Dog small white black
12 Dog small black black
13 Dog small brown black
14 Dog medium white white
15 Dog medium NaN white
16 Cat large brown brown
17 Dog small white black
18 Dog large NaN brown

Pandas combine multiple columns (with NoneType)

My apologies if this has been asked/answered before but I couldn't find this an answer to my problem after some time searching.
Very simply put I would like to combine multiple columns to one seperated with a ,
The problem is that some cells are empty (NoneType)
And when combining them I get either:
TypeError: ('sequence item 3: expected str instance, NoneType found', 'occurred at index 0')
or
When added .map(str), it literally adds 'None' for every NoneType value (as kinda expected)
Let's say I have a production dataframe looking like
0 1 2
1 Rice
2 Beans Rice
3 Milk Beans Rice
4 Sugar Rice
What I would like is a single column with the values
Production
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
With some searching and tweaking I added this code:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x)), axis=1)
Which produces problem 1
or changed it like this:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x.map(str)), axis=1)
Which produces problem 2
Maybe it's good to add that I'm very new and kinda exploring Pandas/Python right now. So any help or push in the right direction is much appreciated!
pd.Series.str.cat should work here
df
Out[43]:
0 1 2
1 Rice NaN NaN
2 Beans Rice NaN
3 Milk Beans Rice
4 Sugar Rice NaN
df.apply(lambda x: x.str.cat(sep=', '), axis=1)
Out[44]:
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
dtype: object
You can use str.join after transforming NaN values to empty strings:
res = df.fillna('').apply(lambda x: ', '.join(filter(None, x)), axis=1)
print(res)
0 Rice
1 Beans, Rice
2 Milk, Beans, Rice
3 Sugar, Rice
dtype: object

Find phrases stored in a dataframe in sentences found in another dataframe

assuming I have 2 dataframes:
sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])
One containing various subjects and the other text from where I should be able to extract the subjects
I want the output of text dataframe to be:
Text | Subjects
Little Red Corvette must Grow Your ego | Little Red, Grow Your
Grow Your Beans | Grow Your
James Dean and his Little Red coat | Little Red
I love pasta | NaN
Any idea how I can achieve this?
I was looking at this question: Check if words in one dataframe appear in another (python 3, pandas)
but it is not exactly as my desired output. Thankyou
Use str.findall with joined all values of sub by | with regex word boundary:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red, Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta
If want NaN for not matched values use loc:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red,Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta NaN

Partial string matching on large data set

So I have been working on this for a while and just haven't got any where and just not sure what to do.Fairly new to pandas and python.
Data set is actually 15,000 product names. All in different formats, some have multiple dashes up to 6, some hyphens, different lengths,The rows are product names with variants.
The code i'm using keeps returning only the first letter as oppose to the partial string when I use it on a large data set.
Works just fine on a small data set which I was using to test it.
I'm assuming this is happening because:
I haven't created a stop section when it matches a full partial string
because its trying to match up words as oppose to individual characters and stopping when it finds a difference.
What is the best way to overcome this on a large data set, what am I missing? or am I going to have to do this manual?
Original test data set
`1.star t-shirt-large-red
2.star t-shirt-large-blue
3.star t-shirt-small-red
4.beautiful rainbow skirt small
5.long maxwell logan jeans- light blue -32L-28W
6.long maxwell logan jeans- Dark blue -32L-28W`
Desired data set/output:
`COL1 COL2 COL3 COL4
1[star t-shirt] [large] [red] NONE
2[star t-shirt] [large] [blue] NONE
3[star t-shirt] [small] [red] NONE
4[beautiful rainbow skirt small] [small] NONE NONE
5[long maxwell logan jeans] [light blue] [32L] [28W]
6[long maxwell logan jeans] [Dark blue] [32L] [28W]`
Here is the code I was helped with in a previous question I asked:
`df['onkey'] = 1
df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]
df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')`
`df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))
df = df[['COL1','other']]
df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)`
` COL1 0 1 2
0 star t-shirt large red NaN
1 star t-shirt large blue NaN
2 star t-shirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W`
***************update*****************
This is your input list of product,some have variants and some don't.
When searching for duplicates strings to determine what are the products with variants and products without variants;nothing comes up because they are all seen as unique values due to the variants being added on at the end of the string.
So what I would like to do is group the partial or similar strings together(the longest match), extract the longest matching string within the group and then put the differences into other columns.
If the product /string is unique just print into the column with the extracted longest string.
star t-shirt-large-red
star t-shirt-large-blue
star t-shirt-small-red
beautiful rainbow skirt small
long maxwell logan jeans- light blue -32L-28W
long maxwell logan jeans- Dark blue -32L-28W
Organic and natural candy - 3 Pack - Mint
Organic and natural candy - 3 Pack - Vanilla
Organic and natural candy - 3 Pack - Strawberry
Organic and natural candy - 3 Pack - Chocolate
Organic and natural candy - 3 Pack - Banana
Organic and natural candy - 3 Pack - Cola
Organic and natural candy - 12 Pack Assorted
Morgan T-shirt Company - Small/Medium-Blue
Morgan T-shirt Company - Medium/Large-Blue
Morgan T-shirt Company - Medium/Large-red
Morgan T-shirt Company - Small/Medium-Red
Morgan T-shirt Company - Small/Medium-Green
Morgan T-shirt Company - Medium/Large-Green
Nelly dress leopard small
col1 col2 col3 col4
star t-shirt large red
star t-shirt large blue
star t-shirt small red
beautiful rainbow skirt small
Long maxwell logan jeans light blue 32L 28W
Long maxwell logan jeans Dark blue 32L 28W
Organic and natural candy 3 Pack Mint
Organic and natural candy 3 Pack Vanilla
Organic and natural candy 3 Pack Strawberry
Organic and natural candy 3 Pack Chocolate
Organic and natural candy 3 Pack Banana
Organic and natural candy 3 Pack Cola
Organic and natural candy 12 Pack Assorted
Morgan T-shirt Company Small/Medium Blue
Morgan T-shirt Company Medium/Large Blue
Morgan T-shirt Company Medium/Large Red
Morgan T-shirt Company Small/Medium Red
Morgan T-shirt Company Small/Medium Green
Morgan T-shirt Company Medium/Large Green
Nelly dress Leopard Small
Bijoux
Princess PJ-set
Lemon tank top Yellow Medium
Constructing a DataFrame df as follows:
df = pd.DataFrame()
df = df.append(['1.star t-shirt-large-red'])
df = df.append(['2.star t-shirt-large-blue'])
df = df.append(['4.beautiful rainbow skirt small'])
df = df.append(['5.long maxwell logan jeans- light blue -32L-28W'])
df = df.append(['6.long maxwell logan jeans- Dark blue -32L-28W'])
df.columns = ['Product']
The following code
(a) strips any whitespace,
(b) splits by the period ('.') and grabs what follows,
(c) replaces 't-shirt' with 'tshirt' because of further operations (change this back if you want after the operation)
(d) splits again by '-' and expands to give your dataframe.
df['Product'].str.strip().str.split('.').str.get(1).str.replace('t-shirt', 'tshirt').str.split('-', expand = True)
Output:
0 1 2 3
0 star tshirt large red None
0 star tshirt large blue None
0 beautiful rainbow skirt small None None None
0 long maxwell logan jeans light blue 32L 28W
0 long maxwell logan jeans Dark blue 32L 28W
Given the inconsistency in nomenclature for your product, there will be edge-cases that are missed (ex : beautiful rainbow skirt small). You may have to fish them out again.
A solution which is quite simple to understand, debug and flexibly extend is the following:
Consider that your initial product names are held in a list called strings.
Then the solution is the following line:
mydf = pd.concat([pd.DataFrame([make_row(row, 4)], columns=['COL1', 'COL2', 'COL3', 'COL4']) for row in strings], ignore_index=True)
where we have defined the parsing function make_row to be:
def make_row(string, num_cols):
cols = [item.strip() for item in string[2:].split('-')] # ignore numbering, split on hyphen and strip whitespace
if len(cols) < num_cols:
cols += [np.nan]*(num_cols - len(cols)) # fill with NaN missing values
return cols
The first line defining cols could also be simply cols = string.split('-'), in which case you could do the formatting afterwards with:
mydf.applymap(lambda x: x if pd.isnull(x) else str.strip(x))
Now in your case, I see that there is a hyphen in some of your product names, in which case you might want to 'sanitize' them in advance (or inside make_row, as you wish), with something like:
strings = [item.replace('t-shirt', 'tshirt') for item in strings]
Example input:
strings = ['1.one-two-three', '2. one-two', '3.one-two-three-four', '4.one - two -three -four ']
Output:
COL1 COL2 COL3 COL4
0 one two three NaN
1 one two NaN NaN
2 one two three four
3 one two three four
Output for question's data (after correcting typo for item 4):
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W
Edit:
If you additionally want to "group" the items together, then you can:
a) Use sort_values (pandas doc) on the column COL1 after you get a dataframe as described above to simply display the rows corresponding to the same product the one after the other, or
b) use group_by to actually get a grouped dataframe like this:
grouped_df = mydf.groupby("COL1")
This will allow you to get each group like this:
grouped_df.get_group("star tshirt")
Producing following output:
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN

Categories

Resources