merging data frames without deleting unique values (Python) - python

I have what seems like a simple problem but I can figure out how to do it...
I have 3 Dataframes.
df1 : 1 column, Product SKU
df2 : 2 Columns, Product SKU, Price(supplier 1)
df3 : 2 Columns, Product SKU, Price( Supplier 2)
I need to create a df4.
df4 : 3 Columns, Product SKU, Supplier 1 Price, Supplier 2 Price
Supplier 1 and 2 have some matching SKU.
df4 needs to contain all SKU's, and the price from each supplier. When the Supplier doesn't have a price for that SKU it should be 0 or Nan.
Any help will be great, I've tried merge(), join(), concatenate() and dropping duplicates but can't achieve the result I am looking for.
Many Thanks in advance.

USE - df4 = df2.merge(df3, on="Product_SKU", how = 'outer')
Code-
I Created random dataframe df4 it contains all unique Product_SKU and some rows will contain NaN values as it's price is not present in df2 or df3.
# initialize list of lists
data2 = [['clutch', 10], ['brake', 15],['tyre',50]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns=['Product_SKU', 'Price_Supplier_1'])
# initialize list of lists
data3 = [['tyre', 30], ['brake', 25],['gear',100]]
# Create the pandas DataFrame
df3 = pd.DataFrame(data3, columns=['Product_SKU', 'Price_Supplier_2'])
df4 = df2.merge(df3, on="Product_SKU", how = 'outer')
df4
Output-
Product_SKU Price_Supplier_1 Price_Supplier_2
0 clutch 10.0 NaN
1 brake 15.0 25.0
2 tyre 50.0 30.0
3 gear NaN 100.0

Related

Python: explode column that contains dictionary

I have a DataFrame that looks like this:
df:
amount info
12 {id:'1231232', type:'trade', amount:12}
14 {id:'4124124', info:{operation_type:'deposit'}}
What I want to achieve is this:
df:
amount type operation_type
12 trade Nan
14 Nan deposit
I have tried the df.explode('info') method but with no luck. Are there any other ways to do this?
We could do it in 2 steps: (i) Build a DataFrame df with data; (ii) use json_normalize on "info" column and join it back to df:
df = pd.DataFrame(data)
out = df.join(pd.json_normalize(df['info'].tolist())[['type', 'info.operation_type']]).drop(columns='info')
out.columns = out.columns.map(lambda x: x.split('.')[-1])
Output:
amount type operation_type
0 12 trade NaN
1 14 NaN deposit

How to write vectorized functions that pull arguments from two dataframes of different size

I am putting together a new formatted dataframe that aggregates data from a different dataframe. I need to create a column in this new dataframe that filters and aggregates data from a secondary dataframe. I wrote a function to do so which filters the second dataframe based on the new column title and and the values from each row of another column in the new dataframe. The function then sums the values of a column in the secondary dataframe.
As an example.
df2 = pd.DataFrame({'name':['alan','sky','liam','liam','alan','liam','alan','sky','bryan','alan','sky']
,'age': [1,5,10,15,20,25,30,35,40,45,50]
,'values': [564,65,4,44,8,60,4,684,51,3,14]})
df1 = pd.DataFrame({'name':['alan','sky','liam','bryan']})
def get_cumsum_values(person,data,col):
value = data[data.apply(lambda x: x.age < col and x.name == person, axis = 1)].values.sum()
return value
df1['10'] = df1.apply(lambda x: get_cumsum_values(person = x.name, data = df1, col = 10))
Dealing with a ton of data, and this code takes forever. The culprit seems to be the apply method at the end to create the new column. Is there a way to use vectorization to get this done ?
Why don't you do something like the following (without any .applys):
def get_cumsum_values(names, age, data):
return (
data[data.name.isin(names) & (data.age < age)]
.groupby("name")["values"]
.sum()
.rename(str(age))
.reset_index(drop=False)
)
df1 = df1.merge(
get_cumsum_values(df1.name.unique(), 10, df2), on="name", how="left"
)
Result:
name 10
0 alan 564.0
1 sky 65.0
2 liam NaN
3 bryan NaN
Or, you set the name column of df1 as index, and then do
df1 = df1.set_index("name")
def get_cumsum_values(names, age, data):
return (
data[data.name.isin(names) & (data.age < age)]
.groupby("name")["values"]
.sum()
)
df1['10'] = get_cumsum_values(df1.index.unique(), 10, df2)
df1['35'] = get_cumsum_values(df1.index.unique(), 35, df2)
Result:
10 35
name
alan 564.0 576.0
sky 65.0 65.0
liam NaN 108.0
bryan NaN NaN

How do I merge two datasets with on BusinessID and get the final dataset?

It is two datasets business and review files. how to group the multiple reviews on business_id to get all reviews given by the user into one text.
How to merge the datasets with BusinessID and get the final dataset as the picture below?
How can I do this with the Pandas library?
You can merge df1 (top-left) with a .groupby version of df2 (top-right):
df3 = df1.merge(df2.groupby('Business_id')['Review_text'].apply(list).reset_index(),
how='left', on='Business_id').rename({'Review_text':'All_reviews'}, axis=1)
Out[1]:
Business_id category star Review_count All_reviews
0 1 shopping 3.5 3 [Text_1, Text_2, Text_4]
1 2 restaurant 5.0 1 [Text_3, Text_5]
2 3 Home services 4.0 6 NaN

Copy values only to a new empty dataframe with column names - Pandas

I have two dataframes.
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df2 = pd.DataFrame(columns=['pid','gen','ethn'])
As you can see, the second dataframe (df2) is empty. But may also contain few rows of data at times
What I would like to do is copy dataframe values (only) from df1 to df2 with column names of df2 remain unchanged.
I tried the below but both didn't work
df2 = df1.copy(deep=False)
df2 = df1.copy(deep=True)
How can I achieve my output to be like this? Note that I don't want the column names of df1. I only want the data
Do:
df1.columns = df2.columns.tolist()
df2 = df2.append(df1)
## OR
df2 = pd.concat([df1, df2])
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
Edit based on OPs comment linking to the nature of dataframes:
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})
final_df = pd.DataFrame(columns=['pid','gen','ethn'])
Now do:
frame = [df1, df2, df3]
for i in range(len(frame)):
frame[i].columns = final_df.columns.tolist()
final_df = final_df.append(frame[i])
print(final_df)
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
0 4 Male Chinese
1 5 Female Indian
2 6 Not disclosed European
0 7 Male Chinese
1 8 Female Indian
2 9 Not disclosed European
The cleanest solution I think is to just append the df1 after its column names have been set properly:
df2 = df2.append(pd.DataFrame(df1.values, columns=df2.columns))

Pandas: combine data frames of different sizes

I have 2 data frames:
df1 has ID and count of white products
product_id, count_white
12345,4
23456,7
34567,1
df2 has IDs and counts of all products
product_id,total_count
0009878,14
7862345,20
12345,10
456346,40
23456,30
0987352,10
34567,90
df2 has more products than df1. I need to search df2 for products that are in df1 and add total_count column to df1:
product_id,count_white,total_count
12345,4,10
23456,7,30
34567,1,90
I could do a left merge, but I would end up with a huge file. Is there any way to add specific rows from df2 to df1 using merge?
Just perform a left merge on 'product_id' column:
In [12]:
df.merge(df1, on='product_id', how='left')
Out[12]:
product_id count_white total_count
0 12345 4 10
1 23456 7 30
2 34567 1 90
Perform left join/merge:
Data frames are:
left join:
df1=df1.merge(df2, on='product_id', how='left')
The output will look like this:

Categories

Resources