How to merge two pandas dataframeI [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 dataframes that I want to be merged together.
Df1 : sales dataframe, only with sold products. if not sold, not there. ALL WEEKS 1 to 53 for 2019/2020/2021
Year Week Store Article Sales Volume
2019 11 SF sku1 500
2021 16 NY sku2 20
2020 53 PA sku1 500
2021 01 NY sku3 200
2019 11 SF sku1 455
2021 16 NY sku2 20
df2: is a stock dataframe. Entire product range, even if not sold, it appears. Only with stock at WEEK 16 for each 2019/2020/2021 year for each ALL products
Year Week Store Article Stock Volume
2019 16 SF sku1 500
2021 16 NY sku2 20
2020 16 PA sku4 500
2021 16 NY sku5 200
2019 16 SF sku65 455
2021 16 NY sku2000 20
...
I have tried to merge both dfs by doing this (I wanted to get all Articles but the drawback is that I loose the other weeks):
merged = pd.merge(df1,df2, how = "right", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
But I only get the sales value associated to week 16 stock and I lose all the other weeks.
So I tried a left join
merged = pd.merge(df1,df2, how = "left", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
Now I have all the weeks but I am missing some products stocks
I need to keep ALL PRODUCTS of df2 while also keeping weeks of sales of df1.
Is there a way to merge both dfs by keeping the entire stock depth and the entire sales weeks ?
Thanks for your help !!

You can try this
merged = pd.merge(df1, df2, on='year')
Source: how to merge two data frames based on particular column in pandas python?

You need a full outer join in order to not lose any Sales from df1 or Product from df2:
merged = pd.merge(df1,df2, how = "outer", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])

Related

Pandas: Combining data items on multiple criteria

I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!
You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1

Add a value to a new column on Data frame that depends on the value on another Data frame [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two data frames df1 and df2. df1 has entries of amounts spent by users and each user can have several entries with different amounts values.
The second data frame just holds the information of every users(each user is unique in this data frame).
i want to create a new column on df1 that includes the country value of each unique user from df2.
Any help will be appreciated
df1
name_id Dept amt_spent
0 Alex-01 Engineering 5
1 Bob-01 Finance 5
2 Charles-01 HR 10
3 David-01 HR 6
4 Alex-01 Engineering 50
df2
name_id Country
0 Alex-01 UK
1 Bob-01 USA
2 Charles-01 GHANA
3 David-01 BRAZIL
Result
name_id Dept amt_spent Country
0 Alex-01 Engineering 5 UK
1 Bob-01 Finance 5 USA
2 Charles-01 HR 10 GHANA
3 David-01 HR 6 BRAZIL
4 Alex-01 Engineering 50 UK
This should work:
df = pd.merge(df1, df2)

multiple conditions for lookup in pandas

I have 2 dataframes. One with the City, dates and sales
sales = [['20101113','Miami',35],['20101114','New York',70],['20101114','Los Angeles',4],['20101115','Chicago',36],['20101114','Miami',12]]
df2 = pd.DataFrame(sales,columns=['Date','City','Sales'])
print (df2)
Date City Sales
0 20101113 Miami 35
1 20101114 New York 70
2 20101114 Los Angeles 4
3 20101115 Chicago 36
4 20101114 Miami 12
The second has some dates and cities.
date = [['20101114','New York'],['20101114','Los Angeles'],['20101114','Chicago']]
df = pd.DataFrame(date,columns=['Date','City'])
print (df)
I want to extract the sales from the first dataframe that match the city and and dates in the 3nd dataframe and add the sales to the 2nd dataframe. If the date is missing in the first table then the next highest date's sales should be retrieved.
The new dataframe should look like this
Date City Sales
0 20101114 New York 70
1 20101114 Los Angeles 4
2 20101114 Chicago 36
I am having trouble extracting and merging tables. Any suggestions?
This is pd.merge_asof, which allows you to join on a combination of exact matches and then a "close" match for some column.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df2['Date'] = pd.to_datetime(df2.Date)
pd.merge_asof(df.sort_values('Date'),
df2.sort_values('Date'),
by='City', on='Date',
direction='forward')
Output:
Date City Sales
0 2010-11-14 New York 70
1 2010-11-14 Los Angeles 4
2 2010-11-14 Chicago 36

Join dataframes based on partial string-match between columns

I have a dataframe which I want to compare if they are present in another df.
after_h.sample(10, random_state=1)
movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
I want to compare if the above movies are present in another df.
FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560
I want something like this as my final output:
FILM votes
0 Max Steel 560
There are two ways:
get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of:
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year).
.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
(Acknowledging much help from #user3483203 here and in Python chat room)
Code to recreate dataframes:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
smci's option 1 is nearly there, the following worked for me:
df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))
Explanation:
Create a Votes column in df1
Apply a lambda to every movie string in df1
The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title
Select the Votes column of the resulting subset of df2
Take the first value in this column with any(0)

PYTHON: Changing Column names

I am reading a excel sheet by using pandas as - pd.read_excel() and then putting it in a list and appending it to the final Data-frame.
The sheet which i am reading have a column name Sales and in the final data-frame i have column with name Item.
AllFields is the Data-frame with all the list of columns.
So my question is while appending the list to the final data-frame the records of the Sales Columns comes under the column name Item.
Example of data which i am reading from sheet
Sales 2013 2014 2015 2016 2017 2018 2019
Units Sold 0 0 0 0 0 0 0
Unit Sale Price $900 $900 $900 $900 $900 $900 $900
Unit Profit $500 $500 $500 $500 $500 $500 $500
and then appending to the data-frame which have columns
Full Project Item Market Project Project Step Round Sponsor Subproduct 2013 2014 2015 2016 2017 2018 2019
reading_book1 = pd.read_excel(file, sheet_name="1-Rollout", skiprows=restvalue).iloc[:10]
EmptyList1 = [reading_book1]
RestDataframe = RestDataframe.append(AllFields).append(EmptyList1)
RestDataframe['Project'] = read_ProjectNumber
RestDataframe['Full Project'] = read_fullProject
RestDataframe['Sponsor'] = read_Sponsor
RestDataframe['Round'] = read_round
RestDataframe['Project Step'] = read_projectstep
RestDataframe['Market'] = "Rest of the World Market"
FinalDataframe = FinalDataframe.append(CADataframe).append(RestDataframe)
You need to use pd.concat
RestDataFrame= pd.concat([AllFields,EmptyList1], axis=1)
And then change the name of Sales column to Item column with
data.rename(columns={'Sales':'Item'}, inplace=True)

Categories

Resources