Merge three different dataframes in Python

Merge three different dataframes in Python - python

I want to merge three data frames in Python, the code I have now provide me with some wrong outputs.
This is the first data frame
df_1
Year Month X_1 Y_1
0 2021 January $90 $100
1 2021 February NaN $120
2 2021 March $100 $130
3 2021 April $110 $140
4 2021 May Nan $150
5 2019 June $120 $160
This is the second data frame
df_2
Year Month X_2 Y_2
0 2021 January Nan $120
1 2021 February NaN $130
2 2021 March $80 $140
3 2021 April $90 $150
4 2021 May Nan $150
5 2021 June $120 $170
This is the third data frame
df_3
Year Month X_3 Y_3
0 2021 January $110 $150
1 2021 February $140 $160
2 2021 March $97 $170
3 2021 April $90 $180
4 2021 May Nan $190
5 2021 June $120 $200
The idea is to combine them into one data frame like this:
df_combined
Year Month X_1 Y_1 X_2 Y_2 X_3 Y_3
0 2019 January $90 $100 NaN $120 $110 $150
1 2019 February NaN $120 NaN $130 $140 $160
2 2019 March $100 $130 $80 $140 $97 $170
3 2019 April $110 $140 $90 $150 $90 $180
4 2019 May Nan $150 Nan $150 Nan $190
5 2019 June $120 $160 $120 $170 $120 $200
The code I have for now does not give me the correct outcome, only df_3 has to the correct numbers.
# compile the list of data frames you want to merge
import functools as ft
from functools import reduce
data_frames = [df_1, df_2, df_3]
df_merged = reduce(lambda cross, right: pd.merge(cross,right,on=['Year'],
how='outer'),data_frames)
#remove superfluous columns
df_merged.drop(['Month_x', 'Month_y'], axis=1, inplace=True)

You can try with
df_1.merge(df_2, how='left', on=['Year', 'Month']).merge(df_3, how='left', on=['Year', 'Month'])

One option of probably many is to do
from functools import reduce
import pandas as pd
idx = ["Year", "Month"]
new_df = reduce(pd.DataFrame.join, (i.set_index(idx) for i in dataframes)).reset_index()
or
reduce(lambda x, y: pd.merge(x, y, how="outer", on=["Year", "Month"]), dataframes)

Related

How to calculate monthly growth for each unique course in pandas?

I have dataframe in the following format:
course_id year month student_id
'Design' 2016 1 a123
'Design' 2016 1 a124
'Design' 2016 2 a125
'Design 2016 3 a126
'Marketing' 2016 1 b123
'Marketing' 2016 2 b124
'Marketing' 2016 3 b125
'Marketing' 2016 3 b126
'Marketing' 2016 3 b127
'Marketing' 2016 4 b128
How to calculate growth of every course in every month. I.e. to have the table in the following format
Year Month 'Design' 'Marketing'
2016 1 2 1
2016 2 1 1
2016 3 1 3
2016 4 0 1

You can use pivot_table function like:
df.pivot_table(index=['year', 'month'], columns='course_id', values='student_id', aggfunc=len).fillna(0).reset_index()

Pandas: Combine and average data in a column based on length of month

I have a dataframe which consists of departments, year, the month of invoice, the invoice date and the value.
I have offset the Invoice dates by business days and now what I am trying to achieve is to combine all the months that have the same number of working days (so the 'count' of each month by year) and average the value for each day.
The data I have is as follows:
Department Year Month Invoice Date Value
0 Sales 2019 March 2019-03-25 1000.00
1 Sales 2019 March 2019-03-26 2000.00
2 Sales 2019 March 2019-03-27 3000.00
3 Sales 2019 March 2019-03-28 4000.00
4 Sales 2019 March 2019-03-29 5000.00
... ... ... ... ... ...
2435 Specialist 2020 August 2020-08-27 6000.00
2436 Specialist 2020 August 2020-08-28 7000.00
2437 Specialist 2020 September 2020-09-01 8000.00
2438 Specialist 2020 September 2020-09-02 9000.00
2439 Specialist 2020 September 2020-09-07 1000.00
The count of each month is as follows:
Year Month
2019 April 21
August 21
December 20
July 23
June 20
March 5
May 21
November 21
October 23
September 21
2020 April 21
August 20
February 20
January 22
July 23
June 22
March 22
May 19
September 5
My hope is that using this count I could aggregate the data from the original df and average for example April, August, May, November, September (2019) along with April (2020) as they all have 21 working days in the month.
Producing one dataframe with each day of the month an average of the months combined for each # of days.
I hope that makes sense.
Note: Please ignore the 5 days length, just incomplete data for those months...
Thank you
EDIT: I just realised that the days wont line up for each month so my plan is to aggregate it based on whether its the first business day of the month, then the second the third etc regardless of the actual date.
ALSO (SORRY): I was hoping it could be by department!
Department Month Length Day Number Average Value
0 Sales 21 1 20000
1 Sales 21 2 5541
2 Sales 21 3 87485
3 Sales 21 4 1863
4 Sales 21 5 48687
5 Sales 21 6 486996
6 Sales 21 7 892
7 Sales 21 8 985
8 Sales 21 9 14169
9 Sales 21 10 20000
10 Sales 21 11 5541
11 Sales 21 12 87485
12 Sales 21 13 1863
13 Sales 21 14 48687
14 Sales 21 15 486996
15 Sales 21 16 892
16 Sales 21 17 985
17 Sales 21 18 14169
......
So to explain it a bit better lets take sales, and all the months which have 21 days in them, for each day in those 21 day months I am hoping to get the average of the value and get a table that looks like above.
So 'day 1' is an average of all 'day 1s' in the 21 day months (as seen in the count df)! This is to allow me to then plot a line chart profile to show average revenue value on each given day in a 21 day month. I hope this is a bit of a better explanation, apologies.

I am not really sure whether I understand your question. Maybe you could add an expected df to your question?
In the mean time would this point you in the direction you are looking for:
import pandas as pd
from random import randint
from calendar import month_name
df = pd.DataFrame({'years': [randint(1998, 2020) for x in range(10000)],
'months': [month_name[randint(1, 12)] for x in range(10000)],
'days': [randint(1, 30) for x in range(10000)],
'revenue': [randint(0, 1000) for x in range(10000)]}
)
print(df.groupby(['months', 'days'])['revenue'].mean())
Output is:
months days
April 1 475.529412
2 542.870968
3 296.045455
4 392.416667
5 475.571429
September 26 516.888889
27 539.583333
28 513.500000
29 480.724138
30 456.500000
Name: revenue, Length: 360, dtype: float64

Creating a dataframe with months x years based on time series in pandas

I have a time series data with number of days for each month for several years and trying to create a new dataframe which would have months as rows and years as columns.
I have this
DateTime Days Month Year
2004-11-30 3 November 2004
2004-12-31 16 December 2004
2005-01-31 12 January 2005
2005-02-28 11 February 2005
2005-03-31 11 March 2005
... ... ... ...
2019-06-30 0 June 2019
2019-07-31 2 July 2019
2019-08-31 5 August 2019
2019-09-30 5 September 2019
2019-10-31 3 October 2019
And I'm trying to get this
Month 2004 2005 ... 2019
January nan 12 7
February nan 11 9
...
November 17 17 nan
December 14 15 nan
I created a new dataframe with the first column meaning months and tried to iterate through the first dataframe to add the new columns (years) and information to the cells but the condition which checks whether the month in the first dataframe (days) matches the month in the new dataframe (output) is never True, so the new dataframe never gets updated. I guess this is because the month in days is never the same as the month in output within the same iteration.
for index, row in days.iterrows():
print(days.loc[index, 'Days']) #this prints out as expected
for month in output.items():
print(index.month_name()) #this prints out as expected
if index.month_name()==month:
output.at[month, index.year]=days.loc[index, 'Days'] #I wanted to use this to fill up the cells, is this right?
print(days.loc[index, 'Days']) #this never gets printed out
Could you tell me how to fix this? Or maybe there's a better way to accomplish the result rather than iteration?
It's my first attempt to use libraries in python, so I would appreciate some help.

Use pivot, if your input dataframe has a single value per month and year:
df.pivot('Month', 'Year', 'Days')
Output:
Year 2004 2005 2019
Month
August NaN NaN 5
December 16 NaN NaN
February NaN 11 NaN
January NaN 12 NaN
July NaN NaN 2
June NaN NaN 0
March NaN 11 NaN
November 3 NaN NaN
October NaN NaN 3
September NaN NaN 5

Exclude rows in a dataframe based on matching values in rows from another dataframe

I have two dataframes (A and B). I want to remove all the rows in B where the values for columns Month, Year, Type, Name are an exact match.
Dataframe A
Name Type Month Year country Amount Expiration Paid
0 EXTRON GOLD March 2019 CA 20000 2019-09-07 yes
0 LEAF SILVER March 2019 PL 4893 2019-02-02 yes
0 JMC GOLD March 2019 IN 7000 2020-01-16 no
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 EXTRON GOLD March 2019 CA 90000 2019-09-07 yes
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD March 2019 IN 6000 2020-01-16 no
0 JMC GOLD April 2019 IN 1000 2020-01-16 no
Desired Output:
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD April 2019 IN 1000 2020-01-16 no

We can using merge here
l=['Month', 'Year','Type', 'Name']
B=B.merge(A[l],on=l,indicator=True,how='outer').loc[lambda x : x['_merge']=='left_only'].copy()
# you can add drop here like B=B.drop('_merge',1)
Name Type Month Year country Amount Expiration Paid _merge
0 JONS GOLD March 2018 PL 500.0 2019-10-17 yes left_only
1 ABBY BRONZE March 2019 AU 60000.0 2019-02-02 yes left_only
2 BUYT GOLD March 2018 BR 50.0 2018-03-22 no left_only
4 JAYB PURPLE March 2019 PL 9.9 2018-04-20 yes left_only
6 JMC GOLD April 2019 IN 1000.0 2020-01-16 no left_only

I tried using MultiIndex for the same.
cols =['Month', 'Year','Type', 'Name']
index1 = pd.MultiIndex.from_arrays([df1[col] for col in cols])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in cols])
df2 = df2.loc[~index2.isin(index1)]

Pandas define a seasonal year from June 1 - July 30 instead of Jan 1 - Dec 31

I have seasonal snow data which I want to group by snow year (July 1, 1954 - June 30, 1955) rather than having one winter's data split over two years (January 1, 1954 - December 31, 1954 and January 1, 1955 - Dec 31, 1955.)
example data
I modified the code from this question:
Using pandas to select specific seasons from a dataframe whose values are over a defined threshold (thanks Pad)
def get_season(row):
if row['date'].month <= 7:
return row['date'].year
else:
return row['date'].year + 1
df['Seasonal_Year'] = df.apply(get_season, axis=1)
results of method call
Is there a better way to do this than I have done?

I think yes, with numpy.where:
years = df['date'].dt.year
df['Seasonal_Year'] = np.where(df['date'].dt.month <= 7, years, years + 1)

you can use pd.offsets.MonthBegin
Consider the dataframe of dates df
df = pd.DataFrame(dict(Date=pd.date_range('2010-01-30', periods=24, freq='M')))
We can offset the Date and grab the year
df.assign(Season=(df.Date - pd.offsets.MonthBegin(7)).dt.year + 1)
Date Season
0 2010-01-31 2010
1 2010-02-28 2010
2 2010-03-31 2010
3 2010-04-30 2010
4 2010-05-31 2010
5 2010-06-30 2010
6 2010-07-31 2011
7 2010-08-31 2011
8 2010-09-30 2011
9 2010-10-31 2011
10 2010-11-30 2011
11 2010-12-31 2011
12 2011-01-31 2011
13 2011-02-28 2011
14 2011-03-31 2011
15 2011-04-30 2011
16 2011-05-31 2011
17 2011-06-30 2011
18 2011-07-31 2012
19 2011-08-31 2012
20 2011-09-30 2012
21 2011-10-31 2012
22 2011-11-30 2012
23 2011-12-31 2012

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.