Merge two datasets in Pandas - python

I have previously worked with Stata and am now trying to get the same done with Python. However, I have troubles with the merge command. Somehow I must be missing something. My two dataframes I want to merge look like this:
df1:
Date id Market_Cap
2000 1 400
2000 2 200
2001 1 410
2001 2 220
df2:
id Ticker
1 Shell
2 ExxonMobil
My aim now is to get the following dataset:
Date id Market_Cap Ticker
2000 1 400 Shell
2000 2 200 ExxonMobil
2001 1 410 Shell
2001 2 220 ExxonMobil
I tried the following command:
merged= pd.merge(df1, df2, how="left", on="id")
This merges the datasets, but gives me only nan's in the Ticker column.
I looked at several sources and maybe I am mistaken, but isn't the "left" command the right thing do to for my purpose? I also tried "right" and "outer". They don't get the result I want to and "inner" does not seem to work here in general.
Am I missing something crucial?

Thyere is problem your column id in one df is object (obviously string) and another int, so no match and get NaN.
If have same dtypes:
print (df1['id'].dtypes)
int64
print (df2['id'].dtypes)
int64
merged = pd.merge(df1, df2, how="left", on="id")
print (merged)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Another solution if need add only one new column is map:
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Simulate your problem:
print (df1['id'].dtypes)
object
print (df2['id'].dtypes)
int64
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 NaN
1 2000 2 200 NaN
2 2001 1 410 NaN
3 2001 2 220 NaN
And solution is convert to int by astype (or column id in df2 to str):
df1['id'] = df1['id'].astype(int)
#alternatively
#df2['id'] = df2['id'].astype(str)
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil

Related

How to create new column from another dataframe based on conditions

I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.
First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires
Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960

Create composite variable from multiple variables and add to dataframe

I have a dataframe with three median rent variables. The dataframe looks like this:
region_id
year
1bed_med_rent
2bed_med_rent
3bed_med_rent
1
2010
800
1000
1200
1
2011
850
1050
1250
2
2010
900
1000
1100
2
2011
950
1050
1150
I would like to combine all rent variables into one variable using common elements of region and year like so:
region_id
year
med_rent
1
2010
1000
1
2011
1050
2
2010
1000
2
2011
1050
Using the agg() function in pandas, I have been able to perform functions on multiple variables, but I have not been able to combine variables and insert into the dataframe. I have attempted to use the assign() function in combination with the below code without success.
#Creating the group list of common IDs
group_list = ['region_id', 'year']
#Grouping by common ID and taking median values of each group
new_df = df.groupby(group_list).agg({'1bed_med_rent': ['median'],'2bed_med_rent':
['median'], '3bed_med_rent': ['median']}).reset_index()
What other method might there be for this?
Here set_index combined with apply applied to the rest of the row ought to do it:
(df.set_index(['region_id','year'])
.apply(lambda r:r.median(), axis=1)
.reset_index()
.rename(columns = {0:'med_rent'})
)
produces
region_id year med_rent
0 1 2010 1000.0
1 1 2011 1050.0
2 2 2010 1000.0
3 2 2011 1050.0

Select rows in df based on a column1 of df and latest date in column2

I have a DataFrame in pandas with following layout:
title date value1
0 ABC 6/2/2018 1900
1 ABC 6/1/2018 1000
2 ABC 5/29/2018 405
3 ABC 3/18/2018 300
4 ABC 3/17/2018 50
5 LMO 6/1/2018 100
6 LMO 5/30/2018 10
7 LMO 5/29/2018 1
I want to create df2. It will only contain rows of titles with latest date . I am new python and pandas therefore have to ask for help.
df2:
title date value1
0 ABC 6/2/2018 1900
1 LMO 6/1/2018 100
Try:
df[df.groupby('title').date.transform('max') == df['date']]
df_new:
title date value1
0 ABC 2018-06-02 1900
5 LMO 2018-06-01 100
First sort the data by date, group by title, and take first.
df.sort_values('date', ascending=False).groupby('title').first()

Python - get the mean of value between 2 date

I'd like to get the mean of value between 2 dates grouped by shop.
In fact I've a first xlsx with the sells by shop and date
shop sell date
a 100 2000
a 122 2001
a 300 2002
b 55 2000
b 245 2001
b 1156 2002
And I've another file with the start and end date for each shop
shop start stop
a 2000 2002
a 2000 2001
b 2000 2000
And so I'd like to get the sell mean between each date from the 2nd file.
I try something like this but I got a list of Df and it's not pretty optimize for me
dfend = []
for i in df2.values:
filt1 = df.shop == i[0]
filt2 = df.date >= i[1]
filt3 = df.date <= i[2]
dfgrouped = df.where(filt1 & filt2 & filt3).groupby('shop').agg(mean = ('sell','mean'), begin = ('date','min'), end = ('date', 'max'))
dfend.append(dfgrouped)
Someone can help me ?
Thx a lot
merge the two DataFrames on 'shop'. Then you can check the date condition using between to filter down to the rows that count. Finally groupby + sum. (This assumes your second df is unique)
m = df2.merge(df1, how='left')
(m[m['date'].between(m['start'], m['stop'])]
.groupby(['shop', 'start', 'stop'])['sell'].mean()
.reset_index())
# shop start stop sell
#0 a 2000 2001 111
#1 a 2000 2002 174
#2 b 2000 2000 55
If there are some rows in df2 that will have no qualifying rows in df1, then instead use mask so that they still get a row after the groupby (this is also the reason why df2 is the left DataFrame in the merge). Here I added an extra row
print(df2)
# shop start stop
#0 a 2000 2002
#1 a 2000 2001
#2 b 2000 2000
#3 e 1999 2011
m = df2.merge(df1, how='left')
(m.where(m['date'].between(m['start'], m['stop']))
.groupby([m.shop, m.start, m.stop])['sell'].mean()
.reset_index())
# shop start stop sell
#0 a 2000 2001 111.0
#1 a 2000 2002 174.0
#2 b 2000 2000 55.0
#3 e 1999 2011 NaN

Search Data between excel and csv file python- its like vlookup [duplicate]

I have previously worked with Stata and am now trying to get the same done with Python. However, I have troubles with the merge command. Somehow I must be missing something. My two dataframes I want to merge look like this:
df1:
Date id Market_Cap
2000 1 400
2000 2 200
2001 1 410
2001 2 220
df2:
id Ticker
1 Shell
2 ExxonMobil
My aim now is to get the following dataset:
Date id Market_Cap Ticker
2000 1 400 Shell
2000 2 200 ExxonMobil
2001 1 410 Shell
2001 2 220 ExxonMobil
I tried the following command:
merged= pd.merge(df1, df2, how="left", on="id")
This merges the datasets, but gives me only nan's in the Ticker column.
I looked at several sources and maybe I am mistaken, but isn't the "left" command the right thing do to for my purpose? I also tried "right" and "outer". They don't get the result I want to and "inner" does not seem to work here in general.
Am I missing something crucial?
Thyere is problem your column id in one df is object (obviously string) and another int, so no match and get NaN.
If have same dtypes:
print (df1['id'].dtypes)
int64
print (df2['id'].dtypes)
int64
merged = pd.merge(df1, df2, how="left", on="id")
print (merged)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Another solution if need add only one new column is map:
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Simulate your problem:
print (df1['id'].dtypes)
object
print (df2['id'].dtypes)
int64
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 NaN
1 2000 2 200 NaN
2 2001 1 410 NaN
3 2001 2 220 NaN
And solution is convert to int by astype (or column id in df2 to str):
df1['id'] = df1['id'].astype(int)
#alternatively
#df2['id'] = df2['id'].astype(str)
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil

Categories

Resources