Populating new column in dataframe from existing two with different lengths - python

How would I go about adding a new column to an existing dataframe by comparing it to another that is shorter in length and has a different index.
For example, if I have:
df1 = country code year
0 Armenia a 2016
1 Brazil b 2017
2 Turkey c 2016
3 Armenia d 2017
df2 = geoCountry 2016_gdp 2017_gdp
0 Armenia 10.499 10.74
1 Brazil 1,798.62 2,140.94
2 Turkey 857.429 793.698
and I want to end up with:
df1 = country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74
How would I go about this? I attempted to use answers outlined here and here to no avail. I also did the following which takes too long on a 90000 row dataframe
for index, row in df1.iterrows():
if row['country'] in list(df2.geoCountry):
if row['year'] == 2016:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2016'])
else:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2017'])

I guess this is what you're looking for:
df2 = df2.melt(id_vars = 'geoCountry', value_vars = ['2016_gdp', '2017_gdp'], var_name = ['year'])
df1['year'] = df1['year'].astype('int')
df2['year'] = df2['year'].str.slice(0,4).astype('int')
df1.merge(df2, left_on = ['country','year'], right_on = ['geoCountry','year'])[['country', 'code', 'year', 'value']]
Output:
country code year value
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74

You mainly need the melt function:
df2.columns = df2.columns.str.split("_").str.get(0)
df2 = df2.rename(index=str, columns={"geoCountry": "country"})
df3 = pd.melt(df2, id_vars=['geoCountry'], value_vars=['2016','2017'],
var_name='year', value_name='gdp')
After this you simply merge the df1 with the above df3
result = pd.merge(df1, df3, on=['country','year'])
Output:
pd.merge(df1, df3, on=['country','year'])
Out[36]:
country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2140.940
2 Turkey c 2016 857.429
3 Armenia d 2017 10.740

Related

Replace values in specific rows from one DataFrame to another when certain columns have the same values

Unlike the other questions, I don't want to create a new column with the new values, I want to use the same column just changing the old values for new ones if they exist.
For a new column I would have:
import pandas as pd
df1 = pd.DataFrame(data = {'Name' : ['Carl','Steave','Julius','Marcus'],
'Work' : ['Home','Street','Car','Airplane'],
'Year' : ['2022','2021','2020','2019'],
'Days' : ['',5,'','']})
df2 = pd.DataFrame(data = {'Name' : ['Carl','Julius'],
'Work' : ['Home','Car'],
'Days' : [1,2]})
df_merge = pd.merge(df1, df2, how='left', on=['Name','Work'], suffixes=('','_'))
print(df_merge)
Name Work Year Days Days_
0 Carl Home 2022 1.0
1 Steave Street 2021 5 NaN
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN
But what I really want is exactly like this:
Name Work Year Days
0 Carl Home 2022 1
1 Steave Street 2021 5
2 Julius Car 2020 2
3 Marcus Airplane 2019
How can I make such a union?
You can use combine_first, setting the empty strings to NaNs beforehand (the indexing at the end is to rearrange the columns to match the desired output):
df1.loc[df1["Days"] == "", "Days"] = float("NaN")
df1.combine_first(df1[["Name", "Work"]].merge(df2, "left"))[df1.columns.values]
This outputs:
Name Work Year Days
0 Carl Home 2022 1.0
1 Steave Street 2021 5
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN
You can use the update method of Series:
df1.Days.update(pd.merge(df1, df2, how='left', on=['Name','Work']).Days_y)

Creating a set of columns from rows using pandas

I have a dataframe that currently looks like this:
Year Country Subject Descriptor GDP
0 2015 Austria r 344.2
1 2015 Austria n 344.2
2 2015 Austria d 100
3 2015 Austria u 5.742
4 2015 Belgium r 416.7
5 2015 Belgium n 416.7
6 2015 Belgium d 100
7 2015 Belgium u 8.483
I want to transform it to look something along these lines:
Year Country GDP_R GDP_N GDP_D GDP_U
2015 Austria 344.2 344.2 100 5.742
2015 Belgium 416.7 416.7 100 8.483
So far I have attempted to use melt and stack but I feel like I'm just missing it, if you can help me here it'd be much appreciated.
Thank you!
You can first use groupby.agg() and put all values of GDP column in a list. Then, you can convert the object to a new DataFrame, using as columns the prefix 'GDP_' and all the values of the Subject Descriptor column.
Finally, putting the two together using pd.concat() will give your final output.
Please see below an example:
one = df.groupby(['Year','Country'])['GDP'].agg(list).reset_index()
two = pd.DataFrame(one['GDP'].to_list(), columns=['GDP_' + s.upper() for s in set(df['Subject Descriptor'].tolist())])
new = pd.concat([one,two],axis=1).drop('GDP',axis=1)
new prints back:
Year Country GDP_D GDP_N GDP_R GDP_U
0 2015 Austria 344.2 344.2 100.0 5.742
1 2015 Belgium 416.7 416.7 100.0 8.483
First you can use groupby on ['Year', 'Country'] and next you can convert the GDPs for each group to a list and then transpose them to columns. Last few steps are to rename columns, reset index and remove column axis name.
(
df.groupby(['Year', 'Country'])
.apply(lambda x: pd.Series(x.GDP.tolist(), index=x['Subject Descriptor']))
.rename(columns = lambda x: f'GDP_{x.upper()}')
.reset_index()
.rename_axis('', axis=1)
)
You can use a pivot in this case :
(df.pivot(['Year', 'Country'], 'Subject_Descriptor', 'GDP')
.rename(columns = lambda col: f"GDP_{col.upper()}")
.rename_axis(columns=None).reset_index()
)
Year Country GDP_D GDP_N GDP_R GDP_U
0 2015 Austria 100.0 344.2 344.2 5.742
1 2015 Belgium 100.0 416.7 416.7 8.483

Replacing some values of a column conditional on another column in pandas

I have a data-frame df:
ID county state sales year
a king oh 10 2010
b 0 al 20 2011
c kent oh 10 2010
d 0 wa 30 2012
I want to replace the zero value of the county with county name conditional on ID, such that, if ID equals 'b', county will be 'anchorage' and for 'd' id county will be 'whitman'.
ID county state sales year
a king oh 10 2010
b anchorage al 20 2011
c kent oh 10 2010
d whitman wa 30 2012
I applied the following code:
conditions = [(df['id'] == 'b'),(df['id'] == 'd')]
values = ['anchorage', 'whitman']
df['county'] = np.select(conditions, values)
The above code replaces the zero value with new value, but at the same time it turns existing nonzero county value to zero.
Try setting the default value of np.select to the 'county' column:
conditions = [(df['id'] == 'b'), (df['id'] == 'd')]
values = ['anchorage', 'whitman']
df['county'] = np.select(conditions, values, default=df['county'])
df:
id county state sales year
0 a king oh 10 2010
1 b anchorage al 20 2011
2 c kent oh 10 2010
3 d whitman wa 30 2012
first create criteria and then simply use loc
condition = (df['country'] == 0)
for i in range(len(df['ID'])):
if condition[i] == True:
df.loc[df['ID'] == 'b', 'country'] = 'anchorage'
df.loc[df['ID'] == 'd', 'country'] = 'whitman'

keep order while using python pandas pivot

df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?
Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4

Pandas dataframe vertical merge

I have a query regarding merging two dataframes
For example i have 2 dataframes as below :
print(df1)
Year Location
0 2013 america
1 2008 usa
2 2011 asia
print(df2)
Year Location
0 2008 usa
1 2008 usa
2 2009 asia
My expected output :
Year Location
2013 america
2008 usa
2011 asia
Year Location
2008 usa
2008 usa
2009 asia
Output i am getting right now :
Year Location Year Location
2013 america 2008 usa
2008 usa 2008 usa
2011 asia 2009 asia
I have tried using pd.concat and pd.merge with no luck
Please help me with above
Simply specify the axis along which to concatenate (axis=1) in pd.concat:
df_merged=pd.concat([df1,df2],axis=1)
pd.concat([df1, df2]) should work. If all the column headings are the same, it will bind the second dataframe's rows below the first. This graphic from a pandas cheat sheet (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) explains it pretty well:
It's the same columns and same order, so that you can use: df1.append(df2)
output_df = pd.concat([df1, df2], ignore_index=False)
if you'd set ignore_index = True, you lost your original indexes and get 0..n-1 instead
It works for MultiIndex too

Categories

Resources