Subtract columns from two data frames based on a common column - python

df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.

Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0

Related

Converting multiple variables into one column and creating a matching values column using pandas

I have a Table with the following format:
Country
GDP
LifeExp
USA
6.5
75
UK
9.5
78
Italy
5.5
80
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
Country
Indicator name
Value
USA
GDP
6.5
USA
LifeExp
75
UK
GDP
9.5
UK
LifeExp
78
Italy
GDP
5.5
Italy
LifeExp
80
Here is the code to create the first Table:
import pandas as pd
df = pd.DataFrame({'Country':["USA", "UK", "Italy"],
'GDP':[6.5, 9.5, 5.5],
'LifeExp':[75,78,80]})
I've never uploaded something before on stackoverflow so I hope I've provided sufficient info for someone to help me with this problem.
Thanks in advance!
You can use .melt() with .sort_values(), as follows:
(df.melt(id_vars='Country', var_name='Indicator name', value_name='Value')
.sort_values('Country', ascending=False)
).reset_index(drop=True)
# Result
Country Indicator name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0
You can choose sorting order of Country column. If you want it in ascending order, you can simply remove the parameter ascending=False in the .sort_values() function.
Use .stack() and .reset_index():
print(
df.set_index("Country")
.stack()
.reset_index()
.rename(columns={"level_1": "Indicator Name", 0: "Value"})
)
Prints:
Country Indicator Name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.
arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Summing on all previous values of a dataframe in Python

I have data that looks like:
Year Month Region Value
1978 1 South 1
1990 1 North 22
1990 2 South 33
1990 2 Mid W 12
1998 1 South 1
1998 1 North 12
1998 2 South 2
1998 3 South 4
1998 1 Mid W 2
.
.
up to
2010
2010
My end date is 2010 but I want to sum all Values by the Region and Month by adding all previous year values together.
I don't want just a regular cumulative sum but a Monthly Cumulative Sum by Region where Month 1 of Region South is the cumulative Month 1 of Region South of all previous Month 1s before it, etc....
Desired output is something like:
Month Region Cum_Value
1 South 2
2 South 34
3 South 4
.
.
1 North 34
2 North 10
.
.
1 MidW 2
2 MidW 12
Use pd.DataFrame.groupby with pd.DataFrame.cumsum
df1['cumsum'] = df1.groupby(['Month', 'Region'])['Value'].cumsum()
Result:
Year Month Region Value cumsum
0 1978 1 South 1.0 1.0
1 1990 1 North 22.0 22.0
2 1990 2 South 33.0 33.0
3 1990 2 Mid W 12.0 12.0
4 1998 1 South 1.0 2.0
5 1998 1 North 12.0 34.0
6 1998 2 South 2.0 35.0
7 1998 3 South 4.0 4.0
8 1998 1 Mid W 2.0 2.0
Here's another solutions that corresponds more with your expected output.
df = pd.DataFrame({'Year': [1978,1990,1990,1990,1998,1998,1998,1998,1998],
'Month': [1,1,2,2,1,1,2,3,1],
'Region': ['South','North','South','Mid West','South','North','South','South','Mid West'],
'Value' : [1,22,33,12,1,12,2,4,2]})
#DataFrame Result
Year Month Region Value
0 1978 1 South 1
1 1990 1 North 22
2 1990 2 South 33
3 1990 2 Mid West 12
4 1998 1 South 1
5 1998 1 North 12
6 1998 2 South 2
7 1998 3 South 4
8 1998 1 Mid West 2
Code to run:
df1 = df.groupby(['Month','Region']).sum()
df1 = df1.drop('Year',axis=1)
df1 = df1.sort_values(['Month','Region'])
#Final Result
Month Region Value
1 Mid West 2
1 North 34
1 South 2
2 Mid West 12
2 South 35
3 South 4

Create grouped bar graph from dataframe

Sample DataFrame
continent avg_count_country avg_age
Male 0 Asia 55 5
1 Africa 65 10
2 Europe 75 8
Female 0 Asia 50 7
1 Africa 60 12
2 Europe 70 0
Transgender 0 Asia 30 6
1 Africa 40 11
2 America 80 10
For the grouped bar graph:
X axis will have Male , Female , Transgender
Y axis will have total counts
3 bars in each Male , Female and Transgender
Male and Female will have 3 bars grouped - Asia , Africa , Europe
Transgender will have 3 bars grouped -- Asia , Africa , America
4 unique colors or legends [Asia , Africa ,Europe , America]
I can do it manually like plotting every bar
bars1 = //manually giving values
bars2 = //manually giving values
......bars3, bars4
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='var1')
and plotting each bar like this
But want to do it in more optimized way or dynamic way
You can reshape your dataframe and use pandas plot:
df_out = df.reset_index(level=1, drop=True)\
.set_index(['continent'], append=True)['avg_count_country']\
.unstack()
df_out.plot.bar()
Output:

pandas: if intersection then update dataframe

I have two dataframes:
countries:
Country or Area Name ISO-2 ISO-3
0 Afghanistan AF AFG
1 Philippines PH PHL
2 Albania AL ALB
3 Norway NO NOR
4 American Samoa AS ASM
contracts:
Country Name Jurisdiction Signature year
0 Yemen KY;NO;CA;NO 1999.0
1 Yemen BM;TC;YE 2007.0
2 Congo, CD;CD 2015.0
3 Philippines PH 2009.0
4 Philippines PH;PH 2007.0
5 Philippines PH 2001.0
6 Philippines PH;PH 1997.0
7 Bolivia, Plurinational State of BO;BO 2006.0
I want to:
check whether the column Jurdisctiction in contracts contains at least one two letter code from the countries ISO-2 column.
I have tried numerous ways of testing whether there is an intersection, but none of them works. My last try was:
i1 = pd.Index(contracts['Jurisdiction of Incorporation'].str.split(';'))
i2 = pd.Index(countries['ISO-2'])
print i1, i2
i1.intersection(i2)
Which gives me TypeError: unhashable type: 'list'
if at least one of the codes is present, I want to update the contracts dataframe with new column that will contain just boolean values
contracts['new column'] = np.where("piece of code that will actually work", 1, 0)
So the desired output would be
Country Name Jurisdiction Signature year new column
0 Yemen KY;NO;CA;NO 1999.0 1
1 Yemen BM;TC;YE 2007.0 0
2 Congo, CD;CD 2015.0 0
3 Philippines PH 2009.0 1
4 Philippines PH;PH 2007.0 1
5 Philippines PH 2001.0 1
6 Philippines PH;PH 1997.0 1
7 Bolivia, Plurinational State of BO;BO 2006.0 0
How can I achieve this?
A bit of a mouthful, but try this:
occuring_iso_2_codes = set(countries['ISO-2'])
contracts['new column'] = contracts.Jurisdiction.apply(
lambda s: int(bool(set(s.split(';')).intersection(occuring_iso_2_codes))))

Categories

Resources