pandas: if intersection then update dataframe - python

I have two dataframes:
countries:
Country or Area Name ISO-2 ISO-3
0 Afghanistan AF AFG
1 Philippines PH PHL
2 Albania AL ALB
3 Norway NO NOR
4 American Samoa AS ASM
contracts:
Country Name Jurisdiction Signature year
0 Yemen KY;NO;CA;NO 1999.0
1 Yemen BM;TC;YE 2007.0
2 Congo, CD;CD 2015.0
3 Philippines PH 2009.0
4 Philippines PH;PH 2007.0
5 Philippines PH 2001.0
6 Philippines PH;PH 1997.0
7 Bolivia, Plurinational State of BO;BO 2006.0
I want to:
check whether the column Jurdisctiction in contracts contains at least one two letter code from the countries ISO-2 column.
I have tried numerous ways of testing whether there is an intersection, but none of them works. My last try was:
i1 = pd.Index(contracts['Jurisdiction of Incorporation'].str.split(';'))
i2 = pd.Index(countries['ISO-2'])
print i1, i2
i1.intersection(i2)
Which gives me TypeError: unhashable type: 'list'
if at least one of the codes is present, I want to update the contracts dataframe with new column that will contain just boolean values
contracts['new column'] = np.where("piece of code that will actually work", 1, 0)
So the desired output would be
Country Name Jurisdiction Signature year new column
0 Yemen KY;NO;CA;NO 1999.0 1
1 Yemen BM;TC;YE 2007.0 0
2 Congo, CD;CD 2015.0 0
3 Philippines PH 2009.0 1
4 Philippines PH;PH 2007.0 1
5 Philippines PH 2001.0 1
6 Philippines PH;PH 1997.0 1
7 Bolivia, Plurinational State of BO;BO 2006.0 0
How can I achieve this?

A bit of a mouthful, but try this:
occuring_iso_2_codes = set(countries['ISO-2'])
contracts['new column'] = contracts.Jurisdiction.apply(
lambda s: int(bool(set(s.split(';')).intersection(occuring_iso_2_codes))))

Related

How to replace the values of a column to other columns only in NaN values?

How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13

Python Pivot: Can I get the count of columns per row(id/index) and store it in a new columns?

hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)

Subtract columns from two data frames based on a common column

df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.
Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0

Summing on all previous values of a dataframe in Python

I have data that looks like:
Year Month Region Value
1978 1 South 1
1990 1 North 22
1990 2 South 33
1990 2 Mid W 12
1998 1 South 1
1998 1 North 12
1998 2 South 2
1998 3 South 4
1998 1 Mid W 2
.
.
up to
2010
2010
My end date is 2010 but I want to sum all Values by the Region and Month by adding all previous year values together.
I don't want just a regular cumulative sum but a Monthly Cumulative Sum by Region where Month 1 of Region South is the cumulative Month 1 of Region South of all previous Month 1s before it, etc....
Desired output is something like:
Month Region Cum_Value
1 South 2
2 South 34
3 South 4
.
.
1 North 34
2 North 10
.
.
1 MidW 2
2 MidW 12
Use pd.DataFrame.groupby with pd.DataFrame.cumsum
df1['cumsum'] = df1.groupby(['Month', 'Region'])['Value'].cumsum()
Result:
Year Month Region Value cumsum
0 1978 1 South 1.0 1.0
1 1990 1 North 22.0 22.0
2 1990 2 South 33.0 33.0
3 1990 2 Mid W 12.0 12.0
4 1998 1 South 1.0 2.0
5 1998 1 North 12.0 34.0
6 1998 2 South 2.0 35.0
7 1998 3 South 4.0 4.0
8 1998 1 Mid W 2.0 2.0
Here's another solutions that corresponds more with your expected output.
df = pd.DataFrame({'Year': [1978,1990,1990,1990,1998,1998,1998,1998,1998],
'Month': [1,1,2,2,1,1,2,3,1],
'Region': ['South','North','South','Mid West','South','North','South','South','Mid West'],
'Value' : [1,22,33,12,1,12,2,4,2]})
#DataFrame Result
Year Month Region Value
0 1978 1 South 1
1 1990 1 North 22
2 1990 2 South 33
3 1990 2 Mid West 12
4 1998 1 South 1
5 1998 1 North 12
6 1998 2 South 2
7 1998 3 South 4
8 1998 1 Mid West 2
Code to run:
df1 = df.groupby(['Month','Region']).sum()
df1 = df1.drop('Year',axis=1)
df1 = df1.sort_values(['Month','Region'])
#Final Result
Month Region Value
1 Mid West 2
1 North 34
1 South 2
2 Mid West 12
2 South 35
3 South 4

Simple way to convert a Pandas Series for integer comparison

I have the very simple. following code, and would like to select all the teams that have a highest_ranking of 1.
import pandas as pd
table = pd.read_table('team_rankings.dat')
table.head()
rank team rating highest_rank highest_rating
0 1 Germany 2097 1 2205
1 2 Brazil 2086 1 2161
2 3 Spain 2011 1 2147
3 4 Portugal 1968 2 1991
4 5 Argentina 1967 1 2128
type((table['highest_rank']))
pandas.core.series.Series
table.loc[(table['highest_rank']) < 2]
then gives me a
TypeError: unorderable types: str() < int()
since some highest_rank enteries are '-'. Urgh. What's a simple way to perform this (integer) selection??
You can parse the "-" as a NaN-value. That might help you for more future tasks.
table = pd.read_table('team_rankings.dat', na_values="-")
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
User pd.to_numeric with errors ='coerce' i.e
df.loc[(pd.to_numeric(df['highest_rank'],errors='coerce')) < 2]
Output:
rank team rating highest_rank highest_rating
0 1 Germany 2097 1 2205
1 2 Brazil 2086 1 2161
2 3 Spain 2011 1 2147
4 5 Argentina 1967 1 2128

Categories

Resources