Combining Pandas DataFrames With Multiple Reference Columns

Combining Pandas DataFrames With Multiple Reference Columns - python

I'm trying to combine two pandas DataFrames to update the first one based on criteria from the second. Here is a sample of the two dataframes:
df1
year
2016 CALIFORNIA CLINTON, HILLARY
2016 CALIFORNIA TRUMP, DONALD J.
2016 CALIFORNIA JOHNSON, GARY
2016 CALIFORNIA STEIN, JILL
2016 CALIFORNIA WRITE-IN
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA
2016 TEXAS TRUMP, DONALD J.
2016 TEXAS CLINTON, HILLARY
2016 TEXAS JOHNSON, GARY
2016 TEXAS STEIN, JILL
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W.
1988 CALIFORNIA DUKAKIS, MICHAEL
1988 CALIFORNIA PAUL, RONALD ""RON""
1988 CALIFORNIA FULANI, LENORA
1988 TEXAS BUSH, GEORGE H.W.
1988 TEXAS DUKAKIS, MICHAEL
1988 TEXAS PAUL, RONALD ""RON""
1988 TEXAS FULANI, LENORA
df2
year
1988 CALIFORNIA 47
1988 TEXAS 29
...
2016 CALIFORNIA 55
2016 TEXAS 38
There are values for every election year from 2020 to 1972 that includes all candidates and all states in a similar format. There are other columns in df1 but they aren't relevant to what I'm trying to do.
My expected result is:
year
2016 CALIFORNIA CLINTON, HILLARY 55
2016 CALIFORNIA TRUMP, DONALD J. 55
2016 CALIFORNIA JOHNSON, GARY 55
2016 CALIFORNIA STEIN, JILL 55
2016 CALIFORNIA WRITE-IN 55
2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
2016 TEXAS TRUMP, DONALD J. 38
2016 TEXAS CLINTON, HILLARY 38
2016 TEXAS JOHNSON, GARY 38
2016 TEXAS STEIN, JILL 38
...
state candidate
year
1988 CALIFORNIA BUSH, GEORGE H.W. 47
1988 CALIFORNIA DUKAKIS, MICHAEL 47
1988 CALIFORNIA PAUL, RONALD ""RON"" 47
1988 CALIFORNIA FULANI, LENORA 47
1988 TEXAS BUSH, GEORGE H.W. 29
1988 TEXAS DUKAKIS, MICHAEL 29
1988 TEXAS PAUL, RONALD ""RON"" 29
1988 TEXAS FULANI, LENORA 29
I want to match up the electoral_votes column in df2 with the year and state columns in df1 so it puts the correct value. I got some assistance and was able to match it up when there is only one column being matched (you can see the question and answer here) but I am having trouble matching it up with the two points of reference (year and state). If I use the code linked as is it returns the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have tried apply, map, applymap, merge, etc and haven't been able to figure it out. Thanks in advance for the help!

I believe what you are looking for is left_merge. You should specify the common columns within on=[....], that the merge should be based on.
# Imports
import pandas as pd
# Specify two columns in the "on".
pd.merge(df1,
df2,
how='left',
on=['year','state'])
Out[1821]:
year state candidate votes
0 2016 CALIFORNIA CLINTON, HILLARY 55
1 2016 CALIFORNIA TRUMP, DONALD J. 55
2 2016 CALIFORNIA JOHNSON, GARY 55
3 2016 CALIFORNIA STEIN, JILL 55
4 2016 CALIFORNIA WRITE-IN 55
5 2016 CALIFORNIA LA RIVA, GLORIA ESTELLA 55
6 2016 TEXAS TRUMP, DONALD J. 38
7 2016 TEXAS CLINTON, HILLARY 38
8 2016 TEXAS JOHNSON, GARY 38
9 2016 TEXAS STEIN, JILL 38
10 1988 CALIFORNIA BUSH, GEORGE H.W. 47
11 1988 CALIFORNIA DUKAKIS, MICHAEL 47
12 1988 CALIFORNIA PAUL, RONALD ""RON"" 47
13 1988 CALIFORNIA FULANI, LENORA 47
14 1988 TEXAS BUSH, GEORGE H.W. 29
15 1988 TEXAS DUKAKIS, MICHAEL 29
16 1988 TEXAS PAUL, RONALD ""RON"" 29
17 1988 TEXAS FULANI, LENORA 29
The above code could be written as:
pd.merge(df1,
df2,
how='left',
left_on=['year','state'],
right_on=['year','state'])
but since the columns are the same in the 2 dfs, we can use on = ['year', 'state']

An alternate way to write -
merged_df = df1.merge(df2, on=['year', 'state'], how='left')
If you want to use only 3 columns from df1 -
df1 = pd.read_csv('<name_of_the_CSV_file>', usecols=['year', 'state', 'candidate'])

Related

Ordering Pairs of Data by date - Pandas

I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.
Name
College
Year
Mary
Princeton
2017
Joe
Harvard
2018
Bill
Princeton
2016
Louise
Princeton
2020
Michael
Harvard
2019
Penny
Yale
2018
Harry
Yale
2015
I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:
Name
College
Year
Louise
Princeton
2020
Mary
Princeton
2017
Bill
Princeton
2016
Michael
Harvard
2019
Joe
Harvard
2018
Penny
Yale
2018
Harry
Yale
2015
So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!

Add a temporary extra column with the max year per group and sort on multiple columns:
out = (df
.assign(max_year=df.groupby('College')['Year'].transform('max'))
.sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
.drop(columns='max_year')
)
output:
Name College Year
3 Louise Princeton 2020
0 Mary Princeton 2017
2 Bill Princeton 2016
4 Michael Harvard 2019
1 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015
with temporary column:
Name College Year max_year
3 Louise Princeton 2020 2020
0 Mary Princeton 2017 2020
2 Bill Princeton 2016 2020
4 Michael Harvard 2019 2019
1 Joe Harvard 2018 2019
5 Penny Yale 2018 2018
6 Harry Yale 2015 2018

You first want to sort by "College" then "Year", then keep "College" values together by using .groupby
import pandas as pd
data = [
["Mary", "Princeton", 2017],
["Joe", "Harvard", 2018],
["Bill", "Princeton", 2016],
["Louise", "Princeton", 2020],
["Michael", "Harvard", 2019],
["Penny", "Yale", 2018],
["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()
You'd get this output:
Name College Year
Penny Yale 2018
Harry Yale 2015
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018

You will have to first find the maximum among each group and set that as a column.
You can then sort by values based on max and year.
df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)
Output:
Out[60]:
Name College Year
0 Louise Princeton 2020
1 Mary Princeton 2017
2 Bill Princeton 2016
3 Michael Harvard 2019
4 Joe Harvard 2018
5 Penny Yale 2018
6 Harry Yale 2015

inner join not working in pandas dataframes

I have the following 2 pandas dataframes:
city Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
city W/L Ratio
0 Boston 2.500000
1 Buffalo 0.555556
2 Calgary 1.057143
3 Chicago 0.846154
4 Columbus 1.500000
5 Dallas–Fort Worth 1.312500
6 Denver 1.433333
7 Detroit 0.769231
8 Edmonton 0.900000
9 Las Vegas 2.125000
10 Los Angeles 1.655862
11 Miami–Fort Lauderdale 1.466667
12 Minneapolis-Saint Paul 1.730769
13 Montreal 0.725000
14 Nashville 2.944444
15 New York 1.517241
16 New York City 0.908870
17 Ottawa 0.651163
18 Philadelphia 1.615385
19 Phoenix 0.707317
20 Pittsburgh 1.620690
21 Raleigh 1.028571
22 San Francisco Bay Area 1.666667
23 St. Louis 1.375000
24 Tampa Bay 2.347826
25 Toronto 1.884615
26 Vancouver 0.775000
27 Washington, D.C. 1.884615
28 Winnipeg 2.600000
And I do a join like this:
result = pd.merge(df, nhl_df , on="city")
The result should have 28 rows, instead I have 24 rows.
One of the missing one is for example Miami-Fort Lauderdale
I have double checked on both dataframes and there are NO typographical errors. So, why isnt it in the end dataframe?
city Population W/L Ratio
0 New York City 20153634 0.908870
1 Los Angeles 13310447 1.655862
2 San Francisco Bay Area 6657982 1.666667
3 Chicago 9512999 0.846154
4 Dallas–Fort Worth 7233323 1.312500
5 Washington, D.C. 6131977 1.884615
6 Philadelphia 6070500 1.615385
7 Boston 4794447 2.500000
8 Denver 2853077 1.433333
9 Phoenix 4661537 0.707317
10 Detroit 4297617 0.769231
11 Toronto 5928040 1.884615
12 Pittsburgh 2342299 1.620690
13 St. Louis 2807002 1.375000
14 Nashville 1865298 2.944444
15 Buffalo 1132804 0.555556
16 Montreal 4098927 0.725000
17 Vancouver 2463431 0.775000
18 Columbus 2041520 1.500000
19 Calgary 1392609 1.057143
20 Ottawa 1323783 0.651163
21 Edmonton 1321426 0.900000
22 Winnipeg 778489 2.600000
23 Las Vegas 2155664 2.125000
24 Raleigh 1302946 1.028571

I think here is possible check if same chars by integer that represents the character in function ord, here are different – with code 150 and – with code 8211, so it is reason why values not matched:
a = df1.loc[10, 'city']
print (a)
Miami–Fort Lauderdale
print ([ord(x) for x in a])
[77, 105, 97, 109, 105, 150, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
b = df2.loc[11, 'city']
print (b)
Miami–Fort Lauderdale
print ([ord(x) for x in b])
[77, 105, 97, 109, 105, 8211, 70, 111, 114, 116, 32, 76, 97, 117, 100, 101, 114, 100, 97, 108, 101]
You can try copy values for replace for select correct - value:
#first – is copied from b, second – from a
df2['city'] = df2['city'].replace('–','–', regex=True)

How to filter within a subgroup (Pandas)

here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough

Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy

Pandas - How to get index values from a dataframe

I have a pandas dataframe of the form
Start Date End Date President Party
0 04 March 1921 02 August 1923 Warren G Harding Republican
1 03 August 1923 04 March 1929 Calvin Coolidge Republican
2 05 March 1929 04 March 1933 Herbert Hoover Republican
3 05 March 1933 12 April 1945 Franklin D Roosevelt Democratic
4 13 April 1945 20 January 1953 Harry S Truman Democratic
5 21 January 1953 20 January 1961 Dwight D Eisenhower Republican
6 21 January 1961 22 November 1963 John F Kennedy Democratic
7 23 November 1963 20 January 1969 Lydon B Johnson Democratic
8 21 January 1969 09 August 1974 Richard Nixon Republican
9 10 August 1974 20 January 1977 Gerald Ford Republican
10 21 January 1977 20 January 1981 Jimmy Carter Democratic
11 21 January 1981 20 January 1989 Ronald Reagan Republican
12 21 January 1989 20 January 1993 George H W Bush Republican
13 21 January 1993 20 January 2001 Bill Clinton Democratic
14 21 January 2001 20 January 2009 George W Bush Republican
15 21 January 2009 20 January 2017 Barack Obama Democratic
16 21 January 2017 20 May 2017 Donald Trump Republican
I want to extract the index values for Party=Republican and store them in a list.
Is there a Pandas function to do this quickly?

df.index[df.Party == 'Republican`]
You can call .tolist() on the result if you want.

Count number of counties per state using python {census}

I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.

Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64

juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64

#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.