Swap df1 column with df2 column, based on value - python

Goal: swap out df_hsa.stateabbr with df_state.state, based on 'df_state.abbr`.
Is there such a function, where I mention source, destination, and based-on dataframe columns?
Do I need to order both DataFrames similarly?
df_hsa:
hsa stateabbr county
0 259 AL Butler
1 177 AL Calhoun
2 177 AL Cleburne
3 172 AL Chambers
4 172 AL Randolph
df_state:
abbr state
0 AL Alabama
1 AK Alaska
2 AZ Arizona
3 AR Arkansas
4 CA California
Desired Output:
df_hsa with state column instead of stateabbr.
hsa state county
0 259 Alabama Butler
1 177 Alabama Calhoun
2 177 Alabama Cleburne
3 172 Alabama Chambers
4 172 Alabama Randolph

you can simply join after setting the index to be "stateabbr"
df_hsa.set_index("stateabbr").join(df_state.set_index("abbr"))
output:
hsa county state
AL 259 Butler Alabama
AL 177 Calhoun Alabama
AL 177 Cleburne Alabama
AL 172 Chambers Alabama
AL 172 Randolph Alabama
if you also want the original index your can add .set_index(df_hsa.index) at the end of the line

Related

How to transform combinations of values in columns into individual columns?

I have a dataset (df), that looks like this:
Date
ID
County Name
State
State Name
Product Name
Type of Transaction
QTY
202105
10001
Los Angeles
CA
California
Shoes
Entry
630
202012
10002
Houston
TX
Texas
Keyboard
Exit
5493
202001
11684
Chicago
IL
Illionis
Phone
Disposal
220
202107
12005
New York
NY
New York
Phone
Entry
302
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
Shoes
Exit
201
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that:
1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN.
2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date
ID
County Name
State
State Name
Shoes, Entry
Shoes, Exit
Shoes, Disposal
Phones, Entry
Phones, Exit
Phones, Disposal
Keyboard, Entry
Keyboard, Exit
Keyboard, Disposal
202105
10001
Los Angeles
CA
California
594
694
5660
33299
1110
5659
4559
3223
56889
202012
10002
Houston
TX
Texas
3420
4439
549
2110
5669
2245
39294
3345
556
202001
11684
Chicago
IL
Illionis
55432
4439
329
21190
4320
455
34059
44556
5677
202107
12005
New York
NY
New York
34556
2204
4329
11193
22345
43221
1544
3467
22450
...
...
...
...
...
...
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
54543
23059
3290
21394
34335
59660
NaN
NaN
NaN
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!
This is essentially a pivot, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0

I have number of matches, but only for home matches for given team. How can I sum values for duplicate pairs?

My dataframe contains number of matches for given fixtures, but only for home matches for given team (i.e. number of matches for Argentina-Uruguay matches is 97, but for Uruguay-Argentina this number is 80). In short I want to sum both numbers of home matches for given teams, so that I have the total number of matches between the teams concerned. The dataframe's top 30 rows looks like this:
most_often = mc.groupby(["home_team", "away_team"]).size().reset_index(name="how_many").sort_values(by=['how_many'], ascending = False)
most_often = most_often.reset_index(drop=True)
most_often.head(30)
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
5 Argentina Paraguay 64
6 Belgium Netherlands 63
7 Netherlands Belgium 62
8 England Scotland 59
9 Argentina Brazil 58
10 Brazil Paraguay 58
11 Scotland England 58
12 Norway Sweden 56
13 England Wales 54
14 Sweden Denmark 54
15 Wales Scotland 54
16 Denmark Sweden 53
17 Argentina Chile 53
18 Scotland Wales 52
19 Scotland Northern Ireland 52
20 Sweden Norway 51
21 Wales England 50
22 England Northern Ireland 50
23 Wales Northern Ireland 50
24 Chile Uruguay 49
25 Northern Ireland England 49
26 Brazil Argentina 48
27 Brazil Chile 48
28 Brazil Uruguay 47
29 Chile Peru 46
In turn, I mean something like this
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 107
5 Uganda Kenya 107
6 Belgium Netherlands 105
7 Netherlands Belgium 105
But this is only an example, I want to apply it for every team, which I have on dataframe.
What should I do?
Ok, you can follow steps below.
Here is the initial df.
home_team away_team how_many
0 Argentina Uruguay 97
1 Uruguay Argentina 80
2 Austria Hungary 69
3 Hungary Austria 68
4 Kenya Uganda 65
Here you need to create a siorted list that will be the key foraggregations.
df1['sorted_list_team'] = list(zip(df1['home_team'],df1['away_team']))
df1['sorted_list_team'] = df1['sorted_list_team'].apply(lambda x: np.sort(np.unique(x)))
home_team away_team how_many sorted_list_team
0 Argentina Uruguay 97 [Argentina, Uruguay]
1 Uruguay Argentina 80 [Argentina, Uruguay]
2 Austria Hungary 69 [Austria, Hungary]
3 Hungary Austria 68 [Austria, Hungary]
4 Kenya Uganda 65 [Kenya, Uganda]
Here you will covert this list to tuple and turn it able to be aggregations.
def converter(list):
return (*list, )
df1['sorted_list_team'] = df1['sorted_list_team'].apply(converter)
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
Do the aggregation to make a sum of 'how_many' values in another dataframe that i call 'df_sum'.
df_sum = df1.groupby(['sorted_list_team']).agg({'how_many':'sum'}).reset_index()
sorted_list_team how_many
0 (Argentina, Brazil) 106
1 (Argentina, Chile) 53
2 (Argentina, Paraguay) 64
3 (Argentina, Uruguay) 177
4 (Austria, Hungary) 137
And merge with 'df1' to get the result of a sum, the colum 'how_many' are in both dfs, for this reason pandas rename the column of df_sum as 'how_many_y'
df1 = pd.merge(df1,df_sum[['sorted_list_team','how_many']], on='sorted_list_team',how='left').drop_duplicates()
And final step you need select only columns that you need from result df.
df1 = df1[['home_team','away_team','how_many_y']]
df1 = df1.drop_duplicates()
df1.head()
home_team away_team how_many_y
0 Argentina Uruguay 177
1 Uruguay Argentina 177
2 Austria Hungary 137
3 Hungary Austria 137
4 Kenya Uganda 65
I found a relatively straightforward thing that hopefully does what you want, but is slightly different than your desired output. Your output has what looks like repetitive information where we aren't caring anymore about home-vs-away team but just want the game counts, and so let's get rid of that distinction (if we can...).
If we make a new column that combines the values from home_team and away_team in the same order each time, we can just do a sum on the how_many where that new column matches
df['teams'] = pd.Series(map('-'.join,np.sort(df[['home_team','away_team']],axis=1)))
# this creates values like 'Argentina-Brazil' and 'Chile-Peru'
df[['how_many','teams']].groupby('teams').sum()
This code gave me the following:
how_many
teams
Argentina-Brazil 106
Argentina-Chile 53
Argentina-Paraguay 64
Argentina-Uruguay 177
Austria-Hungary 137
Belgium-Netherlands 125
Brazil-Chile 48
Brazil-Paraguay 58
Brazil-Uruguay 47
Chile-Peru 46
Chile-Uruguay 49
Denmark-Sweden 107
England-Northern Ireland 99
England-Scotland 117
England-Wales 104
Kenya-Uganda 65
Northern Ireland-Scotland 52
Northern Ireland-Wales 50
Norway-Sweden 107
Scotland-Wales 106

DataFrame from Dictionary with variable length keys

So for this assignment I managed to create a dictionary, where the keys are State names (eg: Alabama, Alaska, Arizona), and the values are lists of regions for each state. The problem is that the lists of regions are of different lengths - so each state can have a different number of regions associated.
Example : 'Alabama': ['Auburn',
'Florence',
'Jacksonville',
'Livingston',
'Montevallo',
'Troy',
'Tuscaloosa',
'Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
How can I unload this into a pandas Dataframe? What I want is basically 2 columns - "State", "Region". Something similar to what you would get if you would do a "GroupBy" on state for the regions.
If you work on pandas 0.25+, you can use explode:
pd.Series(states).explode()
Output:
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alabama Livingston
Alabama Montevallo
Alabama Troy
Alabama Tuscaloosa
Alabama Tuskegee
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
Arizona Tucson
dtype: object
You can also use concat which works for most pandas version:
pd.concat(pd.DataFrame({'state':k, 'Region':v}) for k,v in states.items())
Output:
state Region
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
0 Alaska Fairbanks
0 Arizona Flagstaff
1 Arizona Tempe
2 Arizona Tucson
You can also do this by dividing the dictionary into lists. Although that will be a little longer approach. For Example:
Example = {'Alabama': ['Auburn','Florence','Jacksonville','Livingston','Montevallo','Troy','Tuscaloosa','Tuskegee'],
'Alaska': ['Fairbanks'],
'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
new_list_of_keys = []
new_list_of_values = []
keys = list(Example.keys())
values = list(Example.values())
for i in range(len(keys)):
for j in range(len(values[i])):
new_list_of_values.append(values[i][j])
new_list_of_keys.append(keys[i])
df = pd.DataFrame(zip(new_list_of_keys, new_list_of_values), columns = ['State', 'Region'])
This will give output as:
State Region
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson

Itering through a list if identical elements

I have the following function, which returns the pandas series of States - Associated Counties
def answer():
census_df.set_index(['STNAME', 'CTYNAME'])
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
print(name, state, cname)
Alabama 1 Tallapoosa County
Alabama 1 Tuscaloosa County
Alabama 1 Walker County
Alabama 1 Washington County
Alabama 1 Wilcox County
Alabama 1 Winston County
Alaska 2 Alaska
Alaska 2 Aleutians East Borough
Alaska 2 Aleutians West Census Area
Alaska 2 Anchorage Municipality
Alaska 2 Bethel Census Area
Alaska 2 Bristol Bay Borough
Alaska 2 Denali Borough
Alaska 2 Dillingham Census Area
Alaska 2 Fairbanks North Star Borough
I would like to know the state with the most counties in it. I can iterate through each state like this:
counter = 0
counter2 = 0
for name, state, cname in zip(census_df['STNAME'], census_df['STATE'], census_df['CTYNAME']):
if state == 1:
counter += 1
print(counter)
if state == 1:
counter2 += 1
print(counter2)
and so on. I can range the number of states (rng = range(1, 56)) and iterate through it, but creating 56 lists is a nightmare. Is there an easier way if doing so?
Pandas allows us to do such operations without loops/iterating:
In [21]: df.STNAME.value_counts()
Out[21]:
Alaska 9
Alabama 6
Name: STNAME, dtype: int64
In [24]: df.STNAME.value_counts().head(1)
Out[24]:
Alaska 9
Name: STNAME, dtype: int64
or
In [18]: df.groupby('STNAME')['CTYNAME'].count()
Out[18]:
STNAME
Alabama 6
Alaska 9
Name: CTYNAME, dtype: int64
In [19]: df.groupby('STNAME')['CTYNAME'].count().idxmax()
Out[19]: 'Alaska'

Count number of counties per state using python {census}

I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY

Categories

Resources