Pandas groupby using function variable - python

I have this dataframe:
iata airport city state country lat \
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Colorado Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
I am trying to get the number of airports per state. For example if I have the function:
f(dataframe, state):
result reuslt
Where state would be a state abbreviation, such as 'MA'. I am trying to group the dataframe by the input variable, such as state ('MA') to then get the number of airports per state.
When I use:
df.groupby(state)['airport'].value_counts()
or
df.groupby(state)['airport'].value_counts()/df['airport'].count()
df.groupby(['state'] == state)['airport'].value_counts()/df['airport'].count()
The last two are regarding the conditional probability a selected airport will be in that state.
It throws a Key Error: 'MA', which I think is due to the input variable not being recognized as a column, but a value in the column.
Is there a way to get the number of airports per state?

I would use Pandas's nunique to get the number of airports per state. The code is easier to read and remember.
To illustrate my point, I modified the dataset as follows, such that Florida has three more fictional airports:
iata airport city state country lat
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
5 f234 Weirdviller Chilliard FL USA 30.788012
6 23r2 Johnson Billiard FL USA 30.888012
Then, we write:
df.groupby('state').iata.nunique()
to get the following results:
state
CO 1
MS 1
TX 1
FL 3
NY 1
Name: iata, dtype: int64
Hope this helps.

Assuming each record is an airport throughout, you can just count the records for each state / country combination:
df.groupby(['country','state']).size()

You can rewrite this as an explicit groupby apply:
In [11]: df.groupby("state")["airport"].apply(lambda x: x.value_counts() / len(x))
Out[11]:
state
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
or store the groupby and reuse it (probably this is faster):
In [21]: g = df.groupby("state")["airport"]
In [22]: g.value_counts() / g.size()
Out[22]:
state airport
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64

This seemed to work the way I intended with all your help. a[state] represents an input in the form of a state abbreviation ('MA'). This returns the probability of a randomly selected airport belonging to that state.
a = df.groupby('state').iata.nunique()
s = a.sum()
result = a[state]/s
return result

Related

How to take N groups from pandas dataframe when grouped by multiple columns

I am having this kind of code (code sample is recreation of production code) -
import pandas as pd
df_nba = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/nba.csv')
df_nba['custom'] = 'abc'
df_gpby_team_clg = df_nba.groupby(['custom', 'College', 'Team']).agg({'Salary': sum})
print(df_gpby_team_clg)
Output looks something like this -
Now I want to have first N College stats. So if I give n=2 I will have a df with Alabama and Arizona and their respective Team and Salary stats.
You can use .reset_index() to restore the dataframe after groupby() with multi-index row index back to normal range index for easier subsequent operations.
Then extract the first n colleges into a list by calling .unique() on the College column.
Finally, filter the expanded dataframe with .loc by checking for College is in the first n colleges just extracted by using .isin within .loc:
n = 2
df_gpby_team_clg_expand = df_gpby_team_clg.reset_index()
first_N_college = df_gpby_team_clg_expand['College'].unique()[:n]
df_gpby_team_clg_expand.loc[df_gpby_team_clg_expand['College'].isin(first_N_college)]
Result:
custom College Team Salary
0 abc Alabama Cleveland Cavaliers 2100000.0
1 abc Alabama Memphis Grizzlies 845059.0
2 abc Alabama New Orleans Pelicans 1320000.0
3 abc Arizona Brooklyn Nets 1335480.0
4 abc Arizona Cleveland Cavaliers 9140305.0
5 abc Arizona Detroit Pistons 2841960.0
6 abc Arizona Golden State Warriors 11710456.0
7 abc Arizona Houston Rockets 947276.0
8 abc Arizona Indiana Pacers 5358880.0
9 abc Arizona Milwaukee Bucks 3000000.0
10 abc Arizona New York Knicks 4000000.0
11 abc Arizona Orlando Magic 4171680.0
12 abc Arizona Philadelphia 76ers 525093.0
13 abc Arizona Phoenix Suns 206192.0
Use get_level_values() to get the first n colleges:
n = 2
colleges = df_gpby_team_clg.index.get_level_values('College').unique()[:n]
# Index(['Alabama', 'Arizona'], dtype='object', name='College')
Then extract those colleges with IndexSlice:
index = pd.IndexSlice[:, colleges]
df_gpby_team_clg.loc[index, :]
# Salary
# custom College Team
# abc Alabama Cleveland Cavaliers 2100000.0
# Memphis Grizzlies 845059.0
# New Orleans Pelicans 1320000.0
# Arizona Brooklyn Nets 1335480.0
# Cleveland Cavaliers 9140305.0
# Detroit Pistons 2841960.0
# Golden State Warriors 11710456.0
# Houston Rockets 947276.0
# Indiana Pacers 5358880.0
# Milwaukee Bucks 3000000.0
# New York Knicks 4000000.0
# Orlando Magic 4171680.0
# Philadelphia 76ers 525093.0
# Phoenix Suns 206192.0

Removing everything after a char in a dataframe

If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Appending or Concatenating DataFrame via for loop to existing DataFrame

Posted in the output you will see that this code take the Location column(or series), and places it in a data frame. After which, the first,second, and third part of the nested for loop then takes the first index of each column and then creates a dataframe to add to the first dataframe. What I have been trying to do is for loop through, going up one index each for loop, and then adding a new dataframe of repetitve data. However, when I try to print it, the dataframe will only print the first dataframe, and the last repetitive dataframe that it looped through. However I'm trying to make a huge dataframe that attaches a repetitive index data frame from 0-17. I have updated this to show the repetitiveness that I am looking for, but in a truncated way. I hope this helps. Thanks!
Here is the input
for j in range(0,18,1):
for i in range(0,18,1):
df['Rep Loc'] = str(df['Location'][j:j+1])
df['Rep Lat'] = float(df['Latitude'][j:j+1])
df['Rep Long'] = float(df['Longitude'][j:j+1])
break
print(df)
Here is the output
Location Latitude
Longitude \
0 Letsholathebe II Rd, Maun, North-West District... -19.989491
23.397709
1 North-West District, Botswana -19.389353
23.267951
2 Silobela, Kwekwe, Midlands Province, Zimbabwe -18.993930
29.147992
3 Mosi-Oa-Tunya, Livingstone, Southern Province,... -17.910147
25.861904
4 Parkway Drive, Victoria Falls, Matabeleland No... -17.909231
25.827019
5 A33, Kasane, North-West District, Botswana -17.795057
25.197270
6 T1, Southern Province, Zambia -17.040664
26.608454
7 Sikoongo Road, Siavonga, Southern Province, Za... -16.536204
28.708753
8 New Kasama, Lusaka Province, Zambia -15.471934
28.398588
9 Simon Mwansa Kapwepwe Avenue, Avondale, Lusaka... -15.386244
28.397111
10 Lusaka, Lusaka Province, 1010, Zambia -15.416697
28.281381
11 Chigwirizano Road, Rhodes Park, Lusaka, Lusaka... -15.401848
28.302248
12 T2, Kabwe, Central Province, Zambia -14.420744
28.462169
13 Kabushi Road, Ndola, Copperbelt Province, Zambia -12.997968
28.608536
14 Dr Aggrey Avenue, Mishenshi, Kitwe, Copperbelt... -12.797684
28.199061
15 President Avenue, Kalulushi, Copperbelt Provin... -12.833375
28.108370
16 Eglise Methodiste Unie, Avenue Mantola, Mawawa... -11.699407
27.500234
17 Avenue Babemba, Kolwezi, Lwalaba, Katanga, Lua... -10.698109
25.503816
Rep Loc Rep Lat
Rep
Long
0 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
1 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
2 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
Rep Loc Rep Lat
Rep Long
0 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
1 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
2 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
Rep Loc Rep Lat
Rep Long
0 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
1 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
Rep Loc Rep Lat
Rep Long
0 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
1 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
2 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
Rep Loc Rep Lat Rep
Long
0 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
1 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
2 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
Rep Loc Rep Lat Rep
Long
0 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
1 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
2 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
Good practice when asking questions is to provide an example of what you want your output to look like. However, this is my best guess at what you want.
pd.concat({i: d.shift(i) for i in range(18)}, axis=1)

Pandas .groupby automatically selecing column

From the following dataset:
I'm trying to use .groupby to create a set where I get the average Status Count per User Location. I've already done this for Follower Count by using
groupLoc = df.groupby('User Location')
groupCount = groupLoc.mean()
groupCount
Which automatically selected User Location vs Follower Count. Now I'm trying to do the same for User Location vs Status Count, but it's automatically including Follower Count again.
Anyone know how to fix this? Thanks in advance!
I think you need groupby with mean:
print df.groupby('User Location', as_index=False)['Follower Count'].mean()
User Location Follower Count
0 Canada 1654.500000
1 Chicago 9021.000000
2 Indonesia 1352.666667
3 London 990.000000
4 Los Angeles CA 86.000000
5 New York 214.000000
6 Singapore 106.500000
7 Texas 181.000000
8 UK 2431.000000
9 indonesia 316.000000
10 null 295.750000
print df.groupby('User Location', as_index=False)['Status Count'].mean()
User Location Status Count
0 Canada 39299.000000
1 Chicago 6402.000000
2 Indonesia 12826.000000
3 London 4864.666667
4 Los Angeles CA 3230.000000
5 New York 2947.000000
6 Singapore 6785.500000
7 Texas 901.000000
8 UK 81440.000000
9 indonesia 17662.000000
10 null 29610.875000

Categories

Resources