Pandas .groupby automatically selecing column - python

From the following dataset:
I'm trying to use .groupby to create a set where I get the average Status Count per User Location. I've already done this for Follower Count by using
groupLoc = df.groupby('User Location')
groupCount = groupLoc.mean()
groupCount
Which automatically selected User Location vs Follower Count. Now I'm trying to do the same for User Location vs Status Count, but it's automatically including Follower Count again.
Anyone know how to fix this? Thanks in advance!

I think you need groupby with mean:
print df.groupby('User Location', as_index=False)['Follower Count'].mean()
User Location Follower Count
0 Canada 1654.500000
1 Chicago 9021.000000
2 Indonesia 1352.666667
3 London 990.000000
4 Los Angeles CA 86.000000
5 New York 214.000000
6 Singapore 106.500000
7 Texas 181.000000
8 UK 2431.000000
9 indonesia 316.000000
10 null 295.750000
print df.groupby('User Location', as_index=False)['Status Count'].mean()
User Location Status Count
0 Canada 39299.000000
1 Chicago 6402.000000
2 Indonesia 12826.000000
3 London 4864.666667
4 Los Angeles CA 3230.000000
5 New York 2947.000000
6 Singapore 6785.500000
7 Texas 901.000000
8 UK 81440.000000
9 indonesia 17662.000000
10 null 29610.875000

Related

Set Limit 3 to a SQL with Case and Left Join?

I have a working SQL code that retrieves all the scores of a hockey team. I would like to set Limit 3 or <= 3:
my = cursor_test.execute('''SELECT Next.ClubHome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score
FROM NextMatch Next
LEFT JOIN ResultMatch Result ON Next.ClubHome in (Result.ClubHome, Result.ClubAway)
''')
for row in my.fetchall():
print(row)
Let me explain the question better:
Observe the next Chicago, New York and Dallas hockey matchs in the NextMatch table: are featured in ClubHome
NEXTMATCH
ClubHome
ClubAway
Tournament
Chicago
Minnesota
NHL
New York
Los Angeles
NHL
Dallas
Vegas Gold
NHL
In the ResultMatch table, I would like to retrieve the last 3 overall scores of Chicago, New York and Dallas (ScoreHome or ScoreAway). So I would like this output:
Chicago: 2
Chicago: 0
Chicago: 1
New York: 2
New York: 3
New York: 2
Dallas: 4
Dallas: 3
Dallas: 1
RESULTMATCH
ClubHome
ClubAway
Tournament
Round
ScoreHome
ScoreAway
Toronto
CHICAGO
NHL
8
1
2
New York
Vegas
NHL
8
2
3
CHICAGO
Dallas
NHL
7
0
4
Ottawa
New York
NHL
7
3
3
CHICAGO
Buffalo Sab
NHL
6
1
0
Vegas
CHICAGO
NHL
6
4
2
New York
Dallas
NHL
5
2
3
Dallas
Buffalo Sab
NHL
5
1
2
A code that can be USEFUL for the solution is the following. However, it only retrieves the last 3 Scorehome results (and not the ScoreAway):
x = cursor2.execute('''SELECT ClubHome,
FROM (SELECT NextMatch.ClubHome, NextMatch.ClubAway, ResultMatch.ScoreHome,
ROW_NUMBER() OVER (PARTITION BY NextMatch.ClubHome ORDER BY ResultMatch.ScoreHome DESC) AS rn
FROM NextMatch
INNER JOIN ResultMatch ON NextMatch.ClubHome = ResultMatch.ClubHome) t
WHERE rn <= 3
ORDER BY ClubHome ASC''')
How can I modify my (first code) and add Limit 3 or <= 3 to get what I ask for in the outputs example? Thank you
If you want to do it in SQL only and not filtering the results in Python, you could use the windowing function ROW_NUMBER:
SELECT clubHome, score FROM (
SELECT Next.clubhome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score,
ROW_NUMBER() OVER (PARTITION BY next.clubHome ORDER BY round DESC) rowNum
FROM nextmatch Next
JOIN resultmatch Result ON Next.clubhome in (Result.clubhome, Result.clubaway)
) WHERE rowNum <= 3;
SQLFiddle: https://www.db-fiddle.com/f/xrLpLwSu783AQHrwD8Fq4t/0

Get the number of IDs that have the same combination of distinct values in the 'locations' column

I have a table with ids and locations they have been to.
id
Location
1
Maryland
1
Iowa
2
Maryland
2
Texas
3
Georgia
3
Iowa
4
Maryland
4
Iowa
5
Maryland
5
Iowa
5
Texas
I'd like to perform a query that would allow me to get the number of ids per combination.
In this example table, the output would be -
Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1
My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.
If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:
WITH all_locations AS (
SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
FROM locations
GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc
Output:
aloc cnt
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Maryland, Texas 1
Iowa, Maryland 2
Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.
Demo on dbfiddle
Use groupby + agg + value_counts:
new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()
Output:
>>> new_df
index Location
0 Maryland, Iowa 2
1 Maryland, Texas 1
2 Georgia, Iowa 1
3 Maryland, Iowa, Texas 1
Let us do groupby with join then value_counts
df.groupby('id')['Location'].agg(', '.join).value_counts()
Out[938]:
join
Maryland, Iowa 2
Georgia, Iowa 1
Maryland, Iowa, Texas 1
Maryland, Texas 1
dtype: int64
Use a frozenset to aggregate to ensure having unique groups:
df.groupby('id')['Location'].agg(', '.join).value_counts()
Output:
(Maryland, Iowa) 2
(Texas, Maryland) 1
(Georgia, Iowa) 1
(Texas, Maryland, Iowa) 1
Name: Location, dtype: int64
Or a sorted string join:
df.groupby('id')['Location'].agg(lambda x: ', '.join(sorted(x))).value_counts()
output:
Iowa, Maryland 2
Maryland, Texas 1
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Name: Location, dtype: int64

formatting multiple city names into universal name for each city all at once using pandas

change all city name into one universal name.
City b c
0 New york 1 1
1 New York 2 2
2 N.Y. 3 3
3 NY 4 4
They call refer to the city New york however python sees them as separate entity therefore I've changed all into one.
df["City"] = df["City"].replace({"N.Y.":"New york", "New York": "New york", "NY": "New york"})
After this I need to check if all variation of new york is covered, to do that I've created a function
def universal_ok(universal_name):
count = 0
for c in df.City:
if c == universal_name:
count += 1
# This only works when column consists of only one type of city
if count == len(df.City):
return "Yes all names are formatted correctly"
else:
return f"there are {len(df.City) - count} names that need to be changed"
universal_ok("New york")
but the problem is what about when there are more than one city in a column
City b c
0 New york 1 1
1 New York 2 2
2 N.Y. 3 3
3 NY 4 4
4 Toronto 3 2
5 TO 3 2
6 toronto 3 2
is there a way to change each city to universal name?
Convert to Lower, Unique Values, Map and Count:
Data:
City b c
New york 1 1
New York 2 2
N.Y. 3 3
NY 4 4
Toronto 3 2
TO 3 2
toronto 3 2
Convert to Lower:
This will reduce the variations of the city names.
pandas.Series.str.lower
df.City = df.City.str.lower()
City b c
new york 1 1
new york 2 2
n.y. 3 3
ny 4 4
toronto 3 2
to 3 2
toronto 3 2
Unique Values:
This will give you all the values in the column
pandas.Series.unique
df.City.unique()
array(['new york', 'n.y.', 'ny', 'toronto', 'to'], dtype=object)
Mapping the City Names:
Use the unique values list, to map the values to the preferred form
I created a tuple, then used dict comprehension to create the dictionary
I did this, so I wouldn't have to repeatedly type the preferred city name, because I'm lazy / efficient, that way.
Tuples
Python Dictionary Comprehension Tutorial
pandas.Series.map
cities_tup = (('New York', ['ny', 'n.y.', 'new york']),
('Toronto', ['toronto', 'to']))
cities_map = {y:x[0] for x in cities_tup for y in x[1]}
{'ny': 'New York',
'n.y.': 'New York',
'new york': 'New York',
'toronto': 'Toronto',
'to': 'Toronto'}
df.City = df.City.map(cities_map)
City b c
New York 1 1
New York 2 2
New York 3 3
New York 4 4
Toronto 3 2
Toronto 3 2
Toronto 3 2
Unique Counts to verify:
Verify city names have been updated and count them
pandas.Series.value_counts
df.City.value_counts()
New York 4
Toronto 3
Name: City, dtype: int64
Remarks
Undoubtedly, there are alternate methods to accomplish this task, but I think this is straightforward and easy to follow.
Someone will probably come along and offer a one-liner.
You need a specific column with some sort of city id, otherwise you won’t be able to distinguish between Paris, France and Paris, Texas, nor will you be able to group Istanbul and Constantinople.

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Pandas groupby using function variable

I have this dataframe:
iata airport city state country lat \
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Colorado Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
I am trying to get the number of airports per state. For example if I have the function:
f(dataframe, state):
result reuslt
Where state would be a state abbreviation, such as 'MA'. I am trying to group the dataframe by the input variable, such as state ('MA') to then get the number of airports per state.
When I use:
df.groupby(state)['airport'].value_counts()
or
df.groupby(state)['airport'].value_counts()/df['airport'].count()
df.groupby(['state'] == state)['airport'].value_counts()/df['airport'].count()
The last two are regarding the conditional probability a selected airport will be in that state.
It throws a Key Error: 'MA', which I think is due to the input variable not being recognized as a column, but a value in the column.
Is there a way to get the number of airports per state?
I would use Pandas's nunique to get the number of airports per state. The code is easier to read and remember.
To illustrate my point, I modified the dataset as follows, such that Florida has three more fictional airports:
iata airport city state country lat
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
5 f234 Weirdviller Chilliard FL USA 30.788012
6 23r2 Johnson Billiard FL USA 30.888012
Then, we write:
df.groupby('state').iata.nunique()
to get the following results:
state
CO 1
MS 1
TX 1
FL 3
NY 1
Name: iata, dtype: int64
Hope this helps.
Assuming each record is an airport throughout, you can just count the records for each state / country combination:
df.groupby(['country','state']).size()
You can rewrite this as an explicit groupby apply:
In [11]: df.groupby("state")["airport"].apply(lambda x: x.value_counts() / len(x))
Out[11]:
state
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
or store the groupby and reuse it (probably this is faster):
In [21]: g = df.groupby("state")["airport"]
In [22]: g.value_counts() / g.size()
Out[22]:
state airport
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
This seemed to work the way I intended with all your help. a[state] represents an input in the form of a state abbreviation ('MA'). This returns the probability of a randomly selected airport belonging to that state.
a = df.groupby('state').iata.nunique()
s = a.sum()
result = a[state]/s
return result

Categories

Resources