Set Limit 3 to a SQL with Case and Left Join? - python

I have a working SQL code that retrieves all the scores of a hockey team. I would like to set Limit 3 or <= 3:
my = cursor_test.execute('''SELECT Next.ClubHome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score
FROM NextMatch Next
LEFT JOIN ResultMatch Result ON Next.ClubHome in (Result.ClubHome, Result.ClubAway)
''')
for row in my.fetchall():
print(row)
Let me explain the question better:
Observe the next Chicago, New York and Dallas hockey matchs in the NextMatch table: are featured in ClubHome
NEXTMATCH
ClubHome
ClubAway
Tournament
Chicago
Minnesota
NHL
New York
Los Angeles
NHL
Dallas
Vegas Gold
NHL
In the ResultMatch table, I would like to retrieve the last 3 overall scores of Chicago, New York and Dallas (ScoreHome or ScoreAway). So I would like this output:
Chicago: 2
Chicago: 0
Chicago: 1
New York: 2
New York: 3
New York: 2
Dallas: 4
Dallas: 3
Dallas: 1
RESULTMATCH
ClubHome
ClubAway
Tournament
Round
ScoreHome
ScoreAway
Toronto
CHICAGO
NHL
8
1
2
New York
Vegas
NHL
8
2
3
CHICAGO
Dallas
NHL
7
0
4
Ottawa
New York
NHL
7
3
3
CHICAGO
Buffalo Sab
NHL
6
1
0
Vegas
CHICAGO
NHL
6
4
2
New York
Dallas
NHL
5
2
3
Dallas
Buffalo Sab
NHL
5
1
2
A code that can be USEFUL for the solution is the following. However, it only retrieves the last 3 Scorehome results (and not the ScoreAway):
x = cursor2.execute('''SELECT ClubHome,
FROM (SELECT NextMatch.ClubHome, NextMatch.ClubAway, ResultMatch.ScoreHome,
ROW_NUMBER() OVER (PARTITION BY NextMatch.ClubHome ORDER BY ResultMatch.ScoreHome DESC) AS rn
FROM NextMatch
INNER JOIN ResultMatch ON NextMatch.ClubHome = ResultMatch.ClubHome) t
WHERE rn <= 3
ORDER BY ClubHome ASC''')
How can I modify my (first code) and add Limit 3 or <= 3 to get what I ask for in the outputs example? Thank you

If you want to do it in SQL only and not filtering the results in Python, you could use the windowing function ROW_NUMBER:
SELECT clubHome, score FROM (
SELECT Next.clubhome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score,
ROW_NUMBER() OVER (PARTITION BY next.clubHome ORDER BY round DESC) rowNum
FROM nextmatch Next
JOIN resultmatch Result ON Next.clubhome in (Result.clubhome, Result.clubaway)
) WHERE rowNum <= 3;
SQLFiddle: https://www.db-fiddle.com/f/xrLpLwSu783AQHrwD8Fq4t/0

Related

Update Pandas DataFrame for each string in a list

I am analyzing a dataset containing NFL game results over the past 20 years and am trying to create a column denoting for each team whether or not the game was a home game or away game (home game = 1, away game = 0).
The code I have so far is:
home_list = list(df.home_team.unique())
def home_or_away(team_name, dataf):
dataf['home_or_away'] = np.where(dataf['home_team'] == team_name, 1, 0)
return dataf
for i in home_list:
home_update_all = home_or_away(i, df)
df.update(home_update_all)
This doesn't seem to yield the correct results as each team is just overwritten when iterating over them. Any ideas on how to solve this?
Thanks!
Not really sure what your expected output is. Do you mean you want one column per team? You currently keep creating columns but with the same name so always only the one in the last iteration will be kept, the rest overwritten. Or do you want multiple DataFrames?
If you want multiple columns, one per team:
import pandas as pd
df = pd.DataFrame({'game': [1, 2, 3, 4], 'home_team': ['a', 'b', 'c', 'a']})
> game home_team
0 1 a
1 2 b
2 3 c
3 4 a
First collect unique teams as you did:
home_list = list(df.home_team.unique())
Create a column for each team:
for team in home_list:
df[f'home_or_away_{team}'] = [int(ht==team) for ht in df['home_team']]
Which results in:
> game home_team home_or_away_a home_or_away_b home_or_away_c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 a 1 0 0
You're over complicating it. Don't need to iterate with numpy .where(). Just use the np.where() on the 2 columns (not with a separate function).
Basically says "where home_team equals team_name, put a 1, else put 0"
import pandas as pd
import numpy as np
df = pd.DataFrame([['Chicago Bears','Chicago Bears', 'Green Bay Packers'],
['Chicago Bears','Green Bay Packers', 'Chicago Bears'],
['Detriot Lions','Detriot Lions', 'Los Angeles Chargers'],
['New England Patriots','New York Jets', 'New England Patriots'],
['Houston Texans','Los Angeles Rams', 'Houston Texans']],
columns = ['team_name','home_team','away_team'])
df['home_or_away'] = np.where(df['home_team'] == df['team_name'], 1, 0)
Output:
print(df)
team_name home_team away_team home_or_away
0 Chicago Bears Chicago Bears Green Bay Packers 1
1 Chicago Bears Green Bay Packers Chicago Bears 0
2 Detriot Lions Detriot Lions Los Angeles Chargers 1
3 New England Patriots New York Jets New England Patriots 0
4 Houston Texans Los Angeles Rams Houston Texans 0

Get the number of IDs that have the same combination of distinct values in the 'locations' column

I have a table with ids and locations they have been to.
id
Location
1
Maryland
1
Iowa
2
Maryland
2
Texas
3
Georgia
3
Iowa
4
Maryland
4
Iowa
5
Maryland
5
Iowa
5
Texas
I'd like to perform a query that would allow me to get the number of ids per combination.
In this example table, the output would be -
Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1
My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.
If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:
WITH all_locations AS (
SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
FROM locations
GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc
Output:
aloc cnt
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Maryland, Texas 1
Iowa, Maryland 2
Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.
Demo on dbfiddle
Use groupby + agg + value_counts:
new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()
Output:
>>> new_df
index Location
0 Maryland, Iowa 2
1 Maryland, Texas 1
2 Georgia, Iowa 1
3 Maryland, Iowa, Texas 1
Let us do groupby with join then value_counts
df.groupby('id')['Location'].agg(', '.join).value_counts()
Out[938]:
join
Maryland, Iowa 2
Georgia, Iowa 1
Maryland, Iowa, Texas 1
Maryland, Texas 1
dtype: int64
Use a frozenset to aggregate to ensure having unique groups:
df.groupby('id')['Location'].agg(', '.join).value_counts()
Output:
(Maryland, Iowa) 2
(Texas, Maryland) 1
(Georgia, Iowa) 1
(Texas, Maryland, Iowa) 1
Name: Location, dtype: int64
Or a sorted string join:
df.groupby('id')['Location'].agg(lambda x: ', '.join(sorted(x))).value_counts()
output:
Iowa, Maryland 2
Maryland, Texas 1
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Name: Location, dtype: int64

Validate a dataframe based on another dataframe?

I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}
To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]
Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Pandas .groupby automatically selecing column

From the following dataset:
I'm trying to use .groupby to create a set where I get the average Status Count per User Location. I've already done this for Follower Count by using
groupLoc = df.groupby('User Location')
groupCount = groupLoc.mean()
groupCount
Which automatically selected User Location vs Follower Count. Now I'm trying to do the same for User Location vs Status Count, but it's automatically including Follower Count again.
Anyone know how to fix this? Thanks in advance!
I think you need groupby with mean:
print df.groupby('User Location', as_index=False)['Follower Count'].mean()
User Location Follower Count
0 Canada 1654.500000
1 Chicago 9021.000000
2 Indonesia 1352.666667
3 London 990.000000
4 Los Angeles CA 86.000000
5 New York 214.000000
6 Singapore 106.500000
7 Texas 181.000000
8 UK 2431.000000
9 indonesia 316.000000
10 null 295.750000
print df.groupby('User Location', as_index=False)['Status Count'].mean()
User Location Status Count
0 Canada 39299.000000
1 Chicago 6402.000000
2 Indonesia 12826.000000
3 London 4864.666667
4 Los Angeles CA 3230.000000
5 New York 2947.000000
6 Singapore 6785.500000
7 Texas 901.000000
8 UK 81440.000000
9 indonesia 17662.000000
10 null 29610.875000

Categories

Resources