I am analyzing a dataset containing NFL game results over the past 20 years and am trying to create a column denoting for each team whether or not the game was a home game or away game (home game = 1, away game = 0).
The code I have so far is:
home_list = list(df.home_team.unique())
def home_or_away(team_name, dataf):
dataf['home_or_away'] = np.where(dataf['home_team'] == team_name, 1, 0)
return dataf
for i in home_list:
home_update_all = home_or_away(i, df)
df.update(home_update_all)
This doesn't seem to yield the correct results as each team is just overwritten when iterating over them. Any ideas on how to solve this?
Thanks!
Not really sure what your expected output is. Do you mean you want one column per team? You currently keep creating columns but with the same name so always only the one in the last iteration will be kept, the rest overwritten. Or do you want multiple DataFrames?
If you want multiple columns, one per team:
import pandas as pd
df = pd.DataFrame({'game': [1, 2, 3, 4], 'home_team': ['a', 'b', 'c', 'a']})
> game home_team
0 1 a
1 2 b
2 3 c
3 4 a
First collect unique teams as you did:
home_list = list(df.home_team.unique())
Create a column for each team:
for team in home_list:
df[f'home_or_away_{team}'] = [int(ht==team) for ht in df['home_team']]
Which results in:
> game home_team home_or_away_a home_or_away_b home_or_away_c
0 1 a 1 0 0
1 2 b 0 1 0
2 3 c 0 0 1
3 4 a 1 0 0
You're over complicating it. Don't need to iterate with numpy .where(). Just use the np.where() on the 2 columns (not with a separate function).
Basically says "where home_team equals team_name, put a 1, else put 0"
import pandas as pd
import numpy as np
df = pd.DataFrame([['Chicago Bears','Chicago Bears', 'Green Bay Packers'],
['Chicago Bears','Green Bay Packers', 'Chicago Bears'],
['Detriot Lions','Detriot Lions', 'Los Angeles Chargers'],
['New England Patriots','New York Jets', 'New England Patriots'],
['Houston Texans','Los Angeles Rams', 'Houston Texans']],
columns = ['team_name','home_team','away_team'])
df['home_or_away'] = np.where(df['home_team'] == df['team_name'], 1, 0)
Output:
print(df)
team_name home_team away_team home_or_away
0 Chicago Bears Chicago Bears Green Bay Packers 1
1 Chicago Bears Green Bay Packers Chicago Bears 0
2 Detriot Lions Detriot Lions Los Angeles Chargers 1
3 New England Patriots New York Jets New England Patriots 0
4 Houston Texans Los Angeles Rams Houston Texans 0
Related
I have a working SQL code that retrieves all the scores of a hockey team. I would like to set Limit 3 or <= 3:
my = cursor_test.execute('''SELECT Next.ClubHome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score
FROM NextMatch Next
LEFT JOIN ResultMatch Result ON Next.ClubHome in (Result.ClubHome, Result.ClubAway)
''')
for row in my.fetchall():
print(row)
Let me explain the question better:
Observe the next Chicago, New York and Dallas hockey matchs in the NextMatch table: are featured in ClubHome
NEXTMATCH
ClubHome
ClubAway
Tournament
Chicago
Minnesota
NHL
New York
Los Angeles
NHL
Dallas
Vegas Gold
NHL
In the ResultMatch table, I would like to retrieve the last 3 overall scores of Chicago, New York and Dallas (ScoreHome or ScoreAway). So I would like this output:
Chicago: 2
Chicago: 0
Chicago: 1
New York: 2
New York: 3
New York: 2
Dallas: 4
Dallas: 3
Dallas: 1
RESULTMATCH
ClubHome
ClubAway
Tournament
Round
ScoreHome
ScoreAway
Toronto
CHICAGO
NHL
8
1
2
New York
Vegas
NHL
8
2
3
CHICAGO
Dallas
NHL
7
0
4
Ottawa
New York
NHL
7
3
3
CHICAGO
Buffalo Sab
NHL
6
1
0
Vegas
CHICAGO
NHL
6
4
2
New York
Dallas
NHL
5
2
3
Dallas
Buffalo Sab
NHL
5
1
2
A code that can be USEFUL for the solution is the following. However, it only retrieves the last 3 Scorehome results (and not the ScoreAway):
x = cursor2.execute('''SELECT ClubHome,
FROM (SELECT NextMatch.ClubHome, NextMatch.ClubAway, ResultMatch.ScoreHome,
ROW_NUMBER() OVER (PARTITION BY NextMatch.ClubHome ORDER BY ResultMatch.ScoreHome DESC) AS rn
FROM NextMatch
INNER JOIN ResultMatch ON NextMatch.ClubHome = ResultMatch.ClubHome) t
WHERE rn <= 3
ORDER BY ClubHome ASC''')
How can I modify my (first code) and add Limit 3 or <= 3 to get what I ask for in the outputs example? Thank you
If you want to do it in SQL only and not filtering the results in Python, you could use the windowing function ROW_NUMBER:
SELECT clubHome, score FROM (
SELECT Next.clubhome,
CASE
WHEN Next.ClubHome = Result.ClubHome then Result.ScoreHome
WHEN Next.ClubHome = Result.ClubAway then Result.ScoreAway
END as score,
ROW_NUMBER() OVER (PARTITION BY next.clubHome ORDER BY round DESC) rowNum
FROM nextmatch Next
JOIN resultmatch Result ON Next.clubhome in (Result.clubhome, Result.clubaway)
) WHERE rowNum <= 3;
SQLFiddle: https://www.db-fiddle.com/f/xrLpLwSu783AQHrwD8Fq4t/0
I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})
This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I'm using a data frame like this:
home_team away_team home_score away_score
Scotland England 0 0
England Scotland 4 2
Scotland England 2 1
England Scotland 2 2
Scotland England 3 0
Here's what I would like to accomplish, I'm trying to regroup all the country together whether they are home or away and have the sum total of the score.
Team total goal
Scotland 9
England 7
Try this and let me know if you face any issue/error. Here you go:
df.groupby("home_team").home_score.sum()+df.groupby("away_team").away_score.sum()
This should do the trick (assuming your original DataFrame is called df):
nations = ["England", "Scotland"]
tot = pd.DataFrame([(nation, 0) for nation in nations], columns=["team","total goal"])
for nation in nations:
home_goal = sum(df[df["home_team"] == nation]["home_score"])
away_goal = sum(df[df["away_team"] == nation]["away_score"])
tot.loc[tot.team == nation, "total goal"] = home_goal + away_goal
I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA
I have 2 dataframes. One with the City, dates and sales
sales = [['20101113','Miami',35],['20101114','New York',70],['20101114','Los Angeles',4],['20101115','Chicago',36],['20101114','Miami',12]]
df2 = pd.DataFrame(sales,columns=['Date','City','Sales'])
print (df2)
Date City Sales
0 20101113 Miami 35
1 20101114 New York 70
2 20101114 Los Angeles 4
3 20101115 Chicago 36
4 20101114 Miami 12
The second has some dates and cities.
date = [['20101114','New York'],['20101114','Los Angeles'],['20101114','Chicago']]
df = pd.DataFrame(date,columns=['Date','City'])
print (df)
I want to extract the sales from the first dataframe that match the city and and dates in the 3nd dataframe and add the sales to the 2nd dataframe. If the date is missing in the first table then the next highest date's sales should be retrieved.
The new dataframe should look like this
Date City Sales
0 20101114 New York 70
1 20101114 Los Angeles 4
2 20101114 Chicago 36
I am having trouble extracting and merging tables. Any suggestions?
This is pd.merge_asof, which allows you to join on a combination of exact matches and then a "close" match for some column.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df2['Date'] = pd.to_datetime(df2.Date)
pd.merge_asof(df.sort_values('Date'),
df2.sort_values('Date'),
by='City', on='Date',
direction='forward')
Output:
Date City Sales
0 2010-11-14 New York 70
1 2010-11-14 Los Angeles 4
2 2010-11-14 Chicago 36