Python agregation function - python

I need to modify my code:
db_profit_platform=db[['Source','Device','Country','Profit']]
db_profit_final=db_profit_platform.groupby(['Source','Device','Country'])['Profit'].apply(sum).reset_index()
Now I need to add Bid and get average bid after group by (different aggregations for different columns):
to get: Source Device Country SumProfit Average Bid
How can I do it? (and maybe I will need more aggregations) Thanks

You can use agg function, here a minimal working example
import numpy as np
import pandas as pd
size = 10
db = pd.DataFrame({
'Source': np.random.randint(1, 3, size=size),
'Device': np.random.randint(1, 3, size=size),
'Country': np.random.randint(1, 3, size=size),
'Profit': np.random.randn(size),
'Bid': np.random.randn(size)
})
db.groupby(["Source", "Device", "Country"]).agg(
sum_profit=("Profit", "sum"),
avg_bid=("Bid", "mean")
)
See the official documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html as well as this question

Related

Python Trouble for ESPN FF

First time posting here, but having trouble with some code that I'm using to pull fantasy football data from ESPN. I pulled this from Steven Morse's blog (https://stmorse.github.io/journal/espn-fantasy-v3.html) and it appears to work EXCEPT for one error that I'm getting. The error is:
File "<ipython-input-65-56a5896c1c3c>", line 3, in <listcomp>
game['away']['teamId'], game['away']['totalPoints'],
KeyError: 'away'
I've looked in the dictionary and found that 'away' is in there. What I can't figure out is why 'home' works but not 'away'. Here is the code I'm using. Any help is appreciated:
import requests
import pandas as pd
url = 'https://fantasy.espn.com/apis/v3/games/ffl/seasons/2020/segments/0/leagues/721579?view=mMatchupScore'
r = requests.get(url,
cookies={"swid": "{1E653FDE-DA4A-4CC6-A53F-DEDA4A6CC663}",
"espn_s2": "AECpfE9Zsvwwsl7N%2BRt%2BAPhSAKmSs%2F2ZmQVuHJeKG8LGgLBDfRl0j88CvzRFsrRjLmjzASAdIUA9CyKpQJYBfn6avgXoPHJgDiCqfDPspruYqHNENjoeGuGfVqtPewVJGv3rBJPFMp1ugWiqlEzKiT9IXTFAIx3V%2Fp2GBuYjid2N%2FFcSUlRlr9idIL66tz2UevuH4F%2FP6ytdM7ABRCTEnrGXoqvbBPCVbtt6%2Fu69uBs6ut08ApLRQc4mffSYCONOqW1BKbAMPPMbwgCn1d5Ruubl"})
d = r.json()
df = [[
game['matchupPeriodId'],
game['away']['teamId'], game['away']['totalPoints'],
game['home']['teamId'], game['home']['totalPoints']
] for game in d['schedule']]
df = pd.DataFrame(df, columns=['Week', 'Team1', 'Score1', 'Team2', 'Score2'])
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
Seems like some of the games in the schedule don't have an away team:
{'home': {'adjustment': 0.0,
'cumulativeScore': {'losses': 0, 'statBySlot': None, 'ties': 0, 'wins': 0},
'pointsByScoringPeriod': {'14': 102.7},
'teamId': 1,
'tiebreak': 0.0,
'totalPoints': 102.7},
'id': 78,
'matchupPeriodId': 14,
'playoffTierType': 'WINNERS_BRACKET',
'winner': 'UNDECIDED'}
For nested json data like this, it's often easier to use pandas.json_normalize which flattens the data structure and gives you a data frame with lots of columns with names like home.cumulativeScore.losses etc.
df = pd.json_normalize(r.json()['schedule'])
Then you can reshape the dataframe by dropping columns you don't care about and so on.
df = pd.json_normalize(r.json()['schedule'])
column_names = {
'matchupPeriodId':'Week',
'away.teamId':'Team1',
'away.totalPoints':'Score1',
'home.teamId':'Team2',
'home.totalPoints':'Score2',
}
df = df.reindex(columns=column_names).rename(columns=column_names)
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
For the games where there's no away team, pandas will populate those columns with NaN values.
df[df.Team1.isna()]

Python ValueError from np.where create flag based on one condition

If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})
I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")

Making a barchart in pandas with filtered data

I have a csv file the that has a column that a bunch of different columns. the columns thhat i am interested in are the 'Items', 'OrderDate' and 'Units'.
In my IDE I am trying to generate a bar chart of the amount of 'Pencil's sold on each individual 'OrderDate'. What I am trying to do is to look down through the 'Item' columns using pandas and check to see if the item is a pencil and then add it to the graph if it is not then dont do anything.
I think I have made it a bit long winded with the code.
i have the coe going down through the 'Iems' column and checking to see if it is a pencil but i can't figure out what to do next.
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
df.plot(kind='bar', x='OrderDate', y='Units')
item_col = df['Item']
pencil_binary = item_col.str.count('Pencil')
for entry in item_col:
if entry == 'Pencil':
print("i am a pencil")
else:
print("i am not a pencil")
print(df)
plt.plot()
plt.show()
If I understood correctly you want to plot the number of pencils sold per day. For that, you can just filter the dataframe and keep only rows about pencils, and then use a barchart.
Here's a reproducible code that assumes that all rows have different dates:
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
#This dataframe only has pencils
df_pencils = df[df.item == 'Pencil']
df_pencils.groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')
df.plot(kind='bar', x='OrderDate', y='Units')
The groupby is used for grouping all rows with the same date, and, for each group, add up the Units sold.
In fact, when you do this:
df_pencils.groupby('OrderDate').agg('Units').sum()
this is the output:
OrderDate
5/15/2020 4
5/16/2020 5
Name: Units, dtype: int64
If you want a one liner, it's:
df[df.item == 'Pencil'].groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')

Function not returning the correct amount of observations

I am trying to create a function to show the n number of movies most rated by a user in a given dataframe. I have been able to extract the movies the user provided rating for but I cannot return the correct amount of rows - instead it prints all the movies with rating from the user.
I have tried this way as shown in the code with .head(n_rows) but it does not work:
def top_movies(data_,usr,n_rows = 10):
user = data_[data_['user_id']== usr]
movies = data_.loc[user.index].groupby('title')['title','rating']
final = movies.head(n_rows).sort_values(by = 'rating' ,ascending = False)
return final
def ex9():
return top_movies(data,1,30)
ex9()
I expect to print the first 30 rows for example here.
I'm not sure what exactly, you want to achieve, but check this:
import pandas as pd
df = pd.DataFrame(
{
'user_id': [1, 1, 1, 2, 2, ],
'title': ['t1', 't2', 't3', 't1', 't5'],
'rating': [25, 25, 35, 25, 30,],
})
df.sort_values(by='rating', ascending=False).groupby('user_id')[['user_id', 'title','rating', ]].nth(list(range(30)))

Operations within DataFrameGroupBy

I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64

Categories

Resources