I need to modify my code:
db_profit_platform=db[['Source','Device','Country','Profit']]
db_profit_final=db_profit_platform.groupby(['Source','Device','Country'])['Profit'].apply(sum).reset_index()
Now I need to add Bid and get average bid after group by (different aggregations for different columns):
to get: Source Device Country SumProfit Average Bid
How can I do it? (and maybe I will need more aggregations) Thanks
You can use agg function, here a minimal working example
import numpy as np
import pandas as pd
size = 10
db = pd.DataFrame({
'Source': np.random.randint(1, 3, size=size),
'Device': np.random.randint(1, 3, size=size),
'Country': np.random.randint(1, 3, size=size),
'Profit': np.random.randn(size),
'Bid': np.random.randn(size)
})
db.groupby(["Source", "Device", "Country"]).agg(
sum_profit=("Profit", "sum"),
avg_bid=("Bid", "mean")
)
See the official documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html as well as this question
Related
First time posting here, but having trouble with some code that I'm using to pull fantasy football data from ESPN. I pulled this from Steven Morse's blog (https://stmorse.github.io/journal/espn-fantasy-v3.html) and it appears to work EXCEPT for one error that I'm getting. The error is:
File "<ipython-input-65-56a5896c1c3c>", line 3, in <listcomp>
game['away']['teamId'], game['away']['totalPoints'],
KeyError: 'away'
I've looked in the dictionary and found that 'away' is in there. What I can't figure out is why 'home' works but not 'away'. Here is the code I'm using. Any help is appreciated:
import requests
import pandas as pd
url = 'https://fantasy.espn.com/apis/v3/games/ffl/seasons/2020/segments/0/leagues/721579?view=mMatchupScore'
r = requests.get(url,
cookies={"swid": "{1E653FDE-DA4A-4CC6-A53F-DEDA4A6CC663}",
"espn_s2": "AECpfE9Zsvwwsl7N%2BRt%2BAPhSAKmSs%2F2ZmQVuHJeKG8LGgLBDfRl0j88CvzRFsrRjLmjzASAdIUA9CyKpQJYBfn6avgXoPHJgDiCqfDPspruYqHNENjoeGuGfVqtPewVJGv3rBJPFMp1ugWiqlEzKiT9IXTFAIx3V%2Fp2GBuYjid2N%2FFcSUlRlr9idIL66tz2UevuH4F%2FP6ytdM7ABRCTEnrGXoqvbBPCVbtt6%2Fu69uBs6ut08ApLRQc4mffSYCONOqW1BKbAMPPMbwgCn1d5Ruubl"})
d = r.json()
df = [[
game['matchupPeriodId'],
game['away']['teamId'], game['away']['totalPoints'],
game['home']['teamId'], game['home']['totalPoints']
] for game in d['schedule']]
df = pd.DataFrame(df, columns=['Week', 'Team1', 'Score1', 'Team2', 'Score2'])
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
Seems like some of the games in the schedule don't have an away team:
{'home': {'adjustment': 0.0,
'cumulativeScore': {'losses': 0, 'statBySlot': None, 'ties': 0, 'wins': 0},
'pointsByScoringPeriod': {'14': 102.7},
'teamId': 1,
'tiebreak': 0.0,
'totalPoints': 102.7},
'id': 78,
'matchupPeriodId': 14,
'playoffTierType': 'WINNERS_BRACKET',
'winner': 'UNDECIDED'}
For nested json data like this, it's often easier to use pandas.json_normalize which flattens the data structure and gives you a data frame with lots of columns with names like home.cumulativeScore.losses etc.
df = pd.json_normalize(r.json()['schedule'])
Then you can reshape the dataframe by dropping columns you don't care about and so on.
df = pd.json_normalize(r.json()['schedule'])
column_names = {
'matchupPeriodId':'Week',
'away.teamId':'Team1',
'away.totalPoints':'Score1',
'home.teamId':'Team2',
'home.totalPoints':'Score2',
}
df = df.reindex(columns=column_names).rename(columns=column_names)
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
For the games where there's no away team, pandas will populate those columns with NaN values.
df[df.Team1.isna()]
If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})
I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")
I have a csv file the that has a column that a bunch of different columns. the columns thhat i am interested in are the 'Items', 'OrderDate' and 'Units'.
In my IDE I am trying to generate a bar chart of the amount of 'Pencil's sold on each individual 'OrderDate'. What I am trying to do is to look down through the 'Item' columns using pandas and check to see if the item is a pencil and then add it to the graph if it is not then dont do anything.
I think I have made it a bit long winded with the code.
i have the coe going down through the 'Iems' column and checking to see if it is a pencil but i can't figure out what to do next.
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
df.plot(kind='bar', x='OrderDate', y='Units')
item_col = df['Item']
pencil_binary = item_col.str.count('Pencil')
for entry in item_col:
if entry == 'Pencil':
print("i am a pencil")
else:
print("i am not a pencil")
print(df)
plt.plot()
plt.show()
If I understood correctly you want to plot the number of pencils sold per day. For that, you can just filter the dataframe and keep only rows about pencils, and then use a barchart.
Here's a reproducible code that assumes that all rows have different dates:
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
#This dataframe only has pencils
df_pencils = df[df.item == 'Pencil']
df_pencils.groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')
df.plot(kind='bar', x='OrderDate', y='Units')
The groupby is used for grouping all rows with the same date, and, for each group, add up the Units sold.
In fact, when you do this:
df_pencils.groupby('OrderDate').agg('Units').sum()
this is the output:
OrderDate
5/15/2020 4
5/16/2020 5
Name: Units, dtype: int64
If you want a one liner, it's:
df[df.item == 'Pencil'].groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')
I am trying to create a function to show the n number of movies most rated by a user in a given dataframe. I have been able to extract the movies the user provided rating for but I cannot return the correct amount of rows - instead it prints all the movies with rating from the user.
I have tried this way as shown in the code with .head(n_rows) but it does not work:
def top_movies(data_,usr,n_rows = 10):
user = data_[data_['user_id']== usr]
movies = data_.loc[user.index].groupby('title')['title','rating']
final = movies.head(n_rows).sort_values(by = 'rating' ,ascending = False)
return final
def ex9():
return top_movies(data,1,30)
ex9()
I expect to print the first 30 rows for example here.
I'm not sure what exactly, you want to achieve, but check this:
import pandas as pd
df = pd.DataFrame(
{
'user_id': [1, 1, 1, 2, 2, ],
'title': ['t1', 't2', 't3', 't1', 't5'],
'rating': [25, 25, 35, 25, 30,],
})
df.sort_values(by='rating', ascending=False).groupby('user_id')[['user_id', 'title','rating', ]].nth(list(range(30)))
I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64