Pandas averaging selected rows and columns - python

I am working with some EPL stats. I have csv with all matches from one season in following format.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
11.05.2014 Chelsea Swansea 0 0 1.50 3.00 5.00
What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal avgNorwichSC avgArsenalSC 5.00 4.00 1.73
11.05.2014 Chelsea Swansea avgChelseaSC avgSwanseaSC 1.50 3.00 5.00
So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.
The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.
All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
04.05.2014 Arsenal West Brom 1 0 1.40 5.25 8.00
I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).

You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.
import pandas
EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')
#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame
homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count())
print teamAveScore
You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (i.e. eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.
e.g.
recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]
print recentRows
def insertAverages(s):
a = teamAveScore[s['H']]
b = teamAveScore[s['A']]
print a,b
return pandas.Series(dict(H_AVSC=a, A_AVSC=b))
finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)
print finalTable
finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches
Edit
Couple of gotchas
just noticed I didn't put a format string in to_datetime(). For your dates - they look like UK format with dots so you should do
EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')
You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.
If you really need to replace column H_SC with H_AVSC in your finalTable, rather than add on the averages:
newCols = recentRows.apply(insertAverages, axis = 1)
recentRows['H_SC'] = newCols['H_AVSC']
recentRows['A_SC'] = newCols['A_AVSC']
print recentRows

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Update Python dictionary with appended list value

I have a dataframe with price quotes for a variety of parts and makers. ~10k parts and 10 makers, so my dataset contains up to 100k rows, looking roughly like this:
Part
Maker
Price
1
Alpha
1.00
2
Alpha
1.30
3
Alpha
1.25
1
Bravo
1.10
2
Bravo
1.02
3
Bravo
1.15
4
Bravo
1.19
1
Charlie
.99
2
Charlie
1.10
3
Charlie
1.12
4
Charlie
1.19
I am wanting to return two dictionaries based on the best price, Part/Maker and Part/Price. My main issue is when two makers have the same best price.
I want my result to end up like this:
1:.99
2:1.1
3: 1.02
4:1.19
and the second one to be:
1:Charlie
2: Charlie
3: Bravo
4: [Bravo, Charlie]
The first dictionary is easy. Second one is what I'm stuck on. Here's what I have so far:
winning_price_dict={}
winning_mfg_dict={}
for index, row in quote_df.iterrows():
if row['Part'] not in winning_price_dict:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = list(row['Maker'])
if winning_price_dict[row['Part']]>row['Proposed Quote']:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = row['Maker']
if winning_price_dict[row['Part']]==row['Proposed Quote']:
winning_price_dict[row['Part']] = row['Proposed Quote']
winning_mfg_dict[row['Part']] = winning_mfg_dict[row['Part']].append(row['Maker']) #this is the only line that I don't believe works
When I run it as is, it says 'str' object has no attribute 'append'. However, I thought that it should be a list because of the list(row['Maker']) command.
When I change the relevant lines to this:
for index, row in quote_df.iterrows():
if row['Part'] not in winning_price_dict:
winning_mfg_dict[row['Part']] = list(row['Mfg'])
if winning_price_dict[row['Part']]>row['Proposed Quote']:
winning_mfg_dict[row['Part']] = list(row[['Mfg']])
if winning_price_dict[row['Part']]==row['Proposed Quote']:
winning_mfg_dict[row['Part']] = list(winning_mfg_dict[row['Part']]).append(row['Mfg'])
The winning_mfg_dict is all the part numbers and NoneType values, not the maker names.
What do I need to change to get it to return the list of suitable makers?
Thanks!
In your original code, the actual problem was on line 9 of the first fragment: you set vale to a string, not to a list. Also, calling list(some_string) dos not what you expect: it creates a list of single chars, not a [some_string].
I took the liberty to improve the overall readability by extracting common keys to variables, and joined two branches with same bodies. Something like this should work:
winning_price_dict = {}
winning_mfg_dict = {}
for index, row in quote_df.iterrows():
# Extract variables, saving a few accesses and reducing line lengths
part = row['Part']
quote = row['Proposed Quote']
maker = row['Maker']
if part not in winning_price_dict or winning_price_dict[part] > quote:
# First time here or higher value found - reset to initial
winning_price_dict[part] = quote
winning_mfg_dict[part] = [maker]
elif winning_price_dict[part] == quote:
# Add one more item with same value
# Not updating winning_price_dict - we already know it's proper
winning_mfg_dict[part].append(maker)
You can use groupby to get all quotes for one part
best_quotes = quote_df.groupby("part").apply(lambda df: df[df.price == df.price.min()])
Then you get a dataframe with the part number and the previous index as Multiindex. The lambda function selects only the quotes with the minimum price.
You can get the first dictionary with
winning_price_dict = {part : price for (part, _), price in best_quotes.price.iteritems()}
and the second one with
winning_mfg_dict = {part:list(best.loc[part]["maker"]) for part in best_quotes.index.get_level_values("part")}

A way to make the [apply + lambda + loc] efficient

The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?
You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate
you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end

Pandas group sum divided by unique items in group

I have a data in excel of employees and no. of hours worked in a week. I tagged each employee to a project he/she is working on. I can get sum of hours worked in each project by doing groupby as below:
util_breakup_sum = df[["Tag", "Bill. Hours"]].groupby("Tag").sum()
Bill. Hours
Tag
A61H 92.00
A63B 139.75
An 27.00
B32B 33.50
H 37.00
Manager 8.00
PP 23.00
RP0117 38.50
Se 37.50
However, when I try to calculate average time spent on each project per person, it gives me (sum/ total number of entries by employee), whereas correct average should be (sum / unique employee in group).
Example of mean is given below:
util_breakup_mean = df[["Tag", "Bill. Hours"]].groupby("Tag").mean()
Bill. Hours
Tag
A61H 2.243902
A63B 1.486702
An 1.000000
B32B 0.712766
H 2.055556
Manager 0.296296
PP 1.095238
RP0117 1.425926
Se 3.750000
For example, group A61H has just two employees, so there average should be (92/2) = 46. However, the code is dividing by total number of entries by these employees and hence giving an average of 2.24.
How to get the average from unique employee names in the group?
Try:
df.groupby("Tag")["Bill. Hours"].sum().div(df.groupby("Tag")["Employee"].nunique()
Where Employee is column identifying employees.
You can try nunique
util_breakup_mean = util_breakup_sum/df.groupby("Tag")['employee'].nunique()

Pandas: calculate mean for each row by cycle number

I have a CSV file (Mspec Data) which looks like this:
#Header
#
"Cycle";"Time";"ms";"mass amu";"SEM c/s"
0000000001;00:00:01;0000001452; 1,00; 620
0000000001;00:00:01;0000001452; 1,20; 4730
0000000001;00:00:01;0000001452; 1,40; 4610
... ;..:..:..;..........;.........;...........
I read it via:
df = pd.read_csv(Filename, header=30,delimiter=';',decimal= ',' )
the result looks like this:
Cycle Time ms mass amu SEM c/s
0 1 00:00:01 1452 1.0 620
1 1 00:00:01 1452 1.2 4730
2 1 00:00:01 1452 1.4 4610
... ... ... ... ... ...
3872 4 00:06:30 390971 1.0 32290
3873 4 00:06:30 390971 1.2 31510
This data contains several Mass spec scans with identical parameters. Cycle number 1 means scan 1 and so forth. I would like to calculate the mean in the last column SEM c/s for each corresponding identical mass. in the end i would like to have a new data frame containing only:
ms "mass amu" "SEM c/s(mean over all cycles)"
obviously the mean of the mass does not need to be calculated. I would like to avoid to read each cycle into a new dataframe as this would mean I have to look up the length of each Mass spectrum . The "mass range" and " resolution" is obviously different for different measurements (Solution).
I guess doing the calculation in numpy directly would be best but I am stuck?
Thank you in advance
You can use groupby(), something like this:
df.groupby(['ms', 'mass amu'])['SEM c/s'].mean()
You have different ms over all the cycles, and you want to calculate the mean of SEM over each group of same ms. I will show you a step-by-step example.
You should invoke each group and then put the mean in a dictionary to convert in DataFrame.
ms_uni = df['ms'].unique() #calculate the unique ms values
new_df_dict = { "ma":[], "SEM":[] } #later you will rename them
for un in range( len(ms_uni) ):
cms = ms_uni[un]
new_df_dict['ma'].append( cms )
new_df_dict['SEM'].append( df[ df['ms']==cms ]['SEM c/s'].mean() ) #advise: change the column name in a more safe SEM-c_s
new_df = pd.DataFrame(new_df_dict) #end of the dirty work
new_df.rename(index=str, columns={'ma':"mass amu", "SEM": "SEM c/s(mean over all cycles)"} )
Hope it will be helpful

Categories

Resources