A way to make the [apply + lambda + loc] efficient

A way to make the [apply + lambda + loc] efficient - python

The dataframe(contains data on the 2016 elections), loaded in pandas from a .csv has the following structure:
In [2]: df
Out[2]:
county candidate votes ...
0 Ada Trump 10000 ...
1 Ada Clinton 900 ...
2 Adams Trump 12345 ...
.
.
n Total ... ... ...
The idea would be to calculate the first X counties with the highest percentage of votes in favor of candidate X (removing Totals)
For example suppose we want 100 counties, and the candidate is Trump, the operation to be carried out is: 100 * sum of votes for Trump / total votes
I have implemented the following code, getting correct results:
In [3]: (df.groupby(by="county")
.apply(lambda x: 100 * x.loc[(x.candidate == "Trump")
& (~x.county == "Total"), "votes"].sum() / x.votes.sum())
.nlargest(100)
.reset_index(name='percentage'))
Out[3]:
county percentage
0 Hayes 91.82
1 WALLACE 90.35
2 Arthur 89.37
.
.
99 GRANT 79.10
Using %%time i realized that it is quite slow:
Out[3]:
CPU times: user 964 ms, sys: 24 ms, total: 988 ms
Wall time: 943 ms
Is there a way to make it faster?

You can try to amend your codes to use only vectorized operations to speed up the process, like below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3 = df2.nlargest(100).reset_index(name='percentage') # get the largest 100
df3.loc[df3.candidate == "Trump"] # Finally, filter by candidate
Edit:
If you want the top 100 counties with the highest percentages, you can slightly change the codes below:
df1 = df.loc[(df.county != "Total")] # exclude the Total row(s)
df2 = 100 * df1.groupby(['county', 'candidate'])['votes'].sum() / df1.groupby('county')['votes'].sum() # calculate percentage for each candidate
df3a = df2.reset_index(name='percentage') # get the percentage
df3a.loc[df3a.candidate == "Trump"].nlargest(100, 'percentage') # Finally, filter by candidate and get the top 100 counties with highest percentages for the candidate

you can try:
Supposing you don't have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df['votes'].sum()*100).nlargest(100, 'votes')
Supposing you have a 'Total' row with the sum of all votes:
(df[df['candidate'] == 'Trump'].groupby(['county']).sum()/df.loc[df['candidate'] != 'Total', 'votes'].sum()*100).nlargest(100, 'votes')
I could not test it because I don`t have the data but it doesn't use any apply which could increase the performance
for the rename of the columns you can use .rename(columns={'votes':'percentage'}) at the end

Related

Better and Efficient way to Iterate through Data

So, I've got two dataframes, one with 54k rows and 1 column and another with 139k rows and 3 columns, I need to check weather the values of a column from first dataframe lies in between values of two columns in second dataframe, and if they match, I need to replace that particular value with corresponding string value in the second dataframe into first dataframe.
I tried doing it with simple for loops and if else statements, but the number of iteration are huge and my cell is taking forever to run. I've attached some snippets down below, If there is any better way to rewrite that particular part of code, It would be great help. Thanks in advance.
First DataFrame:
ip_address_to_clean
IP_Address_clean
0 815237196
1 1577685417
2 979279225
3 3250268602
4 2103448748
... ...
54208 4145673247
54209 1344187002
54210 3156712153
54211 1947493810
54212 2872038579
54213 rows × 1 columns
Second DataFrame:
ip_boundaries_file
country lower_bound_ip_address_clean upper_bound_ip_address_clean
0 Australia 16777216 16777471
1 China 16777472 16777727
2 China 16777728 16778239
3 Australia 16778240 16779263
4 China 16779264 16781311
... ... ... ...
138841 Hong Kong 3758092288 3758093311
138842 India 3758093312 3758094335
138843 China 3758095360 3758095871
138844 Singapore 3758095872 3758096127
138845 Australia 3758096128 3758096383
138846 rows × 3 columns
Code I've written :
ip_address_to_clean_copy = ip_address_to_clean.copy()
o_ip = ip_address_to_clean['IP_Address_clean'].values
l_b = ip_boundaries_file['lower_bound_ip_address_clean'].values
for i in range(len(o_ip)):
for j in range(len(l_b)):
if (ip_address_to_clean['IP_Address_clean'][i] > ip_boundaries_file['lower_bound_ip_address_clean'][j]) and (ip_address_to_clean['IP_Address_clean'][i] < ip_boundaries_file['upper_bound_ip_address_clean'][j]):
ip_address_to_clean_copy['IP_Address_clean'][i] = ip_boundaries_file['country'][j]
#print(ip_address_to_clean_copy['IP_Address_clean'][i])
#print(i)

This works (I tested it on small tables).
replacement1 = [None]*3758096384
replacement2 = []
for _, row in ip_boundaries_file.iterrows():
a,b,c = row['lower_bound_ip_address_clean'], row['upper_bound_ip_address_clean'], row['country']
replacement1[a+1:b]=[len(replacement2)]*(b-a-1)
replacement2.append(c)
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean_copy['IP_Address_clean'].apply(lambda x:replacement2[replacement1[x]] if (x < len(replacement1) and replacement1[x]!=None) else x)
I tweaked the lambda function to keep the original ip if it's not in the replacement table.
Notes:
Compared to my comment, I added the replacement2 table to hold the actual strings, and put the indexes in replacement1 to make it more memory efficient.
This is based on one of the methods to sort a list in O(n) when you know the contained values are bounded.
Example:
Inputs:
ip_address_to_clean = pd.DataFrame([10,33,2,179,2345,123], columns = ['IP_Address_clean'])
ip_boundaries_file = pd.DataFrame([['China',1,12],
['Australia', 20,40],
['China',2000,3000],
['France', 100,150]],
columns = ['country', 'lower_bound_ip_address_clean',
'upper_bound_ip_address_clean'])
Output:
ip_address_to_clean_copy
# Out[13]:
# IP_Address_clean
# 0 China
# 1 Australia
# 2 China
# 3 179
# 4 China
# 5 France
As I mentioned in another comment, here's another script that performs a dichotomy search on the 2nd DataFrame; it works in O(n log(p)), which is slower than the above script, but consumes far less memory!
def replace(n, df):
if len(df) == 0:
return n
i = len(df)//2
if df.iloc[i]['lower_bound_ip_address_clean'] < n < df.iloc[i]['upper_bound_ip_address_clean']:
return df.iloc[i]['country']
elif len(df) == 1:
return n
else:
if n <= df.iloc[i]['lower_bound_ip_address_clean']:
return replace(n, df.iloc[:i-1])
else:
return replace(n, df.iloc[i+1:])
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean['IP_Address_clean'].apply(lambda x: replace(x,ip_boundaries_file))

groupby percentage calculation returning incorrect percentage

Shape of my dataframe is (129880, 23). 129880 passengers and 23 inputs.
I'm just concerned with the calculation of the percentage proportion of the passengers I selected using groupby.
Looking at the table, the calculation of the long-haul passengers is way off. Why only the long-haul passengers? Is there a cleaner way of calculating this that I can try to see if it makes a difference?
It's almost like the calculation is doing 352 out of 4105 not 129880. But even that would be 8.5749 not 6.54%.
6.54% of 129880 = 8494, but the count is only 352.
4105 passengers out of 129880 = 3.16% (correct)
352 passengers out of 129880 is actually 0.2710%.
age_seg = (df['AGE']>30) & (df['AGE']<40)
ptc_seg = (df['PTC'] == 'NEW')
route_age_csat_ptc = df.groupby(['CSAT', 'ROUTE', age_seg, ptc_seg], sort=False)
vc = route_age_csat_ptc.ngroup().value_counts(normalize=True, sort=False)
vc.index = route_age_csat_ptc.groups.keys()
out = route_age_csat_ptc.size().to_frame('Count').assign(Percentage=vc.mul(100).round(2)).reset_index()
out.loc[(out['AGE'] == True) & (out['PTC'] == True) & (out['CSAT']=='DSAT')]
CSAT ROUTE AGE PTC Count Percentage
7 DSAT SHORTHAUL True True 4105 3.16
14 DSAT LONGHAUL True True 352 6.54

Buy/sell strategy with indicators?

I have a dataframe similar to the one below;
Price
return
indicator
5
0.05
1
6
0.20
-1
5
-0.16
1
Where the indicator is based upon the forecasted return on the following day.
what I would like to achieve is a strategy where when the indicator is positive 1, I buy the stock at the price on that date/row. Then if the indicator is negative we sell at that price. Then I would like to create a new column with represents the value of the portfolio on each day. Assuming I have $1000 to invest the value of the portfolio should equal the holdings and cash amount. Im assuming that any fraction of Stock can be purchased.
Im unsure where to start with this one. I tried calculating a the Buy/Hold strategy using;
df['Holding'] = df['return'].add(1).cumprod().*5000
this worked for a buy hold strategy but to modify it to the new strategy seems difficult.
I tried;
df['HOLDINg'] = (df['return'].add(1).cumprod()* 5000 * df['Indicator])
#to get the value of the buy or the sell
#then using
df['HOLDING'] = np.where(df['HOLDING'] >0, df['HOLDING'] , df['HON HOLDING 2']*-1)
#my logic was, if its positive its the value of the stock holding, and if its negative it is a cash inflow therefore I made it positive as it would be cash.
the issue is, my logic is massively flawed, as if the holding is cash the return shouldn't apply to it. further I don't think using the cumprod is correct with this strategy.
Has anyone used this strategy before and can offer tips of how to make it work?
thank you

I'm not sure about the returns and prices being in the correct place (they shouldn't really be in the same row if they represent the buying price (presumably yesterday's close), and the daily return (assuming the position was held for the whole day). But anyway...
import pandas as pd
# the data you provided
df = pd.read_csv("Data.csv", header=0)
# an initial starting row (explanation provided)
starting = pd.DataFrame({'Price': [0], 'return': [0], 'indicator': [0]})
# concatenate so starting is first row
df = pd.concat([starting, df]).reset_index(drop=True)
# setting holding to 0 at start (no shares), and cash at 1000 (therefore portfolio = 1000)
df[["Holding", "Cash", "Portfolio"]] = [0, 1000, 1000]
# buy/sell is the difference (explanation provided)
df["BuySell"] = df["indicator"].diff()
# simulating every day
for i in range(1, len(df)):
# buying
if df["BuySell"].iloc[i] > 0:
df["Holding"].iloc[i] += df["Cash"].iloc[i-1] / df["Price"].iloc[i]
df["Cash"].iloc[i] = 0
# selling
elif df["BuySell"].iloc[i] < 0:
df["Cash"].iloc[i] = df["Holding"].iloc[i-1] * df["Price"].iloc[i]
df["Holding"].iloc[i] = 0
# holding position
else:
df["Cash"].iloc[i] = df["Cash"].iloc[i-1]
df["Holding"].iloc[i] = df["Holding"].iloc[i-1]
# multiply holding by return (assuming all-in, so holding=0 not affected)
df["Holding"].iloc[i] *= (1 + df["return"].iloc[i])
df["Portfolio"].iloc[i] = df["Holding"].iloc[i] * df["Price"].iloc[i] + df["Cash"].iloc[i]
Explanations:
Starting row:
This is needed so that the loop can refer to the previous holdings and cash (it would be more of an inconvenience to add in an if statement in the loop if i=0).
Buy/Sell:
The difference is necessary here, as if the position changes from buy to sell, then obviously selling the shares (and vice versa). However, if the previous was buy/sell, the same as the current row, there would be no change (diff=0), with no shares bought or sold.
Portfolio:
This is an "equivalent" amount (the amount you would hold if you converted all shares to cash at the time).
Holding:
This is the number of shares held.
NOTE: from what I understood of your question, this is an all-in strategy - there is no percentage in, which has made this strategy more simplistic, but easier to code.
Output:
#Out:
# Price return indicator Holding Cash Portfolio BuySell
#0 0 0.00 0 0.00 1000 1000.0 NaN
#1 5 0.05 1 210.00 0 1050.0 1.0
#2 6 0.20 -1 0.00 1260 1260.0 -2.0
#3 5 -0.16 1 211.68 0 1058.4 2.0
Hopefully this will give you a good starting point to create something more to your specification and more advanced, such as with multiple shares, or being a certain percentage exposed, etc.

Conditional w/Pandas Dataframe

Need a help with conditionals for pandas dataframe. Apologies in advance for the basic question or if it's covered elsewhere.
Here's the example dataframe:
employee sales revenue salary
12345 20 10000 100000
I have a few conditions based on data which will result in salary changing.
scenarios:
if sales >10 and revenue > $5,000, increase salary by 20%
if sales <5 and revenue > $5,000, increase salary by 10%
otherwise, do nothing.
variables:
high_sales = 10
low_sales = 5
high_revenue = 5000
big_increase = 1.2
small_increase = 1.1
I know this requires some nesting but it's not clear to me how to do it.
I want the outcome to be a dataframe with only the salary column adjusted.
Here's the code:
df['salary'] = np.where((df['sales']>=high_sales & df['revenue']
>=high_revenue), df['salary'] * big_increase, (df['sales']<=low_sales &
df['revenue'] >=high_revenue), df['salary'] * small_increase, df['sales'])
Is this right?

With multiple conditions, it's nicer to use np.select rather than np.where:
conds = [(df.sales > 10) & (df.revenue > 5000),
(df.sales < 5) & (df.revenue > 5000)]
choices = [df.salary * 1.2, df.salary * 1.1]
df['salary'] = np.select(conds, choices, default = df.revenue)

Pandas averaging selected rows and columns

I am working with some EPL stats. I have csv with all matches from one season in following format.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
11.05.2014 Chelsea Swansea 0 0 1.50 3.00 5.00
What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal avgNorwichSC avgArsenalSC 5.00 4.00 1.73
11.05.2014 Chelsea Swansea avgChelseaSC avgSwanseaSC 1.50 3.00 5.00
So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.
The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.
All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
04.05.2014 Arsenal West Brom 1 0 1.40 5.25 8.00
I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).

You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.
import pandas
EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')
#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame
homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count())
print teamAveScore
You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (i.e. eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.
e.g.
recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]
print recentRows
def insertAverages(s):
a = teamAveScore[s['H']]
b = teamAveScore[s['A']]
print a,b
return pandas.Series(dict(H_AVSC=a, A_AVSC=b))
finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)
print finalTable
finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches
Edit
Couple of gotchas
just noticed I didn't put a format string in to_datetime(). For your dates - they look like UK format with dots so you should do
EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')
You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.
If you really need to replace column H_SC with H_AVSC in your finalTable, rather than add on the averages:
newCols = recentRows.apply(insertAverages, axis = 1)
recentRows['H_SC'] = newCols['H_AVSC']
recentRows['A_SC'] = newCols['A_AVSC']
print recentRows

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A way to make the [apply + lambda + loc] efficient - python

Related

Better and Efficient way to Iterate through Data

groupby percentage calculation returning incorrect percentage

Buy/sell strategy with indicators?

Conditional w/Pandas Dataframe

Pandas averaging selected rows and columns

Categories

Resources