Optimizing over different pandas grouby's

Optimizing over different pandas grouby's - python

Intent
Calculating weighted averages for different combinations of columns in a DataFrame df, and identifying anomalies in these averages based on how far they deviate from the mean of the weighted averages.
The code iterates through all combinations of the category columns (i.e., columns that are not "weight" or "score") using the combinations function from the itertools module. For each combination, it groups the DataFrame by these columns, calculates the weighted average using the weighted_average function, and stores the resulting DataFrame as df_wa.
Next, the code calculates the mean and standard deviation of the weighted averages in df_wa and stores them as new columns in df_wa. It then identifies anomalies by selecting rows in which the absolute difference between the weighted average and the mean is greater than three times the standard deviation. These anomalous rows, along with the combination of columns used to group the data, are stored in the results dictionary as a list of dictionaries. Each dictionary contains information about the reference group (i.e., the combination of columns used to group the data), a description of the anomaly, and the score and reference score for the anomaly.
Problem
This code seems to take to long. On my laptop it takes about 90 seconds to perform on a dataset of 1000 rows and 9 category columns.
However when deployed to aws lambda, it takes about 6 minutes.
Looking for optimizations to help the speed.
from itertools import combinations
import pandas as pd
def weighted_average(df: pd.DataFrame):
if "weight" in df.columns:
return sum(df["weight"] * df["score"]) / df["weight"].sum()
return sum(df["score"]) / df["score"].count()
def main() -> dict:
# read all the data
df = pd.read_csv("./review.csv")
# Create a dictionary to store the results
results = {"anomalies": []}
# create list of the category columns
categories = [item for item in df.columns if item not in ["weight", "score"]]
# Iterate through each unique combination of columns
for i in range(0, len(categories)):
for cols in combinations(categories, i + 1):
# group by combination of column, calculate the weighted average
df_wa = df_wa = (
df.groupby(by=list(cols))
.apply(lambda x: weighted_average(x))
.reset_index()
.rename(columns={0: "weighted_average"})
)
# print(df_wa.shape)
if len(df_wa) < 30:
continue
# calculate the mean and std from the weighted average
df_wa["weighted_average_mean"] = df_wa["weighted_average"].mean()
df_wa["weighted_average_std"] = df_wa["weighted_average"].std()
# calculate the anomalies
anomalies = df_wa.query(
"abs(weighted_average - weighted_average_mean) > (3 * weighted_average_std)"
)
# store the anomalous rows and the column combination in the results dictionary
ref_group = " and ".join(cols)
for _, row in anomalies.iterrows():
perc = abs(100 - (row["weighted_average"] / row["weighted_average_mean"] * 100))
aob = "above" if (perc > 100) else "below"
group = " and ".join(
row.drop(
labels=[
"weighted_average",
"weighted_average_mean",
"weighted_average_std",
]
).values
)
srs = " and ".join(
[str(x) for x in row[["weighted_average", "weighted_average_mean"]].values]
)
anomaly = {
"reference_group": ref_group,
"description": f"{group} : score is {perc:.2f}% {aob} others.",
"Score_and_reference_score": srs,
}
results["anomalies"].append(anomaly)
return results
So for I've chosen to do an early exit from a loop if df_wa has fewer than 30 rows, and move on to the next combination.
But this hasn't reduced the time mucha as most of the groupby dataframes have more than 30 rows...

Related

How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?

I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.

I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!

What is making this Python code so slow? How can I modify it to run faster?

I am writing a program in Python for a data analytics project involving advertisement performance data matched to advertisement characteristics aimed at identifying high performing groups of ads that share n similar characteristics. The dataset I am using has individual ads as rows, and characteristic, summary, and performance data as columns. Below is my current code - the actual dataset I am using has 51 columns, 4 are excluded, so it is running with 47 C 4, or 178365 iterations in the outer loop.
Currently, this code takes ~2 hours to execute. I know that nested for loops can be the source of such a problem, but I do not know why it is taking so long to run, and am not sure how I can modify the inner/outer for loops to improve performance. Any feedback on either of these topics would be greatly appreciated.
import itertools
import pandas as pd
import numpy as np
# Identify Clusters of Rows (Ads) that have a KPI value above a certain threshold
def set_groups(df, n):
"""This function takes a dataframe and a number n, and returns a list of lists. Each list is a group of n columns.
The list of lists will hold all size n combinations of the columns in the dataframe.
"""
# Create a list of all relevant column names
columns = list(df.columns[4:]) # exclude first 4 summary columns
# Create a list of lists, where each list is a group of n columns
groups = []
vals_lst = list(map(list, itertools.product([True, False], repeat=n))) # Create a list of all possible combinations of 0s and 1s
for comb in itertools.combinations(columns, n): # itertools.combinations returns a list of tuples
groups.append([comb, vals_lst])
groups = np.array(groups,dtype=object)
return groups # len(groups) = len(columns(df)) choose n
def identify_clusters(df, KPI, KPI_threshhold, max_size, min_size, groups):
"""
This function takes in a dataframe, a KPI, a threshhold value, a max and min size, and a list of lists of groupings.
The function will identify groups of rows in the dataframe that have the same values for each column in each list of groupings.
The function will return a list of lists with each list of groups, the values list, and the ad_ids in the cluster.
"""
# Create a dictionary to hold the results
output = []
# Iterate through each list of groups
for group in groups:
for vals_lst in group[1]: # for each pair of groups and associated value matrices
# Create a temporary dataframe to hold the group of rows with matching values for columns in group
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])] # reduce the temp_df to only rows that match the values in vals_lst for each combination of values
if temp_df[KPI].mean() > KPI_threshhold: # if the mean of the KPI for the temp_df is above the threshhold
output.append([group, vals_lst, temp_df['ad_id'].values]) # append the group, vals_lst, and ad_ids to the output list
print(output)
return output
## Main
df = pd.read_excel('data.xlsx', sheet_name='name')
groups = set_groups(df, 4)
print(len(groups))
identify_clusters(df, 'KPI_var', 0.0015, 6, 4, groups)
Any insight into why the code is taking such a long time to run, and/or any advice on improving the performance of this code would be extremely helpful.

I think your biggest issue is the lines:
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]
You're filtering the entire dataframe while I think you're only actually interested in the KPI and ad_id columns. You could instead create a rolling mask, something like
mask = pd.Series(True, index=df.index)
for i in range(len(group[0])):
mask = mask & (temp_df[group[0][i]] == vals_lst[i])]
You can then access your subsets something like df[mask][KPI].mean() and df[mask]['ad_id'].values. If you do this, you will avoid copying a huge amount of data on every iteration.
I would also be tempted to simplify the code a little, for example I believe vals_lst = list(map(list, itertools.product([True, False], repeat=n))) is the same for each group, so I would probably calculate it once and hold it as a stand alone variable rather than add it to every group; this would clean up the group[0], group[1] and group[0][i] references which were a little hard to track on first reading the code.
Looking at the change from iterative filtering to tracking a mask, the mask approach always to perform better, but the gap increases with data size. With 10000 rows the gaps are:
Method
Time
Relative
Original
2.900383699918166
2.8098094911581533
Using Mask
1.03223499993328
1.0
with the following test code:
import random, timeit
import pandas as pd
random.seed(1)
iterations = 1000
data = {hex(i): [random.randint(0, 1) for i in range(10000)] for i in range(52)}
df = pd.DataFrame(data)
kpi_col = hex(1)
# test group of columns with desired values
group = (
(hex(5), 1),
(hex(6), 1),
(hex(7), 1),
(hex(8), 1)
)
def method0():
tmp = df
for column, value in group:
tmp = tmp[tmp[column] == value]
return tmp[kpi_col].mean()
def method1():
mask = pd.Series(True, df.index)
for column, value in group:
mask = mask & (df[column] == value)
return df[mask][kpi_col].mean()
assert method0() == method1()
t0 = timeit.timeit(lambda: method0(), number=iterations)
t1 = timeit.timeit(lambda: method1(), number=iterations)
tmin = min((t0, t1))
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Using Mask | {t1} | {t1 / tmin} |')

Welles Wilder's moving average with pandas

I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)

You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()

How can I group rows based on those that have the same value in a column, and then run my code on each subset?

I have a csv file which I created through some lines of code that includes the following:
A 'BatchID' column, in the format of DEFGH12-01, specifying which batch each unit is in, and a column of the units and their full ID numbers, 'UnitID', in the format of DEFGH12-01_x01_y01. Each unit (UnitID) falls under a specific batch (and thus the Unit ID number corresponds to the BatchID it is under.
I have a certain algorithm that I have been running on the entire dataset of unit IDs. I want to group the units based on having the same batchID value (as there are many unique units that fall under each batch), and then running the algorithm on each of these subsets of unit batches.
How can I do this?

The simplest way is to use pandas groupping.
here is an example:
creating data:
df = pd.DataFrame({"A": [1,2,3,4,5], "B":[1,2,3,4,5], "C": ['GROUP_A', 'GROUP_A', 'GROUP_A', 'GROUP_B', 'GROUP_B']})
applying your funcion:
groups_list = []
for group_name, group_values in df.groupby("C"):
# applying a function on a column based on group
group_values = group_values.assign(A=group_values.A.apply(lambda x: x ** 2))
# for re-creating the df
groups_list.append(group_values)
# if there is only 1 group , else is needed
mod_df = pd.concat(groups_list, axis=0) if len(groups_list) > 1 else groups_list[0]
print(mod_df)

Pandas: Calculate the percentage between two rows and add the value as a column

I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?

df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)

I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:

Using diff
(-df['Close'].diff())/df['Close'].shift()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing over different pandas grouby's - python

Related

How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?

What is making this Python code so slow? How can I modify it to run faster?

Welles Wilder's moving average with pandas

How can I group rows based on those that have the same value in a column, and then run my code on each subset?

Pandas: Calculate the percentage between two rows and add the value as a column

Categories

Resources