PROBLEM: I have a dataframe showing which assignments students chose to do and what grades they got on them. I am trying to determine which subsets of assignments were done by the most students and the total points earned on them. The method I'm using is very slow, so I'm wondering what the fastest way is.
My data has this structure:
STUDENT
ASSIGNMENT1
ASSIGNMENT2
ASSIGNMENT3
...
ASSIGNMENT20
Student1
50
75
100
...
50
Student2
75
25
NaN
...
NaN
...
Student2000
100
50
NaN
...
50
TARGET OUTPUT:
For every possible combination of assignments, I'm trying to get the number of completions and the sum of total points earned on each individual assignment by the subset of students who completed that exact assignment combo:
ASSIGNMENT_COMBO
NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO
ASSIGNMENT1 TOTAL POINTS
ASSIGNMENT2 TOTAL POINTS
ASSIGNMENT3 TOTAL POINTS
...
ASSIGNMENT20 TOTAL POINTS
Assignment 1, Assignment 2
900
5000
400
NaN
...
NaN
Assignment 1, Assignment 2, Assignment 3
100
3000
500
...
NaN
Assignment 2, Assignment 3
750
NaN
7000
750
...
NaN
...
All possible combos, including any number of assignments
WHAT I'VE TRIED: First, I'm using itertools to make my assignment combos and then iterating through the dataframe to classify each student by what combos of assignments they completed:
for combo in itertools.product(list_of_assignment_names, repeat=20):
for i, row in starting_data.iterrows():
ifor = str(combo)
ifor_val = 'no'
for item in combo:
if row[str(item)]>0:
ifor_val = 'yes'
starting_data.at[i,ifor] = ifor_val
Then, I make a second dataframe (assignmentcombostats) that has each combo as a row to count up the number of students who did each combo:
numberofstudents =[]
for combo in assignmentcombostats['combo']:
column = str(combo)
number = len(starting_data[starting_data[column] == 'yes'])
numberofstudents.append(number)
assignmentcombostats['numberofstudents'] = numberofstudents
This works, but it is very slow.
RESOURCES: I've looked at a few resources -
This post is what I based my current method on
This page has ideas for faster iterating, but I'm not sure of the best way to
solve my problem using vectorization
One approach to speed up your code is to avoid using for loops and instead use pandas built-in functions to apply transformations on your data. Here's an example implementation that should accomplish your desired output:
import itertools
import pandas as pd
# sample data
data = {
'STUDENT': ['Student1', 'Student2', 'Student3', 'Student4'],
'ASSIGNMENT1': [50, 75, 100, 100],
'ASSIGNMENT2': [75, 25, 50, 75],
'ASSIGNMENT3': [100, None, 75, 50],
'ASSIGNMENT4': [50, None, None, 100]
}
df = pd.DataFrame(data)
# create a list of all possible assignment combinations
assignments = df.columns[1:].tolist()
combinations = []
for r in range(1, len(assignments)+1):
combinations += list(itertools.combinations(assignments, r))
# create a dictionary to hold the results
results = {'ASSIGNMENT_COMBO': [],
'NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO': [],
'ASSIGNMENT_TOTAL_POINTS': []}
# iterate over the combinations and compute the results
for combo in combinations:
# filter the dataframe for students who have completed this combo
combo_df = df.loc[df[list(combo)].notnull().all(axis=1)]
num_students = len(combo_df)
# compute the total points for each assignment in the combo
points = combo_df[list(combo)].sum()
# append the results to the dictionary
results['ASSIGNMENT_COMBO'].append(combo)
results['NUMBER_OF_STUDENTS_WHO_DID_THIS_COMBO'].append(num_students)
results['ASSIGNMENT_TOTAL_POINTS'].append(points.tolist())
# create a new dataframe from the results dictionary
combo_stats_df = pd.DataFrame(results)
# explode the ASSIGNMENT_COMBO column into separate rows for each
assignment in the combo
combo_stats_df = combo_stats_df.explode('ASSIGNMENT_COMBO')
# create separate columns for each assignment in the combo
for i, assignment in enumerate(assignments):
combo_stats_df[f'{assignment} TOTAL POINTS'] =
combo_stats_df['ASSIGNMENT_TOTAL_POINTS'].apply(lambda x: x[i])
# drop the ASSIGNMENT_TOTAL_POINTS column
combo_stats_df = combo_stats_df.drop('ASSIGNMENT_TOTAL_POINTS',
axis=1)
print(combo_stats_df)
This code first creates a list of all possible assignment combinations using itertools.combinations. Then, it iterates over each combo and filters the dataframe to include only students who have completed the combo. It computes the number of students and the total points for each assignment in the combo using built-in pandas functions like notnull, all, and sum. Finally, it creates a new dataframe from the results dictionary and explodes the ASSIGNMENT_COMBO column into separate rows for each assignment in the combo. It then creates separate columns for each assignment and drops the ASSIGNMENT_TOTAL_POINTS column. This approach should be much faster than using for loops, especially for large dataframes.
I had a go at tidying up Bryan's Answer
Make a list of all possible combinations
Iterate over each combination to find the totals and number of students
Combine the results in to a dataframe
Setup: (Makes a dataset of 20,000 students and 10 assignments)
import itertools
import pandas as pd
import numpy as np
# Bigger random sample data
def make_data(rows, cols, nans, non_nans):
df = pd.DataFrame()
df["student"] = list(range(rows))
for i in range(1,cols+1):
a = np.random.randint(low=1-nans, high=non_nans, size=(rows)).clip(0).astype(float)
a[ a <= 0 ] = np.nan
df[f"a{i:02}"] = a
return df
rows = 20000
cols = 10
df = make_data(rows, cols, 50, 50)
# dummy columns, makes aggregates easier
df["students"] = 1
df["combo"] = ""
Transformation:
# create a list of all possible assignment combinations (ignore first and last two)
assignments = df.columns[1:-2].tolist()
combos = []
for r in range(1, len(assignments)+1):
new_combos = list(itertools.combinations(assignments, r))
combos += new_combos
# create a list to hold the results
results = list(range(len(combos)))
# ignore the student identifier column
df_source = df.iloc[:, 1:]
# iterate over the combinations and compute the results
for ix, combo in enumerate(combos):
# filter the dataframe for students who have completed this combo
df_filter = df.loc[ df[ list(combo) ].notnull().all(axis=1) ]
# aggregate the results to a single row (sum of the dummy students column counts the rows)
df_agg = df_filter.groupby("combo", as_index=False).sum().reset_index(drop=True)
# store the assignment comination in the results
df_agg["combo"] = ",".join(combo)
# add the results to the list
results[ix] = df_agg
# create a new dataframe from the results list
combo_stats_df = pd.concat(results).reset_index(drop = True)
In this demo it takes ~6 seconds to return ~1000 rows of results.
For 20 assignments that's ~1,000,000 rows of results, so ~6000 seconds (over 1.5 hours).
Even on my desktop it takes ~2 seconds to process 1,000 combinations, so ~0.5 hours for ~1,000,000 combinations from 20 assignments.
I initially tried to write it without the loop, but the process was killed for using too much memory. I like the puzzle, it helps me learn, so I'll ponder if there's a way to avoid the loop while staying within memory.
Related
Intent
Calculating weighted averages for different combinations of columns in a DataFrame df, and identifying anomalies in these averages based on how far they deviate from the mean of the weighted averages.
The code iterates through all combinations of the category columns (i.e., columns that are not "weight" or "score") using the combinations function from the itertools module. For each combination, it groups the DataFrame by these columns, calculates the weighted average using the weighted_average function, and stores the resulting DataFrame as df_wa.
Next, the code calculates the mean and standard deviation of the weighted averages in df_wa and stores them as new columns in df_wa. It then identifies anomalies by selecting rows in which the absolute difference between the weighted average and the mean is greater than three times the standard deviation. These anomalous rows, along with the combination of columns used to group the data, are stored in the results dictionary as a list of dictionaries. Each dictionary contains information about the reference group (i.e., the combination of columns used to group the data), a description of the anomaly, and the score and reference score for the anomaly.
Problem
This code seems to take to long. On my laptop it takes about 90 seconds to perform on a dataset of 1000 rows and 9 category columns.
However when deployed to aws lambda, it takes about 6 minutes.
Looking for optimizations to help the speed.
from itertools import combinations
import pandas as pd
def weighted_average(df: pd.DataFrame):
if "weight" in df.columns:
return sum(df["weight"] * df["score"]) / df["weight"].sum()
return sum(df["score"]) / df["score"].count()
def main() -> dict:
# read all the data
df = pd.read_csv("./review.csv")
# Create a dictionary to store the results
results = {"anomalies": []}
# create list of the category columns
categories = [item for item in df.columns if item not in ["weight", "score"]]
# Iterate through each unique combination of columns
for i in range(0, len(categories)):
for cols in combinations(categories, i + 1):
# group by combination of column, calculate the weighted average
df_wa = df_wa = (
df.groupby(by=list(cols))
.apply(lambda x: weighted_average(x))
.reset_index()
.rename(columns={0: "weighted_average"})
)
# print(df_wa.shape)
if len(df_wa) < 30:
continue
# calculate the mean and std from the weighted average
df_wa["weighted_average_mean"] = df_wa["weighted_average"].mean()
df_wa["weighted_average_std"] = df_wa["weighted_average"].std()
# calculate the anomalies
anomalies = df_wa.query(
"abs(weighted_average - weighted_average_mean) > (3 * weighted_average_std)"
)
# store the anomalous rows and the column combination in the results dictionary
ref_group = " and ".join(cols)
for _, row in anomalies.iterrows():
perc = abs(100 - (row["weighted_average"] / row["weighted_average_mean"] * 100))
aob = "above" if (perc > 100) else "below"
group = " and ".join(
row.drop(
labels=[
"weighted_average",
"weighted_average_mean",
"weighted_average_std",
]
).values
)
srs = " and ".join(
[str(x) for x in row[["weighted_average", "weighted_average_mean"]].values]
)
anomaly = {
"reference_group": ref_group,
"description": f"{group} : score is {perc:.2f}% {aob} others.",
"Score_and_reference_score": srs,
}
results["anomalies"].append(anomaly)
return results
So for I've chosen to do an early exit from a loop if df_wa has fewer than 30 rows, and move on to the next combination.
But this hasn't reduced the time mucha as most of the groupby dataframes have more than 30 rows...
I am writing a program in Python for a data analytics project involving advertisement performance data matched to advertisement characteristics aimed at identifying high performing groups of ads that share n similar characteristics. The dataset I am using has individual ads as rows, and characteristic, summary, and performance data as columns. Below is my current code - the actual dataset I am using has 51 columns, 4 are excluded, so it is running with 47 C 4, or 178365 iterations in the outer loop.
Currently, this code takes ~2 hours to execute. I know that nested for loops can be the source of such a problem, but I do not know why it is taking so long to run, and am not sure how I can modify the inner/outer for loops to improve performance. Any feedback on either of these topics would be greatly appreciated.
import itertools
import pandas as pd
import numpy as np
# Identify Clusters of Rows (Ads) that have a KPI value above a certain threshold
def set_groups(df, n):
"""This function takes a dataframe and a number n, and returns a list of lists. Each list is a group of n columns.
The list of lists will hold all size n combinations of the columns in the dataframe.
"""
# Create a list of all relevant column names
columns = list(df.columns[4:]) # exclude first 4 summary columns
# Create a list of lists, where each list is a group of n columns
groups = []
vals_lst = list(map(list, itertools.product([True, False], repeat=n))) # Create a list of all possible combinations of 0s and 1s
for comb in itertools.combinations(columns, n): # itertools.combinations returns a list of tuples
groups.append([comb, vals_lst])
groups = np.array(groups,dtype=object)
return groups # len(groups) = len(columns(df)) choose n
def identify_clusters(df, KPI, KPI_threshhold, max_size, min_size, groups):
"""
This function takes in a dataframe, a KPI, a threshhold value, a max and min size, and a list of lists of groupings.
The function will identify groups of rows in the dataframe that have the same values for each column in each list of groupings.
The function will return a list of lists with each list of groups, the values list, and the ad_ids in the cluster.
"""
# Create a dictionary to hold the results
output = []
# Iterate through each list of groups
for group in groups:
for vals_lst in group[1]: # for each pair of groups and associated value matrices
# Create a temporary dataframe to hold the group of rows with matching values for columns in group
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])] # reduce the temp_df to only rows that match the values in vals_lst for each combination of values
if temp_df[KPI].mean() > KPI_threshhold: # if the mean of the KPI for the temp_df is above the threshhold
output.append([group, vals_lst, temp_df['ad_id'].values]) # append the group, vals_lst, and ad_ids to the output list
print(output)
return output
## Main
df = pd.read_excel('data.xlsx', sheet_name='name')
groups = set_groups(df, 4)
print(len(groups))
identify_clusters(df, 'KPI_var', 0.0015, 6, 4, groups)
Any insight into why the code is taking such a long time to run, and/or any advice on improving the performance of this code would be extremely helpful.
I think your biggest issue is the lines:
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]
You're filtering the entire dataframe while I think you're only actually interested in the KPI and ad_id columns. You could instead create a rolling mask, something like
mask = pd.Series(True, index=df.index)
for i in range(len(group[0])):
mask = mask & (temp_df[group[0][i]] == vals_lst[i])]
You can then access your subsets something like df[mask][KPI].mean() and df[mask]['ad_id'].values. If you do this, you will avoid copying a huge amount of data on every iteration.
I would also be tempted to simplify the code a little, for example I believe vals_lst = list(map(list, itertools.product([True, False], repeat=n))) is the same for each group, so I would probably calculate it once and hold it as a stand alone variable rather than add it to every group; this would clean up the group[0], group[1] and group[0][i] references which were a little hard to track on first reading the code.
Looking at the change from iterative filtering to tracking a mask, the mask approach always to perform better, but the gap increases with data size. With 10000 rows the gaps are:
Method
Time
Relative
Original
2.900383699918166
2.8098094911581533
Using Mask
1.03223499993328
1.0
with the following test code:
import random, timeit
import pandas as pd
random.seed(1)
iterations = 1000
data = {hex(i): [random.randint(0, 1) for i in range(10000)] for i in range(52)}
df = pd.DataFrame(data)
kpi_col = hex(1)
# test group of columns with desired values
group = (
(hex(5), 1),
(hex(6), 1),
(hex(7), 1),
(hex(8), 1)
)
def method0():
tmp = df
for column, value in group:
tmp = tmp[tmp[column] == value]
return tmp[kpi_col].mean()
def method1():
mask = pd.Series(True, df.index)
for column, value in group:
mask = mask & (df[column] == value)
return df[mask][kpi_col].mean()
assert method0() == method1()
t0 = timeit.timeit(lambda: method0(), number=iterations)
t1 = timeit.timeit(lambda: method1(), number=iterations)
tmin = min((t0, t1))
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Using Mask | {t1} | {t1 / tmin} |')
TL;DR: I understand that .apply() is slow in pandas. However, I have a function that acts on indexes, which I cannot figure out how to vectorize. I want this function to act on two sets of parameters (500,000 and 1,500 long, respectively). I want to produce a dataframe with the first parameter as the row index, the second parameter as column names, and the cells containing the function's output for that particular row and column. As it stands it looks like the code will take several days to run. More details and minimal reproducible example below:
INPUT DATA:
I have a series of unique student IDs, which is 500,000 students long. I have a df (exam_score_df) that is indexed by these student IDs, containing each student's corresponding scores in math, language, history, and science.
I also have a series of school codes (each school code corresponds to a school), which is 1,500 schools long. I have a df (school_weight_df) that is indexed by school codes, containing the school's weights in math, language, history, and science that it uses to calculate a student's scores. Each row also contains a 'Y' or 'N' indexed 'Alternative_Score' because some schools allow you to take the best subject score between history and science to calculate your overall score.
FUNCTION I WROTE TO BE VECTORIZED:
def calc_score(student_ID, program_code):
'''
For a given student and program, returns students score for that program.
'''
if school_weight_df.loc[program_code]['Alternative_Score'] == 'N':
return np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST','%SCI']]))
elif school_weight_df.loc[program_code]['Alternative_Score'] == 'Y':
history_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'HIST']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%HIST']]))
science_score = np.dot(np.array(exam_score_df.loc[student_ID][['LANG', 'MAT', 'SCI']]),
np.array(school_weight_df.loc[program_code][['%LANG','%MAT','%SCI']]))
return max(history_score, science_score)
EXAMPLE DFs:
Here are example dfs for exam_score_df and school_weight_df:
student_data = [[3, 620, 688, 638, 688], [5, 534, 602, 606, 700], [9, 487, 611, 477, 578]]
exam_score_df = pd.DataFrame(student_data, columns = ['student_ID', 'LANG', 'MAT', 'HIST', 'SCI'])
exam_score_df.set_index('student_ID')
program_data = [[101, 20, 30, 25, 25, 'N'], [102, 40, 10, 50, 50, 'Y']]
school_weight_df = pd.DataFrame(program_data, columns = ['program_code', '%LANG','%MAT','%HIST','%SCI', 'Alternative_Score'])
school_weight_df.set_index('program_code', inplace = True)
Here's the series which are used to index the code below:
series_student_IDs = pd.Series(exam_score_df.index, inplace = True)
series_program_codes = pd.Series(school_weight_df.index, inplace = True)
CODE TO CREATE DF USING FUNCTION:
To create the df of all of the students' scores at each program, I used nested .apply()'s:
new_df = pd.DataFrame(series_student_IDs.apply(lambda x: series_program_codes.apply(lambda y: calc_score(x, y))))
I've already read several primers on optimizing code in pandas, including the very well-written Guide by Sofia Heisler. My primary concern, and reason for why I can't figure out how to vectorize this code, is that my function needs to act on indexes. I also have a secondary concern that, even if I do vectorize, there is this problem with np.dot on large matrices for which I would want to loop anyway.
Thanks for all the help! I have only been coding for a few months, so all the helpful comments are really appreciated.
Apply = bad, Double Apply = very bad
If you are going Numpy in the function, why not go Numpy all the way? You would still prefer a batch-wise approach since the overall matrix would take tons of memory. Check the following approach.
Each iteration took me 2.05 seconds on a batch of 5000 students on a low-end macbook pro. This means for 500,000 students, you can expect 200 seconds approx, which is not half bad.
I ran the following on 100000 students and 1500 schools which took me a total of 30-40 seconds approx.
First I created a dummy set data: Exam scores (100,000 students, 4 scores), school weights (1500 schools, 4 weights) AND a boolean flag for which school has alternative as Y or N, Y==True, N==False
Next, for a batch of 5000 students, I simply calculate the element-wise product of each of the 4 subjects between the 2 matrices using np.einsum. This gives me (5000,4) * (1500,4) -> (1500,5000,4). Consider this as the first part of the dot product (without the sum).
The reason I do this is because this is a necessary step for both your conditions N or Y.
Next, FOR N: I simply filter the above matrix based on alt_flag, reduce it (sum) over last axis and transpose to get (5000, 766), where 766 are the number of schools with alternative == N
FOR Y:, I filter based on alt_flag and then I calculate the sum of the first 2 subjects (because they are common) and add those to the 3rd and 4th subject separately, take a max and return that as my final score. Post that a Transpose. This gives me (5000, 734).
I do this for each batch of 5000, until I have appended all the batches and then simply np.vstack to get the final tables (100000, 766) and (100000, 734).
Now I can simply stack these over axis=0 to get (100000, 1500) but if I want to map them to the IDs (student, schools), it would be easier to do it separately using pd.DataFrame(data, columns=list_of_schools_alt_Y, index=list_of_student_ids and then combine them. Read the last step for you.
Last step is for you to perform since I don't have the complete dataset. Since the order of the indexes is retained through batch-wise vectorization, you can now simply map the 766 school IDs with N, 734 school IDs with Y, and 100000 student IDs, in the order they occur in your main dataset. Then simply append the 2 data frames to create a final (massive) dataframe.
NOTE: you will have to change the 100000 to 500000 in the for loop, don't forget!!
import numpy as np
import pandas as pd
from tqdm import notebook
exam_scores = np.random.randint(300,800,(100000,4))
school_weights = np.random.randint(10,50,(1500,4))
alt_flag = np.random.randint(0,2,(1500,), dtype=bool) #0 for N, 1 for Y
batch = 5000
n_alts = []
y_alts = []
for i in notebook.tqdm(range(0,100000,batch)):
scores = np.einsum('ij,kj->kij', exam_scores[i:i+batch], school_weights) #(1500,5000,4)
#Alternative == N
n_alt = scores[~alt_flag].sum(-1).T #(766, 5000, 4) -> (5000, 766)
#Alternative == Y
lm = scores[alt_flag,:,:2].sum(-1) #(734, 5000, 2) -> (734, 5000); lang+math
h = scores[alt_flag,:,2] #(734, 5000); history
s = scores[alt_flag,:,3] #(734, 5000); science
y_alt = np.maximum(lm+h, lm+s).T #(5000, 734)
n_alts.append(n_alt)
y_alts.append(y_alt)
final_n_alts = np.vstack(n_alts)
final_y_alts = np.vstack(y_alts)
print(final_n_alts.shape)
print(final_y_alts.shape)
Starting with a CSV file with the columns ['race_number', 'number_of_horses_bet_on','odds']
I would like to add/calculate an extra column called 'desired_output'.
The 'desired_output' column is computed by
for 'race_number' 1, the 'number_of_horses_bet_on'=2, therefore in the 'desired_output column', only the first 2 'odds' are included. The remaining values for 'race_number' 1 are 0. Then we go to 'race_number' 2 and the cycle repeats.
Code I have tried includes:
import pandas as pd
df=pd.read_csv('test.csv')
desired_output=[]
count=0
for i in df.number_of_horses_bet_on:
for j in df.odds:
if count<i:
desired_output.append(j)
count+=1
else:
desired_output.append(0)
print(desired_output)
and also
df['desired_output']=df.odds.apply(lambda x: x if count<number_of_horses_bet_on else 0)
Neither of these give the output of the column 'desired_output'
I realise the 'count' in the lambda above is misplaced - but hopefully you can see what I am after.
Thanks.
I'm gonna do it a bit differently, this will be what I'm gonna do
get a list of all race_number
for each race_number, extract the number_of_horses_bet_on
create a list that contains 1 or 0, where we would have number_of_horses_bet_on number of 1s and the rest would be zero.
multiple this list with the odds column
import pandas as pd
df=pd.read_csv('test.csv')
mask = []
races = df['race_number'].unique().tolist() # unique list of all races
for race in races:
# filter the dataframe by the race number
df_race = df[df['race_number'] == race]
# assuming number of horses is unique for every race, we extract it here
number_of_horses = df_race['number_of_horses_bet_on'].iloc[0]
# this mask will contain a list of 1s and 0s, for example for race 1 it'll be [1,1,0,0,0]
mask = mask + [1] * number_of_horses + [0] * (len(df_race) - number_of_horses)
df['mask'] = mask
df['desired_output'] = df['mask'] * df['odds']
del df['mask']
print(df)
This assumes that for each race the numbers_of_horses_bet_on equals or less than the number of rows for that race, otherwise you might need to use min/max to get proper results
I am working on a python project which iterates through all the possible combinations of entries in a row of excel data to find which combination produces the correct output.
To achieve this, I am iterating through different combinations of 0 and 1 to choose whether that entry is required for the combination. 1 meaning data point is included in the calculation and 0 meaning the data point is not included.
The number of combinations would thus be equal to 2 ^ (Number of excel columns)
Example Excel Data:
1, 22, 7, 11, 2, 4
Example Iteration:
(1, 0, 0, 0, 1, 0)
I could be looking for what combination of the excel data would result in an output of 3, the only correct combination of the excel data being the above iteration.
However, I would know that any value greater than 3 would not be included in a possible combination that would equal 3. As such I would like to choose and set the values of these columns to 0 and iterate the other columns only. This would in turn reduce the number of combinations.
Combination = 2 ^ (Number of excel columns - Fixed Entry Columns)
At the moment I am using Itertools.products to get all combination which I need:
Numbers = ["0","1"]
for item in itertools.product(Numbers, repeat=len(df.columns)):
Iteration = pd.DataFrame(item) # Iteration e.g (0,1,1,1,0,0,1)
Data = df.iloc[0] # Excel data row
Data = Data.to_numpy()
Iteration = Iteration.astype(float)
Answer = np.dot(Data, Iteration) # Get the result of (Iteration * Data) to check if answer is correct
This results in iterating through combinations which I know will not work.
Is there a way to only iterate 0's and 1's in certain positions of the combination while keeping the known entries a fixed value (either 0 or 1) to reduce the possible combinations?
There are some excel files have over 25 columns which as a result would be 33,554,432 combinations. As such I am trying to reduce the number of columns which I need to iterate by setting values to the columns that I do know.
If you would need further clarification please let me know. I am novice programmer so I may be overlooking or over complicating a simple solution.
Find which columns meet your criteria for exclusion. Then just get the product combinations for the other columns.
One possible method:
from itertools import product
LIMIT=10
column_data = [1, 22, 7, 11, 2, 4]
changeable_indexes = [i for i,x in enumerate(column_data) if x <= LIMIT]
for item in product([0,1], repeat=len(changeable_indexes)):
row_iteration = [0] * len(column_data)
for index, value in zip(changeable_indexes, item):
row_iteration[index] = value
print(row_iteration)