Running Functions with Multiple Arguments Concurrently and Aggregating Complex Results - python

Set Up
This is part two of a question that I posted regarding accessing results from multiple processes.
For part one click Here: Link to Part One
I have a complex set of data that I need to compare to various sets of constraints concurrently, but I'm running into multiple issues. The first issue is getting results out of my multiple processes, and the second issue is making anything beyond an extremely simple function to run concurrently.
Example
I have multiple sets of constraints that I need to compare against some data and I would like to do this concurrently because I have a lot of sets of constrains. In this example I'll just be using two sets of constraints.
Jupyter Notebook
Create Some Sample Constraints & Data
# Create a set of constraints
constraints = pd.DataFrame([['2x2x2', 2,2,2],['5x5x5',5,5,5],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints.set_index('Name', inplace=True)
# Create a second set of constraints
constraints2 = pd.DataFrame([['4x4x4', 4,4,4],['6x6x6',6,6,6],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints2.set_index('Name', inplace=True)
# Create some sample data
items = pd.DataFrame([['a', 2,8,2],['b',5,3,5],['c',7,4,7]], columns=['Name','First', 'Second', 'Third'])
items.set_index('Name', inplace=True)
Running Sequentially
If I run this sequentially I can get my desired results but with the data that I am actually dealing with it can take over 12 hours. Here is what it would look like ran sequentially so that you know what my desired result would look like.
# Function
def seq_check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1)
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
constraint_sets = [constraints, constraints2, ...]
results = {}
counter = 0
for df in constrain_sets:
res = seq_check_constraint(df, items)
results['constraints'+str(counter)] = res
or uglier:
df_res1 = seq_check_constraint(constraints, items)
df_res2 = seq_check_constraint(constraints2, items)
results = {'constraints0':df_res1, 'constraints1': df_res2}
As a result of running these sequentially I end up with DataFrame's like shown here:
I'd ultimately like to end up with a dictionary or list of the DataFrame's, or be able to append the DataFrame's all together. The order that I get the results doesn't matter to me, I just want to have them all together and need to be able to do further analysis on them.
What I've Tried
So this brings me to my attempts at multiprocessing, From what I understand you can either use Queues or Managers to handle shared data and memory, but I haven't been able to get either to work. I also am struggling to get my function which takes two arguments to execute within the Pool's at all.
Here is my code as it stands right now using the same sample data from above:
Function
def check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1) # Mathematical Product
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
Jupyter Notebook
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
res = p.map_async(mpf.check_constraint, (df_ns.constraint_sets, itertools.repeat(items)))
print(res.get())
and my current error:
TypeError: check_constraint() missing 1 required positional argument: 'df_items_input'

Easiest way is to create a list of tuples (where one tuple represents one set of arguments to the function) and pass it to starmap.
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
check_constraint_args = []
for constraint in constraint_sets:
check_constraint_args.append((constraint, items))
res = p.starmap(mpf.check_constraint, check_constraint_args)
print(res.get())

Related

Grouping a dataframe and performing operations on the resulting matrix in a parallelized manner using Python/Dask/multiprocessing?

I am working on a project where I need to group molecules in a database by their ID and perform operations on the resulting matrix. I am using Python and I want to improve performance by parallelizing the process.
I am currently loading the molecules from an SDF file and storing them in a Pandas dataframe. Each molecule has an ID, a unique Pose ID, and a unique Structure. My goal is to group the dataframe by ID and create a matrix for each ID group. The rows and columns of the matrix would correspond to the unique Pose IDs of the molecules in that ID group. Then, I can calculate values for each cell in the matrix, such as the similarity between the molecules that define that cell. However, the specific operations on the molecules are not important for this question. I am primarily asking for advice on how to set up such a system for parallelized computing using Dask or Multiprocessing, or if there are other better options.
Here is a gist of the version without any parallelisation (please note i have heavily modified to make my questions clearer, the code below outputs the desired things, but I am looking to calculate the celles on molecules not the Pose ID) : https://gist.github.com/Tonylac77/abfd54b1ceef39f0d161fb6b21950edb
#Generate sample dataframe
import pandas as pd
df = pd.DataFrame(columns=['ID', 'Pose ID'])
ids = ['ID' + str(i) for i in range(1, 6)]
pose_ids = ['Pose ' + str(i) for i in range(1, 11)]
# For each ID, add 10 rows to the dataframe with the corresponding Pose IDs
df_list = []
for i in ids:
temp_df = pd.DataFrame({'ID': [i] * 10, 'Pose ID': pose_ids})
df_list.append(temp_df)
df= pd.concat(df_list)
print(df)
################
from tqdm import tqdm
import itertools
import functools
import numpy as np
from IPython.display import display
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have tried using multiprocessing, however, my implementation seems to be slower than the non-parallelized version : https://gist.github.com/Tonylac77/b4bbada97ee2bab7c37d4a29079af574
def function(tuple):
return tuple[0]+tuple[1]
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
with multiprocessing.Pool() as p:
try:
results = p.map(function, itertools.combinations(df_name['Pose ID'], 2))
except KeyError:
print('Incorrect clustering method selected')
return
results_list = list(zip(itertools.combinations(df_name['Pose ID'], 2), results))
for subset, result in results_list:
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have also started playing around with Dask, however the main issue I'm facing is that I need all of the values of one ID to be in the same dask partition, otherwise I will have incomplete matrices (if I understand correctly at least). I have tried to find a solution to this (like chunking in x partitions etc) but so far to no avail. Will update this thread if something changes.
Any advice welcome to speed these calculations up. For reference, the actual datasets I'm working contain ~10000 unique IDs and ~300000 Pose IDs. With the calculations I'm running on the molecules, some of these are taking 40h to complete.
This should be pretty straightforward using Dask Dataframe and groupBy:
ddf = your_dataframe_as_dask
def matrix_calculation(df, id):
matrix = pd.DataFrame(0.0, index=[df['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df[df['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df[df['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
return matrix
ddf.groupby('ID').apply(matrix_calculation).compute()
See https://examples.dask.org/dataframes/02-groupby.html#Groupby-Apply.
This will parallelize the work for each ID.
You might then want to look at https://docs.dask.org/en/stable/scheduling.html to chose the scheduler that suits your need (default with Dataframe is threads, which might not be efficient depending on your code).

How to loop through a list of BQ tables in Python?

I've a list of BQ tables that I'd like to use one at a time. The purpose is to process each table individually, perform some action (in my example, score the dataset for a previously fitted model), then compute, append, and save the probabilities in the all scores list.
Here's a screenshot of the entire code snippet.
# List of BQ Table
scoring_tables = ["`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_01`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_02`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_03`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_04`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_05`"]
# List to store probabilities/scores
all_scores = []
# Loop through each BQ tables, calculate, append and store the probabilities in the all_score = []
for t in scoring_tables:
%%bigquery property_data_score_00
SELECT * FROM t
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
# Fixing for HOUSE_INCOME = 0 & AGE = 0 based on means
score_data['HOUSE_INCOME'] = np.where(score_data['HOUSE_INCOME']==0,107,score_data['HOUSE_INCOME'])
score_data['AGE'] = np.where(score_data['AGE']==0,54,score_data['AGE'])
# recategorize PROP_EXTR_WALL_TYPE | PROP_GRG_TYPE | PROP_ROOF_TYPE
condition = [(score_data['PROP_EXTR_WALL_TYPE'].str.contains("BRICK")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("WOOD")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("CONCRETE")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("METAL")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("STEEL"))]
choice = ["BRICK","WOOD","CONCRETE","METAL","METAL"]
score_data['PROP_EXTR_WALL_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_GRG_TYPE'].str.contains("ATTACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("DETACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("CARPORT")),
(score_data['PROP_GRG_TYPE'].str.contains("BASEMENT"))]
choice = ["ATTACHED","DETACHED","CARPORT","BASEMENT"]
score_data['PROP_GRG_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_ROOF_TYPE'].str.contains("GABLE")),
(score_data['PROP_ROOF_TYPE'].str.contains("HIP")),
(score_data['PROP_ROOF_TYPE'].str.contains("GAMBREL"))]
choice = ["GABLE","HIP","GAMBREL"]
score_data['PROP_ROOF_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
# one-hot encoding
to_encode = ["IND_ETHNICITY","IND_GENDER","IND_MOVERS_FLAG","IND_OCCUPATION","IND_REGION","PROP_EXTR_WALL_TYPE","PROP_GRG_TYPE","PROP_ROOF_TYPE","TSI"]
score_data_dm = pd.get_dummies(data = score_data, columns = to_encode, drop_first = False)
columns_not_in_score_data_dm = [c for c in train_X.columns if c not in score_data_dm.columns] #columns which might not get produced during pd.get_dummies(data = score_data....), if categories are not available in score data
score_data_dm[columns_not_in_score_data] = 0 #initializing above columns as 0
score_data_dm_filt = score_data_dm[select_columns] # making sure to select only the columns which are in the train_X
y_pred = xgb_prop_PV.predict_proba(score_data_dm_filt)[:,1] #final scoring
all_scores = all_scores + y_pred
Inside the looping, I'm having trouble with the SELECT * FROM t step. The error is shown below. I believe the indent within the loop is causing the %% bigquery step to fail. I looked at itertools here, however it appears that it is only useful when conditional looping is there, which is not the case in my situation.
Also, this appears to be a complex approach; is there a more elegant solution? Because the table was too large (600GB), it needed to be split into smaller datasets, so we tried this method. PS: It works without the loop if run for one table at a time. But its quite a manual effort.
Thanks,
Piyush
To answer the issue of the error message. If you are going to pass a variable in to a query using the bigquery magics command you need to use the params flag.
Your code should end up looking something like this:
t="table_name"
my_params = {"t": t}
%%bigquery --params $my_params
select #t
You may consider and try below approach.
Instead of using BigQuery magics, you may use BiQuery Client Library for Python. From there, you may loop to your list of tables by using f string as shown on below sample code.
from google.cloud import bigquery
bqclient = bigquery.Client()
scoring_tables = ["`your-project-id.your-dataset.test_table1`",
"`your-project-id.your-dataset.test_table2`","`your-project-id.your-dataset.test_table3`"]
for t in scoring_tables:
# Download query results.
query_string = f"""
SELECT * FROM {t}
"""
property_data_score_00 = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
print(score_data)
Sample output of above code:

Using Multiprocessing on a Collection of DataFrames

Setup
I have a multiple datasets each with their own DataFrame. I'm running calculations within them before comparing my results to a separate DataFrame which we can think of as constraints.
For example lets say 2 sets of data in a dictionary:
df_data_1 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
df_data_2 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
data_sets = {'data_1': df_data, 'data_2': df_data_2}
and one set of constraints:
df_constraints = pd.DataFrame([['a', 10, 20, 10000000],
['b', 100, 200, 20000000],
['c', 1000, 2000, 30000000]])
df_constraints.columns = ['index', 'sumMin', 'sumMax', 'productMax']
df_constraints.set_index('index', inplace=True)
Visually:
data_set_1
data_set_2
constraints
Function
I'm making calculations within each set of data and then comparing them to a set of constraints. For the sake of simplifying my question I am only comparing the data to the first row of constraints here, but in reality I have to compare the results of my calculations within each data-set to up to 20 sets of constraints.
Here is a simplified version of the function that I am trying to have run in parallel:
def test_func(df_data, df_constraints):
# Run some calculations
df = df_data.copy()
df['sum'] = df.sum(axis=1)
df['product'] = df.product(axis=1)
# Compare results to constraints
df['sumFit'] = ((df['sum'] > df_constraints.loc['a', 'sumMin']) &
(df['sum'] < df_constraints.loc['a', 'sumMax']))
df['productFit'] = df['product'] < df_constraints.loc['a', 'productMax']
# Analyze results
count_sumFits = df['sumFit'].sum()
count_productFits = df['productFit'].sum()
df_results = pd.DataFrame([['data_set_1', count_sumFits, count_productFits]],
columns=['DataSet', 'FittingSums', 'FittingProducts'])
df_results.set_index('DataSet', inplace=True)
return df_results
Sequential Version
I can run this function sequentially through each set of data; iterating through the dictionary with a while loop and then append the results as shown here, but with increased complexity this is taking way longer than I would like. (It's ugly but it works)
n=0
while n < len(data_sets):
data_set_names = list(data_sets.keys())
df_temp = test_func(data_sets[data_set_names[n]], df_constraints)
df_all_results.loc[n, 'FittingSums'] = df_temp.loc[0, 'FittingSums']
df_all_results.loc[n, 'FittingProducts'] = df_temp.loc[0, 'FittingProducts']
n+=1
The Problem
When I have 25 data-sets and I'm running more complex analysis with more calculations, the run time ends up being minutes long. Leading me to pursue concurrency/multiprocessing. I'm hoping to make this significantly faster as it is one step of many that I'm trying to optimize and then run them all a few thousand times.
So, Multiprocessing...
Due to the need to pass two arguments to the function I've been looking at mp.Pool.starmap, and pool.map(partial(test_func, b=df_constraints), data_sets, but I haven't been able to get either method to work.
ex.1) mp.Pool.starmap
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets, itertools.repeat(df_contraints)
This is as far as I've been able to get. Is it possible to process data concurrently like this and then append results to a dataframe? I don't need them to be in any particular order I just want to get the data into the right format.
I don't fully understand your code and your logic but replace data_sets by data_sets.values():
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets.values(),itertools.repeat(df_contraints)))

Parallelize pandas with multiple class instances

I am trying to figure out how to run a large problem on multiple cores. I am struggling with splitting a dataframe to the different processes.
I have a class as follows:
class Pergroup():
def __init__(self, groupid):
...
def process_datapoint(self, df_in, group):
...
My data is a time-series, and contains events that can be grouped using the groupid column. I create an instance of the class for each group as so:
for groupname in df_in['groupid'].unique():
instance_names.append(groupname)
holder = {name: Pergroup(name) for name in instance_names}
Now, for each timestamp in the dataframe, I want to call the corresponding instance (based on the group), and pass to it the dataframe at that timestamp.
I have tried the following, which does not seem to parallelize as I expect:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(holder[current_group].process_datapoint, current_df, current_group)
I have also tried using this, which splits the df into its columns, when calling the instances:
Parallel(n_jobs=-1)(map(delayed(holder[current_group].process_datapoint), current_df, current_group))
How should I break up the dataframe such that I can still call the right instance with the right data? Basically, I am attempting to run a loop as below, with the last line running in parallel:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
holder[current_group].process_datapoint(current_df, current_group) #This call should be initiated in as many cores as possible.
Slightly different approach using pool
import pandas as pd
from multiprocessing import Pool
# First make sure each process has its own data
groups = df_in['groupid'].unique().values
data = [(group_id, holder[group_id], df_in.ix[group_id])
for group for groups]
# Prepare a function that can take this data as input
def help_func(current_group, holder, current_df):
return holder.process_datapoint(current_df, current_group)
# Run in parallel
with Pool(processes=4) as p:
p.map(help_func, data)
I had at some point a similar problem; as I can completely adapt to your question, I hope you can transpose and make this fit to your problem:
import multiprocessing
from joblib import Parallel, delayed
maxbatchsize = 10000 #limit the amount of data dispatched to each core
ncores = -1 #number of cores to use
data = pandas.DataFrame() #<<<- your dataframe
class DFconvoluter():
def __init__(self, myparam):
self.myparam = myparam
def __call__(self, df):
return df.apply(lamda row: row['somecolumn']*self.myparam)
nbatches = max(math.ceil(len(df)/maxbatchsize), ncores)
g = GenStrategicGroups( data['Key'].values, nbatches ) #a vector telling which row should be dispatched to which batch
#-- parallel part
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=ncores)(delayed(func)(group) for _, group in dfGrouped)
return pd.concat(retLst)
out = applyParallel(data.groupby(g), Dfconvoluter(42)))'
what is left is to write, how you'd like to group the batches together, for me this had to be done in a fashion so that rows, where values in the 'keys'-column where similar had to stay together:
def GenStrategicGroups(stratify, ngroups):
''' Generate a list of integers in a grouped sequence,
where grouped levels in stratifiy are preserved.
'''
g = []
nelpg = float(len(stratify)) / ngroups
prev_ = None
grouped_idx = 0
for i,s in enumerate(stratify):
if i > (grouped_idx+1)*nelpg:
if s != prev_:
grouped_idx += 1
g.append(grouped_idx)
prev_ = s
return g

Automating Python Task

I would like to automate the below python code to be applied to different dataframes.
df_twitter = pd.read_csv('merged_watsonTwitter.csv')
df_original = pd.read_csv('merged_watsonOriginal.csv')
sample_1_twitter = df_twitter['ID_A'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1_twitter = df_twitter[sample_1_twitter]
sample_1_original = df_original['ID_B'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1_original = df_original[sample_1_original]
sample_1_twit_trunc = sample_1_twitter[['raw_score_parent_A','raw_score_child_A']]
sample_1_ori_trunc = sample_1_original[['raw_score_parent_B','raw_score_child_B']]
sample_1_twit_trunc.reset_index(drop=True, inplace=True)
sample_1_ori_trunc.reset_index(drop=True, inplace=True)
sample_1 = pd.concat([sample_1_twit_trunc, sample_1_ori_trunc], axis=1)
sample_1['ID'] = '08b56ebc-8eae-41b3-9c86-c79e3be542fd'
stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])
For example, the above code that indicates the ID "08b56ebc-8eae-41b3-9c86-c79e3be542fd" is of a specific individual. If I am to calculate the T-test for all the samples I have, then I'll need to keep replacing the different ID's for everyone by copying and pasting it on the code above.
Is there a method to automate this process whereby these sections;
df_twitter['ID_A'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
df_original['ID_B'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1['ID'] = '08b56ebc-8eae-41b3-9c86-c79e3be542fd'
could accept all the ID's I have and automate the entire process.
At the end, save each result output generated by this function as well:
stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])
As Klaus mentioned you need a function which takes arguments. You can just try to place your code in a function. You may want to store your ID's in a list of any iterable collection. You can also store the t-test results in a list.
ids = ["08b56ebc-8eae-41b3-9c86-c79e3be542fd","08b56ebc-8eae-41b3-9c86-c79e3be542f4"]
def runTTest (id,df_twitter,df_original):
sample_1_twitter = df_twitter['ID_A'] == id
sample_1_twitter = df_twitter[sample_1_twitter]
sample_1_original = df_original['ID_B'] == id
sample_1_original = df_original[sample_1_original]
sample_1_twit_trunc =
sample_1_twitter[['raw_score_parent_A','raw_score_child_A']]
sample_1_ori_trunc =
sample_1_original[['raw_score_parent_B','raw_score_child_B']]
sample_1_twit_trunc.reset_index(drop=True, inplace=True)
sample_1_ori_trunc.reset_index(drop=True, inplace=True)
sample_1 = pd.concat([sample_1_twit_trunc, sample_1_ori_trunc], axis=1)
sample_1['ID'] = id
return stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])
t_test_results=[]
for id in ids:
t_test_results.append(runTTest(id,df_twitter ,df_original))

Categories

Resources