My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.
Here's some constraint of the data:
The data is sorted by TimeseriesID and TimeID
The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
The starting of TimeID is not always 0
The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.
Here's the initial setup to illustrate the problem:
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
This should be pretty trivial if I can just concat all the files but the problem is if I concat all the dataframe then it won't fit in the memory.
The output I desired is should be similar to this but without concat all the files.
pd.concat([df1,df2,df3],axis=0).groupby('TimeseriesID').agg({"value":simple_func})
I'm also aware about vaex and dask but I want to stick with simple pandas for time being.
I'm also open to solution which involve modifying the BigQuery to split the files better.
Approach presented by op to use concat with million of records would be overkill for memories/other resources.
I have tested OP code using Google Colab Nootebooks and this was a bad approach
import pandas as pd
import numpy as np
import time
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
start = time.time()
df = pd.concat([df1,df2,df3]).groupby('TimeseriesID').agg({"value":custom_func})
elapsed = (time.time() - start)
print(elapsed)
print(df.head())
output will be:
0.023952960968017578
value
TimeseriesID A 11.666667
B 16.250000
C 20.000000
D 18.333333
As you can see, 'concat' takes time to process. Due to few records this is not perceived.
The approach should be as follow:
Get files with data that you are going to process. ie: only workable columns.
Create a dictionary from the processed files key and values. if necessary, obtain values per key in a necessary file. You can store the results in a 'results' directory as json/csv:
A.csv will have all key 'A' values
...
n.csv will have all key 'n' values
Iterate trough results directory and start building your final output inside a dictionary.
{'A': [10, 20, 5], 'B': [30, 10, 20, 5], 'C': [30, 10], 'D': [20, 5, 30]}
apply custom function to each key value list.
{'A': 11.666666666666666, 'B': 16.25, 'C': 20.0, 'D': 18.333333333333332}
You can check the logic using below code, I use json to store the data:
from google.colab import files
import json
import pandas as pd
#initial dataset
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
#get unique keys and its values
df1.groupby('TimeseriesID')['value'].apply(list).to_json('df1.json')
df2.groupby('TimeseriesID')['value'].apply(list).to_json('df2.json')
df3.groupby('TimeseriesID')['value'].apply(list).to_json('df3.json')
#as this is an example you can download the output as jsons
files.download('df1.json')
files.download('df2.json')
files.download('df3.json')
Update 06/10/2021
I have tuned code for OPs needs. This part creates refined files.
from google.colab import files
import json
#you should use your own function to get the data from the file
def retrieve_data(uploaded,file):
return json.loads(uploaded[file].decode('utf-8'))
#you should use your own function to get a list of files to process
def retrieve_files():
return files.upload()
key_list =[]
#call a function that gets a list of files to process
file_to_process = retrieve_files()
#read every raw file:
for file in file_to_process:
file_data = retrieve_data(file_to_process,file)
for key,value in file_data.items():
if key not in key_list:
key_list.append(key)
with open(f'{key}.json','w') as new_key_file:
new_json = json.dumps({key:value})
new_key_file.write(new_json)
else:
with open(f'{key}.json','r+') as key_file:
raw_json = key_file.read()
old_json = json.loads(raw_json)
new_json = json.dumps({key:old_json[key]+value})
key_file.seek(0)
key_file.write(new_json)
for key in key_list:
files.download(f'{key}.json')
print(key_list)
Update 07/10/2021
I have updated code to avoid confusion. This part process refined files.
import time
import numpy as np
#Once we get the refined values we can use it to apply custom functions
def custom_func(x):
return np.mean(x)
#Get key and data content from single json
def get_data(file_data):
content = file_data.popitem()
return content[0],content[1]
#load key list and build our refined dictionary
refined_values = []
#call a function that gets a list of files to process
file_to_process = retrieve_files()
start = time.time()
#read every refined file:
for file in file_to_process:
#read content of file n
file_data = retrieve_data(file_to_process,file)
#parse and apply function per file read
key,data = get_data(file_data)
func_output = custom_func(data)
#start building refined list
refined_values.append([key,func_output])
elapsed = (time.time() - start)
print(elapsed)
df = pd.DataFrame.from_records(refined_values,columns=['TimerSeriesID','value']).sort_values(by=['TimerSeriesID'])
df = df.reset_index(drop=True)
print(df.head())
output will be:
0.00045609474182128906
TimerSeriesID value
0 A 11.666667
1 B 16.250000
2 C 20.000000
3 D 18.333333
summarize:
When handling large datasets, you should always need to focus on the data that you are going to use and keep it minimal. Only using the workable values.
Processing times are faster when operations are performed by basic operators or python native libraries.
I am new to multithreading in python so am not sure how to set this up. I am trying to produce a large output dataframe populated with calculations based on another input dataframe. The output dataframe is like an adjacency matrix of the columns of the input dataframe.
The following non-multithreaded version works perfectly:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import json
import os
import time
def build_adjacency_matrix(DATA_MATRIX, OUT):
# READS DATA: data must be a csv with a header and an index column
my_data = pd.read_csv(DATA_MATRIX, index_col=0)
# INITIALIZE EMPTY DF WITH COLSNAMES FROM INPUT AS COLUMNS AND INDEX (rownames)
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns)
y=0
w=2
for c1 in my_data.columns:
print (c1)
y+=1
if y > w:
time.sleep(1) # GIVE THE PROCESSER A REST AFTER EACH 10 COLUMNS
print(y) #KEEP TRACK OF HOW MANY COLS HAVE BEEN PROCESSED
w+=10
for c2 in my_data.columns:
if c1==c2: AM.loc[c1,c2]=0; continue
sample_df = pd.DataFrame(my_data, columns=[c1,c2])
# KEEP ONLY ROWS WITH 1s and 0s
sample_df = sample_df[sample_df[c1] != 0.5]
sample_df = sample_df[sample_df[c2] != 0.5]
sample_df = sample_df.dropna()
# CALCULATE ChiX
# Contingency table.
contingency = pd.crosstab(sample_df[c1], sample_df[c2])
# Chi-square test of independence.
try:
chi2, p, ddof, expected = chi2_contingency(contingency)
AM.loc[c1,c2] = p
except:
ValueError;
# ASSIGN AS NOT SIGNIFICANT IF THERE IS A PROBLEM
AM.loc[c1,c2] = 1
AM.to_csv(OUT, sep=',')
return
# FILES
data_matrix='input_test.csv'
out='output_mt_test.csv'
# FUNCTION CALL
build_adjacency_matrix(data_matrix, out)
Here is the top few rows of the input file:
,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10,VAR11,VAR12,VAR13,VAR14,VAR15,VAR16,VAR17,VAR18,VAR19
SAMPLE1,1,0,0.5,1,1,0.5,0.5,1,0.5,0.5,0.5,0.5,0,0.5,0,0.5,0,0.5,0.5
SAMPLE2,0.5,0.5,0.5,1,1,0.5,0.5,1,0.5,0.5,0,1,0,0.5,0,0.5,0.5,0.5,0.5
SAMPLE3,0.5,0,0.5,1,1,0.5,0.5,1,0.5,0.5,1,0.5,0.5,0.5,0,1,0,0.5,0.5
SAMPLE4,1,0.5,0.5,1,1,0.5,0.5,0,0.5,0.5,0.5,0.5,0.5,0.5,1,1,0.5,0.5,1
And here is the top few rows of the output file:
,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10,VAR11,VAR12,VAR13,VAR14,VAR15,VAR16,VAR17,VAR18,VAR19
VAR1,0,0.00326965769624,0.67328997966,0.573642138098,0.573642138098,0.923724918398,0.556975806531,0.665485722686,1.0,0.545971722677,0.125786424639,0.665005542102,0.914326585297,0.843324894877,0.10024407707,0.37367830795,0.894229755473,0.711877649185,0.920167313802
VAR2,0.00326965769624,0,0.67328997966,0.714393037634,0.714393037634,0.829638099719,1.0,0.881545828869,1.0,1.0,0.504985075094,0.665005542102,0.672603817442,0.75946286538,0.365088814029,1.0,0.478520976544,0.698535358303,0.700311372937
VAR3,0.67328997966,0.67328997966,0,1.0,1.0,0.665005542102,1.0,0.672603817442,1.0,1.0,1.0,1.0,0.819476976778,1.0,0.324126587758,1.0,1.0,0.665005542102,0.608407800233
The code works well and produces the expected output for the test file, however the real input file (exactly the same file structure but with 100s rows and 1000s of cols) is considerably larger and takes ~48 hours to run so I need to make it faster.
I tried the following attempt to implement multithreading:
import pandas as pd
from scipy.stats import chi2_contingency
from threading import Thread
def build_adjacency_matrix(DATA_MATRIX, OUT, THREADS):
# READS DATA: data must be a csv with a header and an index column
my_data = pd.read_csv(DATA_MATRIX, index_col=0)
# INITIALIZE EMPTY DF WITH COLSNAMES FROM INPUT AS COLUMNS AND INDEX (rownames)
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns)
print(len(my_data.columns))
print(len(my_data.index))
# BUILD THREAD GROUPS
thread_groups={}
chunk=int(len(AM.columns)/THREADS)
i=0; j=chunk
for t in range(THREADS): thread_groups[t]=list(range(i,j)); i+=chunk; j+=chunk;
# DELEGATE REMAINING COLS TO THE LAST THREAD
if thread_groups[THREADS-1][-1] != len(AM.columns):
thread_groups[THREADS-1] = thread_groups[THREADS-1] + \
list(range((thread_groups[THREADS-1][-1]),len(AM.columns)))
print(thread_groups)
def populate_DF(section):
for c1 in AM.columns[section]:
for c2 in AM.columns:
if c1==c2: AM.loc[c1,c2]=0; continue
sample_df = pd.DataFrame(my_data, columns=[c1,c2])
# KEEP ONLY ROWS WITH 1s and 0s
sample_df = sample_df[sample_df[c1] != 0.5]
sample_df = sample_df[sample_df[c2] != 0.5]
sample_df = sample_df.dropna()
# CALCULATE ChiX
# Contingency table.
contingency = pd.crosstab(sample_df[c1], sample_df[c2])
#Chi-square test of independence.
try:
# POPULATE AM WITH CHI-SQ p-value
chi2, p, ddof, expected = chi2_contingency(contingency)
AM.loc[c1,c2] = p
except:
# ASSIGN A p-value OF 1.0 IF THERE IS A PROBLEM
ValueError;
AM.loc[c1,c2] = 1
for tg in thread_groups:
t = Thread(target=populate_DF, args=(thread_groups[tg],))
print(tg)
print(thread_groups[tg])
t.start()
AM.to_csv(OUT, sep=',')
return
data_matrix='input_test.csv'
out='output_mt_test.csv'
build_adjacency_matrix(data_matrix, out, 4)
I'm not sure if I should be making the output dataframe a global variable? Or how to do it? The aim of the section on 'building thread groups' is to delegate groups of columns from the input file to be delegated to separate threads and each of the outputs added to the final dataframe. I have up to 16 cores available so thought a multithreading solution would help here. The code as it is produces an unexpected, partially complete output:
,VAR1,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10,VAR11,VAR12,VAR13,VAR14,VAR15,VAR16,VAR17,VAR18,VAR19
VAR1,0,0.00326965769624,0.67328997966,0.573642138098,0.573642138098,0.923724918398,0.556975806531,0.665485722686,1.0,0.545971722677,0.125786424639,0.665005542102,0.914326585297,0.843324894877,0.10024407707,0.37367830795,0.894229755473,0.711877649185,
VAR2,,,,,,,,,,,,,,,,,,,
VAR3,,,,,,,,,,,,,,,,,,,
VAR4,,,,,,,,,,,,,,,,,,,
VAR5,0.573642138098,0.714393037634,1.0,5.61531250139e-06,0,1.0,1.0,0.859350808026,0.819476976778,0.819476976778,1.0,1.0,0.805020272634,,,,,,
VAR6,,,,,,,,,,,,,,,,,,,
VAR7,,,,,,,,,,,,,,,,,,,
VAR8,,,,,,,,,,,,,,,,,,,
VAR9,1.0,1.0,1.0,0.819476976778,,,,,,,,,,,,,,,
VAR10,,,,,,,,,,,,,,,,,,,
VAR11,,,,,,,,,,,,,,,,,,,
VAR12,,,,,,,,,,,,,,,,,,,
VAR13,0.914326585297,,,,,,,,,,,,,,,,,,
VAR14,,,,,,,,,,,,,,,,,,,
VAR15,,,,,,,,,,,,,,,,,,,
VAR16,,,,,,,,,,,,,,,,,,,
VAR17,,,,,,,,,,,,,,,,,,,
VAR18,,,,,,,,,,,,,,,,,,,
VAR19,,,,,,,,,,,,,,,,,,,
i'm not sure if this is to do with an issue with the multithreads trying to output to the same variable or if this is a problem with how I have spread the workload. I would really appreciate any help with how to fix this, or any other ways to optimize the code? Thanks in advance!