How to run the same function multiple times simultaneously? - python

I have a function that takes dataframe as an input and returns a dataframe. Like:
def process(df):
<all the code for processing>
return df
# input df has 250K rows and 30 columns
# saving it in a variable
result = process(df)
# transform input df into 10,000K rows and over 50 columns
It does a lot of processing and thus takes a long time to return the output. I am using jupyter notebook.
I have come up with a new function that filters the original dataframe into 5 chunks not of equal size but between 30K to 100K, based on some category filter on a column on the origianl df and have it passed separately as process(df1), process(df2)...etc. and save it as result1, result 2, etc and then merge the results together as one single final dataframe.
But I want them to run simultaneously and combine the results automatically. Like a code to run the 5 process functions together and once all are completed then they can join as one to give me the same "result" as earlier but with a lot of run time saved.
Even better if I can split the original dataframe into equal parts and run simultaneously each part using the process(df) function, like it splits randomly those 250 k rows into 5 dataframes of size 50k each and send them as an input to the process(df) five times and runs them parallelly and give me the same final output I would be getting right now without any of this optimization.
I was reading a lot about multi-threading and I found some useful answers on stack overflow but I wasn't able to really get it work. I am very new to this concept of multi-threading.

You can use the multiprocessing library for this, which allows you to run a function on different cores of the CPU.
The following is an example
from multiprocessing import Pool
def f(df):
# Process dataframe
return df
if __name__ == '__main__':
dataframes = [df1, df2, df3]
with Pool(len(dataframes)) as p:
proccessed_dfs = p.map(f, dataframes)
print(processed_dfs)
# You would join them here

You should check dask (https://dask.org/) since it seems like you have mostly operations on dataframes. A big advantage is that you won't have to worry about all the details of manually splitting your dataframe and all of that.

Related

using pandas.read_csv() for malformed csv data

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

using melt function in groupby for large data sets in python

I have one data frame with 1782568 distinct groups.
So, when i melt that data by grouping level my kernal got stuck.
So, I am decided to to melt the data by group wise and then i will combine all of them sequentially.
For that I wrote the following function.
def split(df,key):
df2=pd.DataFrame()
for i in range(df[key].drop_duplicates().shape[0]):
grp_key=tuple(df[key].drop_duplicates().iloc[i,:])
df1=df.groupby(key,as_index=False).
get_group(grp_key).reset_index().drop('index',axis=1)
df2=df2.append(df1.groupby(key,as_index=False).
apply(pd.melt,id_vars=key).reset_index()).dropna()
df2=df2.drop(grep('level',df2.columns),axis=1)
return(df2)
here grep is my user defined function, it is working as grep function in R.
In df i would pass data frame and in key i would pass grouping keys in list format.
But the function also took very huge time to complete the process.
Can any one help me to improve the performance.
Thanks in Advance.

pandas chunksize - concat inside and outside the loops

I have to read massive csv files (500 million lines), and I tried to read them with pandas using the chunksize method, in order to reduce memory consumption. But I didn't understand the behaviour of the concat method and the option to read all the file and reduce memory. I'm adding some pseudocode in order to explain what I did so far.
Let's say I'm reading and then concatenate a file with n lines with:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
df = pd.concat([chunk for chunk in iter_csv])
Then I have to apply a function to the dataframe to create a new column based on some values:
df['newcl'] = df.apply(function)
Everything goes fine.
But now I wonder what's the difference between the above procedure and the following:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
for chunk in iter_csv:
chunk['newcl'] = chunk.apply(function)
df = pd.concat([chunk])
In terms of RAM consumption, I thought that the second method should be better because it applies the function only to the chunk and not to the whole dataframe. But the following issues occur:
putting the df = pd.concat([chunk]) inside the loop returns me a dataframe with a size of n/2 (the size of the chunk), and not the full one;
putting the df = pd.concat([chunk]) outside, after the loop returns the same n/2 dataframe length.
So my doubt is whether the first method (concatenate the dataframe just after the read_csv function) is the best one, balancing speed and RAM consumption. And I'm also wondering how may I concat the chunks using the for loop.
Thanks for your support.

Running update of Pandas dataframe

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.

Categories

Resources