Python Multiprocessing write to csv data for huge volume files - python

I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()

You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)

Related

How to write multiprocessing results to csv within multiprocessing?

Hi I am trying to retrieve my results using csv , with the column names as states this is taking very long, how can i include this in my multiprocessing function?.
my previous question here has the entire code (how to add multiprocessing to loops?), so I just included a snippet of the part i am trying to change
with Pool(initializer=init_pool_processes, initargs=(lookup1, matrix_data, end_dates)) as pool:
results = {cust_id: arr for cust_id, arr in pool.map(calc, data1.itertuples(name=None, index=False))}
s=pd.series(results).explode()
df =pd.DataFrame(
s.to_list(),
index=s.index,
columns=['A','B','C','D','E',],
)
df.to_csv
if __name__ == '__main__':
main()

Is there any way to boost the speed of pandas program for large dataset?

So I have a dataset of 600K rows with multiple columns from which I used a small subset to check the code first.
I wrote this code which generates new column of PMIDs which is retrieved from a online repository
from pandas import read_csv
from Bio import Entrez
Entrez.email = "myemail#mail.com"
sample = read_csv("rsid_subset.csv",dtype=str)
start = timer()
def retrieve_Pmids(sample):
res = []
if sample["Rs_ID"].startswith("rs"):
rs_id = sample.Rs_ID
handle = Entrez.esearch(db="pubmed", term=rs_id, rettype="uilist", retmode="text")
res = Entrez.read(handle)
ids = ""
if res:
ids = ",".join(str(res['IdList']))
return ids
else:
return "."
else:
return "."
sample["PMIDS"] = sample.apply(retrieve_Pmids,axis=1)
Which gives me the desired result
Now the problem I am facing is that for the dataset of only 173 rows it takes about 133seconds.
So is there any way I can make this program faster for large dataset.
So far I have tried Dask (no imporvement), tried static typing wherever possible but not much difference, and then I used Modin module but the my main Pandas version differs from the modin's pandas version.
I also tried multiprocessing by chunking the dataset but somehow its not providing any result.
Below is the code for multiprocessing I referred from somewhere
Problem is I cant see the result, the script is getting executed but I cant see the result.
from Bio import Entrez
import multiprocessing as mp
import pandas as pd
SAMPLE = "D:\\Trial\\rsid_subset.csv"
CHUNKSIZE = int(20)
Entrez.email = "myemail#mail.com"
def retrieve_Pmids(sample):
res = []
if sample["Rs_ID"].startswith("rs"):
rs_id = sample.Rs_ID
handle = Entrez.esearch(db="pubmed", term=rs_id, rettype="uilist", retmode="text")
res = Entrez.read(handle)
ids = ""
if res:
ids = ",".join(res['IdList'])
return ids
else:
return "."
else:
return "."
if __name__ == '__main__':
reader = pd.read_csv(SAMPLE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
f = pool.apply_async(retrieve_Pmids,[df])
funclist.append(f)
So is there any way I can speed up the process for large dataset as I would work on such large dataset in future.
Can anyone help me improve the above multiprocessing code in order to get the desired dataframe result?
Thank you in advance.

python cbpro api problems appending data

This script should be printing every order for every second for a two minute duration, but the csv file just has the the same row repeated. Sample data from the csv is below.
import cbpro
import time
import pandas as pd
import os
import json
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
csv_file = "cbpro-test-1.csv"
df = pd.DataFrame()
timeout = time.time() + 60*2
while True:
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
df.to_csv(csv_file, index=False, encoding='utf-8')
here is some sample output of the csv:
trade_id,price,size,time,bid,ask,volume
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
edit: I moved the public client and the res variable to inside the loop and it works somewhat, it skips a second data looks like this now:
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
it goes from 06:33:50 to 06:33:52, the rest of the file follows the same format
tried with this while loop:
while True:
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
You fetch only one quote, before you enter the loop. Then you repeatedly process that same data. You never change res, you simply keep appending the same values to you DF, one iteration after another. You need to fetch repeatedly using get_product_ticker.
After OP update:
Yes, that's how to get the quotations rapidly. You can do it better if you move that first line above the loop: you don't need to re-create the client object on every iteration.
Several lines are identical because you're fetching real-time quotations. If nobody changes the current-best bid or ask price, then the quotation remains static. If you want only the changes, then use the unique method of PANDAS to remove the duplicates.

Multiprocessing for Pandas Dataframe write to excel sheets

I have working code to write from a large dataframe to separate sheets in an excel file but it takes a long time about 30-40 minutes. I would like to find a way for it to run faster using multiprocessing.
I tried to rewrite it using multiprocessing so that writing to each excel tab could be done in parallel with multiple processors. The revised code runs without errors but it also is not writing to the excel file properly either. Any suggestions would be helpful.
Original working section of code:
import os
from excel_writer import append_df_to_excel
import pandas as pd
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
append_df_to_excel(path, df_col, sheet_name = column, truncate_sheet = True,
startrow = 0) # Add data to excel file
Revised code trying multiprocessing:
import os
from excel_writer import append_df_to_excel
import pandas as pd
import multiprocessing
def data_to_excel(col, excel_fn, data):
data_fr = pd.DataFrame(data) # switch list back to dataframe for putting into excel file sheets
append_df_to_excel(excel_fn, data_fr, sheet_name = col, truncate_sheet = True, startrow = 0) # Add data to sheet in excel file
if __name__ == "__main__":
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
data_col = df_col.values.tolist() # convert dataframe coluumn to a list to use in pool
pool.apply_async(data_to_excel, args = (column, path, data_col))
pool.close()
pool.join()
I do not know proper way to write to single file from multiple process. I need to solve similar problem. I solve it with creation writer process which gets data using Queue. You can see my solution here (sorry it is not documented).
Simplified version (draft)
from multiprocessing import Queue
input_queue = Queue()
res_queue = Queue()
process_list = []
def do_calculation(input_queue, res_queue, calculate_function):
try:
while True:
data = in_queue.get(False)
try:
res = calculate_function(**data)
out_queue.put(res)
except ValueError as e:
out_queue.put("fail")
logging.error(f" fail on {data}")
except queue.Empty:
return
# put data in input queue
def save_process(out_queue, file_path, count):
for i in range(count):
data = out_queue.get()
if data == "fail":
continue
# write to excel here
for i in range(process_num):
p = Process(target=do_calculation, args=(input_queue, res_queue, calculate_function))
p.start()
process_list.append(p)
p2 = Process(target=save_process, args=(res_queue, path_to_excel, data_size))
p2.start()
p2.join()
for p in process_list:
p.join()

Python multiprocessing with dataframe and multiple arguments

According to this answer when multiprocessing with multiple arguments starmap should be used. The problem I am having is that one of my arguments is a constant dataframe. When I create a list of arguments to be used by my function and starmap the dataframe gets stored over and over. I though I could get around this problem using namespace, but can't seem to figure it out. My code below hasn't thrown an error, but after 30 minutes no files have written. The code runs in under 10 minutes without using multiprocessing and just calling write_file directly.
import pandas as pd
import numpy as np
import multiprocessing as mp
def write_file(df, colIndex, splitter, outpath):
with open(outpath + splitter + ".txt", 'a') as oFile:
data = df[df.iloc[:,colIndex] == splitter]
data.to_csv(oFile, sep = '|', index = False, header = False)
mgr = mp.Manager()
ns = mgr.Namespace()
df = pd.read_table(file_, delimiter = '|', header = None)
ns.df = df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
for item in fileList:
with mp.Pool(processes=3) as pool:
pool.starmap(write_file, np.array((ns, 1, item, outpath)).tolist())
To anyone else struggling with this issue, my solution was to create an iterable list of tuples of length chunksize out of the dataframe via:
iterable = product(np.array_split(data, 15), [args])
Then, pass this iterable to the starmap:
pool.starmap(func, iterable)
I had the same issue - needed to pass two existing dataframes to the function using starmap. It turns out that there isn't a need to declare a dataframe as an argument in the function at all. You could just call the dataframe using 'global', as described in the accepted answer here: Pandas: local vs global dataframe in functions

Categories

Resources