How to write multiprocessing results to csv within multiprocessing? - python

Hi I am trying to retrieve my results using csv , with the column names as states this is taking very long, how can i include this in my multiprocessing function?.
my previous question here has the entire code (how to add multiprocessing to loops?), so I just included a snippet of the part i am trying to change
with Pool(initializer=init_pool_processes, initargs=(lookup1, matrix_data, end_dates)) as pool:
results = {cust_id: arr for cust_id, arr in pool.map(calc, data1.itertuples(name=None, index=False))}
s=pd.series(results).explode()
df =pd.DataFrame(
s.to_list(),
index=s.index,
columns=['A','B','C','D','E',],
)
df.to_csv
if __name__ == '__main__':
main()

Related

Python Multiprocessing write to csv data for huge volume files

I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()
You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)

python cbpro api problems appending data

This script should be printing every order for every second for a two minute duration, but the csv file just has the the same row repeated. Sample data from the csv is below.
import cbpro
import time
import pandas as pd
import os
import json
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
csv_file = "cbpro-test-1.csv"
df = pd.DataFrame()
timeout = time.time() + 60*2
while True:
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
df.to_csv(csv_file, index=False, encoding='utf-8')
here is some sample output of the csv:
trade_id,price,size,time,bid,ask,volume
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
edit: I moved the public client and the res variable to inside the loop and it works somewhat, it skips a second data looks like this now:
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
it goes from 06:33:50 to 06:33:52, the rest of the file follows the same format
tried with this while loop:
while True:
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
You fetch only one quote, before you enter the loop. Then you repeatedly process that same data. You never change res, you simply keep appending the same values to you DF, one iteration after another. You need to fetch repeatedly using get_product_ticker.
After OP update:
Yes, that's how to get the quotations rapidly. You can do it better if you move that first line above the loop: you don't need to re-create the client object on every iteration.
Several lines are identical because you're fetching real-time quotations. If nobody changes the current-best bid or ask price, then the quotation remains static. If you want only the changes, then use the unique method of PANDAS to remove the duplicates.

Multiprocessing for Pandas Dataframe write to excel sheets

I have working code to write from a large dataframe to separate sheets in an excel file but it takes a long time about 30-40 minutes. I would like to find a way for it to run faster using multiprocessing.
I tried to rewrite it using multiprocessing so that writing to each excel tab could be done in parallel with multiple processors. The revised code runs without errors but it also is not writing to the excel file properly either. Any suggestions would be helpful.
Original working section of code:
import os
from excel_writer import append_df_to_excel
import pandas as pd
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
append_df_to_excel(path, df_col, sheet_name = column, truncate_sheet = True,
startrow = 0) # Add data to excel file
Revised code trying multiprocessing:
import os
from excel_writer import append_df_to_excel
import pandas as pd
import multiprocessing
def data_to_excel(col, excel_fn, data):
data_fr = pd.DataFrame(data) # switch list back to dataframe for putting into excel file sheets
append_df_to_excel(excel_fn, data_fr, sheet_name = col, truncate_sheet = True, startrow = 0) # Add data to sheet in excel file
if __name__ == "__main__":
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
data_col = df_col.values.tolist() # convert dataframe coluumn to a list to use in pool
pool.apply_async(data_to_excel, args = (column, path, data_col))
pool.close()
pool.join()
I do not know proper way to write to single file from multiple process. I need to solve similar problem. I solve it with creation writer process which gets data using Queue. You can see my solution here (sorry it is not documented).
Simplified version (draft)
from multiprocessing import Queue
input_queue = Queue()
res_queue = Queue()
process_list = []
def do_calculation(input_queue, res_queue, calculate_function):
try:
while True:
data = in_queue.get(False)
try:
res = calculate_function(**data)
out_queue.put(res)
except ValueError as e:
out_queue.put("fail")
logging.error(f" fail on {data}")
except queue.Empty:
return
# put data in input queue
def save_process(out_queue, file_path, count):
for i in range(count):
data = out_queue.get()
if data == "fail":
continue
# write to excel here
for i in range(process_num):
p = Process(target=do_calculation, args=(input_queue, res_queue, calculate_function))
p.start()
process_list.append(p)
p2 = Process(target=save_process, args=(res_queue, path_to_excel, data_size))
p2.start()
p2.join()
for p in process_list:
p.join()

Writing to csv from dataframe using multiprocessing and not messing up the output

import numpy as np
import pandas as pd
from multiprocessing import Pool
import threading
#Load the data
df = pd.read_csv('crsp_short.csv', low_memory=False)
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
#then write it to file
sample.to_csv('crsp_full.csv', mode='a')
def evaluation(f_list):
with futures.ProcessPoolExecutor() as pool:
return pool.map(funk, f_list)
# list_s is a list of dates I want to calculate function funk for
evaluation(list_s)
I get a csv file as an output with some of the lines messed up because python is writing some pieces from different threads at the same time. I guess I need to use Queues, but I was not able to modify the code so that it worked. Ideas how to do it?Otherwise it takes ages to get the results.
That solved the problem (Pool does the queue for you)
Python: Writing to a single file with queue while using multiprocessing Pool
My version of the code that didn't mess up the output csv file:
import numpy as np
import pandas as pd
from multiprocessing import Pool
import threading
#Load the data
df = pd.read_csv('crsp_short.csv', low_memory=False)
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()

Ironpython code taking too long to execute?

I have a piece of code written in Iron Python that reads data from table present in SpotFire and serialize in JSON object. It is taking too long to get executed. Please provide alternates to it.
import clr
import sys
clr.AddReference('System.Web.Extensions')
from System.Web.Script.Serialization import JavaScriptSerializer
from Spotfire.Dxp.Data import IndexSet
from Spotfire.Dxp.Data import DataValueCursor
rowCount = MyTable.RowCount
rows = IndexSet(rowCount,True)
cols = MyTable.Columns
MyTableData=[]
for r in rows:
list={}
item={}
for c in cols:
item[c.Name] = c.RowValues.GetFormattedValue(r)
list['MyData']=item
MyTableData.append(list)
json=JavaScriptSerializer(MaxJsonLength=sys.maxint).Serialize(MyTableData)
Your code will be faster if you don't call list['MyData']=item for every column. You only need to call it once.
You could also use list and dictionary comprehensions, instead of appending, or looking up keys for every value.
MyTableData = [{'MyData': {column.Name: column.RowValues.GetFormattedValue(row)
for column in cols}}
for row in rows]
If column.RowValues is an expensive operation you may be better looping over columns, which isn't as neat.

Categories

Resources