I have a process that reads URIs of CSV files located in cloud storage, serializes the data (one file is an "example" in tensorflow speak), and writes them to the same TFRecord file.
The process is very slow and I would like to parallelize the writing using python multiprocessing. I've searched high and low and tried multiple implementations to no avail. This question is very similar to mine, but the question is never really answered.
This is the closest I've come (unfortunately, I can't really provide a replicable example due to the read from cloud storage):
import pandas as pd
import multiprocessing
import tensorflow as TF
TFR_PATH = "./tfr.tfrecord"
BANDS = ["B2", "B3","B4","B5","B6","B7","B8","B8A","B11","B12"]
def write_tfrecord(tfr_path, df_list, bands):
with tf.io.TFRecordWriter(tfr_path) as writer:
for _, grp in df_list:
band_data = {b: [] for b in bands}
for i, row in grp.iterrows():
try:
df = pd.read_csv(row['uri'])
except FileNotFoundError:
continue
df = prepare_df(df, bands)
label = row['FS_crop'].encode()
for b in bands:
band_data[b].append(list(df[b].astype('Int64')))
# pad to same length and flatten
mlen = max([len(j) for j in band_data[list(band_data.keys())[0]]])
npx = len(band_data[list(band_data.keys())[0]])
flat_band_data = {k: [] for k in band_data}
for k,v in band_data.items(): # for each band
for b in v:
flat_band_data[k].extend(b + [0] * int(mlen - len(b)))
example_proto = serialize_example(npx, flat_band_data, label)
writer.write(example_proto)
# List of grouped DF object, may be 1000's long
gqdf = list(qdf.groupby("field_centroid_str"))
n = 100 #Groups of files to write
processes = [multiprocessing.Process(target=write_tfrecord, args=(TFR_PATH, gqdf[i:i+n], BANDS)) for i in range(0, len(gqdf), n)]
for p in processes:
p.start()
for p in processes:
p.join()
p.close()
This processes will finish, but when I go to read a record, I like so:
raw_dataset = tf.data.TFRecordDataset(TFR_PATH)
for raw_record in raw_dataset.take(10):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
I always end up with a corrupted data error DataLossError: corrupted record at 7462 [Op:IteratorGetNext]
Any ideas on the correct approach for doing something like this? I've tried using Pool instead of Process, but the tf.io.TFRecordWriter can't be pickled, so it doesn't work.
Related
I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.
I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(int(parts1[0]))
n2.append(int(parts1[1]))
n3.append(int(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
data_output.write(data)
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?
Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
data_output.write(data)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()
Similar to this question How to share a variable in 'joblib' Python library
I want to share a variable in joblib. However, my problem is completely different, I have a huge variable (2-3Gb of RAM) and I want all my threads to read from it. They will never write, something like:
def func(varThatChange, varToRead):
# Do something over varToRead depending on varThatChange
return results
def main():
results = Parallel(n_jobs=100)(delayed(func)(varThatChange, varToRead) for varThatChange in listVars)
I cannot share it normally because it needs a lot of time to copy the variable, moreover, I go out of memory.
How can I share it?
if your data/variable can be indexed you can use an approach like that:
from joblib import Parallel, delayed
import numpy as np
# dummy data
big_data = np.arange(1000)
# size of the data
data_size = len(big_data)
# number of chunks the data should be divided in for multiprocessing
num_chunks = 12
# size of one chunk
chunk_size = int(data_size / num_chunks)
# get the indices of the chunks
chunk_ind = [[i, i + chunk_size] for i in range(0, data_size, chunk_size)]
# function that does the data processing
def processing_func(segment):
# do the data processing
x = big_data[segment[0] : segment[-1]] * 1
return x
# results of the parallel processing - one list per chunk
parallel_results = Parallel(n_jobs=10)(delayed(processing_func)(i) for i in chunk_ind)
My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.
For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.
My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.
have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.
from multiprocessing import Process, Manager
def worker_func(proc_id,results):
# Do your thing
results[proc_id] = ["your dataset from %s" % proc_id]
def convert_dataset_to_csv(dataset):
# Placeholder example. I realize what its doing is ridiculous
converted_dataset = [ ','.join(data.split()) for data in dataset]
return converted_dataset
m = Manager()
d_results= m.dict()
worker_count = 100
jobs = [Process(target=worker_func,
args=(proc_id,d_results))
for proc_id in range(worker_count)]
for j in jobs:
j.start()
for j in jobs:
j.join()
with open('somecsv.csv','w') as f:
for d in d_results.values():
# if the actual conversion function benefits from multiprocessing,
# you can do that there too instead of here
for r in convert_dataset_to_csv(d):
f.write(r + '\n')
If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:
with open("unified.csv", "w") as unified_csv_file:
for partial_csv_name in partial_csv_names:
with open(partial_csv_name) as partial_csv_file:
unified_csv_file.write(partial_csv_file.read())
Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))
for f in files:
print "files added {n}".format(n=n)
concat_file.write(f())
n+=1
concat_file.close()
I am using JModelica to optimize a model using IPOPT in the background.
I would like to run many optimizations in parallel. At the moment I am doing
this using the multiprocessing module.
Right now, the code is as follows. It performs a parameter sweep over the
variables T and So and writes the results to output files named for these
parameters. The output files also contain a list of the parameters used in the
model along with the run results.
#!/usr/local/jmodelica/bin/jm_python.sh
import itertools
import multiprocessing
import numpy as np
import time
import sys
import signal
import traceback
import StringIO
import random
import cPickle as pickle
def PrintResToFile(filename,result):
def StripMX(x):
return str(x).replace('MX(','').replace(')','')
varstr = '#Variable Name={name: <10}, Unit={unit: <7}, Val={val: <10}, Col={col:< 5}, Comment="{comment}"\n'
with open(filename,'w') as fout:
#Print all variables at the top of the file, along with relevant information
#about them.
for var in result.model.getAllVariables():
if not result.is_variable(var.getName()):
val = result.initial(var.getName())
col = -1
else:
val = "Varies"
col = result.get_column(var.getName())
unit = StripMX(var.getUnit())
if not unit:
unit = "X"
fout.write(varstr.format(
name = var.getName(),
unit = unit,
val = val,
col = col,
comment = StripMX(var.getAttribute('comment'))
))
#Ensure that time variable is printed
fout.write(varstr.format(
name = 'time',
unit = 's',
val = 'Varies',
col = 0,
comment = 'None'
))
#The data matrix contains only time-varying variables. So fetch all of
#these, couple them in tuples with their column number, sort by column
#number, and then extract the name of the variable again. This results in a
#list of variable names which are guaranteed to be in the same order as the
#data matrix.
vkeys_in_order = [(result.get_column(x),x) for x in result.keys() if result.is_variable(x)]
vkeys_in_order = map(lambda x: x[1], sorted(vkeys_in_order))
for vk in vkeys_in_order:
fout.write("{0:>13},".format(vk))
fout.write("\n")
sio = StringIO.StringIO()
np.savetxt(sio, result.data_matrix, delimiter=',', fmt='%13.5f')
fout.write(sio.getvalue())
def RunModel(params):
T = params[0]
So = params[1]
try:
import pyjmi
signal.signal(signal.SIGINT, signal.SIG_IGN)
#For testing what happens if an error occurs
# import random
# if random.randint(0,100)<50:
# raise "Test Exception"
op = pyjmi.transfer_optimization_problem("ModelClass", "model.mop")
op.set('a', 0.20)
op.set('b', 1.00)
op.set('f', 0.05)
op.set('h', 0.05)
op.set('S0', So)
op.set('finalTime', T)
# Set options, see: http://www.jmodelica.org/api-docs/usersguide/1.13.0/ch07s06.html
opt_opts = op.optimize_options()
opt_opts['n_e'] = 40
opt_opts['IPOPT_options']['tol'] = 1e-10
opt_opts['IPOPT_options']['output_file'] = '/z/err_'+str(T)+'_'+str(So)+'_info.dat'
opt_opts['IPOPT_options']['linear_solver'] = 'ma27' #See: http://www.coin-or.org/Ipopt/documentation/node50.html
res = op.optimize(options=opt_opts)
result_file_name = 'out_'+str(T)+'_'+str(So)+'.dat'
PrintResToFile(result_file_name, res)
return (True,(T,So))
except:
ex_type, ex, tb = sys.exc_info()
return (False,(T,So),traceback.extract_tb(tb))
try:
fstatus = open('status','w')
except:
print("Could not open status file!")
sys.exit(-1)
T = map(float,[10,20,30,40,50,60,70,80,90,100,110,120,130,140])
So = np.arange(0.1,30.1,0.1)
tspairs = list(itertools.product(T,So))
random.shuffle(tspairs)
pool = multiprocessing.Pool()
mapit = pool.imap_unordered(RunModel,tspairs)
pool.close()
completed = 0
while True:
try:
res = mapit.next(timeout=2)
pickle.dump(res,fstatus)
fstatus.flush()
completed += 1
print(res)
print "{0: >4} of {1: >4} ({2: >4} left)".format(completed,len(tspairs),len(tspairs)-completed)
except KeyboardInterrupt:
pool.terminate()
pool.join()
sys.exit(0)
except multiprocessing.TimeoutError:
print "{0: >4} of {1: >4} ({2: >4} left)".format(completed,len(tspairs),len(tspairs)-completed)
except StopIteration:
break
Using the model:
optimization ModelClass(objective=-S(finalTime), startTime=0, finalTime=100)
parameter Real S0 = 2;
parameter Real F0 = 0;
parameter Real a = 0.2;
parameter Real b = 1;
parameter Real f = 0.05;
parameter Real h = 0.05;
output Real F(start=F0, fixed=true, min=0, max=100, unit="kg");
output Real S(start=S0, fixed=true, min=0, max=100, unit="kg");
input Real u(min=0, max=1);
equation
der(F) = u*(a*F+b);
der(S) = f*F/(1+h*F)-u*(a*F+b);
end ModelClass;
Is this safe?
No, it is not safe. op.optimize() will store the optimization results with a file name derived from the model name, and then load the results to return the data, so when you try to run several optimizations at once you will get a race condition. To circumvent this, you can provide distinct result file names in opt_opts['result_file_name'].
No. It does not seem to be safe as of 02015-11-09.
The code above names output files according to the input parameters. The output files also contain the input parameters used to run the model.
With 4 cores two situations arise:
Occasionally the error Inconsistent number of lines in the result data. is raised in the file /usr/local/jmodelica/Python/pyjmi/common/io.py.
Output files show one set of parameters internally but are named for a different set of parameters, which indicates disagreement between the parameters the script thinks it is processing and the parameters that are actually being processed.
With 24 cores:
The error The result does not seem to be of a supported format. is repeatedly raised by /usr/local/jmodelica/Python/pyjmi/common/io.py.
Together, this information suggests that intermediate files are being used by JModelica, but that there is overlap in the names of the intermediate files resulting in errors in the best case and incorrect results in the worst case.
One might hypothesize that this is the result of bad random number generation in a tempfile function somewhere, but a bug relating to that was resolved on 02011-11-25. Perhaps the PRNGs are being seeded based on a system clock or a constant and therefore progress in sync?
However, this does not seem to be the case since the following does not produce collisions:
#!/usr/bin/env python
import time
import tempfile
import os
import collections
from multiprocessing import Pool
def f(x):
tf = tempfile.NamedTemporaryFile(delete=False)
print(tf.name)
return tf.name
p = Pool(24)
ret = p.map(f, range(2000))
counts = collections.Counter(ret)
print(counts)