I have extremely large files. Each file is almost 2GB. Therefore, I would like to run multiple files in parallel. And I can do that because all of the files have similar format therefore, file reading can be done in parallel. I know I should use multiprocessing library but I am really confused how to use it with my code.
My code for file reading is:
def file_reading(file,num_of_sample,segsites,positions,snp_matrix):
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
My main function is:
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
###How can I use multiprocessing with function file_reading
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
So my input to the function is as follows:
file - list containing filenames
num_of_samples - int value
segsites - initially an empty list to which I want to append as I am reading the file.
positions - initially an empty list to which I want to append as I am reading the file.
snp_matrix - initially an empty list to which I want to append as I am reading the file.
The function returns positions list and snp_matrix list at the end. How can I use multiprocessing for this where my arguments are lists and integer? The way I've used multiprocessing gives me following error:
TypeError: file_reading() missing 3 required positional arguments: 'segsites', 'positions', and 'snp_matrix'
The elements in the list that is being passed to the Pool.map are not automatically unpacked. You can generally only have one argument in your 'file_reading' function.
Of course, this argument can be a tuple, so it is no problem to unpack it yourself:
def file_reading(args):
file, num_of_sample, segsites, positions, snp_matrix = args
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], num_of_samples, segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
Related
I am trying to write a function that reads an array of parquet files as dataframe and concatenate them sequentially as on file
# Get the list of all files and directories
path = "C:**********\\main_data_1"
dir_list = os.listdir(path)
dir_list
Total_df = []
all_ais_msg_df = []
for j in np.arange(1,len(dir_list)+1):
filename = 'ais_comb_'+ str(j)+ '.parquet'
j +=1
Total_df.append(filename)
for i in Total_df:
** df = pd.read_parquet('main_data_1/'+ str(i))** # having problem here
i+=1
all_ais_msg_df = pd.DataFrame(all_ais_msg_df.append(df))
I intend to construct a string like this ------>'main_data_1/ais_comb_1.parquet' while using the i in the second loop to increment ais_comb_1, ais-comb_2 aspect of the string until the last filename in order to get and combine all as a dataframe
The error I get is:
TypeError: can only concatenate str (not "int") to str
I crave a very clear explanation please, I am still a rookie.
Im new to python and having trouble passing in an object in a function.
Basically, I'm trying to read a large file over 1.4B lines.
I am passing in an object that contains information on the file. One of these is a very large array containing the location of the start of each line in the file.
This is a large array and by passing just the object reference I wish to have just one instance of the array which is then shared by multiple processes although I don't know if this is actually happening.
The array when passed into the process_line function is then empty leading to errors. This is the problem.
Here is where the function is being called (see the p.starmap)
with open(file_name, 'r') as f:
line_start = file_inf.start_offset
# Iterate over all lines and construct arguments for `process_line`
while line_start < file_inf.file_size:
line_end = min(file_inf.file_size, line_start + line_size) #end = minimum of either file size and line start + line size
# Save `process_line` arguments
args = [file_name, line_start, line_end, file_inf.line_offset ] #arguments for process_line
line_args.append(args)
# Move to the next line
line_start = line_end
print(line_args[1])
with multiprocessing.Pool(cpu_count) as p: #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap
line_result = p.starmap(process_line, line_args) #maps the process_line function to each line
This is the function:
def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
# Moving stream position to `line_start`
f.seek(file_obj[line_start])
i = 0
if line_start == 63400:
print ("hello")
# Read and process lines until `line_end`
for line in f:
line_start += 1
if line_start > line_end:
line_results.__append__(c2)
c2.clear()
break
c1 = func(line)
c2.__add__(c1)
i= i+1
return line_results.countf
where file_obj contains line_offset which is the array in question.
Now If I remove the multiprocessing and just use:
line_result = starmap(process_line, line_args)
the array is passed in just fine. Although without multiprocessing
Also if I pass in just the array instead of the entire object then it also works but now for some reason only 2 processes work (on Linux, on Windows using task manager only 1 works while the rest just use memory but not CPU). Instead of an expected 20 which is critical for this task.
Is there any solution to this? please help
I have two Python files (using PyCharm). In Python file#2, I want to call a function in Python file#1.
from main import load_data_from_file
delay, wavelength, measured_trace = load_data_from_file("Sweep_0.txt")
print(delay.shape)
which main is the name of python file#1. However, when I run python file#2 (the code posted at the top), I can see that whole python file#1 is also running.
Any suggestion on how I can just run print(delay. shape) without running the entire python file#1??
Here are my codes:
class import_trace:
def __init__(self,delays,wavelengths,spectra):
self.delays = delays
self.wavelengths = wavelengths
self.spectra = spectra
def load_data_from_file(self, filename):
# logging.info("Entered load_data_from_file")
with open(filename, 'r') as data_file:
wavelengths = []
spectra = []
delays = []
for num, line in enumerate(data_file):
if num == 0:
# Get the 1st line, drop the 1st element, and convert it
# to a float array.
delays = np.array([float(stri) for stri in line.split()[1:]])
else:
data = [float(stri) for stri in line.split()]
# The first column contains wavelengths.
wavelengths.append(data[0])
# All other columns contain intensity at that wavelength
# vs time.
spectra.append(np.array(data[1:]))
logging.info("Data loaded from file has sizes (%dx%d)" %
(delays.size, len(wavelengths)))
return delays, np.array(wavelengths), np.vstack(spectra)
and below I use this to get the values, however it does not work:
frog = import_trace(delays,wavelengths,spectra)
delay, wavelength, measured_trace =
frog.load_data_from_file("Sweep_0.txt")
I got this error:
frog = import_trace(delays,wavelengths,spectra)
NameError: name 'delays' is not defined
You can wrap up the functions of file 1 in a class and in file 2 you can create object and can call the specific function. However if you can share the file 1's code then it would be clear. For your reference find below...
File-1
class A:
def func1():
...
File-2
import file1 as f
# function call
f.A.func1()
I have 3 types of files each of the same size ( around 500 files of each type). I have to give these files to a function. How can I use multiprocessing for the same?
The files are rgb_image: 15.png,16.png,17.png .... depth_img: 15.png, 16.png, 17.png and mat :15.mat, 16.mat, 17.mat ... I have to use 3 files 15.png, 15.png and 15.mat as argument to the function. Starting names of files can vary but it is of this format.
The code is as follows:
def depth_rgb_registration(rgb, depth, mat):
required operation is performed here and
gait_list ( a list is the output of this function)
def display_fun(mat, selected_depth, selected_color, excel):
for idx, color_img in enumerate(color_lists):
for i in range(len(depth_lists)):
if color_img.split('.')[0] == depth_lists[i].split('.')[0]:
rgb = os.path.join(selected_color, color_img)
depth = os.path.join(selected_depth, sorted(depth_lists)[i])
m = sorted(mat_lists)[idx]
mat2 = os.path.join(mat, m)
abc = color_img.split('.')[0]
gait_list1 = []
fnum = int("".join([str(i) for i in re.findall("(\d+)", abc)]))
gait_list1.append(fnum)
depth_rgb_registration(rgb, depth,mat2)
gait_list2.append(gait_list1) #Output gait_list1 from above function
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
In the above code, we have display_fun which is the main function, which is called from the other code.
In this function, we have color_img, depth_imp, and mat which are three different types of files from the folders. These three files are given as arguments to depth_rgb_registration function. In this function, some required values are stored in gait_list1 which is then stored in an excel file for every set of files.
This loop above is working but it takes around 20-30 minutes to run depending on the number of files.
So I wanted to use Multiprocessing and reduce the overall time.
I tried multiprocessing by seeing some example but I am not able to understand how can I give these 3 files as an argument. I know using a dictionary here is not correct which I have used below, but what can be an alternative?
Even if it is asynchronous multiprocessing, it is fine. I even thought of using GPU to run the function, but as I read, extra time will go in the loading of the data to GPU. Any suggestions?
def display_fun2(mat, selected_depth, selected_color, results, excel):
path3 = selected_depth
path4 = selected_color
path5 = mat
rgb_depth_pairs = defaultdict(list)
for rgb in path4.iterdir():
rgb_depth_pairs[rgb.stem].append(rgb)
included_extensions = ['png']
images = [fn for ext in included_extensions for fn in path3.glob(f'*.{ext}')]
for image in images:
rgb_depth_pairs[image.stem].append(image)
for mat in path5.iterdir():
rgb_depth_pairs[mat.stem].append(mat)
rgb_depth_pairs = [item for item in rgb_depth_pairs.items() if len(item) == 3]
with Pool() as p:
p.starmap_async(process_pairs, rgb_depth_pairs)
gait_list2.append(gait_list1)
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
def depth_rgb_registration(rgb, depth, mat):
required operation for one set of files
I did not look at the code in detail (it was too long), but provided that the combinations of arguments that will be sent to your function with 3 arguments can be evaluated independently (outside of the function itself), you can simply use Pool.starmap:
For example:
from multiprocessing import Pool
def myfunc(a, b, c):
return 100*a + 10*b + c
myargs = [(2,3,1), (1,2,4), (5,3,2), (4,6,1), (1,3,8), (3,4,1)]
p = Pool(2)
print(p.starmap(myfunc, myargs))
returns:
[231, 124, 532, 461, 138, 341]
Alternatively, if your function can be recast as a function which accepts a single argument (the tuple) and expands from this into the separate variables that it needs, then you can use Pool.map:
def myfunc(t):
a, b, c = t # unpack the tuple and carry on
return 100*a + 10*b + c
...
print(p.map(myfunc, myargs))
I'm writing a Python piece of code to parse a lot of ascii file using multiprocessing functionality.
For each file I've to perform the operations of this function
def parse_file(file_name):
record = False
path_include = []
buffer_include = []
include_file_filters = {}
include_keylines = {}
grids_lines = []
mat_name_lines = []
pids_name_lines = []
pids_shell_lines= []
pids_weld_lines = []
shells_lines = []
welds_lines = []
with open(file_name, 'rb') as in_file:
for lineID, line in enumerate(in_file):
if record:
path_include += line
if record and re.search(r'[\'|\"]$', line.strip()):
buffer_include.append(re_path_include.search(
path_include).group(1).replace('\n', ''))
record = False
if 'INCLUDE' in line and '$' not in line:
if re_path_include.search(line):
buffer_include.append(
re_path_include.search(line).group(1))
else:
path_include = line
record = True
if line.startswith('GRID'):
grids_lines += [lineID]
if line.startswith('$HMNAME MAT'):
mat_name_lines += [lineID]
if line.startswith('$HMNAME PROP'):
pids_name_lines += [lineID]
if line.startswith('PSHELL'):
pids_shell_lines += [lineID]
if line.startswith('PWELD'):
pids_weld_lines += [lineID]
if line.startswith(('CTRIA3', 'CQUAD4')):
shells_lines += [lineID]
if line.startswith('CWELD'):
welds_lines += [lineID]
include_keylines = {'grid': grids_lines, 'mat_name': mat_name_lines, 'pid_name': pids_name_lines, \
'pid_shell': pids_shell_lines, 'pid_weld': pids_weld_lines, 'shell': shells_lines, 'weld': welds_lines}
include_file_filters = {file_name: include_keylines}
return buffer_include, include_file_filters
This function is used in a loop through list of files, in this way (each process on CPU parse one entire file)
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
for include in grouper([list_of_file_path]):
current = mp.current_process()
print 'Running: ', current.name, current._identity
results = p.map(parse_file, include)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
The grouper function used above is defined as
def grouper(iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*mp.cpu_count(), fillvalue=padvalue)
I'm using Python 2.7.15 in cpu with 4 cores (Intel Core i3-6006U).
When I run my code, I see all the CPUs engaged on 100%, the output in Python console as Running: MainProcess () but nothing appened otherwise. It seems that my code is blocked at instruction results = p.map(parse_file, include) and can't go ahead (the code works well when i parse the files one at a time without parallelization).
What is wrong?
How can I deal with the results given by parse_file function
during parallel execution?My approach is correct or not?
Thanks in advance for your support
EDIT
Thanks darc for your reply. I've tried your suggestion but the issue is the same. The problem, seems to be overcome if I put the code under if statement like so
if __name__ == '__main__':
Maybe this is due to the manner in which Python IDLE handle the process. I'm using the IDLE environ for development and debugging reasons.
according to python docs:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
since it is blocking your process wait until parse file is done.
since map already chnucks the iterable you can try to send all of the includes together as one large iterable.
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
results = p.map(parse_file, list_of_file_path, 1)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
if you want to keep the original loop use apply_async, or if you are using python3 you can use ProcessPoolExecutor submit() function and read the results.