Multiprocessing vs Threading in Python

Multiprocessing vs Threading in Python - python

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram
Each of output file depends on the analysis of all input files.
Single processing of the program takes quite a long time, so I tried the following codes:
(a) multiprocessing
start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
p.apply_async(my_read_process_and_write_func, args=(i,w))
p.close()
p.join()
end = time.time()
(b) threading
start = time.time()
thread_count = cpu_count()
thread_list = []
for i in range(0, thread_count):
t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
end = time.time()
I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.
My questions are:
Are my codes correct?
Is there any better way/codes to improve the efficiency?
Thanks!

Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.

Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files.
Now my codes run 8 times faster.

Now my processing diagram looks like this.
My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool).
In order to generate one output file, index file and all flles within the file pool need to be read. (e.g. First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.)
Previously I tried multiprocessing and Threading without making copies, the codes were very slow. Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files.
Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.

Related

Multiprocessing is not executing parallel in Python

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv

welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.

There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

Multiprocessing and multithreading in Python

I have a python program which 1) Reads from a very large file from Disk(~95% time) and then 2) Process and Provide a relatively small output (~5% time). This Program is to be run on TeraBytes of files .
Now i am looking to Optimize this Program by utilizing Multi Processing and Multi Threading . The platform I am running is a Virtual Machine with 4 Processors on a virtual Machine .
I plan to have a scheduler Process which will execute 4 Processes (same as processors) and then Each Process should have some threads as most part is I/O . Each Thread will process 1 file & will report result to the Main Thread which in turn will report it back to scheduler Process via IPC . Scheduler can queue these and eventually write them to disk in ordered manner
So wondering How does one decide number of Processes and Threads to create for such scenario ? Is there a Mathematical way to figure out whats the best mix .
Thankyou

I think I would arrange it the inverse of what you are doing. That is, I would create a thread pool of a certain size that would be responsible for producing the results. The tasks that get submitted to this pool would be passed as an argument a processor pool that could be used by the worker thread for submitting the CPU-bound portions of work. In other words, the thread pool workers would primarily be doing all the disk-related operations and handing off to the processor pool any CPU-intensive work.
The size of the processor pool should just be the number of processors you have in your environment. It's difficult to give a precise size for the thread pool; it depends on how many concurrent disk operations it can handle before the law of diminishing returns come into play. It also depends on your memory: The larger the pool, the greater the memory resources that will be taken, especially if entire files have to be read into memory for processing. So, you may have to experiment with this value. The code below outlines these ideas. What you gain from the thread pool is overlapping of I/O operations greater than you would achieve if you just used a small processor pool:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
def cpu_bound_function(arg1, arg2):
...
return some_result
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'r') as f:
# Do disk related operations:
. . . # code omitted
# Now we have to do a CPU-intensive operation:
future = process_pool_executor.submit(cpu_bound_function, arg1, arg2)
result = future.result() # get result
return result
file_list = [file_1, file_2, file_n]
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list)
Important Note:
Another far simpler approach is to just have a single, processor pool whose size is greater than the number of CPU processors you have, for example, 25. The worker processes will do both I/O and CPU operations. Even though you have more processes than CPUs, many of the processes will be in a wait state waiting for I/O to complete allowing CPU-intensive work to run.
The downside to this approach is that the overhead in creating a N processes is far greater than the overhead in creating N threads + a small number of processes. However, as the running time of the tasks submitted to the pool becomes increasingly larger, then this increased overhead becomes decreasingly a smaller percentage of the total run time. So, if your tasks are not trivial, this could be a reasonably performant simplification.
Update: Benchmarks of Both Approaches
I did some benchmarks against the two approaches processing 24 files whose sizes were approximately 10,000KB (actually, these were just 3 different files processed 8 times each, so there might have been some caching done):
Method 1 (thread pool + processor pool)
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'rb') as f:
b = f.read()
future = process_pool_executor.submit(cpu_bound_function, b)
result = future.result() # get result
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Method 2 (processor pool only)
from concurrent.futures import ProcessPoolExecutor
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(file_name):
with open(file_name, 'rb') as f:
b = f.read()
result = cpu_bound_function(b)
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_PROCESSES = 50 # depends on your configuration on how well the I/O can be overlapped
N_PROCESSES = min(N_FILES, MAX_PROCESSES) # no point in creating more threds than required
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(process_pool_executor.map(io_bound_function, file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Results:
(I have 8 cores)
Thread Pool + Processor Pool: 13.5 seconds
Processor Pool Alone: 13.3 seconds
Conclusion: I would try the simpler approach first of just using a processor pool for everything. Now the tricky bit is deciding what the maximum number of processes to create, which was part of your original question and had a simple answer when all it was doing was the CPU-intensive computations. If the number of files you are reading are not too many, then the point is moot; you can have one process per file. But if you have hundreds of files, you will not want to have hundreds of processes in your pool (there is also an upper limit to how many processes you can create and again there are those nasty memory constraints). There is just no way I can give you an exact number. If you do have a large number of files, start with a smallish pool size and keep incrementing until you get no further benefit (of course, you probably do not want to be processing more files than some maximum number for these tests or you will be running forever just deciding on a good pool size for the real run).

For parallel processing:
I saw this question, and quoting the accepted answer:
In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.
For multiprocessing:
Someone asked a similar question here, and the accepted answer said this:
If all of your threads/processes are indeed CPU-bound, you should run as many processes as the CPU reports cores. Due to HyperThreading, each physical CPU cores may be able to present multiple virtual cores. Call multiprocessing.cpu_count to get the number of virtual cores.
If only p of 1 of your threads is CPU-bound, you can adjust that number by multiplying by p. For example, if half your processes are CPU-bound (p = 0.5) and you have two CPUs with 4 cores each and 2x HyperThreading, you should start 0.5 * 2 * 4 * 2 = 8 processes.
The key here is understand what machine are you using, from that, you can choose a nearly optimal number of threads/processes to split the execution of you code. And I said nearly optimal because it will vary a little bit every time you run your script, so it'll be difficult to predict this optimal number from a mathematical point of view.
For your specific situation, if your machine has 4 cores, I would recommend you to only create 4 threads max, and then split them:
1 to the main thread.
3 for file reading and process.

using multiple processes to speed up IO performance may not be a good idea, check this and the sample code below it to see wether it is helpful

One idea can be to have a thread only reading the file (If I understood well, there is only one file) and pushing the independent parts (for ex. rows) into queue with messages.
The messages can be processed by 4 threads. In this way, you can optimize the load between the processors.

On a strongly I/O-bound process (like what you are describing), you do not necessarily need multithreading nor multiprocessing: you could also use more advanced I/O primitives from your OS.
For example on Linux you can submit read requests to the kernel along with a suitably sized mutable buffer and be notified when the buffer is filled. This can be done using the AIO API, for which I've written a pure-python binding: python-libaio (libaio on pypi)), or with the more recent io_uring API for which there seems to be a CFFI python binding (liburing on pypy) (I have neither used io_uring nor this python binding).
This removes the complexity of parallel processing at your level, may reduce the number of OS/userland context switches (reducing the cpu time even further), and lets the OS know more about what you are trying to do, giving it the opportunity of scheduling the IO more efficiently (in a virtualised environment I would not be surprised if it reduced the number of data copies, although I have not tried it myself).
Of course, the downside is that your program will be more tightly bound to the OS you are executing it on, requiring more effort to get it to run on another one.

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join

Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

No increase in speed when multithreading python hdf5 parsing function

I have a function that:
1) reads in a hdf5 dataset as integer ascii code
2) converts ascii integers to characters...chr() function
3) joins the characters into a single string function
Upon profiling, I found that the vast majority of the calculation is spent on the step #2, the conversion of the ascii integers to characters. I have somewhat optimized this call by using:
''.join([chr(x) for x in file[dataSetName].value])
As my parsing function seems to be cpu bound (the conversion of integer to characters) and not i/o bound, I expected to obtain a more/less linear speed enhancement with the number of cores devoted to parsing. To parse one file serially takes ~15 seconds...to parse 10 files (on my 12 core machine) takes ~150 seconds while using 10 threads. That is, there seems to be no enhancement at all.
I have used the following code to launch my threads:
threads=[]
timer=[]
threadNumber=10
for i,d in enumerate(sortedDirSet):
timer.append(time.time())
# self.loadFile(d,i)
threads.append(Thread(target=self.loadFileargs=(d,i)))
threads[-1].start()
if(i%threadNumber==0):
for i2,t in enumerate(threads):
t.join()
print(time.time()-timer[i2])
timer=[]
threads=[]
for t in threads:
t.join()
Any help would be greatly appreciated.

Python cannot use multiple cores (due to GIL) unless you spawn subprocesses (with multiprocessing for example). Thus you won't get any performance boost with spawning threads for CPU bound tasks.
Here's an example of a script using multiprocessing and queue:
from Queue import Empty # <-- only needed to catch Exception
from multiprocessing import Process, Queue, cpu_count
def loadFile(d, i, queue):
# some other stuff
queue.put(result)
if name == "main":
queue = Queue()
no = cpu_count()
processes = []
for i,d in enumerate(sortedDirSet):
p = Process(target=self.loadFile, args=(d, i, queue))
p.start()
processes.append(p)
if i % no == 0:
for p in processes:
p.join()
processes = []
for p in processes:
p.join()
results = []
while True:
try:
# False means "don't wait when Empty, throw an exception instead"
data = queue.get(False)
results.append(data)
except Empty:
break
# You have all the data, do something with it
The other (more complicated) way would be to use pipe instead of queue.
It would be also more efficient to spawn processes, then create a job queue and send them (via pipe) to subprocesses (so you won't have to create a process each time). But this would be even more complicated, so let's leave it like that.

Freakish is correct with his answer, it will be the GIL thwarting your efforts.
If you were to use python 3, you could do this very nicely using concurrent.futures. I believe PyPy has also backported this feature.
Also, you could eek a little bit more speed out of your code by replacing your list comprehension:
''.join([chr(x) for x in file[dataSetName].value])
With a map:
''.join(map(chr, file[dataSetName].value))
My tests (on a massive random list) using above code showed 15.73s using list comprehension and 12.44s using map.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.