Python multithreading without a queue working with large data sets

Python multithreading without a queue working with large data sets - python

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Related

How to parallelize "for" loops? [duplicate]

Say I have a very large list and I'm performing an operation like so:
for item in items:
try:
api.my_operation(item)
except:
print 'error with item'
My issue is two fold:
There are a lot of items
api.my_operation takes forever to return
I'd like to use multi-threading to spin up a bunch of api.my_operations at once so I can process maybe 5 or 10 or even 100 items at once.
If my_operation() returns an exception (because maybe I already processed that item) - that's OK. It won't break anything. The loop can continue to the next item.
Note: this is for Python 2.7.3

First, in Python, if your code is CPU-bound, multithreading won't help, because only one thread can hold the Global Interpreter Lock, and therefore run Python code, at a time. So, you need to use processes, not threads.
This is not true if your operation "takes forever to return" because it's IO-bound—that is, waiting on the network or disk copies or the like. I'll come back to that later.
Next, the way to process 5 or 10 or 100 items at once is to create a pool of 5 or 10 or 100 workers, and put the items into a queue that the workers service. Fortunately, the stdlib multiprocessing and concurrent.futures libraries both wraps up most of the details for you.
The former is more powerful and flexible for traditional programming; the latter is simpler if you need to compose future-waiting; for trivial cases, it really doesn't matter which you choose. (In this case, the most obvious implementation with each takes 3 lines with futures, 4 lines with multiprocessing.)
If you're using 2.6-2.7 or 3.0-3.1, futures isn't built in, but you can install it from PyPI (pip install futures).
Finally, it's usually a lot simpler to parallelize things if you can turn the entire loop iteration into a function call (something you could, e.g., pass to map), so let's do that first:
def try_my_operation(item):
try:
api.my_operation(item)
except:
print('error with item')
Putting it all together:
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_my_operation, item) for item in items]
concurrent.futures.wait(futures)
If you have lots of relatively small jobs, the overhead of multiprocessing might swamp the gains. The way to solve that is to batch up the work into larger jobs. For example (using grouper from the itertools recipes, which you can copy and paste into your code, or get from the more-itertools project on PyPI):
def try_multiple_operations(items):
for item in items:
try:
api.my_operation(item)
except:
print('error with item')
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, group)
for group in grouper(5, items)]
concurrent.futures.wait(futures)
Finally, what if your code is IO bound? Then threads are just as good as processes, and with less overhead (and fewer limitations, but those limitations usually won't affect you in cases like this). Sometimes that "less overhead" is enough to mean you don't need batching with threads, but you do with processes, which is a nice win.
So, how do you use threads instead of processes? Just change ProcessPoolExecutor to ThreadPoolExecutor.
If you're not sure whether your code is CPU-bound or IO-bound, just try it both ways.
Can I do this for multiple functions in my python script? For example, if I had another for loop elsewhere in the code that I wanted to parallelize. Is it possible to do two multi threaded functions in the same script?
Yes. In fact, there are two different ways to do it.
First, you can share the same (thread or process) executor and use it from multiple places with no problem. The whole point of tasks and futures is that they're self-contained; you don't care where they run, just that you queue them up and eventually get the answer back.
Alternatively, you can have two executors in the same program with no problem. This has a performance cost—if you're using both executors at the same time, you'll end up trying to run (for example) 16 busy threads on 8 cores, which means there's going to be some context switching. But sometimes it's worth doing because, say, the two executors are rarely busy at the same time, and it makes your code a lot simpler. Or maybe one executor is running very large tasks that can take a while to complete, and the other is running very small tasks that need to complete as quickly as possible, because responsiveness is more important than throughput for part of your program.
If you don't know which is appropriate for your program, usually it's the first.

There's multiprocesing.pool, and the following sample illustrates how to use one of them:
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
pool_size = 5 # your "parallelness"
# define worker function before a Pool is instantiated
def worker(item):
try:
api.my_operation(item)
except:
print('error with item')
pool = Pool(pool_size)
for item in items:
pool.apply_async(worker, (item,))
pool.close()
pool.join()
Now if you indeed identify that your process is CPU bound as #abarnert mentioned, change ThreadPool to the process pool implementation (commented under ThreadPool import). You can find more details here: http://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

You can split the processing into a specified number of threads using an approach like this:
import threading
def process(items, start, end):
for item in items[start:end]:
try:
api.my_operation(item)
except Exception:
print('error with item')
def split_processing(items, num_splits=4):
split_size = len(items) // num_splits
threads = []
for i in range(num_splits):
# determine the indices of the list this thread will handle
start = i * split_size
# special case on the last chunk to account for uneven splits
end = None if i+1 == num_splits else (i+1) * split_size
# create the thread
threads.append(
threading.Thread(target=process, args=(items, start, end)))
threads[-1].start() # start the thread we just created
# wait for all threads to finish
for t in threads:
t.join()
split_processing(items)

import numpy as np
import threading
def threaded_process(items_chunk):
""" Your main process which runs in thread for each chunk"""
for item in items_chunk:
try:
api.my_operation(item)
except Exception:
print('error with item')
n_threads = 20
# Splitting the items into chunks equal to number of threads
array_chunk = np.array_split(input_image_list, n_threads)
thread_list = []
for thr in range(n_threads):
thread = threading.Thread(target=threaded_process, args=(array_chunk[thr]),)
thread_list.append(thread)
thread_list[thr].start()
for thread in thread_list:
thread.join()

Multiprocessing -- Thread Pool Memory Leak?

I am observing memory usage that I cannot explain to myself. Below I provide a stripped down version of my actual code that still exhibits this behavior. The code is intended to accomplish the following:
Read a text file in chunks of 1000 lines. Each line is a sentence. Split these 1000 sentences into 4 generators. Pass these generators to a thread pool and run feature extraction in parallel on 250 sentences.
In my actual code I accumulate features and labels from all sentences of the entire file.
Now here comes the weird thing: Memory gets allocated but not freed again even when not accumulating these values! And it has something to do with the thread pool I think. The amount of memory taken in total is dependent on how many features are extracted for any given word. I simulate this here with range(100). Have a look:
from sys import argv
from itertools import chain, islice
from multiprocessing import Pool
from math import ceil
# dummyfied feature extraction function
# the lengt of the range determines howmuch mamory is used up in total,
# eventhough the objects are never stored
def features_from_sentence(sentence):
return [{'some feature' 'some value'} for i in range(100)], ['some label' for i in range(100)]
# split iterable into generator of generators of length `size`
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def features_from_sentence_meta(l):
return list(map (features_from_sentence, l))
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# split sentences into a generator of 4 generators
sentence_chunks = chunks(sentences, ceil(50000/4))
# results is a list containing the lists of pairs of X and Y of all chunks
results = map(lambda x : x[0], pool.map(features_from_sentence_meta, sentence_chunks))
X, Y = zip(*results)
print(f'end: {i}')
return X, Y
# reads file in chunks of `lines_per_chunk` lines
def line_chunks(textfile, lines_per_chunk=1000):
chunk = []
i = 0
with open(textfile, 'r') as textfile:
for line in textfile:
if not line.split(): continue
i+=1
chunk.append(line.strip())
if i == lines_per_chunk:
yield chunk
i = 0
chunk = []
yield chunk
textfile = argv[1]
for i, line_chunk in enumerate(line_chunks(textfile)):
# stop processing file after 10 chunks to demonstrate
# that memory stays occupied (check your system monitor)
if i == 10:
while True:
pass
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
The file I am using to debug this has 50000 nonempty lines, which is why I use the hardcoded 50000 at one place. If you want to use the same file, he is a link for your convenience:
https://www.dropbox.com/s/v7nxb7vrrjim349/de_wiki_50000_lines?dl=0
Now when you run this script and open your system monitor you will observe that memory gets used up and the usage keeps going until the 10th chunk, where I artificially go into an endless loop to demonstrate that the memory stays in use, even though I never store anything.
Can you explain to me why this happens? I seem to be missing something about how multiprocessing pools are supposed to be used.

First, let's clear up some misunderstandings—although, as it turns out, this wasn't actually the right avenue to explore in the first place.
When you allocate memory in Python, of course it has to go get that memory from the OS.
When you release memory, however, it rarely gets returned to the OS, until you finally exit. Instead, it goes into a "free list"—or, actually, multiple levels of free lists for different purposes. This means that the next time you need memory, Python already has it lying around, and can find it immediately, without needing to talk to the OS to allocate more. This usually makes memory-intensive programs much faster.
But this also means that—especially on modern 64-bit operating systems—trying to understand whether you really do have any memory pressure issues by looking at your Activity Monitor/Task Manager/etc. is next to useless.
The tracemalloc module in the standard library provides low-level tools to see what actually is going on with your memory usage. At a higher level, you can use something like memory_profiler, which (if you enable tracemalloc support—this is important) can put that information together with OS-level information from sources like psutil to figure out where things are going.
However, if you aren't seeing any actual problems—your system isn't going into swap hell, you aren't getting any MemoryError exceptions, your performance isn't hitting some weird cliff where it scales linearly up to N and then suddenly goes all to hell at N+1, etc.—you usually don't need to bother with any of this in the first place.
If you do discover a problem, then, fortunately, you're already half-way to solving it. As I mentioned at the top, most memory that you allocated doesn't get returned to the OS until you finally exit. But if all of your memory usage is happening in child processes, and those child processes have no state, you can make them exit and restart whenever you want.
Of course there's a performance cost to doing so—process teardown and startup time, and page maps and caches that have to start over, and asking the OS to allocate the memory again, and so on. And there's also a complexity cost—you can't just run a pool and let it do its thing; you have to get involved in its thing and make it recycle processes for you.
There's no builtin support in the multiprocessing.Pool class for doing this.
You can, of course, build your own Pool. If you want to get fancy, you can look at the source to multiprocessing and do what it does. Or you can build a trivial pool out of a list of Process objects and a pair of Queues. Or you can just directly use Process objects without the abstraction of a pool.
Another reason you can have memory problems is that your individual processes are fine, but you just have too many of them.
And, in fact, that seems to be the case here.
You create a Pool of 4 workers in this function:
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# ...
… and you call this function for every chunk:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
So, you end up with 4 new processes for every chunk. Even if each one has pretty low memory usage, having hundreds of them at once is going to add up.
Not to mention that you're probably severely hurting your time performance by having hundreds of processes competing over 4 cores, so you waste time in context switching and OS scheduling instead of doing real work.
As you pointed out in a comment, the fix for this is trivial: just make a single global pool instead of a new one for each call.
Sorry for getting all Columbo here, but… just one more thing… This code runs at the top level of your module:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
… and that's the code that tries to spin up the pool and all the child tasks. But each child process in that pool needs to import this module, which means they're all going to end up running the same code, and spinning up another pool and a whole extra set of child tasks.
You're presumably running this on Linux or macOS, where the default startmethod is fork, which means multiprocessing can avoid this import, so you don't have a problem. But with the other startmethods, this code would basically be a forkbomb that eats up all of your system resources. And that includes spawn, which is the default startmethod on Windows. So, if there's ever any chance anyone might run this code on Windows, you should put all of that top-level code in a if __name__ == '__main__': guard.

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.

I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res

Simply create the pool within your loop and close it at the end of the loop with
pool.close().

Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.

I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache

You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link

I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Python Threading stdin/stdout

I have a file which contains a lot of data. Each row is a record. And I am trying to do some ETL work against the whole file. Right now I am using standard input to read the data line by line. The cool thing about this is your script could be very flexible to integrate with other script and shell commands. I write the result to standard output. For example.
$ cat input_file
line1
line2
line3
line4
...
My current python code looks like this - parse.py
import sys
for line in sys.stdin:
result = ETL(line) # ETL is some self defined function which takes a while to execute.
print result
The code below is how it is working right now:
cat input_file | python parse.py > output_file
I have looked at the Threading module of Python and I am wondering if the performance would be dramatically improved if I use that module.
Question1: How should I plan the quotas for each thread, why?
...
counter = 0
buffer = []
for line in sys.stdin:
buffer.append(line)
if counter % 5 == 0: # maybe assign 5 rows to each thread? if not, is there a rule of thumb to determine
counter = 0
thread = parser(buffer)
buffer = []
thread.start()
Question2: Multiple Threads might print the result back to stdout at the same time, how to organize them and avoid the situation below?
import threading
import time
class parser(threading.Thread):
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
print elem + 'Finished'
work = ['a', 'b', 'c', 'd', 'e', 'f']
thread1 = parser(['a', 'b'])
thread2 = parser(['c', 'd'])
thread3 = parser(['e', 'f'])
thread1.start()
thread2.start()
thread3.start()
The output is really ugly, where one row contains the outputs from two threads.
aFinished
cFinishedeFinished
bFinished
fFinished
dFinished

Taking your second question first, this is what mutexes are for. You can get the cleaner output that you want by using a lock to coordinate among the parsers and ensure that only one thread has access to the output stream during a given period of time:
class parser(threading.Thread):
output_lock = threading.Lock()
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
with self.output_lock:
print elem + 'Finished'
As regards your first question, note that it's probably the case that multi-threading will provide no benefit for your particular workload. It largely depends on whether the work you do with each input line (your ETL function) is primarily CPU-bound or IO-bound. If the former (which I suspect is likely), threads will be of no help, because of the global interpreter lock. In that case, you would want to use the multiprocessing module to distribute work among multiple processes instead of multiple threads.
But you can get the same results with an easier to implement workflow: Split the input file into n pieces (using, e.g., the split command); invoke the extract-and-transform script separately on each subfile; then concatenate the resulting output files.
One nitpick: "using standard input to read the data line by line because it won't load the whole file into memory" involves a misconception. You can read a file line by line from within Python by, e.g., replacing sys.stdin with a file object in a construct like:
for line in sys.stdin:
See also the readline() method of file objects, and note that read() can take as parameter the maximum number of bytes to read.

Whether threading will be helpful you is highly dependent on on your situation. In particular, if your ETL() function involves a lot of disk access, then threading would likely give you pretty significant speed improvement.
In response to your first question, I've always found that it just depends. There are a lot of factors at play when determining the ideal number of threads, and many of them are program-dependent. If you're doing a lot of disk access (which is pretty slow), for example, then you'll want more threads to take advantage of the downtime while waiting for disk access. If the program is CPU-bound, though, tons of threads may not be super helpful. So, while it may be possible to analyze all the factors to come up with an ideal number of threads, it's usually a lot faster to make an initial guess and then adjust from there.
More specifically, though, assigning a certain number of lines to each thread probably isn't the best way to go about divvying up the work. Consider, for example, if one line takes a particularly long time to process. It would be best if one thread could work away at that one line and the other threads could each do a few more lines in the meantime. The best way to handle this is to use a Queue. If you push each line into a Queue, then each thread can pull a line off the Queue, handle it, and repeat until the Queue is empty. This way, the work gets distributed such that no thread is ever without work to do (until the end, of course).
Now, the second question. You're definitely right that writing to stdout from multiple threads at once isn't an ideal solution. Ideally, you would arrange things so that the writing to stdout happens in only one place. One great way to do that is to use a Queue. If you have each thread write its output to a shared Queue, then you can spawn an additional thread whose sole task is to pull items out of that Queue and print them to stdout. By restricting the printing to just one threading, you'll avoid the issues inherent in multiple threads trying to print at once.

How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process.
with open(source_filename) as source, open(target_filename) as target:
for line in source:
target.write(do_something(line))
do_something_else()
Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?
edit: To make this question not a discussion, How should the code look like?
with open(source_filename) as source, open(target_filename) as target:
?
#Nicoretti: In an iteration I need to read a line of several KB of data.
update 2: the file may be a bz2, so Python may have to wait for unpacking:
$ bzip2 -d country.osm.bz2 | ./my_script.py

You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.
import threading
import Queue
QUEUE_SIZE = 1000
sentinel = object()
def read_file(name, queue):
with open(name) as f:
for line in f:
queue.put(line)
queue.put(sentinel)
def process(inqueue, outqueue):
for line in iter(inqueue.get, sentinel):
outqueue.put(do_something(line))
outqueue.put(sentinel)
def write_file(name, queue):
with open(name, "w") as f:
for line in iter(queue.get, sentinel):
f.write(line)
inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)
threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)
It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.

Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.
If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve.

This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.
Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.
Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.
There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).

In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:
you are doing processing that doesn't require the global interpreter lock; or
you are spending time blocked on I/O.
Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.
So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.
Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads).
from threading import Thread
from Queue import Queue
def process_file(f, source_filename, target_filename):
"""
Apply the function `f` to each line of `source_filename` and write
the results to `target_filename`. Each call to `f` is evaluated in
a separate thread.
"""
worker_queue = Queue()
finished = object()
def process(queue, line):
"Process `line` and put the result on `queue`."
queue.put(f(line))
def read():
"""
Read `source_filename`, create an output queue and a worker
thread for every line, and put that worker's output queue onto
`worker_queue`.
"""
with open(source_filename) as source:
for line in source:
queue = Queue()
Thread(target = process, args=(queue, line)).start()
worker_queue.put(queue)
worker_queue.put(finished)
Thread(target = read).start()
with open(target_filename, 'w') as target:
for output in iter(worker_queue.get, finished):
target.write(output.get())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.