Parsing large number of files in parallel and summarizing - python

I have a tool that generates a large number of files (ranging from hundreds of thousands to few millions) whenever it runs. All these files can be read independently of each other. I need to parse them and summarize the information.
Dummy example of generated files:
File1:
NAME=John AGE=25 ADDRESS=123 Fake St
NAME=Jane AGE=25 ADDRESS=234 Fake St
File2:
NAME=Dan AGE=30 ADDRESS=123 Fake St
NAME=Lisa AGE=30 ADDRESS=234 Fake St
Summary - counts how many times an address appeared across all files:
123 Fake St - 2
234 Fake St - 2
I want to use parallelization to read them, so multiprocessing or asyncio come to mind (I/O intensive operations). I plan to do the following operations in a single unit/function that will be called in parallel for each file:
Open the file, go line by line
Populate a unique dict containing the information provided by this file specifically
Close the file
Once I am done reading all the files in parallel, and have one dict per file, I can now loop over each dict and summarize as needed.
The reason I think I need this two step process, is I can't have multiple parallel calls to that function directly summarize and write to a common summary dict. That will mess things up.
But that means I will consume a large amount of memory (due to holding those many hundreds of thousands to millions of dicts in memory).
What would be a good way to get the best of both worlds - runtime and memory consumption - to meet this objective?

Based on the comments here's example using multiprocessing.Pool.
Each process reads one file line by line and sends the result back to main process to collect.
import re
import multiprocessing
from collections import Counter
pat = re.compile(r"([A-Z]{2,})=(.+?)(?=[A-Z]{2,}=|$)")
def process_file(filename):
c = Counter()
with open(filename, "r") as f_in:
for line in f_in:
d = dict(pat.findall(line))
if "ADDRESS" in d:
c[d["ADDRESS"]] += 1
# send partial result back to main process:
return c
if __name__ == "__main__":
# you can get filelist for example from `glob` module:
files = ["file1.txt", "file2.txt"]
final_counter = Counter()
with multiprocessing.Pool(processes=4) as pool:
# iterate over files and update the final Counter:
for result in pool.imap_unordered(process_file, files):
final_counter.update(result)
# print final Counter:
for k, v in final_counter.items():
print("{:<20} {}".format(k, v))
Prints:
123 Fake St 2
234 Fake St 2
Note: You can use tqdm module to get nice progress bar, for example:
...
from tqdm import tqdm
...
for result in pool.imap_unordered(process_file, tqdm(files)):
...

Solution
Redis Queue RQ is likely the most appropriate datastore for this use-case. This will allow you to process the files in parallel while managing resources without needing to use multiprocessing—while you still could if you really needed to.
The benefit with this approach, in contrast with solely using multiprocessing or async is that it scales horizontally when used in conjunction with Redis Cluster.
Pseudocode
from redis import Redis
from rq import Queue, Retry
def do_work(f):
if maximum_filesize_in_queue_exceeds_resources():
raise ResourcesUnavailableError
else:
f.read()
...
def run(files_to_process):
q = Queue()
for f in files_to_process:
with open(filename, "r") as f:
q.enqueue(do_work, f, retry=Retry(max=3, interval=[10, 30, 60]))
if __name__ == "__main__":
files_to_process = ("foo", "bar", "baz")
run(files_to_process)
...
Results
You could use Redis to also store the results from each worker, and use rq's success hook to update the summary, which would also be stored in Redis. Once all queues have completed—you could then print a report with result and summary information stored in Redis.
References
https://redis.com
https://python-rq.org
https://redis.io/docs/manual/scaling/

Related

How to read large sas file with pandas and export to parquet using multiprocessing?

i'm trying to make a scrip that read a sasb7dat file and export to parquet using pandas, but i'm struggling to increase my performance with large files (>1Gb and more de 1 million rows). Doing some research, i found that using multiprocessing could help me, but i can't make it work. The code runs with no errors, but no parquet files are created.
Here is what i got so far:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
arq = path_to_my_file
def sas_mult_process(data):
for i, df in enumerate(data):
df.to_parquet(f"{'hist_dif_base_pt'+str(i)}.parquet")
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, file_reader)
Can anyone see where is my mistake?
You use the term multiprocessing all over the place yet your code is not using multiprocessing but rather multithreading. It appears that you are trying to break up the input file into dataframe chunks and have each chunk become a separate output file. If that is so, you would want to pass each chunk to you worker sas_mult_process, which would then process that single chunk. I am assuming that converting the input to parquet involves more than just I/O but rather entails some CPU processing. Therefore, multiprocessing would be a better choice.
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
arq = path_to_my_file
def sas_mult_process(tpl):
"""
This worker function is passed an index and single chunk
as a tuple.
"""
i, df = tpl # Unpack
# The following f-string can be simplified:
df.to_parquet(f"hist_dif_base_pt{i}.parquet")
# Required for Windows:
if __name__ == '__main__':
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ProcessPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, enumerate(file_reader))

Parallel computing for loop with no last function

I'm trying to parallelize the reading the content of 16 gzip files with script:
import gzip
import glob
from dask import delayed
from dask.distributed import Client, LocalCluster
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
reads = [read.decode("utf-8") for read in reads]
return reads
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
read_files = glob.glob("*.txt.gz")
all_files = []
for file in read_files:
reads = get_gzip_delayed(file)
all_files.extend(reads)
with open("all_reads.txt", "w") as f:
w = delayed(all_files.writelines)(f)
w.compute()
However, I get the following error:
> TypeError: Delayed objects of unspecified length are not iterable
How do I parallelize a for loop with extend/append and writing the function to a doc. All dask examples always include some final function performed on for loop product.
List all_files consists of delayed values, and calling delayed(f.writelines)(all_files) (note the different arguments relative to the code in the question) is not going to work for several reasons, the main is that you prepare lazy instructions for writing, but execute them only after closing the file.
There are different ways to solve this problem, at least two are:
if the data from the files fits into memory, then it's easiest to compute it and write to the file:
all_files = dask.compute(all_files)
with open("all_reads.txt", "w") as f:
f.writelines(all_files)
if the data cannot fit into memory, then another option is to put the writing inside the get_gzip_delayed function, so data doesn't travel between worker and client:
from dask.distributed import Lock
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
# create a lock to prevent others from writing at the same time
with Lock("all_reads.txt"):
with open("all_reads.txt", "a") as f: # need to be careful here, since files are appending
f.writelines([read.decode("utf-8") for read in reads])
Note that if memory is a severe constraint, then the above can also be refactored to process the files line-by-line (at the cost of slower IO).

How can I write each process from multiprocessing to a separate csv file using pandas?

I have several large txt files. Call them mytext01.txt, mytext02.txt, mytext03.txt (in reality there are many more than three). I want to create a separate dataframe for each file that counts occurrences of certain keywords and then write each dataframe to its own csv file. I'd like each txt file to be handled in one process using the multiprocessing library.
I have written code that I thought would do what I wanted, but the csv file never appeared (the code doesn't seem to be doing much of anything-the entire thing runs more quickly than it would normally take to just load a single file). Here is a simplified version of what I tried:
import pandas as pd
from multiprocessing import Pool
keywords=['dog','cat','fish']
def count_words(file_number):
file=path+'mytext{}.txt'.format(file_number)
with open(file, 'r',encoding='utf-8') as f:
text = f.read()
text=text.split(' ')
words_dict=dict(zip(positive,[0 for i in words]))
for word in words_dict.keys():
words_dict[word]=text.count(word)
words_df=pd.DataFrame.from_dict(words_dict,orient='index')
words_df.to_csv('word_counts{}.csv'.format(file_number))
if __name__ == '__main__':
pool = Pool()
pool.map(count_words, ['01','02','03'])
I'm not super familiar with using multiprocessing, so any idea of what I have done wrong would be much appreciated. Thanks!
In my experience it's better to have a dedicated function for parallelization as
import multiprocessing as mp
def parallelize(fun, vec, cores):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
Now you just have to check if your function count_words works for a single file_number and you can use parallelize.

Multiprocessing vs multithreading for Parse large files. And maintain a counter for no. of lines processed altogether

I have lot of large files containing plain text(comma separated).
Need to perform some action on every line and keep a tracker count of total number of lines processed irrespective of the number of files processed.
I'm not sure which option approach would be better (multiprocessing or multithreading)?
I tried implementing multiprocessing but the time difference of sequential processing or multiprocessing didn't differ much.
- It might be I didn't apply multiprocessing correctly.
global_counter
def process_file():
with open(file, 'r') as f, open(file.done, 'w') as df, open(file.err, 'w') as ef:
#some operations on file
#based on some operation logs, work on df or ef.
#increment global_counter with every line
records = process_file()
t = Pool(processes=8)
for i in records:
t.map(processing, (i,))
t.close()
Time of execution remained the same.
I want to implement multiprocessing/multithreading to reduce time in processing multiple large files.
Kindly help me in deciding which of the 2 approach would be better for my case.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Categories

Resources