I'm trying to parallelize the reading the content of 16 gzip files with script:
import gzip
import glob
from dask import delayed
from dask.distributed import Client, LocalCluster
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
reads = [read.decode("utf-8") for read in reads]
return reads
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
read_files = glob.glob("*.txt.gz")
all_files = []
for file in read_files:
reads = get_gzip_delayed(file)
all_files.extend(reads)
with open("all_reads.txt", "w") as f:
w = delayed(all_files.writelines)(f)
w.compute()
However, I get the following error:
> TypeError: Delayed objects of unspecified length are not iterable
How do I parallelize a for loop with extend/append and writing the function to a doc. All dask examples always include some final function performed on for loop product.
List all_files consists of delayed values, and calling delayed(f.writelines)(all_files) (note the different arguments relative to the code in the question) is not going to work for several reasons, the main is that you prepare lazy instructions for writing, but execute them only after closing the file.
There are different ways to solve this problem, at least two are:
if the data from the files fits into memory, then it's easiest to compute it and write to the file:
all_files = dask.compute(all_files)
with open("all_reads.txt", "w") as f:
f.writelines(all_files)
if the data cannot fit into memory, then another option is to put the writing inside the get_gzip_delayed function, so data doesn't travel between worker and client:
from dask.distributed import Lock
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
# create a lock to prevent others from writing at the same time
with Lock("all_reads.txt"):
with open("all_reads.txt", "a") as f: # need to be careful here, since files are appending
f.writelines([read.decode("utf-8") for read in reads])
Note that if memory is a severe constraint, then the above can also be refactored to process the files line-by-line (at the cost of slower IO).
Related
i'm trying to make a scrip that read a sasb7dat file and export to parquet using pandas, but i'm struggling to increase my performance with large files (>1Gb and more de 1 million rows). Doing some research, i found that using multiprocessing could help me, but i can't make it work. The code runs with no errors, but no parquet files are created.
Here is what i got so far:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
arq = path_to_my_file
def sas_mult_process(data):
for i, df in enumerate(data):
df.to_parquet(f"{'hist_dif_base_pt'+str(i)}.parquet")
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, file_reader)
Can anyone see where is my mistake?
You use the term multiprocessing all over the place yet your code is not using multiprocessing but rather multithreading. It appears that you are trying to break up the input file into dataframe chunks and have each chunk become a separate output file. If that is so, you would want to pass each chunk to you worker sas_mult_process, which would then process that single chunk. I am assuming that converting the input to parquet involves more than just I/O but rather entails some CPU processing. Therefore, multiprocessing would be a better choice.
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
arq = path_to_my_file
def sas_mult_process(tpl):
"""
This worker function is passed an index and single chunk
as a tuple.
"""
i, df = tpl # Unpack
# The following f-string can be simplified:
df.to_parquet(f"hist_dif_base_pt{i}.parquet")
# Required for Windows:
if __name__ == '__main__':
file_reader = pd.read_sas(arq, chunksize=100000,encoding='ISO-8859-1',format='sas7bdat')
with ProcessPoolExecutor(max_workers=10) as executor:
executor.map(sas_mult_process, enumerate(file_reader))
I have a tool that generates a large number of files (ranging from hundreds of thousands to few millions) whenever it runs. All these files can be read independently of each other. I need to parse them and summarize the information.
Dummy example of generated files:
File1:
NAME=John AGE=25 ADDRESS=123 Fake St
NAME=Jane AGE=25 ADDRESS=234 Fake St
File2:
NAME=Dan AGE=30 ADDRESS=123 Fake St
NAME=Lisa AGE=30 ADDRESS=234 Fake St
Summary - counts how many times an address appeared across all files:
123 Fake St - 2
234 Fake St - 2
I want to use parallelization to read them, so multiprocessing or asyncio come to mind (I/O intensive operations). I plan to do the following operations in a single unit/function that will be called in parallel for each file:
Open the file, go line by line
Populate a unique dict containing the information provided by this file specifically
Close the file
Once I am done reading all the files in parallel, and have one dict per file, I can now loop over each dict and summarize as needed.
The reason I think I need this two step process, is I can't have multiple parallel calls to that function directly summarize and write to a common summary dict. That will mess things up.
But that means I will consume a large amount of memory (due to holding those many hundreds of thousands to millions of dicts in memory).
What would be a good way to get the best of both worlds - runtime and memory consumption - to meet this objective?
Based on the comments here's example using multiprocessing.Pool.
Each process reads one file line by line and sends the result back to main process to collect.
import re
import multiprocessing
from collections import Counter
pat = re.compile(r"([A-Z]{2,})=(.+?)(?=[A-Z]{2,}=|$)")
def process_file(filename):
c = Counter()
with open(filename, "r") as f_in:
for line in f_in:
d = dict(pat.findall(line))
if "ADDRESS" in d:
c[d["ADDRESS"]] += 1
# send partial result back to main process:
return c
if __name__ == "__main__":
# you can get filelist for example from `glob` module:
files = ["file1.txt", "file2.txt"]
final_counter = Counter()
with multiprocessing.Pool(processes=4) as pool:
# iterate over files and update the final Counter:
for result in pool.imap_unordered(process_file, files):
final_counter.update(result)
# print final Counter:
for k, v in final_counter.items():
print("{:<20} {}".format(k, v))
Prints:
123 Fake St 2
234 Fake St 2
Note: You can use tqdm module to get nice progress bar, for example:
...
from tqdm import tqdm
...
for result in pool.imap_unordered(process_file, tqdm(files)):
...
Solution
Redis Queue RQ is likely the most appropriate datastore for this use-case. This will allow you to process the files in parallel while managing resources without needing to use multiprocessing—while you still could if you really needed to.
The benefit with this approach, in contrast with solely using multiprocessing or async is that it scales horizontally when used in conjunction with Redis Cluster.
Pseudocode
from redis import Redis
from rq import Queue, Retry
def do_work(f):
if maximum_filesize_in_queue_exceeds_resources():
raise ResourcesUnavailableError
else:
f.read()
...
def run(files_to_process):
q = Queue()
for f in files_to_process:
with open(filename, "r") as f:
q.enqueue(do_work, f, retry=Retry(max=3, interval=[10, 30, 60]))
if __name__ == "__main__":
files_to_process = ("foo", "bar", "baz")
run(files_to_process)
...
Results
You could use Redis to also store the results from each worker, and use rq's success hook to update the summary, which would also be stored in Redis. Once all queues have completed—you could then print a report with result and summary information stored in Redis.
References
https://redis.com
https://python-rq.org
https://redis.io/docs/manual/scaling/
How can i construct a ray framework where each process will write it's results to a common file ? What i'm currently trying is :
import ray
import time
import pickle
import filelock
ray.init()
filename = 'data/db.pkl'
#ray.remote
def f(i):
try:
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
except FileNotFoundError:
data = {}
if i not in data.keys():
# The actual computations that takes times and need to be parralell: here just a square.
new_key = i
new_item = i**2
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
data[new_key] = new_item
with open(filename, 'wb') as file:
pickle.dump(data,file)
return None
numbers = [0,1,2,3,4,5,6,7,8,9,10]
rez = [f.remote(i) for i in numbers]
But i get an error.
How can i achieve this behavior ? I want each process to :
1° Check the database to see if it's work is needed
2° Work
3° Write it's result to the database.
Without locking the file, this work, but not all results are saved... How can i achieve the wanted behavior ? Note that later i'll need this to work on a distributed setup..
First of all, you should use 'ab' (the append mode instead of 'wb' for overwriting the file). With append mode you shouldn't need locking since it is thread-safe on a POSIX system.
What error did you get when using lock on the file?
Given that you will eventually make the program distributed, I think the easiest thing to do is to use ray.put() in your f(i) to store the data in Ray shared memory and then write the objects out from the main program.
I have several large txt files. Call them mytext01.txt, mytext02.txt, mytext03.txt (in reality there are many more than three). I want to create a separate dataframe for each file that counts occurrences of certain keywords and then write each dataframe to its own csv file. I'd like each txt file to be handled in one process using the multiprocessing library.
I have written code that I thought would do what I wanted, but the csv file never appeared (the code doesn't seem to be doing much of anything-the entire thing runs more quickly than it would normally take to just load a single file). Here is a simplified version of what I tried:
import pandas as pd
from multiprocessing import Pool
keywords=['dog','cat','fish']
def count_words(file_number):
file=path+'mytext{}.txt'.format(file_number)
with open(file, 'r',encoding='utf-8') as f:
text = f.read()
text=text.split(' ')
words_dict=dict(zip(positive,[0 for i in words]))
for word in words_dict.keys():
words_dict[word]=text.count(word)
words_df=pd.DataFrame.from_dict(words_dict,orient='index')
words_df.to_csv('word_counts{}.csv'.format(file_number))
if __name__ == '__main__':
pool = Pool()
pool.map(count_words, ['01','02','03'])
I'm not super familiar with using multiprocessing, so any idea of what I have done wrong would be much appreciated. Thanks!
In my experience it's better to have a dedicated function for parallelization as
import multiprocessing as mp
def parallelize(fun, vec, cores):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
Now you just have to check if your function count_words works for a single file_number and you can use parallelize.
I have lot of large files containing plain text(comma separated).
Need to perform some action on every line and keep a tracker count of total number of lines processed irrespective of the number of files processed.
I'm not sure which option approach would be better (multiprocessing or multithreading)?
I tried implementing multiprocessing but the time difference of sequential processing or multiprocessing didn't differ much.
- It might be I didn't apply multiprocessing correctly.
global_counter
def process_file():
with open(file, 'r') as f, open(file.done, 'w') as df, open(file.err, 'w') as ef:
#some operations on file
#based on some operation logs, work on df or ef.
#increment global_counter with every line
records = process_file()
t = Pool(processes=8)
for i in records:
t.map(processing, (i,))
t.close()
Time of execution remained the same.
I want to implement multiprocessing/multithreading to reduce time in processing multiple large files.
Kindly help me in deciding which of the 2 approach would be better for my case.