P4Python: use multiple threads that request perforce information at the same time - python

I've been working on a "crawler" of sorts that goes through our repository, and lists directories and files as it goes. For every directory it enounters, it creates a thread that does the same for that directory and so on, recursively. Effectively this creates a very short-lived thread for every directory encountered in the repos. ( it doesn't take very long to request information on just one path, there are just tens of thousands of them )
The logic looks as follows:
import threading
import perforce as Perforce #custom perforce class
from pathlib import Path
p4 = Perforce()
p4.connect()
class Dir():
def __init__(self, path):
self.dirs = []
self.files = []
self.path = path
self.crawlers = []
def build_crawler(self):
worker = Crawler(self)
# append to class variable to keep it from being deleted
self.crawlers.append(worker)
worker.start()
class Crawler(threading.Thread):
def __init__(self, dir):
threading.Thread.__init__(self)
self.dir = dir
def run(self):
depotdirs = p4.getdepotdirs(self.dir.path)
depotfiles = p4.getdepotfiles(self.dir.path)
for p in depotdirs:
if Path(p).is_dir():
_d = Dir(self.dir, p)
self.dir.dirs.append(_d)
for p in depotfiles:
if Path(p).is_file():
f = File(p) # File is like Dir, but with less stuff, just a path.
self.dir.files.append(f)
for dir in self.dir.dirs:
dir.build_crawler()
for worker in d.crawlers:
worker.join()
Obviously this is not complete code, but it represents what I'm doing.
My question really is whether I can create an instance of this Perforce class in the __init__ method of the Crawler class, so that requests can be done separately. Right now, I have to call join() on the created threads so that they wait for completion, to avoid concurrent perforce calls.
I've tried it out, but it seems like there is a limit to how many connections you can create: I don't have a solid number, but somewhere along the line Perforce just started straight up refusing connections, which I presume is due to the number of concurrent requests.
Really what I'm asking I suppose is two-fold: is there a better way of creating a data model representing a repos with tens of thousands of files than the one I'm using, and is what I'm trying to do possible, and if so, how.
Any help would be greatly appreciated :)

I found out how to do this (it's infuriatingly simple, as with all simple solutions to overly complicated problems):
To build a data model that contains Dir and File classes representing a whole depot with thousands of files, just call p4.run("files", "-e", path + "\\...").
This will return a list of every file in path, recursively. From there all you need to do is iterate over every returned path and construct your data model from there.
Hope this helps someone at some point.

Related

Struggling with multiprocessing Queue

My structure (massively simplified) is depicted below:
import multiprocessing
def creator():
# creates files
return
def relocator():
# moves created files
return
create = multiprocessing.Process(target=creator)
relocate = multiprocessing.Process(target=relocator)
create.start()
relocate.start()
What I am trying to do is have a bunch of files created by creator and as soon as they get created have them moved to another directory by relocator.
The reason I want to use multiprocessing here is:
I do not want creator to wait for the moving to be finished first because moving takes time I dont want to waste.
Creating all the files first before starting to copy is not an option either because there is not enough space in the drive for all of them.
I want both the creator and relocator processes to be serial (one file at a time each) but run in parallel. A "log" of the actions should lool like this:
# creating file 1
# creating file 2 and relocating file 1
# creating file 3 and relocating file 2
# ...
# relocating last file
Based on what I have read, Queue is the way to go here.
Strategy: (maybe not the best one?!)
After an file gets created it will be entering the queue and after it has finished being relocated, it will be removed from the queue.
I am however having issues coding it; multiple files being created at the same time (multiple instances of creator running in parallel) and others...
I would be very grateful for any ideas, hints, explanations, etc
Lets take your idea and split in this features:
Creator should create files (100 for example)
Relocator should move 1 file at a time till there are no more files to move
Creator may end before Relocator so it can also
transform himself into a Relocator Both have to know when to
finish
So, we have 2 main functionalities:
def create(i):
# creates files and return outpath
return os.path.join("some/path/based/on/stuff", "{}.ext".format(i))
def relocate(from, to):
# moves created files
shuttil.move(from, to)
Now lets create our processes:
from multiprocessing import Process, Queue
comm_queue = Queue()
#process that create the files and push the data into the queue
def creator(comm_q):
for i in range(100):
comm_q.put(create(i))
comm_q.put("STOP_FLAG") # we tell the workers when to stop, we just push one since we only have one more worker
#the relocator works till it gets an stop flag
def relocator(comm_q):
data = comm_q.get()
while data != "STOP_FLAG":
if data:
relocate(data, to_path_you_may_want)
data = comm_q.get()
creator_process= multiprocessing.Process(target=creator, args=(comm_queue))
relocators = multiprocessing.Process(target=relocator, args=(comm_queue))
creator_process.start()
relocators .start()
This way we would have now a creator and a relocator, but, lets say now we want the Creator to start relocating when the creation job is done by it, we can just use relocator, but we would need to push one more "STOP_FLAG" since we would have 2 processes relocating
def creator(comm_q):
for i in range(100):
comm_q.put(create(i))
for _ in range(2):
comm_q.put("STOP_FLAG")
relocator(comm_q)
Lets say we want now an arbitrary number of relocator processes, we should adapt our code a bit to handle this, we would need the creator method to be aware of how many flags to notify the other processes when to stop, our resulting code would look like this:
from multiprocessing import Process, Queue, cpu_count
comm_queue = Queue()
#process that create the files and push the data into the queue
def creator(comm_q, number_of_subprocesses):
for i in range(100):
comm_q.put(create(i))
for _ in range(number_of_subprocesses + 1): # we need to count ourselves
comm_q.put("STOP_FLAG")
relocator(comm_q)
#the relocator works till it gets an stop flag
def relocator(comm_q):
data = comm_q.get()
while data != "STOP_FLAG":
if data:
relocate(data, to_path_you_may_want)
data = comm_q.get()
num_of_cpus = cpu_count() #we will spam as many processes as cpu core we have
creator_process= Process(target=creator, args=(comm_queue, num_of_cpus))
relocators = [Process(target=relocator, args=(comm_queue)) for _ in num_of_cpus]
creator_process.start()
for rp in relocators:
rp.start()
Then you will have to WAIT for them to finish:
creator_process.join()
for rp in relocators:
rp.join()
You may want to check at the multiprocessing.Queue documentation
Specially to the get method (is a blocking call by default)
Remove and return an item from the queue. If optional args block is
True (the default) and timeout is None (the default), block if
necessary until an item is available.

Python multiprocess/multithreading to speed up file copying

I have a program which copies large numbers of files from one location to another - I'm talking 100,000+ files (I'm copying 314g in image sequences at this moment). They're both on huge, VERY fast network storage RAID'd in the extreme. I'm using shutil to copy the files over sequentially and it is taking some time, so I'm trying to find the best way to opimize this. I've noticed some software I use effectively multi-threads reading files off of the network with huge gains in load times so I'd like to try doing this in python.
I have no experience with programming multithreading/multiprocessesing - does this seem like the right area to proceed? If so what's the best way to do this? I've looked around a few other SO posts regarding threading file copying in python and they all seemed to say that you get no speed gain, but I do not think this will be the case considering my hardware. I'm nowhere near my IO cap at the moment and resources are sitting around 1% (I have 40 cores and 64g of RAM locally).
EDIT
Been getting some up-votes on this question (now a few years old) so I thought I'd point out one more thing to speed up file copies. In addition to the fact that you can easily 8x-10x copy speeds using some of the answers below (seriously!) I have also since found that shutil.copy2 is excruciatingly slow for no good reason. Yes, even in python 3+. It is beyond the scope of this question so I won't dive into it here (it's also highly OS and hardware/network dependent), beyond just mentioning that by tweaking the copy buffer size in the copy2 function you can increase copy speeds by yet another factor of 10! (however note that you will start running into bandwidth limits and the gains are not linear when multi-threading AND tweaking buffer sizes. At some point it does flat line).
UPDATE:
I never did get Gevent working (first answer) because I couldn't install the module without an internet connection, which I don't have on my workstation. However I was able to decrease file copy times by 8 just using the built in threading with python (which I have since learned how to use) and I wanted to post it up as an additional answer for anyone interested! Here's my code below, and it is probably important to note that my 8x copy time will most likely differ from environment to environment due to your hardware/network set-up.
import Queue, threading, os, time
import shutil
fileQueue = Queue.Queue()
destPath = 'path/to/cop'
class ThreadedCopy:
totalFiles = 0
copyCount = 0
lock = threading.Lock()
def __init__(self):
with open("filelist.txt", "r") as txt: #txt with a file per line
fileList = txt.read().splitlines()
if not os.path.exists(destPath):
os.mkdir(destPath)
self.totalFiles = len(fileList)
print str(self.totalFiles) + " files to copy."
self.threadWorkerCopy(fileList)
def CopyWorker(self):
while True:
fileName = fileQueue.get()
shutil.copy(fileName, destPath)
fileQueue.task_done()
with self.lock:
self.copyCount += 1
percent = (self.copyCount * 100) / self.totalFiles
print str(percent) + " percent copied."
def threadWorkerCopy(self, fileNameList):
for i in range(16):
t = threading.Thread(target=self.CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
ThreadedCopy()
How about using a ThreadPool?
import os
import glob
import shutil
from functools import partial
from multiprocessing.pool import ThreadPool
DST_DIR = '../path/to/new/dir'
SRC_DIR = '../path/to/files/to/copy'
# copy_to_mydir will copy any file you give it to DST_DIR
copy_to_mydir = partial(shutil.copy, dst=DST_DIR)
# list of files we want to copy
to_copy = glob.glob(os.path.join(SRC_DIR, '*'))
with ThreadPool(4) as p:
p.map(copy_to_mydir, to_copy)
This can be parallelized by using gevent in Python.
I would recommend the following logic to achieve speeding up 100k+ file copying:
Put names of all the 100K+ files, which need to be copied in a csv file, for eg: 'input.csv'.
Then create chunks from that csv file. The number of chunks should be decided based on no.of processors/cores in your machine.
Pass each of those chunks to separate threads.
Each thread sequentially reads filename in that chunk and copies it from one location to another.
Here goes the python code snippet:
import sys
import os
import multiprocessing
from gevent import monkey
monkey.patch_all()
from gevent.pool import Pool
def _copyFile(file):
# over here, you can put your own logic of copying a file from source to destination
def _worker(csv_file, chunk):
f = open(csv_file)
f.seek(chunk[0])
for file in f.read(chunk[1]).splitlines():
_copyFile(file)
def _getChunks(file, size):
f = open(file)
while 1:
start = f.tell()
f.seek(size, 1)
s = f.readline()
yield start, f.tell() - start
if not s:
f.close()
break
if __name__ == "__main__":
if(len(sys.argv) > 1):
csv_file_name = sys.argv[1]
else:
print "Please provide a csv file as an argument."
sys.exit()
no_of_procs = multiprocessing.cpu_count() * 4
file_size = os.stat(csv_file_name).st_size
file_size_per_chunk = file_size/no_of_procs
pool = Pool(no_of_procs)
for chunk in _getChunks(csv_file_name, file_size_per_chunk):
pool.apply_async(_worker, (csv_file_name, chunk))
pool.join()
Save the file as file_copier.py.
Open terminal and run:
$ ./file_copier.py input.csv
While re-implementing the code posted by #Spencer, I ran into the same error as mentioned in the comments below the post (to be more specific: OSError: [Errno 24] Too many open files).
I solved this issue by moving away from the daemonic threads and using concurrent.futures.ThreadPoolExecutor instead. This seems to handle in a better way the opening and closing of the files to copy. By doing so all the code stayed the same besides the threadWorkerCopy(self, filename_list: List[str]) method which looks like this now:
def threadWorkerCopy(self, filename_list: List[str]):
"""
This function initializes the workers to enable the multi-threaded process. The workers are handles automatically with
ThreadPoolExecutor. More infos about multi-threading can be found here: https://realpython.com/intro-to-python-threading/.
A recurrent problem with the threading here was "OSError: [Errno 24] Too many open files". This was coming from the fact
that deamon threads were not killed before the end of the script. Therefore, everything opened by them was never closed.
Args:
filename_list (List[str]): List containing the name of the files to copy.
"""
with concurrent.futures.ThreadPoolExecutor(max_workers=cores) as executor:
executor.submit(self.CopyWorker)
for filename in filename_list:
self.file_queue.put(filename)
self.file_queue.join() # program waits for this process to be done.
If you just want to copy a directory tree from one path to another, here's my solution that's a litte more simple than the previous solutions. It leverages multiprocessing.pool.ThreadPool and uses a custom copy function for shutil.copytree:
import shutil
from multiprocessing.pool import ThreadPool
class MultithreadedCopier:
def __init__(self, max_threads):
self.pool = ThreadPool(max_threads)
def copy(self, source, dest):
self.pool.apply_async(shutil.copy2, args=(source, dest))
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.pool.close()
self.pool.join()
src_dir = "/path/to/src/dir"
dest_dir = "/path/to/dest/dir"
with MultithreadedCopier(max_threads=16) as copier:
shutil.copytree(src_dir, dest_dir, copy_function=copier.copy)

How to continously update target file using Luigi?

I have recently started playing around with Luigi, and I would like to find out how to use it to continuously append new data into an existing target file.
Imagine I am pinging an api every minute to retrieve new data. Because a Task only runs if the Target is not already present, a naive approach would be to parameterize the output file by the current datetime. Here's a bare bones example:
import luigi
import datetime
class data_download(luigi.Task):
date = luigi.DateParameter(default = datetime.datetime.now())
def requires(self):
return []
def output(self):
return luigi.LocalTarget("data_test_%s.json" % self.date.strftime("%Y-%m-%d_%H:%M"))
def run(self):
data = download_data()
with self.output().open('w') as out_file:
out_file.write(data + '\n')
if __name__ == '__main__':
luigi.run()
If I schedule this task to run every minute, it will execute because the target file of the current time does not exist yet. But it creates 60 files a minute. What I'd like to do instead, is make sure that all the new data ends up in the same file eventually. What would be a scalable approach to accomplish that? Any ideas, suggestions are welcome!
You cannot. As the doc for LocalTarget says:
Parameters: mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas w will open the FileSystemTarget in write mode. Subclasses can implement additional options.
I.e. only r or w modes are allowed. Additional options such as a require an extension of the LocalTarget class; despite it breaks the desired idempotency on Luigi task executions.
def output(self):
return luigi.LocalTarget("data_test_%s.json" % self.date.strftime("%Y-%m-%d_%H:%M"))
It's not the 'luigi way', but it does the job. In the end those targets are just file objects.

Communicate between two python app

I have done two app's :
The first one is a spider that extract all links from a website.
The second one do some checks on each link sent by the first app.
When the first app find a link, how can I send a notification or something else to the second app ?
The second app must listen continuously the data sent by the first app.
I found few post speaking of Queue but I don't really understand how that work.
Can someone explain me with a simple exemple how to communicate between the two app ?
Thank's
There are all sorts of ways to accomplish inter-process communication, but by far the simplest is to use the filesystem. Have your spider write it's output to a temp file. When it's finished, move it into a folder that your second process polls periodically and when it finds work, then process it.
The spider could like something like:
import tempfile, os
tmpname = ''
with tempfile.NamedTemporaryFile(delete=False) as tmp:
tmpname = tmp.name
tmp.write("spider output....\n")
tgt = os.path.join('incoming', os.path.basename(tmpname))
os.rename(tmpname, tgt)
The second process could look something like this:
import time, os
while 1:
time.sleep(5)
for item in os.listdir('incoming'):
work_item = os.path.join('incoming', item)
with open(work_item) as fin:
# do something with item
os.unlink(work_item)
You want to save one file as a "module" to be imported by the other file. Here this can be implemented with the the import keyword. For example, if you name the second part of your application listener.py, you can type import listener in your other file (remember to put them in the same folder!) and call any method from the second file. You can read more on Python modules here.
A Queue is just a container into which items may be put and retrieved, often in FIFO order. The Queue module in Python 2 is just an implementation of one that supports synchronized access, meaning that it supports multiple threads using it (putting and getting things) at the same time.

correctly checking file existence while multiprocessing

I have a function, myfunc, which is called in parallel processing. When I make several processes share the same destination folder, all of them call myfunc in parallel and check the existence of the destination folder. If it already existed, no problem. However, if the folder did not exist prior to launching the script, then the first process will enter the if block and create the folder. On the other hand, there will be another process that will enter the very same if block "almost" at the same time, will find the folder does not exist, and will try to create it, while the first process is actually creating it or has already done it. So at some point there will be an OSError telling the folder already exists.
Is there a clean way to deal with this issue while multiprocessing? I am thinking of taking care of the destination folder out of the function myfunc, before launching my processes. It would be nice however to find a solution using multiprocessing, for the sake of knowledge.
import os, sys
def myfunc(file_names, destination=None, file_permission=None, verbose=False):
absPath = os.path.abspath(file_names[0])
baseName = os.path.basename(absPath)
dirName = os.path.dirname(absPath)
destination_folder = "/default/destination" if destination is None \
else os.path.abspath(destination)
if not os.path.isdir(destination_folder):
os.mkdir(destination_folder)
os.chmod(destination_folder, file_permission)
if verbose:
print "Created directory", destination_folder
Checking for existence of a file/folder and taking action based on the result is fundamentally wrong in most cases, because even if you are not multiprocessing, you don't know what else is running on the computer. It is also difficult to guarantee that someone else won't later run multiple copies of your process, even if you didn't originally intend that.
The most robust method is to always try to create the folder, and silently ignore the "but that already exists" error. (Do not ignore other errors, such as "but you don't have that permission"!) This would still be the best way to do things even if you did a single check prior to starting the multiprocessing.
Use Lock from the threading module for solving concurrency issues:
import os, sys
from threading import Lock
def myfunc(lock, ...):
... do stuff as usual ...
with lock:
destination_folder = "/default/destination" if destination is None \
else os.path.abspath(destination)
... do everything else as usual ...
if __name__ == "__main__":
my_lock = Lock()
myfunc(my_lock, ...)

Categories

Resources