Run python (unix) command with very low priority - python

I have a py script which reads data from db and loads into CSV file as below
with open(filePath, 'a') as myfile:
myfile.write(myline)
Once the file is ready, my py script will initiate SQLLDR from subprocess.call(...) and loads all of them.
I will have to do all of the above steps with very low priority. I have set os.nice(19) and passed it to all my subprocess.call(...) function. Looks good this for step 2 above.
How can set "niceness" while creating and writing files on server? Is it possible to pass os (after setting os.nice(19)) while writing/reading files that way "nice" will make sure it run with low priority?

By default, on linux >2.6, the io priority is computed from the nice value, meaning that low priority process will automatically have a lower IO impact, but you can specify the io priority of a process with the ionice command, if you want to set it to another value.

In general, nice value affects the scheduling of your process on the CPU (i.e. the share of CPU cycles you are getting and the priority), so it will hardly have any impact for the process that just writes data to a file. Why not run the process from the get go with an appropriately low nice priority?

As it says in the manual for setpriority(2) upon which nice is built:
A child created by fork(2) inherits its parent's nice value. The nice
value is preserved across execve(2).
so any Subprocesses will inherit the current nice setting. Although as Alexander notes, you won't see much difference for a process that spends most of its time waiting for disk writes to complete.

Related

Wait till file gets copy/upload completes

I have to wait till file copy/upload finishes completely using python (preferred approach), bash/shell also fine(I will call from python)
I have shared nfs directory /data/files_in/, if somebody copies/uploads a file to /data/files_in/ directory, I should notify to other application, only after complete file copy/upload is done
My current code to check file is completed copied or not
while True:
current_size = Path(file_path).stat().st_size
time.sleep(5)
result_size = Path(file_path).stat().st_size
if result_size == current_size:
break
# Notify your application
It is working only with small size files, for large files like 100G files it is not working properly.
I have increased a timer, but still sometimes it is failing and timer based approach seems not good idea to rely on.
Is there any other way, I can implement code to fix this issue?
OS: Linux, Cent os
Python Version: 3.9
I can't comment so I will ask here. Shouldn't the resulting size be larger (or at least different) from the current one in order for a file to be done uploading and therefore stop the loop?
I assume you cannot establish any kind of direct communications with the other process, i.e. the one which is copying/uploading the file.
One common approach in these cases is to have the other process to write/erase a "semaphore" file. It may be that it creates the semaphore just before beginning copying and erases it just after finishing, so the semaphore means "don't do anything, I'm still running", or the other way round, it creates the semaphore just after finishing and erases it just before starting next time, so the semaphore means "your data are ready to use".
That said, I'm amazed your approach doesn't work if you allow enough time, and 5 secs should be more than enough on any networks

Pros and cons of using shared Value/Array vs Queue/Pipe in Python multiprocessing

I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!

Good practice for parallel tasks in python

I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function
The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.

Sharing a resource (file) across different python processes using HDFS

So I have some code that attempts to find a resource on HDFS...if it is not there it will calculate the contents of that file, then write it. And next time it goes to be accessed the reader can just look at the file. This is to prevent expensive recalculation of certain functions
However...I have several processes running at the same time on different machines on the same cluster. I SUSPECT that they are trying to access the same resource and I'm hitting a race condition that leads a lot of errors where I either can't open a file or a file exists but can't be read.
Hopefully this timeline will demonstrate what I believe my issue to be
Process A goes to access resource X
Process A finds resource X exists and begins writing
Process B goes to access resource X
Process A finishes writing resource X
...and so on
Obviously I would want Process B to wait for Process A to be done with Resource X and simply read it when A is done.
Something like semaphores come to mind but I am unaware of how to use these across different python processes on separate processors looking at the same HDFS location. Any help would be greatly appreciated
UPDATE: To be clear..process A and process B will end up calculating the exact same output (i.e. the same filename, with the same contents, to the same location). Ideally, B shouldn't have to calculate it. B would wait for A to calculate it, then read the output once A is done. Essentially this whole process is working like a "long term cache" using HDFS. Where a given function will have an output signature. Any process that wants the output of a function, will first determine the output signature (this is basically a hash of some function parameters, inputs, etc.). It will then check the HDFS to see if it is there. If it's not...it will write calculate it and write it to the HDFS so that other processes can also read it.
(Setting aside that it sounds like HDFS might not be the right solution for your use case, I'll assume you can't switch to something else. If you can, take a look at Redis, or memcached.)
It seems like this is the kind of thing where you should have a single service that's responsible for computing/caching these results. That way all your processes will have to do is request that the resource be created if it's not already. If it's not already computed, the service will compute it; once it's been computed (or if it already was), either a signal saying the resource is available, or even just the resource itself, is returned to your process.
If for some reason you can't do that, you could try using HDFS for synchronization. For example, you could try creating the resource with a sentinel value inside which signals that process A is currently building this file. Meanwhile process A could be computing the value and writing it to a temporary resource; once it's finished, it could just move the temporary resource over the sentinel resource. It's clunky and hackish, and you should try to avoid it, but it's an option.
You say you want to avoid expensive recalculations, but if process B is waiting for process A to compute the resource, why can't process B (and C and D) be computing it as well for itself/themselves? If this is okay with you, then in the event that a resource doesn't already exist, you could just have each process start computing and writing to a temporary file, then move the file to the resource location. Hopefully moves are atomic, so one of them will cleanly win; it doesn't matter which if they're all identical. Once it's there, it'll be available in the future. This does involve the possibility of multiple processes sending the same data to the HDFS cluster at the same time, so it's not the most efficient, but how bad it is depends on your use case. You can lessen the inefficiency by, for example, checking after computation and before upload to the HDFS whether someone else has created the resource since you last looked; if so, there's no need to even create the temporary resource.
TLDR: You can do it with just HDFS, but it would be better to have a service that manages it for you, and it would probably be even better not to use HDFS for this (though you still would possibly want a service to handle it for you, even if you're using Redis or memcached; it depends, once again, on your particular use case).

Python logging from multiple processes

I have a possibly long running program that currently has 4 processes, but could be configured to have more. I have researched logging from multiple processes using python's logging and am using the SocketHandler approach discussed here. I never had any problems having a single logger (no sockets), but from what I read I was told it would fail eventually and unexpectedly. As far as I understand its unknown what will happen when you try to write to the same file at the same time. My code essentially does the following:
import logging
log = logging.getLogger(__name__)
def monitor(...):
# Spawn child processes with os.fork()
# os.wait() and act accordingly
def main():
log_server_pid = os.fork()
if log_server_pid == 0:
# Create a LogRecordSocketServer (daemon)
...
sys.exit(0)
# Add SocketHandler to root logger
...
monitor(<configuration stuff>)
if __name__ == "__main__":
main()
So my questions are: Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
If someone could provide an answer or links to help me out, that would be great. Thanks.
FYI:
I am testing on Mac OS X Lion and the code will probably end up running on a CentOS 6 VM on a Windows machine (if that matters). Whatever solution I use does not need to work on Windows, but should work on a Unix based system.
UPDATE: This question has started to move away from logging specific behavior and is more in the realm of what does linux do with file descriptors during forks. I pulled out one of my college textbooks and it seems that if you open a file in append mode from two processes (not before a fork) that they will both be able to write to the file properly as long as your write doesn't exceed the actual kernel buffer (although line buffering might need to be used, still not sure on that one). This creates 2 file table entries and one v-node table entry. Opening a file then forking isn't supposed to work, but it seems to as long as you don't exceed the kernel buffer as before (I've done it in a previous program).
So I guess, if you want platform independent multi-processing logging you use sockets and create a new SocketHandler after each fork to be safe as Vinay suggested below (that should work everywhere). For me, since I have strong control over what OS my software is being run on, I think I'm going to go with one global log object with a FileHandler (opens in append mode by default and line buffered on most OSs). The documentation for open says "A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used." or I could just create my own logging stream to be sure of line buffering. And just to be clear, I'm ok with:
# Process A
a_file.write("A\n")
a_file.write("A\n")
# Process B
a_file.write("B\n")
producing...
A\n
B\n
A\n
as long as it doesn't produce...
AB\n
\n
A\n
Vinay(or anyone else), how wrong am I? Let me know. Thanks for any more clarity/sureness you can provide.
Do I need to create a new log object after each os.fork()? What happens to the existing global log object?
AFAIK the global log object remains pointing to the same logger in parent and child processes. So you shouldn't need to create a new one. However, I think you should create and add the SocketHandler after the fork() in monitor(), so that the socket server has four distinct connections, one to each child process. If you don't do this, then the child processes forked off in monitor() will inherit the SocketHandler and its socket handle from their parent, but I don't know for sure that it will misbehave. The behaviour is likely to be OS-dependent and you may be lucky on OSX.
With doing things the way I am, am I even getting around the problem that I'm trying to avoid (multiple open files/sockets)? Will this fail and why will it fail (I'd like to be able to tell if future similar implementations will fail)?
I wouldn't expect failure if you create the socket connection to the socket server after the last fork() as I suggested above, but I am not sure the behaviour is well-defined in any other case. You refer to multiple open files but I see no reference to opening files in your pseudocode snippet, just opening sockets.
Also, in what way does the "normal" (one log= expression) method of logging to one file from multiple processes fail? Does it raise an IOError/OSError? Or does it just not completely write data to the file?
I think the behaviour is not well-defined, but one would expect failure modes to present as interspersed log messages from different processes in the file, e.g.
Process A writes first part of its message
Process B writes its message
Process A writes second part of its message
Update: If you use a FileHandler in the way you described in your comment, things will not be so good, due to the scenario I've described above: process A and B both begin pointing at the end of file (because of the append mode), but thereafter things can get out of sync because (e.g. on a multiprocessor, but even potentially on a uniprocessor), one process can (preempt another and) write to the shared file handle before another process has finished doing so.

Categories

Resources