Sharing a resource (file) across different python processes using HDFS

Sharing a resource (file) across different python processes using HDFS - python

So I have some code that attempts to find a resource on HDFS...if it is not there it will calculate the contents of that file, then write it. And next time it goes to be accessed the reader can just look at the file. This is to prevent expensive recalculation of certain functions
However...I have several processes running at the same time on different machines on the same cluster. I SUSPECT that they are trying to access the same resource and I'm hitting a race condition that leads a lot of errors where I either can't open a file or a file exists but can't be read.
Hopefully this timeline will demonstrate what I believe my issue to be
Process A goes to access resource X
Process A finds resource X exists and begins writing
Process B goes to access resource X
Process A finishes writing resource X
...and so on
Obviously I would want Process B to wait for Process A to be done with Resource X and simply read it when A is done.
Something like semaphores come to mind but I am unaware of how to use these across different python processes on separate processors looking at the same HDFS location. Any help would be greatly appreciated
UPDATE: To be clear..process A and process B will end up calculating the exact same output (i.e. the same filename, with the same contents, to the same location). Ideally, B shouldn't have to calculate it. B would wait for A to calculate it, then read the output once A is done. Essentially this whole process is working like a "long term cache" using HDFS. Where a given function will have an output signature. Any process that wants the output of a function, will first determine the output signature (this is basically a hash of some function parameters, inputs, etc.). It will then check the HDFS to see if it is there. If it's not...it will write calculate it and write it to the HDFS so that other processes can also read it.

(Setting aside that it sounds like HDFS might not be the right solution for your use case, I'll assume you can't switch to something else. If you can, take a look at Redis, or memcached.)
It seems like this is the kind of thing where you should have a single service that's responsible for computing/caching these results. That way all your processes will have to do is request that the resource be created if it's not already. If it's not already computed, the service will compute it; once it's been computed (or if it already was), either a signal saying the resource is available, or even just the resource itself, is returned to your process.
If for some reason you can't do that, you could try using HDFS for synchronization. For example, you could try creating the resource with a sentinel value inside which signals that process A is currently building this file. Meanwhile process A could be computing the value and writing it to a temporary resource; once it's finished, it could just move the temporary resource over the sentinel resource. It's clunky and hackish, and you should try to avoid it, but it's an option.
You say you want to avoid expensive recalculations, but if process B is waiting for process A to compute the resource, why can't process B (and C and D) be computing it as well for itself/themselves? If this is okay with you, then in the event that a resource doesn't already exist, you could just have each process start computing and writing to a temporary file, then move the file to the resource location. Hopefully moves are atomic, so one of them will cleanly win; it doesn't matter which if they're all identical. Once it's there, it'll be available in the future. This does involve the possibility of multiple processes sending the same data to the HDFS cluster at the same time, so it's not the most efficient, but how bad it is depends on your use case. You can lessen the inefficiency by, for example, checking after computation and before upload to the HDFS whether someone else has created the resource since you last looked; if so, there's no need to even create the temporary resource.
TLDR: You can do it with just HDFS, but it would be better to have a service that manages it for you, and it would probably be even better not to use HDFS for this (though you still would possibly want a service to handle it for you, even if you're using Redis or memcached; it depends, once again, on your particular use case).

Related

how to efficiently make airflow dag definitions database-driven

Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.

Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.

I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.

Good practice for parallel tasks in python

I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function

The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.

Using ZooKeeper to manage tasks which are in process or have been processed

I have a python script which periodically scans directories, processing new files. Each file takes a long time to process (many hours). I currently have the script running on a single computer, writing the names of processed files to a local file. Not fancy or robust, but it more or less works. I would like to use multiple worker machines to improve throughput (and robustness). My goals are to keep it as simple as possible. A zookeeper cluster is readily available.
My plan is to have in zookeeper a directory "started_files" with ephemeral nodes with the filename, which is known to be unique. I would have another directory "completed_files" with regular nodes with the filename. In pseudocode,
if filename does not exist in completed files:
try:
create emphemeral node filename in started files
process(filename)
create node filename in completed files
except node exists error:
do nothing, another worker is processing it
My first question is whether or not this is safe. Under any circumstance, can two different machines each create the same node successfully? I don't fully understand the doc. Having a file processed twice won't cause anything ALL that bad, but I would prefer it to be correct out of principle.
Secondly, is this a decent approach? Is there another approach which is clearly better? I will be processing 10's of files per DAY, so performance of this part of the application doesn't really matter to me (I sure wish processing the file was faster). Alternatively, I could have another script with just a single instance (or elect a leader) to scan for files and put them in a queue. I could modify the code which is causing these files to magically appear in the first place. I could use celery or storm. However all of those alternatives grow the scope which I am trying to keep small and simple.

In general your approach should work. It is possible, that you configure writing znodes to ZooKeeper in a way that consecutive creation of the same path will fail if it exists.
For the ephermal znodes you already found out quite well that these would vanish automatically if a client closes the connection to ZooKeeper which could,be especially useful in the case of failing compute nodes.
Other nodes can actually monitor the path with the ephermal znodes in order to figure out when it would be a good idea to scan for new tasks.
It would even be possible to implement a queue on top of ZooKeeper for instance using the sequencing of znodes; there are possible better ways.
In general I believe that a message queue system with publish subscribe pattern would scale a bit better. In that case you would only need to think about how to reschedule jobs of failed compute nodes.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.

You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Downloading a Large Number of Files from S3

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.

You might consider using s3fs, and just running concurrent file system commands from Python.

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

what about thread + queue, I love this article: Practical threaded programming with Python

Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.