I'm writing a set of programs that have to operate on a common database, possibly concurrently. For the sake of simplicity (for the user), I didn't want to require the setup of a database server. Therefore I setteled on Berkeley DB, where one can just fire up a program and let it create the DB if it doesn't exist.
In order to let programs work concurrently on a database, one has to use the transactional features present in the 5.x release (here I use python3-bsddb3 6.1.0-1+b2 with libdb5.3 5.3.28-12): the documentation clearly says that it can be done. However I quickly ran in trouble, even with some basic tasks :
Program 1 initializes records in a table
Program 2 has to scan the records previously added by program 1 and updates them with additional data.
To speed things up, there is an index for said additional data. When program 1 creates the records, the additional data isn't present, so the pointer to that record is added to the index under an empty key. Program 2 can then just quickly seek to the not-yet-updated records.
Even when not run concurrently, the record updating program crashes after a few updates. First it complained about insufficient space in the mutex area. I had to resolve this with an obscure DB_CONFIG file and then run db_recover.
Next, again after a few updates it complained 'Cannot allocate memory -- BDB3017 unable to allocate space from the buffer cache'. db_recover and relaunching the program did the trick, only for it to crash again with the same error a few records later.
I'm not even mentioning concurrent use: when one of the programs is launched while the other is running, they almost instantly crash with deadlock, panic about corrupted segments and ask to run recover. I made many changes so I went throug a wide spectrum of errors which often yield irrelevant matches when searched for. I even rewrote the db calls to use lmdb, which in fact works quite well and is really quick, which tends to indicate my program logic isn't at fault. Unfortunately it seems the datafile produced by lmdb is quite sparse, and quickly grew to unacceptable sizes.
From what I said, it seems that maybe some resources are being leaked somewhere. I'm hesitant to rewrite all this directly in C to check if the problem can come from the Python binding.
I can and I will update the question with code, but for the moment ti is long enough. I'm looking for people who have used the transactional stuff in BDB, for similar uses, which could point me to some of the gotchas.
Thanks
RPM (see http://rpm5.org) uses Berkeley DB in transactional mode. There's a fair number of gotchas, depending on what you are attempting.
You have already found DB_CONFIG: you MUST configure the sizes for mutexes and locks, the defaults are invariably too small.
Needing to run db_recover while developing is quite painful too. The best fix (imho) is to automate recovery while opening by checking the return code for DB_RUNRECOVERY, and then reopening the dbenv with DB_RECOVER.
Deadlocks are usually design/coding errors: run db_stat -CA to see what is deadlocked (or what locks are held) and adjust your program. "Works with lmdv" isn't sufficient to claim working code ;-)
Leaks can be seen with either valgrind and/or BDB compilation with -fsanitize:address. Note that valgrind will report false uninitializations unless you use overrides and/or compile BDB to initialize.
Related
I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!
While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc
I'm trying to optimize my Django 1.8 project's memory usage, and I have noticed that even from initialization, it's using 80+MB, which seems excessive. I observe this when I simply run ./manage shell --plain. By contrast, an empty Django project on startup uses only 30MB. I want to know what's consuming so much memory.
I have tried various things to reduce memory consumption, including a fair amount of stripping down of the project by removing its apps and URLs. I tried poking around gc.get_objects but it's not comprehensible. I got excited about tracemalloc so I build a custom Python 2.7.8 with tracemalloc included, only to realize that it won't start tracking until I call start() from the prompt, at which point the memory has already been consumed.
Questions:
What kinds of things could be causing this high memory floor?
What process can I use to determine the consumer?
And yes, I realize the versions badly need upgrades. Thanks!
UPDATE #1
I did manage to get some use of out tracemalloc. I inserted the import and start at the beginning of manage.py.
/Users/john/.venv/proj-tracemalloc/lib/python2.7/site-packages/zinnia/comparison.py:19: size=25.0 MiB (+25.0 MiB), count=27168 (+27168), average=993 B
/usr/local/tracemalloc-py2.7.8/py27/lib/python2.7/importlib/__init__.py:37: size=3936 KiB (+3936 KiB), count=9581 (+9581), average=420 B
...
It did reveal one interesting thing - this first line s a blogging app. This loop seems to use a lot of memory, although I guess it's temporary. By line 3 everything is 1.5MB and smaller.
PUNCTUATION = dict.fromkeys(
i for i in range(sys.maxunicode)
if unicodedata.category(six.unichr(i)).startswith('P')
)
UPDATE #2
I went through the tedious process of uninstalling packages, apps, and fixing broken code. In the end, it looks like maybe 10MB was due to my internal apps, and 25MB was due to the Django debugging toolbar. This got me down to 45MB. It doesn't seem unreasonable for my core app to be using up 15MB, taking it down to the 30MB core. I don't use the toolbar in production, but it definitely needs additional memory for actually working.
In the end, I didn't learn much, but at least nothing seems insanely wrong. I was disappointed with tracemalloc, but hopefully it will work better as integrated into Python 3.
I have a python script, which is used to perform a lab measurement using several devices. The whole setup is rather involved, including communication over serial devices, API calls as well as the use of self-written and commercial drivers. In the end, however, everything boils down to two nested loops, which vary some parameters, collect data and write it to a file.
My problem is that I observe random occurences of a MemoryError, typically after 10 hours, equivalent to ~15k runs of the loops. At the moment, I don't have an idea, where it comes from or how I can trace it further. So I would be happy for suggestions, how to work on my problem. My observations up to this moment are as follows.
The error occurs at random states of the program. Different runs will throw the MemoryError at different lines of my script.
There is never any helpful error message. Python only says MemoryError without any error string. The traceback leads me to some point in the script, where memory is needed (e.g. when building a list), but it appears to be no specific instruction, which is the problem.
My RAM is far from full. The python process in question typically consumes some ten MB of RAM when viewed in the task manager. In addition, the RAM usage appears to be stable for hours. Usually, it increases slowly for some time, just to drop to down to the previous level quickly, which I interpret as the garbage collector kicking in periodically.
So far I did not find any indications for a memory leak. I used memory_profiler to trace the memory usage of my functions and found it to be stable. In addition, I followed this blog entry to observe what the garbage collector does in detail. Again, I could not find any hints for undeleted objects.
I am stuck to Win7 x86 due to a driver, which will only work on a 32bit system. So I cannot follow suggestions like this to go to a 64 bit version of Windows. Anyway, I do not see, how this would help in my situation.
The iPython console, from which the script is being launched, often behaves strange after the error occurred. Sometimes, a new MemoryError is thrown even for very simple operations. Often, the console is marked by Windows as "not responding" after some time. A menu pops up, where besides the usual options to wait for the process or to terminate it, there is a third option to "restore" the program (whatever that means). Doing so usually causes the console to work normal again.
At this point, I am somewhat out of ideas on how to proceed. The general receipe to comment out parts of the script until it works is highly undesirable in my case. As stated above, each test run will take several hours, meaning a potential downtime of weeks for my lab equipment. Going that direction, appears unfeasable to me. Is there any more direct approach to learn, what is crashing behind the scenes? How can I understand that python apparently fails to malloc?
I have a desktop app that has 65 modules, about half of which read from or write to an SQLite database. I've found that there are 3 ways that the database can throw an SQliteDatabaseError:
SQL logic error or missing database (happens unpredictably every now and then)
Database is locked (if it's being edited by another program, like SQLite Database Browser)
Disk I/O error (also happens unpredictably)
Although these errors don't happen often, when they do they lock up my application entirely, and so I can't just let them stand.
And so I've started re-writing every single access of the database to be a pointer to a common "database-access function" in its own module. That function then can catch these three errors as exceptions and thereby not crash, and also alert the user accordingly. For example, if it is a "database is locked error", it will announce this and ask the user to close any program that is also using the database and then try again. (If it's the other errors, perhaps it will tell the user to try again later...not sure yet). Updating all the database accesses to do this is mostly a matter of copy/pasting the redirect to the common function--easy work.
The problem is: it is not sufficient to just provide this database-access function and its announcements, because at all of the points of database access in the 65 modules there is code that follows the access that assumes the database will successfully return data or complete a write--and when it doesn't, that code has to have a condition for that. But writing those conditionals requires carefully going into each access point and seeing how best to handle it. This is laborious and difficult for the couple of hundred database accesses I'll need to patch in this way.
I'm willing to do that, but I thought I'd inquire if there were a more efficient/clever way or at least heuristics that would help in finishing this fix efficiently and well.
(I should state that there is no particular "architecture" of this application...it's mostly what could be called "ravioli code", where the GUI and database calls and logic are all together in units that "go together". I am not willing to re-write the architecture of the whole project in MVC or something like this at this point, though I'd consider it for future projects.)
Your gut feeling is right. There is no way to add robustness to the application without reviewing each database access point separately.
You still have a lot of important choice at how the application should react on errors that depends on factors like,
Is it attended, or sometimes completely unattended?
Is delay OK, or is it important to report database errors promptly?
What are relative frequencies of the three types of failure that you describe?
Now that you have a single wrapper, you can use it to do some common configuration and error handling, especially:
set reasonable connect timeouts
set reasonable busy timeouts
enforce command timeouts on client side
retry automatically on errors, especially on SQLITE_BUSY (insert large delays between retries, fail after a few retries)
use exceptions to reduce the number of application level handlers. You may be able to restart the whole application on database errors. However, do that only if you have confidence as to in which state you are aborting the application; consistent use of transactions may ensure that the restart method does not leave inconsistent data behind.
ask a human for help when you detect a locking error
...but there comes a moment where you need to bite the bullet and let the error out into the application, and see what all the particular callers are likely to do with it.
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".