how to increase berkeley db cache size - python

I am trying to set the cachesize in giga bytes for a bdb file. I am using the python interface for bdb. I see the underlying 'C' api for bdb has this option.
int DB->set_cachesize(DB *db, u_int32_t gbytes, u_int32_t bytes, int ncache);
But I am able to pass only one cachesize argument to the btopen function which is being interpreted as cache_size in bytes. This is restricting the max cache_size to 2GB. I would like to be able to set cache size to atleast 4gb.
Any help to be able to set/increase the cache size would be greatly appreciated, thanks in advance!
Below is the python function i am using to set the cache_size.
cache_size = (2*1024*1024*1024) - 1
db = bsddb.btopen(self._bdbFileName, cachesize=cache_size, flag='n')

i am not familiar with python interface of BDB. When i use c api, i really appreciate opening the bdb with an environment and set a lot configurations in the environment.
For example, open a bdb file with environment and the env dir is set to ~/env/. Then put a file named DB_CONFIG in ~/env/ with content:
set_cachesize 4 0 1
The cachesize will be set to 4GB. No code needed.
You can check this for all the available configurations in DB_CONFIG.

Related

Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.
We tried:
Building numpy memmap and accessing requested slice:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,
offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
Reopening the memmap on each call to getitem:
#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e
Addressing the h5 file directly and converting to a numpy array:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.
We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.
Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.
As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)
You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.
The easiest way to get the file descriptor is to supply it yourself:
pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')
You can now set the advice. For the entire file:
os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)
Or for a particular dataset:
os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)
Other things to look at
H5py does its own chunk caching. You may want to try turning this off:
f = h5py.File(..., rdcc_nbytes=0)
As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2':
f = h5py.File(..., driver='sec2')

Resultpointer in function call

I want to use functions in dll's via ctype. I can call the function without errors and even the error code of the function is 0 meanig function successfuly finished. But when I try to acces the result variable ist is empty.
I have been implemented the lookup in free pascal severeal years ago and would transfer it to python right now. The interface allow to access via cdel convention and I tied to reimplement in python 3.7.4 with ctypes now
The last working Pascal Prototype have been:
PROCEDURE pGetCallInfo(DriveInfo: pointer; ACall: pointer; AInfo: pointer;
var AErrorCode: SmallInt); pascal; external 'raccd32a.dll';
My best version in python have been the following:
from ctypes import *
callBookDLL = CDLL('raccd32a')
AInfo = create_string_buffer(400)
err = callBookDLL.cGetCallInfo("self.txt_CallBookPath.text()","DG1ATN",AInfo)
The result ist:
err
0
AInfo.value
b''
AInfo should contain a max. 400 char long stringbuffer with an result containing Name, Adress and so on.
As I have a second library I have to acces same way I search for my fault but I was not able to find it. I think my problem is the work with pointer and the type conversion.
I checked teh ctypes howto allready but I can noht solve this trouble.
Thanks a lot so far ...
Check [Python 3.Docs]: ctypes - A foreign function library for Python. It contains (almost) every piece of info that you need.
There are a number of problems:
ctypes doesn't support pascal calling convention, only cdecl and stdcall (applies to 32bit only). That means (after reading the manual) that you shouldn't use the p* functions, but the c* (or s*)
You didn't specify argtypes (and restype) for your function. This results in UB. Some effects of this:
[SO]: Python ctypes cdll.LoadLibrary, instantiate an object, execute its method, private variable address truncated (#CristiFati's answer)
[SO]: python ctypes issue on different OSes (#CristiFati's answer)
It is a procedure (a function that returns void). Anyway this is a minor one
Here's some sample code (of course it's blind, as I didn't test it):
#!/usr/bin/env python3
import sys
import ctypes
dll = ctypes.CDLL("raccd32a.dll")
cGetCallInfo = dll.cGetCallInfo
cGetCallInfo.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_char_p, ctypes.POINTER(ctypes.c_short)]
cGetCallInfo.restype = None
ADriveInfo = self.txt_CallBookPath.text().encode()
#ADriveInfo = b"C:\\callbook2019\\" # Notice the double bkslashes
ACall = b"DG1ATN"
AInfo = ctypes.create_string_buffer(400)
result = ctypes.c_short(0)
cGetCallInfo(ADriveInfo, ACall, AInfo, ctypes.byref(result))
#EDIT0:
From the beginning, I wanted yo say that the 1st argument passed to the function doesn't make much sense. Then, there are problems regarding the 2nd one as well. According to the manual ([AMT-I]: TECHNICAL INFORMATION about RACCD32a.DLL (emphasis is mine)):
ADriveInfo, ACall and AInfo are pointers to zero-terminated strings. These
strings has to exist at the moment of calling xGetCallInfo. The calling
program is responsible for creating them. AInfo must be long enough to
comfort xGetCallInfo (at least 400 characters).
Note: "Length of AInfo" refers to the length of the string AInfo points at.
ADriveInfo and ACall are treated in the same manner for short.
In ADriveInfo the procedure expects the path to the CD ROM drive. Use
"G:\"
if "G:" designates the CD ROM drive with the callbook CD ROM.
Keep in mind that this information is a *must* and the calling program
has to know it.
Note: If the active directory on drive G: is not the root, ADriveInfo = "G:"
will lead to an error 3. So always use "G:\".
The calling program has to ensure that the length of ADriveInfo does not
exceed 80 characters.
ACall contains the call you are looking for, all letters in lower case,
no additional spaces etc. The calling program has to ensure that ACall is
not longer than 15 characters. However, there is no call longer than 6
characters in the database.

How to cause Errno 23 ENFILE on purpose

Is there a way I can cause errno 23 (ENFILE File table overflow) on purpose?
I am doing socket programming and I want to check if creating too many sockets can cause this error. As I understand - created socked is treated as a file descriptor, so it should count towards system limit of opened files.
Here is a part of my python script, which creates the sockets
def enfile():
nofile_soft_limit = 10000
nofile_hard_limit = 20000
resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_soft_limit,nofile_hard_limit))
sock_table = []
for i in range(0, 10000):
print "Creating socket number {0}".format(i)
try:
temp = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.SOL_UDP)
except socket.error as msg:
print 'Failed to create socket. Error code: ' + str(msg[0]) + ' , Error message : ' + msg[1]
print msg[0]
sock_table.append(temp)
With setrlimit() I change the processes limit of open files to a high value, so that I don't get Errno24 (EMFILE).
I have tried two approaches:
1) Per-user limit
by changing /etc/security/limits.conf
root hard nofile 5000
root soft nofile 5000
(logged in with a new session after that)
2) System-wide limit
by changing /etc/sysctl.conf
fs.file-max = 5000
and then run sysctl -p to apply the changes.
My script easily creates 10k sockets despite per-user and system-wide limits, and it ends with errno 24 (EMFILE).
Is it possible to achieve my goal? I am using two OS'es - CentOS 6.7 and Fedora 20. Maybe there are some other settings to make in these system?
Thanks!
ENFILE will only happen if the system-wide limit is reached, whereas the settings you've tried so far are per-process, so only related to EMFILE. For more details including which system-wide settings to change to trigger ENFILE, see this answer: https://stackoverflow.com/a/24862823/4323 as well as https://serverfault.com/questions/122679/how-do-ulimit-n-and-proc-sys-fs-file-max-differ
You should look for an answer in kernel sources.
Socket call returns ENFILE in __sock_create() when sock_alloc() returns NULL. This can happen only if it can't allocate a new inode.
You can use:
df -i
to check for your inodes usage.
Unfortunately the inode limit can't be changed dynamically.
Generally the total number of inodes and the space reserved for these inodes is set when the filesystem is first created.
Solution?
Modern filesystems like Brtfs and XFS use dynamic inodes to avoid inode limits - if you have one of them it could be impossible to do that.
If you have LVM disk, decreasing the size of the volume could help.
But if you want to be sure of simulating a situation from your post you should create googol of files, 1 byte each and you will run out of inodes long before you run out of disk. Then you can try to create socket.
If I am wrong, please correct me.

Python interface to C++ COM dll

So I'm trying to interface with a COM object with Python and having some difficulty as I'm not much of a programmer. I've interface Python with c dlls before but not with COM dlls. But that may not necessarily be the source of the problem. Any help or suggestions would be very much appreciated.
I've been able to load the library with Python with:
CiGenUsb = pythoncom.MakeIID(CiGenUsb_string)
win32com.client.pythoncom.CoInitialize()
disp = win32com.client.gencache.EnsureDispatch(CiGenUsb)
I've been able to call some functions okay but not the following one. The function is defined in C++ as:
CIUsb_SendFrame([in] LONG nDevId, [in] BYTE* pFrameData, [in] LONG nSize, [out] LONG* pStatus);
The data I want to send with CIUsb_SendFrame -- the pFrameData array -- is first read in as an array of 160 integers with python. I then put that into a byte array (of 320 bytes):
frame_bytes_type = ctypes.c_ubyte * 320
frame_bytes = frame_bytes_type()
j=0
for i in range(0,320,2):
frame_bytes[i] = intData[j]&0xff
frame_bytes[i+1] = (intData[j]>>8)&0xff
j=j+1
disp.CIUsb_SendFrame(0, ctypes.addressof(frame_bytes), ctypes.sizeof(frame_bytes),0)
The code runs but the frame that is output to the hardware which the code controls looks very wrong. And it seems to change run to run. So I assume the data being sent is not what's in frame_bytes but something random.
The COM library also has variant versions of all the functions for use in ActiveX Automation environments but I have even less of a clue on how to use that.
Thanks.
Edit: I have called CoInitialize which I've now included in the snippet above. nDevId set to 0 is correct. I'll try using something more correct for *pStatus. The hardware being controlled consists of 160 pixels which can each take a 2-byte value. But since the CIUsb_SendFrame function takes a BYTE array, I create the 320 element array and pass that.

Seeming discrepancy in shutil.disk_usage()

I am using the shutil.disk_usage() function to find the current disk usage of a particular path (amount available, used, etc.). As far as I can find, this is a wrapper around os.statvfs() calls. I'm finding that it is not giving the answers I'd expect, as comparing to the output of "du" in Linux.
I have obscured some of the paths below for company privacy reasons, but the output and code are otherwise undoctored. I am using Python 3.3.2 64-bit version.
#!/apps/python/3.3.2_64bit/bin/python3
# test of shutils.diskusage module
import shutil
BytesPerGB = 1024 * 1024 * 1024
(total, used, free) = shutil.disk_usage("/data/foo/")
print ("Total: %.2fGB" % (float(total)/BytesPerGB))
print ("Used: %.2fGB" % (float(used)/BytesPerGB))
(total1, used1, free1) = shutil.disk_usage("/data/foo/utils/")
print ("Total: %.2fGB" % (float(total1)/BytesPerGB))
print ("Used: %.2fGB" % (float(used1)/BytesPerGB))
Which outputs:
/data/foo/drivecode/me % disk_usage_test.py
Total: 609.60GB
Used: 291.58GB
Total: 609.60GB
Used: 291.58GB
As you can see, the main problem is I would expect the second amount for "Used" to be much smaller, as it is a subset of the first directory.
/data/foo/drivecode/me % du -sh /data/foo/utils
2.0G /data/foo/utils
As much as I trust "du," I find it hard to believe the Python module would be incorrect either. So perhaps it is just my understanding of Linux filesystems that could be the issue. :)
I wrote a module (based heavily on someone's code here at SO) which recursively gets the disk_usage, which I was using until now. It appears to match the "du" output but is MUCH, much slower than the shutil.disk_usage() function, so I'm hoping I can make that one work.
Thanks much in advance.
The problem is that shutil uses the statvfs system call underneath to determine the space used. This system call has no file-path granularity as far as I'm aware, only file-system granularity. What this means is that the path you provide it with only helps to identify the file system you want to query, not the path's.
In other words, you gave it the path /data/foo/utils and then it determined which file system backs this file path. Then it queried the file system. This becomes apparent when you consider how the used parameter is defined in shutil:
used = (st.f_blocks - st.f_bfree) * st.f_frsize
Where:
fsblkcnt_t f_blocks; /* size of fs in f_frsize units */
fsblkcnt_t f_bfree; /* # free blocks */
unsigned long f_frsize; /* fragment size */
This is why it's giving you the total space used on the entire file system.
Indeed, it seems to me like the du command itself also traverses the file structure and adds up the file sizes. Here is GNU coreutils du command's source code.
The shutil.disk_usage returns the disk usage (i.e. the mount point which backs the path) and not actual file usage under that path. It is equivalent of running df /path/to/mount and not du /path/to/files. Notice that for both directories you got the exact same usage.
From the docs: "Return disk usage statistics about the given path as a named tuple with the attributes total, used and free, which are the amount of total, used and free space, in bytes."
Update for anyone stumbling upon this after 2013:
Depending on your Python version and OS, shutil.disk_usage might support files and directories for the path variable. Here's the breakdown:
Windows:
3.3 - 3.5: only suports mountpoint/filesystem
3.6 - 3.7: directory support
3.8+: file & directory support
Unix:
3.3 - 3.5: only suports mountpoint/filesystem
3.6+: file & directory support

Categories

Resources