I am using the shutil.disk_usage() function to find the current disk usage of a particular path (amount available, used, etc.). As far as I can find, this is a wrapper around os.statvfs() calls. I'm finding that it is not giving the answers I'd expect, as comparing to the output of "du" in Linux.
I have obscured some of the paths below for company privacy reasons, but the output and code are otherwise undoctored. I am using Python 3.3.2 64-bit version.
#!/apps/python/3.3.2_64bit/bin/python3
# test of shutils.diskusage module
import shutil
BytesPerGB = 1024 * 1024 * 1024
(total, used, free) = shutil.disk_usage("/data/foo/")
print ("Total: %.2fGB" % (float(total)/BytesPerGB))
print ("Used: %.2fGB" % (float(used)/BytesPerGB))
(total1, used1, free1) = shutil.disk_usage("/data/foo/utils/")
print ("Total: %.2fGB" % (float(total1)/BytesPerGB))
print ("Used: %.2fGB" % (float(used1)/BytesPerGB))
Which outputs:
/data/foo/drivecode/me % disk_usage_test.py
Total: 609.60GB
Used: 291.58GB
Total: 609.60GB
Used: 291.58GB
As you can see, the main problem is I would expect the second amount for "Used" to be much smaller, as it is a subset of the first directory.
/data/foo/drivecode/me % du -sh /data/foo/utils
2.0G /data/foo/utils
As much as I trust "du," I find it hard to believe the Python module would be incorrect either. So perhaps it is just my understanding of Linux filesystems that could be the issue. :)
I wrote a module (based heavily on someone's code here at SO) which recursively gets the disk_usage, which I was using until now. It appears to match the "du" output but is MUCH, much slower than the shutil.disk_usage() function, so I'm hoping I can make that one work.
Thanks much in advance.
The problem is that shutil uses the statvfs system call underneath to determine the space used. This system call has no file-path granularity as far as I'm aware, only file-system granularity. What this means is that the path you provide it with only helps to identify the file system you want to query, not the path's.
In other words, you gave it the path /data/foo/utils and then it determined which file system backs this file path. Then it queried the file system. This becomes apparent when you consider how the used parameter is defined in shutil:
used = (st.f_blocks - st.f_bfree) * st.f_frsize
Where:
fsblkcnt_t f_blocks; /* size of fs in f_frsize units */
fsblkcnt_t f_bfree; /* # free blocks */
unsigned long f_frsize; /* fragment size */
This is why it's giving you the total space used on the entire file system.
Indeed, it seems to me like the du command itself also traverses the file structure and adds up the file sizes. Here is GNU coreutils du command's source code.
The shutil.disk_usage returns the disk usage (i.e. the mount point which backs the path) and not actual file usage under that path. It is equivalent of running df /path/to/mount and not du /path/to/files. Notice that for both directories you got the exact same usage.
From the docs: "Return disk usage statistics about the given path as a named tuple with the attributes total, used and free, which are the amount of total, used and free space, in bytes."
Update for anyone stumbling upon this after 2013:
Depending on your Python version and OS, shutil.disk_usage might support files and directories for the path variable. Here's the breakdown:
Windows:
3.3 - 3.5: only suports mountpoint/filesystem
3.6 - 3.7: directory support
3.8+: file & directory support
Unix:
3.3 - 3.5: only suports mountpoint/filesystem
3.6+: file & directory support
Related
I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.
The "Go is going to be faster" idea was supported by a trivial test:
# test.py
print("Hello world")
$ time python dummy.py
Hello world
real 0m0.029s
user 0m0.019s
sys 0m0.010s
// test.go
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
$ time ./test
hello world
real 0m0.001s
user 0m0.001s
sys 0m0.000s
Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:
$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174
However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:
# test2.py
s = ""
for i in range(1000000):
s += "a"
$ time python test2.py
real 0m0.179s
user 0m0.145s
sys 0m0.013s
// test2.go
package main
func main() {
s := ""
for i:= 0; i < 1000000; i++ {
s += "a";
}
}
$ time ./test2
real 0m56.840s
user 1m50.836s
sys 0m17.653
So Go is hundreds of times slower than Python.
Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).
However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.
Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?
Or have I run into a case where Python happens to work well, but falls apart under more complex use?
Versions used: Python 3.8.2, Go 1.14.2
TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).
To expand my comments into a real answer:
Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.
With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:
// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
need = s.len + t.len
if s.refcount == 1 { // can modify string in-place
data = s.data
if cap(data) >= need {
copy_into(data + s.len, t.data, t.len)
return s
}
}
// s is shared or s.cap < need
new_s := make_new_string(roundup(need))
// important: new_s has extra space for the next call to add_to_string
copy_into(new_s.data, s.data, s.len)
copy_into(new_s.data + s.len, t.data, t.len)
s.refcount--
if s.refcount == 0 {
gc_release_string(s)
}
return new_s
}
By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log2(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.
[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]
The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.
To show that the Python code really does use the refcount, take your original Python:
s = ""
for i in range(1000000):
s += "a"
and add a second variable t like this:
s = ""
t = s
for i in range(1000000):
s += "a"
t = s
The difference in execution time is impressive:
$ time python test2.py
0.68 real 0.65 user 0.03 sys
$ time python test3.py
34.60 real 34.08 user 0.51 sys
The modified Python program still beats Go (1.13.5) on this same system:
$ time ./test2
67.32 real 103.27 user 13.60 sys
and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.
(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])
Well. You should never, ever use string concatenation in this way :-)
in go, try the strings.Buider
package main
import (
"strings"
)
func main() {
var b1 strings.Builder
for i:= 0; i < 1000000; i++ {
b1.WriteString("a");
}
}
I want to use functions in dll's via ctype. I can call the function without errors and even the error code of the function is 0 meanig function successfuly finished. But when I try to acces the result variable ist is empty.
I have been implemented the lookup in free pascal severeal years ago and would transfer it to python right now. The interface allow to access via cdel convention and I tied to reimplement in python 3.7.4 with ctypes now
The last working Pascal Prototype have been:
PROCEDURE pGetCallInfo(DriveInfo: pointer; ACall: pointer; AInfo: pointer;
var AErrorCode: SmallInt); pascal; external 'raccd32a.dll';
My best version in python have been the following:
from ctypes import *
callBookDLL = CDLL('raccd32a')
AInfo = create_string_buffer(400)
err = callBookDLL.cGetCallInfo("self.txt_CallBookPath.text()","DG1ATN",AInfo)
The result ist:
err
0
AInfo.value
b''
AInfo should contain a max. 400 char long stringbuffer with an result containing Name, Adress and so on.
As I have a second library I have to acces same way I search for my fault but I was not able to find it. I think my problem is the work with pointer and the type conversion.
I checked teh ctypes howto allready but I can noht solve this trouble.
Thanks a lot so far ...
Check [Python 3.Docs]: ctypes - A foreign function library for Python. It contains (almost) every piece of info that you need.
There are a number of problems:
ctypes doesn't support pascal calling convention, only cdecl and stdcall (applies to 32bit only). That means (after reading the manual) that you shouldn't use the p* functions, but the c* (or s*)
You didn't specify argtypes (and restype) for your function. This results in UB. Some effects of this:
[SO]: Python ctypes cdll.LoadLibrary, instantiate an object, execute its method, private variable address truncated (#CristiFati's answer)
[SO]: python ctypes issue on different OSes (#CristiFati's answer)
It is a procedure (a function that returns void). Anyway this is a minor one
Here's some sample code (of course it's blind, as I didn't test it):
#!/usr/bin/env python3
import sys
import ctypes
dll = ctypes.CDLL("raccd32a.dll")
cGetCallInfo = dll.cGetCallInfo
cGetCallInfo.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.c_char_p, ctypes.POINTER(ctypes.c_short)]
cGetCallInfo.restype = None
ADriveInfo = self.txt_CallBookPath.text().encode()
#ADriveInfo = b"C:\\callbook2019\\" # Notice the double bkslashes
ACall = b"DG1ATN"
AInfo = ctypes.create_string_buffer(400)
result = ctypes.c_short(0)
cGetCallInfo(ADriveInfo, ACall, AInfo, ctypes.byref(result))
#EDIT0:
From the beginning, I wanted yo say that the 1st argument passed to the function doesn't make much sense. Then, there are problems regarding the 2nd one as well. According to the manual ([AMT-I]: TECHNICAL INFORMATION about RACCD32a.DLL (emphasis is mine)):
ADriveInfo, ACall and AInfo are pointers to zero-terminated strings. These
strings has to exist at the moment of calling xGetCallInfo. The calling
program is responsible for creating them. AInfo must be long enough to
comfort xGetCallInfo (at least 400 characters).
Note: "Length of AInfo" refers to the length of the string AInfo points at.
ADriveInfo and ACall are treated in the same manner for short.
In ADriveInfo the procedure expects the path to the CD ROM drive. Use
"G:\"
if "G:" designates the CD ROM drive with the callbook CD ROM.
Keep in mind that this information is a *must* and the calling program
has to know it.
Note: If the active directory on drive G: is not the root, ADriveInfo = "G:"
will lead to an error 3. So always use "G:\".
The calling program has to ensure that the length of ADriveInfo does not
exceed 80 characters.
ACall contains the call you are looking for, all letters in lower case,
no additional spaces etc. The calling program has to ensure that ACall is
not longer than 15 characters. However, there is no call longer than 6
characters in the database.
I have implemented a FUSE filesystem, and now I'm trying to optimize performance, as the wait time of directories with many thousands of files runs into seconds.
Some logging has shown me that after I return the list of file names, getattr is then called once for each file (10,000 times for my test directory of 10,000 files), with the result that a 5 second wait time is quadrupled into a 20 second wait time.
The version of FUSE I'm using [1] appears to support returning a list of tuples (name, stat, offset) as well as a list of names, so I tried returning
(
'<file_name>',
<stat with st_atime, st_mtime, st_mode, st_uid, st_gid, st_size>,
0,
)
for each file, but getattr is still being called once for each file.
Anybody know what I am missing in the stat that is causing the O/S to still call getattr, or is there nothing I can do to change this behavior?
[1]
Copyright (c) 2008 Giorgos Verigakis
__version__ = '1.1'
I've found the number of times getattr is called is beyond your control. Depending on your implementation, you can improve performance by caching the result within getattr.
I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).
this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.
Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:
There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.
The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().
Most of this is background, skip the next 3 paragraphs for the question:
I have developed a tool that calls some installers, changes registry items, and moves files around to help me test a product which has a fairly fast update cycle. So far so good, I have a GUI which runs in a separate process to the business logic to prevent it locking due to the GIL, everything works etc, however I have concerns with a section of my code where I make calls to msiexec.
Specifically it's the uninstall part which gives me concerns. Currently the GUID does not change so I am able to uninstall the product using an os.system('msiexec /x "{GUID}" /passive') sort of thing. It's actually a bit more complicated as I'm using subprocess.Popen and polling it until it finished from within an event loop to allow for concurrency with other steps.
My concern is that should the GUID change, obviously this will not work. I don't want to point msiexec directly at the installation source, as this would mean that it wouldn't work if I were to 'lose' the msi file, which I store in a temporary directory.
What I am looking for, is a way of querying by program name to get the GUID, or even a wrapper for msiexec that would do all of this, including the uninstall, for me. I thought of scanning through the registry, but the _winreg module seems very slow, so I'd prefer to avoid this if at all possible. If there's a better way to scan the registry, I'm all ears, as this would speed up other parts of the tool also.
Update0
Performance on this is critical as one of the design goals is to make the process which the tool follows faster than any other method, manual or otherwise, in order to gain wholesale adoption.
Update1
I have tried a slight variation of the registry version below however it consistently returns None. I'm not quite sure how this is happening - it seems like it is failing to open the appropriate key as I have inserted a breakpoint after the with statement which is never reached...
def get_guid_by_name(name):
from _winreg import (OpenKey,
QueryInfoKey,
EnumKey,
QueryValueEx,
HKEY_LOCAL_MACHINE,
)
with OpenKey(HKEY_LOCAL_MACHINE,
r'SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall') as key:
subkeys, _0, _1 = QueryInfoKey(key) # The breakpoint here is never reached
del _0, _1
for i in range(subkeys):
subkey = EnumKey(key, i)
if subkey[0] != '{' or subkey[-1] != '}':
continue
with OpenKey(key, subkey) as _subkey:
if name in QueryValueEx(_subkey, 'DisplayName')[0]:
return subkey
return None
print get_guid_by_name('Microsoft Visual Studio')
Update2
Strike that - I'm a fool who doesn't check his indentation thoroughly enough - print get_guid_by_name('Microsoft Visual Studio') was within get_guid_by_name...
I'm not sure about the _winreg module being all that slow. I suppose if you were trying to enumerate the entire registry to find all instances of a string that might take a while, but with a decently targeted query it seems reasonably fast.
Here's an example:
from _winreg import *
def get_guid_by_name(name):
# Open the uninstaller key
with OpenKey(HKEY_LOCAL_MACHINE, r'Software\Microsoft\Windows\CurrentVersion\Uninstall') as key:
# We only care about subkeys of the installer key
subkeys, _, _ = QueryInfoKey(key)
for i in range(subkeys):
subkey = EnumKey(key, i)
# Since we're looking for uninstallers for MSI products,
# the key name will always be the GUID. We assume that any
# key starting with '{' and ending with '}' is a GUID, but
# if not the name won't match.
if subkey[0] != '{' or subkey[-1] != '}':
continue
# Query the display name or other property of the key to
# see if it's the one we want
with OpenKey(key, subkey) as _subkey:
if QueryValueEx(_subkey, 'DisplayName')[0] == name:
return subkey
return None
On my machine, querying for ActiveState's Komodo Edit (I actually used a regular expression rather than straight-value comparison), 1000 iterations of this took 8.18 seconds (timed using timeit), which seems like a negligible amount of time to me. Better yet, you can pull the UninstallString key from the registry and pass that straight to your subprocess (though you may want to add the /passive switch to the end.
Edit
Microsoft does, of course, provide a WMI class (Win32_Product) that provides a rather convenient interface to do all of this. Using Tim Golden's excellent WMI wrapper, one could initiate an install like this:
import wmi
c = wmi.WMI()
c.Win32_Product(Name = 'ProductName')[0].Uninstall()
However, as noted in this blog post, the Win32_Product class is extremely, painfully slow to use.