How do I get the actual filesize on disk in python? (the actual size it takes on the harddrive).
UNIX only:
import os
from collections import namedtuple
_ntuple_diskusage = namedtuple('usage', 'total used free')
def disk_usage(path):
"""Return disk usage statistics about the given path.
Returned valus is a named tuple with attributes 'total', 'used' and
'free', which are the amount of total, used and free space, in bytes.
"""
st = os.statvfs(path)
free = st.f_bavail * st.f_frsize
total = st.f_blocks * st.f_frsize
used = (st.f_blocks - st.f_bfree) * st.f_frsize
return _ntuple_diskusage(total, used, free)
Usage:
>>> disk_usage('/')
usage(total=21378641920, used=7650934784, free=12641718272)
>>>
Edit 1 - also for Windows: https://code.activestate.com/recipes/577972-disk-usage/?in=user-4178764
Edit 2 - this is also available in Python 3.3+: https://docs.python.org/3/library/shutil.html#shutil.disk_usage
Here is the correct way to get a file's size on disk, on platforms where st_blocks is set:
import os
def size_on_disk(path):
st = os.stat(path)
return st.st_blocks * 512
Other answers that indicate to multiply by os.stat(path).st_blksize or os.vfsstat(path).f_bsize are simply incorrect.
The Python documentation for os.stat_result.st_blocks very clearly states:
st_blocks
Number of 512-byte blocks allocated for file. This may be smaller than st_size/512 when the file has holes.
Furthermore, the stat(2) man page says the same thing:
blkcnt_t st_blocks; /* Number of 512B blocks allocated */
Update 2021-03-26: Previously, my answer rounded the logical size of the file up to be an integer multiple of the block size. This approach only works if the file is stored in a continuous sequence of blocks on disk (or if all the blocks are full except for one). Since this is a special case (though common for small files), I have updated my answer to make it more generally correct. However, note that unfortunately the statvfs method and the st_blocks value may not be available on some system (e.g., Windows 10).
Call os.stat(filename).st_blocks to get the number of blocks in the file.
Call os.statvfs(filename).f_bsize to get the filesystem block size.
Then compute the correct size on disk, as follows:
num_blocks = os.stat(filename).st_blocks
block_size = os.statvfs(filename).f_bsize
sizeOnDisk = num_blocks*block_size
st = os.stat(…)
du = st.st_blocks * st.st_blksize
Practically 12 years and no answer on how to do this in windows...
Here's how to find the 'Size on disk' in windows via ctypes;
import ctypes
def GetSizeOnDisk(path):
'''https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getcompressedfilesizew'''
filesizehigh = ctypes.c_ulonglong(0) # not sure about this... something about files >4gb
return ctypes.windll.kernel32.GetCompressedFileSizeW(ctypes.c_wchar_p(path),ctypes.pointer(filesizehigh))
'''
>>> os.stat(somecompressedorofflinefile).st_size
943141
>>> GetSizeOnDisk(somecompressedorofflinefile)
671744
>>>
'''
I'm not certain if this is size on disk, or the logical size:
import os
filename = "/home/tzhx/stuff.wev"
size = os.path.getsize(filename)
If it's not the droid your looking for, you can round it up by dividing by cluster size (as float), then using ceil, then multiplying.
To get the disk usage for a given file/folder, you can do the following:
import os
def disk_usage(path):
"""Return cumulative number of bytes for a given path."""
# get total usage of current path
total = os.path.getsize(path)
# if path is dir, collect children
if os.path.isdir(path):
for file_name in os.listdir(path):
child = os.path.join(path, file_name)
# recursively get byte use for children
total += disk_usage(child)
return total
The function recursively collects byte usage for files nested within a given path, and returns the cumulative use for the entire path.
You could also add a print "{path}: {bytes}".format(path, total) in there if you want the information for each file to print.
Related
I have two Byte objects.
One comes from using the Wave module to read a "chunk" of data:
def get_wave_from_file(filename):
import wave
original_wave = wave.open(filename, 'rb')
return original_wave
The other uses MIDI information and a Synthesizer module (fluidsynth)
def create_wave_from_midi_info(sound_font_path, notes):
import fluidsynth
s = []
fl = fluidsynth.Synth()
sfid = fl.sfload(sound_font_path) # Loads a soundfont
fl.program_select(track=0, soundfontid=sfid, banknum=0, presetnum=0) # Selects the soundfont
for n in notes:
fl.noteon(0, n['midi_num'], n['velocity'])
s = np.append(s, fl.get_samples(int(44100 * n['duration']))) # Gives the note the correct duration, based on a sample rate of 44.1Khz
fl.noteoff(0, n['midi_num'])
fl.delete()
samps = fluidsynth.raw_audio_string(s)
return samps
The two files are of different length.
I want to combine the two waves, so that both are heard simultaneously.
Specifically, I would like to do this "one chunk at a time".
Here is my setup:
def get_a_chunk_from_each(wave_object, bytes_from_midi, chunk_size=1024, starting_sample=0)):
from_wav_data = wave_object.readframes(chunk_size)
from_midi_data = bytes_from_midi[starting_sample:starting_sample + chunk_size]
return from_wav_data, from_midi_data
Info about the return from get_a_chunk_from_each():
type(from_wav_data), type(from_midi_data)
len(from_wav_data), type(from_midi_data)
4096 1024
Firstly, I'm confused as to why the lengths are different (the one generated from wave_object.readframes(1024) is exactly 4 times longer than the one generated by manually slicing bytes_from_midi[0:1024]. This may be part of the reason I have been unsuccessful.
Secondly, I want to create the function which combines the two chunks. The following "pseudocode" illustrates what I want to happen:
def combine_chunks(chunk1, chunk2):
mixed = chunk1 + chunk2
# OR, probably more like:
mixed = (chunk1 + chunk2) / 2
# To prevent clipping?
return mixed
It turns out there is a very, very simple solution.
I simply used the library audioop:
https://docs.python.org/3/library/audioop.html
and used their "add" function ("width" is the sample width in bytes. Since this is 16 bit audio, that's 16 / 8 = 2 bytes):
audioop.add(chunk1, chunk2, width=2)
I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code:
x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
"nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
"tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
"nmin" = x[(3*length(x)/4 + 1):(length(x))])
With Python, I am trying the following code:
import struct
with open('file','rb') as f:
val = f.read(16)
while val != '':
print(struct.unpack('4f', val))
val = f.read(16)
I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below).
Python output:
R output:
I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python.
I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.
Deducing from the R code, the binary file first contains a certain number tmax's, then the same number of nmax's, then tmin's and nmin's. What the code does is reading the entire file, which is then chopped up in the 4 parts (tmax's, nmax's, etc..) using slicing.
To do the same in python:
import struct
# Read entire file into memory first. This is done so we can count
# number of bytes before parsing the bytes. It is not a very memory
# efficient way, but it's the easiest. The R-code as posted wastes even
# more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no
# matter how small the file may be.
#
data = open('data.bin','rb').read()
# Calculate number of points in the file. This is
# file-size / 16, because there are 4 numeric()'s per
# point, and they are 4 bytes each.
#
num = int(len(data) / 16)
# Now we know how much there are, we take all tmax numbers first, then
# all nmax's, tmin's and lastly all nmin's.
# First generate a format string because it depends on the number points
# there are in the file. It will look like: "fffff"
#
format_string = 'f' * num
# Then, for cleaner code, calculate chunk size of the bytes we need to
# slice off each time.
#
n = num * 4 # 4-byte floats
# Note that python has different interpretation of slicing indices
# than R, so no "+1" is needed here as it is in the R code.
#
tmax = struct.unpack(format_string, data[:n])
nmax = struct.unpack(format_string, data[n:2*n])
tmin = struct.unpack(format_string, data[2*n:3*n])
nmin = struct.unpack(format_string, data[3*n:])
print("tmax", tmax)
print("nmax", nmax)
print("tmin", tmin)
print("nmin", nmin)
If the goal is to have this data structured as a list of points(?) like (tmax,nmax,tmin,nmin), then append this to the code:
print()
print("Points:")
# Combine ("zip") all 4 lists into a list of (tmax,nmax,tmin,nmin) points.
# Python has a function to do this at once: zip()
#
i = 0
for point in zip(tmax, nmax, tmin, nmin):
print(i, ":", point)
i += 1
Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me)
My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's.
But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's.
Anyway, here's the second code:
import os
import struct
# load_data()
# data_store : object to append() data items (floats) to
# num : number of floats to read and store
# datafile : opened binary file object to read float data from
#
def load_data(data_store, num, datafile):
for i in range(num):
data = datafile.read(4) # process one float (=4 bytes) at a time
item = struct.unpack("<f", data)[0] # '<' means little endian
data_store.append(item)
# save_list() saves a list of float's as strings to a file
#
def save_list(filename, datalist):
output = open(filename, "wt")
for item in datalist:
output.write(str(item) + '\n')
output.close()
#### MAIN ####
datafile = open('data.bin','rb')
# Get file size so we can calculate number of points without reading
# the (large) file entirely into memory.
#
file_info = os.stat(datafile.fileno())
# Calculate number of points, i.e. number of each tmax's, nmax's,
# tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
# of points = file-size / (4*4)
#
num = int(file_info.st_size / 16)
tmax_list = list()
load_data(tmax_list, num, datafile)
save_list("tmax.txt", tmax_list)
del tmax_list # huge list, save memory
nmax_list = list()
load_data(nmax_list, num, datafile)
save_list("nmax.txt", nmax_list)
del nmax_list # huge list, save memory
tmin_list = list()
load_data(tmin_list, num, datafile)
save_list("tmin.txt", tmin_list)
del tmin_list # huge list, save memory
nmin_list = list()
load_data(nmin_list, num, datafile)
save_list("nmin.txt", nmin_list)
del nmin_list # huge list, save memory
I'm working on the "Longest Absolute filepath" problem on LeetCode. This is a simple problem that asks "What is the length of the longest absolute file path in a given directory". And my working solution is as follows. The file directory is given as a string.
def lengthLongestPath(self, input):
"""
:type input: str, the file directory
:rtype: int
"""
current_folder_path = [""] * 40
longest_file_path_size = 0
for item in input.split("\n"):
num_tabs = item.count("\t")
print num_tabs
if "." not in item:
current_folder_path[num_tabs] = item.lstrip("\t")
else:
absolute_file_path = "/".join(current_folder_path[:num_tabs] + [item.lstrip("\t")])
print item
print num_tabs, absolute_file_path, current_folder_path
longest_file_path_size = max(len(absolute_file_path), longest_file_path_size)
return longest_file_path_size
This works. However, note that on line current_folder_path = [""] * 40 is very unelegant. This was a line to remember the current file path. I wonder if there is a way to remove this.
The problem statement does not address some fine points. It is very unclear what path may correspond to the string
a\n\t\tb
Is it a//b or plain illegal? If the former, do we need to normalize it?
I guess it is safe to assume that such paths are illegal. In other words, the path depth only grows by 1, and the current_folder_path in fact functions like a stack. You don't need to preinitialize it, but just push the name when num_tabs exceeds its size, and pop as necessary.
As a side note, since join is linear to the current accumulated length, the entire algorithm seems quadratic, which violates the time complexity requirement.
I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.
Example dump from the list of a directory:
hello:3.1 GB
world:1.2 MB
foo:956.2 KB
The above list is in the format of FILE:VALUE UNIT. How would one go about ordering each line above according to file size?
I thought perhaps to parse each line for the unit via the pattern ":VALUE UNIT" (or somehow use the delimiter) then run it through the ConvertAll engine, receive the size off each value in bytes, hash it with the rest of the line (filenames), then order the resulting dictionary pairs via size.
Trouble is, I have no idea about pattern matching. But I see that you can sort a dictionary
If there is a better direction in which to solve this problem, please let me know.
EDIT:
The list that I had was actually in a file. Taking inspiration from answer of the (awesome) Alex Martelli, I've written up the following code that extracts from one file, orders it and writes to another.
#!/usr/bin/env python
sourceFile = open("SOURCE_FILE_HERE", "r")
allLines = sourceFile.readlines()
sourceFile.close()
print "Reading the entire file into a list."
cleanLines = []
for line in allLines:
cleanLines.append(line.rstrip())
mult = dict(KB=2**10, MB=2**20, GB=2**30)
def getsize(aline):
fn, size = aline.split(':', 1)
value, unit = size.split(' ')
multiplier = mult[unit]
return float(value) * multiplier
print "Writing sorted list to file."
cleanLines.sort(key=getsize)
writeLines = open("WRITE_OUT_FILE_HERE",'a')
for line in cleanLines:
writeLines.write(line+"\n")
writeLines.close()
thelines = ['hello:3.1 GB', 'world:1.2 MB', 'foo:956.2 KB']
mult = dict(KB=2**10, MB=2**20, GB=2**30)
def getsize(aline):
fn, size = aline.split(':', 1)
value, unit = size.split(' ')
multiplier = mult[unit]
return float(value) * multiplier
thelines.sort(key=getsize)
print thelines
emits ['foo:956.2 KB', 'world:1.2 MB', 'hello:3.1 GB'], as desired. You may have to add some entries to mult if KB, MB and GB don't exhaust your set of units of interest of course.