When running psutil.virtual_memory() i'm getting output like this:
>>psutil.virtual_memory()
vmem(total=8374149120L, available=1247768576L)
But what unit of measurement are these values? The documentation simply claims that its the "total physical memory available" but nothing more. I'm trying to translate it into values that the user can actually relate to (ie GBs).
Thanks in advance
why not use bit shift operator:
if you want to display in human readable way, just like this!
values = psutil.virtual_memory()
display in MB format
total = values.total >> 20
display in GB format
total = values.total >> 30
1024^3 = Byte to Gigabyte
So I think this work:
import psutil
memory = psutil.virtual_memory().total / (1024.0 ** 3)
print(memory)
The unit of measurement specified is bytes . You can use this code to convert it into Gb's
When u use the value it will have a trailing "L" , but that doesn't affect the calculations.
values=psutil.virtual_memory()
def get_human_readable_size(self,num):
exp_str = [ (0, 'B'), (10, 'KB'),(20, 'MB'),(30, 'GB'),(40, 'TB'), (50, 'PB'),]
i = 0
while i+1 < len(exp_str) and num >= (2 ** exp_str[i+1][0]):
i += 1
rounded_val = round(float(num) / 2 ** exp_str[i][0], 2)
return '%s %s' % (int(rounded_val), exp_str[i][1])
total_size = get_human_readable_size(values.total)
It is in Bytes. To convert to a more readable format, simply use bytes2human
import psutil
from psutil._common import bytes2human
mem_usage = psutil.virtual_memory()
total_in_human_format = bytes2human(mem_usage[0])
print(total_in_human_format)
Output:
15.6G
Cant comment so I'm using "answer".
regarding "1024^3 = Byte to Gigabyte"
This is incorrect. The prefix Giga means 10^9. Therefore, a Gigabyte is 1000^3 Bytes. You can forget the extra 24.
Therefore: 1000^3 = Byte to Gigabyte
If you don't mind a 3rd party dependency, and you want to avoid magic numbers in your code, try pint.
This lets you work with units symbolically, and you can convert to whatever you want and get the "magnitude" at the end of your computation.
I like this because the code "self-documents" that the info from psutil is in bytes and then self-documents that we are converting to gigabytes.
import psutil
import pint
reg = pint.UnitRegistry()
vmem_info = psutil.virtual_memory()
total_gb = (vmem_info.total * reg.byte).to(reg.gigabyte).m
avail_gb = (vmem_info.available * reg.byte).to(reg.gigabyte).m
print('total_gb = {!r}'.format(total_gb))
print('avail_gb = {!r}'.format(avail_gb))
Related
Assume this is a sample of my data: dataframe
the entire dataframe is stored in a csv file (dataframe.csv) that is 40GBs so I can't open all of it at once.
I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data).
To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?
SUMMARY -- sorry if the question is a bit long:
the csv file storing the data is very big and I can't open it at once, but I'd like to find the top 25 dominant names in the data. Any ideas on what to do and how to do it?
I'd appreciate any help I can get! :)
Thanks for your interesting task! I've implemented pure numpy + pandas solution. It uses sorted array to keep names and counts. Hence algorithm should be around O(n * log n) complexity.
I didn't any hash table in numpy, hash table definitely would be faster (O(n)). Hence I used existing sorting/inserting routines of numpy.
Also I used .read_csv() from pandas with iterator = True, chunksize = 1 << 24 params, this allows reading file in chunks and producing pandas dataframes of fixed size from each chunk.
Note! In the first runs (until program is debugged) set limit_chunks (number of chunks to process) in code to small value (like 5). This is to check that whole program runs correctly on partial data.
Program needs to run one time command python -m pip install pandas numpy to install these 2 packages if you don't have them.
Progress is printed once in a while, total megabytes done plus speed.
Result will be printed to console plus saved to res_fname file name, all constants configuring script are placed in the beginning of script. topk constant controls how many top names will be outputed to file/console.
Interesting how fast is my solution. If it is to slow maybe I devote some time to write nice HashTable class using pure numpy.
You can also try and run next code here online.
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
):
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
else:
cfg['progressed'] = total
sys.stdout.write(
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
)
sys.stdout.flush()
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
break
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
continue
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
Progress(f.tell())
Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
f.write(f'{g}:\n\n')
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
continue
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
f.write(f'\n')
print()
import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())
The second line will convert your csv into a pandas dataframe and the third line should print the occurances of each name.
https://dfrieds.com/data-analysis/value-counts-python-pandas.html
This doesn't seem to be a case where pandas is really going to be an advantage. But if you're committed to going down that route, change the read_csv chunksize paramater, then filter out the useless columns.
Perhaps consider using a different set of tooling such as a database or even vanilla python using a generator to populate a dict in the form of name:count.
I have two Byte objects.
One comes from using the Wave module to read a "chunk" of data:
def get_wave_from_file(filename):
import wave
original_wave = wave.open(filename, 'rb')
return original_wave
The other uses MIDI information and a Synthesizer module (fluidsynth)
def create_wave_from_midi_info(sound_font_path, notes):
import fluidsynth
s = []
fl = fluidsynth.Synth()
sfid = fl.sfload(sound_font_path) # Loads a soundfont
fl.program_select(track=0, soundfontid=sfid, banknum=0, presetnum=0) # Selects the soundfont
for n in notes:
fl.noteon(0, n['midi_num'], n['velocity'])
s = np.append(s, fl.get_samples(int(44100 * n['duration']))) # Gives the note the correct duration, based on a sample rate of 44.1Khz
fl.noteoff(0, n['midi_num'])
fl.delete()
samps = fluidsynth.raw_audio_string(s)
return samps
The two files are of different length.
I want to combine the two waves, so that both are heard simultaneously.
Specifically, I would like to do this "one chunk at a time".
Here is my setup:
def get_a_chunk_from_each(wave_object, bytes_from_midi, chunk_size=1024, starting_sample=0)):
from_wav_data = wave_object.readframes(chunk_size)
from_midi_data = bytes_from_midi[starting_sample:starting_sample + chunk_size]
return from_wav_data, from_midi_data
Info about the return from get_a_chunk_from_each():
type(from_wav_data), type(from_midi_data)
len(from_wav_data), type(from_midi_data)
4096 1024
Firstly, I'm confused as to why the lengths are different (the one generated from wave_object.readframes(1024) is exactly 4 times longer than the one generated by manually slicing bytes_from_midi[0:1024]. This may be part of the reason I have been unsuccessful.
Secondly, I want to create the function which combines the two chunks. The following "pseudocode" illustrates what I want to happen:
def combine_chunks(chunk1, chunk2):
mixed = chunk1 + chunk2
# OR, probably more like:
mixed = (chunk1 + chunk2) / 2
# To prevent clipping?
return mixed
It turns out there is a very, very simple solution.
I simply used the library audioop:
https://docs.python.org/3/library/audioop.html
and used their "add" function ("width" is the sample width in bytes. Since this is 16 bit audio, that's 16 / 8 = 2 bytes):
audioop.add(chunk1, chunk2, width=2)
In python, I have a huge list of floating point values(nearly 30 million values). I have to convert each of them as 4 byte values in little endian format and write all those to a binary file in order.
For a list with some thousands or even 100k of data, my code is working fine. But if the data increases, it is taking time to process and write to file. What optimization techniques can I use to write to file more efficiently?
As suggested in this blog , I am replacing all the small writes to a file by the use of bytearray. But still, the performance is not satisfiable.
Also I have tried multiprocessing (concurrent.futures.ProcessPoolExecutor()) to utilize all the cores in the system instead of using a single CPU core. But still it is taking more time to complete the execution.
Can anyone give me more suggestions on how to improve the performance(in terms of time and memory) for this problem.
Here is my code:
def process_value (value):
hex_value = hex(struct.unpack('<I', struct.pack('<f', value))[0])
if len(hex_value.split('x')[1]) < 8:
hex_value = hex_value[:2] + ('0' * (8 - len(hex_value.split('x')[1]))) + hex_value[2:]
dec1 = int( hex_value.split('x')[1][0] + hex_value.split('x')[1][1], 16)
dec2 = int(hex_value.split('x')[1][2]+hex_value.split('x')[1][3],16)
dec3 = int(hex_valur.split('x')[1][4]+hex_value.split('x')[1][5],16)
dec4 = int(hex_value.split('x')[1][6]+hex_value.split('x')[1][7],16)
msg = bytearray( [dec4,dec3,dec2,dec1] )
return msg
def main_function(fp, values):
msg = bytearray()
for val in values:
msg.extend (process_value(val))
fp.write(msg)
You could try converting all the floats before writing them, and then write the resulting data in one go:
import struct
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
f_output.write(struct.pack('<{}f'.format(len(my_floats)), *my_floats))
For the amount of values you have, you might need to do this in large blocks:
import struct
def blocks(data, n):
for i in xrange(0, len(data), n):
yield data[i:i+n]
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
for block in blocks(my_floats, 10000):
f_output.write(struct.pack('<{}f'.format(len(block)), *block))
The output from struct.pack() should be in the correct binary format for writing directly to the file. The file must be opened in binary mode e.g. wb is used.
I've been trying over and over again to use libreplaygain.so (ReplayGain is an algorithm for calculating loudness of audio. ) from python, passing it data from an audio file. Here is the header file of libreplaygain. I don't understand much about ctypes nor C in general, so I'm hoping it could be a problem of me being stupid, and very obvious for somebody else! Here is the script I am using :
import numpy as np
from scipy.io import wavfile
import ctypes
replaygain = ctypes.CDLL('libreplaygain.so')
def calculate_replaygain(samples, frame_rate=44100):
"""
inspired from https://github.com/vontrapp/replaygain
"""
replaygain.gain_init_analysis(frame_rate)
block_size = 10000
channel_count = samples.shape[1]
i = 0
samples = samples.astype(np.float64)
while i * block_size < samples.shape[0]:
channel_left = samples[i*block_size:(i+1)*block_size,0]
channel_right = samples[i*block_size:(i+1)*block_size,1]
samples_p_left = channel_left.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
samples_p_right = channel_right.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
replaygain.gain_analyze_samples(samples_p_left, samples_p_right, channel_left.shape[0], channel_count)
i += 1
return replaygain.gain_get_chapter()
if __name__ == '__main__':
frame_rate, samples = wavfile.read('directions.wav')
samples = samples.astype(np.float64) / 2**15
gain = calculate_replaygain(samples, frame_rate=frame_rate)
print "Recommended gain: %f dB" % gain
gain = calculate_replaygain(np.random.random((441000, 2)) * 2 - 1, frame_rate=44100)
print "Recommended gain: %f dB" % gain
The script runs, but I cannot get the same value as with the command line tool replaygain. In fact I always get 80.0. To try you can replace 'directions.wav' with any sound file ... and compare the result with the result of the command replaygain <soundfile.wav>.
gain_get_chapter() returns a double, but the ctypes docs say "By default functions are assumed to return the C int type." You should do something like
replaygain.gain_get_chapter.restype = ctypes.c_double
You should also check the return values of gain_init_analysis and gain_analyze_samples; if those aren't both 1, something else is going wrong. (Those actually are ints, so you shouldn't have to do anything else there.)
How do I get the actual filesize on disk in python? (the actual size it takes on the harddrive).
UNIX only:
import os
from collections import namedtuple
_ntuple_diskusage = namedtuple('usage', 'total used free')
def disk_usage(path):
"""Return disk usage statistics about the given path.
Returned valus is a named tuple with attributes 'total', 'used' and
'free', which are the amount of total, used and free space, in bytes.
"""
st = os.statvfs(path)
free = st.f_bavail * st.f_frsize
total = st.f_blocks * st.f_frsize
used = (st.f_blocks - st.f_bfree) * st.f_frsize
return _ntuple_diskusage(total, used, free)
Usage:
>>> disk_usage('/')
usage(total=21378641920, used=7650934784, free=12641718272)
>>>
Edit 1 - also for Windows: https://code.activestate.com/recipes/577972-disk-usage/?in=user-4178764
Edit 2 - this is also available in Python 3.3+: https://docs.python.org/3/library/shutil.html#shutil.disk_usage
Here is the correct way to get a file's size on disk, on platforms where st_blocks is set:
import os
def size_on_disk(path):
st = os.stat(path)
return st.st_blocks * 512
Other answers that indicate to multiply by os.stat(path).st_blksize or os.vfsstat(path).f_bsize are simply incorrect.
The Python documentation for os.stat_result.st_blocks very clearly states:
st_blocks
Number of 512-byte blocks allocated for file. This may be smaller than st_size/512 when the file has holes.
Furthermore, the stat(2) man page says the same thing:
blkcnt_t st_blocks; /* Number of 512B blocks allocated */
Update 2021-03-26: Previously, my answer rounded the logical size of the file up to be an integer multiple of the block size. This approach only works if the file is stored in a continuous sequence of blocks on disk (or if all the blocks are full except for one). Since this is a special case (though common for small files), I have updated my answer to make it more generally correct. However, note that unfortunately the statvfs method and the st_blocks value may not be available on some system (e.g., Windows 10).
Call os.stat(filename).st_blocks to get the number of blocks in the file.
Call os.statvfs(filename).f_bsize to get the filesystem block size.
Then compute the correct size on disk, as follows:
num_blocks = os.stat(filename).st_blocks
block_size = os.statvfs(filename).f_bsize
sizeOnDisk = num_blocks*block_size
st = os.stat(…)
du = st.st_blocks * st.st_blksize
Practically 12 years and no answer on how to do this in windows...
Here's how to find the 'Size on disk' in windows via ctypes;
import ctypes
def GetSizeOnDisk(path):
'''https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getcompressedfilesizew'''
filesizehigh = ctypes.c_ulonglong(0) # not sure about this... something about files >4gb
return ctypes.windll.kernel32.GetCompressedFileSizeW(ctypes.c_wchar_p(path),ctypes.pointer(filesizehigh))
'''
>>> os.stat(somecompressedorofflinefile).st_size
943141
>>> GetSizeOnDisk(somecompressedorofflinefile)
671744
>>>
'''
I'm not certain if this is size on disk, or the logical size:
import os
filename = "/home/tzhx/stuff.wev"
size = os.path.getsize(filename)
If it's not the droid your looking for, you can round it up by dividing by cluster size (as float), then using ceil, then multiplying.
To get the disk usage for a given file/folder, you can do the following:
import os
def disk_usage(path):
"""Return cumulative number of bytes for a given path."""
# get total usage of current path
total = os.path.getsize(path)
# if path is dir, collect children
if os.path.isdir(path):
for file_name in os.listdir(path):
child = os.path.join(path, file_name)
# recursively get byte use for children
total += disk_usage(child)
return total
The function recursively collects byte usage for files nested within a given path, and returns the cumulative use for the entire path.
You could also add a print "{path}: {bytes}".format(path, total) in there if you want the information for each file to print.