python how to get BytesIO allocated memory length? - python

This is the code i am using to test the memory allocation
import pycurl
import io
url = "http://www.stackoverflow.com"
buf = io.BytesIO()
print(len(buf.getvalue())) #here i am getting 0 as length
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.CONNECTTIMEOUT, 10)
c.setopt(c.TIMEOUT, 10)
c.setopt(c.ENCODING, 'gzip')
c.setopt(c.FOLLOWLOCATION, True)
c.setopt(c.IPRESOLVE, c.IPRESOLVE_V4)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
c.close()
print(len(buf.getvalue())) #here length of the dowloaded file
print(buf.getvalue())
buf.close()
How to get the allocated buffer/memory length by BytesIO ?
what am i doing wrong here ? python doesn't allocate fixed buffer length ?

I am not sure what you mean by allocated buffer/memory length, but if you want the length of the user data stored in the BytesIO object you can do
>>> bio = io.BytesIO()
>>> bio.getbuffer().nbytes
0
>>> bio.write(b'here is some data')
17
>>> bio.getbuffer().nbytes
17
But this seems equivalent to the len(buf.getvalue()) that you are currently using.
The actual size of the BytesIO object can be found using sys.getsizeof():
>>> bio = io.BytesIO()
>>> sys.getsizeof(bio)
104
Or you could be nasty and call __sizeof__() directly (which is like sys.getsizeof() but without garbage collector overhead applicable to the object):
>>> bio = io.BytesIO()
>>> bio.__sizeof__()
72
Memory for BytesIO is allocated as required, and some buffering does take place:
>>> bio = io.BytesIO()
>>> for i in range(20):
... _=bio.write(b'a')
... print(bio.getbuffer().nbytes, sys.getsizeof(bio), bio.__sizeof__())
...
1 106 74
2 106 74
3 108 76
4 108 76
5 110 78
6 110 78
7 112 80
8 112 80
9 120 88
10 120 88
11 120 88
12 120 88
13 120 88
14 120 88
15 120 88
16 120 88
17 129 97
18 129 97
19 129 97
20 129 97

io.BytesIO() returns a standard file object which has function tell(). It reports the current descriptor position and does not copy the whole buffer out to compute total size as len(bio.getvalue()) of bio.getbuffer().nbytes. It is a very fast and simple method to get the exact size of used memory in the buffer object.
However, if you preset your buffer, tell() will point at the beginning of the buffer and return 0, but the buffer size is not zero. In this case, you can move the pointer to the end of the buffer seek(0,2), which will report the total buffer size without copying the whole buffer into another chank of the memory.
I posted and recently updated an example code and a more detailed answer here

You can also use tracemalloc to get indirect information about the size of objects, by wrapping memory events in tracemalloc.get_traced_memory()
Do note that active threads (if any) and side effects of your program will affect the output, but it may also be more representative of the real memory cost if many samples are taken, as shown below.
>>> import tracemalloc
>>> from io import BytesIO
>>> tracemalloc.start()
>>>
>>> memory_traces = []
>>>
>>> with BytesIO() as bytes_fh:
... # returns (current memory usage, peak memory usage)
# ..but only since calling .start()
... memory_traces.append(tracemalloc.get_traced_memory())
... bytes_fh.write(b'a' * (1024**2)) # create 1MB of 'a'
... memory_traces.append(tracemalloc.get_traced_memory())
...
1048576
>>> print("used_memory = {}b".format(memory_traces[1][0] - memory_traces[0][0]))
used_memory = 1048870b
>>> 1048870 - 1024**2 # show small overhead
294

Related

Python threads memory leak

I have a class like this:
class Detector:
...
def detect:
sniff(iface='eth6', filter='vlan or not vlan and udp port 53', prn=self.spawnThread, store=0)
def spawnThread(self, pkt):
t = threading.Thread(target=self.predict, args=(pkt,))
t.start()
def predict(self, pkt):
# do something
# write log file with logging module
wheresniff() is a method from scapy, for every packet it captured, it passes the packet to spawnThread, in spawnThread I want to create different threads to run predict method.
But there seems to be a memory leak, I checked with Heapy and get this output:
Partition of a set of 623561 objects. Total size = 87355208 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 236145 38 26871176 31 26871176 31 str
1 139658 22 13805832 16 40677008 47 tuple
2 6565 1 7366648 8 48043656 55 dict (no owner)
3 1408 0 6592768 8 54636424 63 dict of module
4 25764 4 3297792 4 57934216 66 types.CodeType
5 17737 3 3223240 4 61157456 70 list
6 24878 4 2985360 3 64142816 73 function
7 14367 2 2577384 3 66720200 76 unicode
8 2445 0 2206320 3 68926520 79 type
9 2445 0 2173752 2 71100272 81 dict of type
the count and size of tuple objects keeps growing, I think that's what causes memory leak but I don't know where and why. Thanks for any feed back!
Update: If I directly call predict from sniff without using threads, there is no memory leak. And also, there are no other tuple objects anywhere else in the class. in __init__ I just initiated some strings like paths and names.
class Detector:
...
def detect(self):
sniff(iface='eth6', filter='vlan or not vlan and udp port 53',
prn=self.predict, store=0)
def predict(self, pkt):
# do something with pkt
# write log file with loggin module

Getting number of characters through it's memory size in python

[ Python ]
I have a string and I know it's size in memory. I want to get the number of characters (a rough estimate) will it contain.
Actual case, I want to send a report through mail, the content of the mail is exceeding the permitted size is allowed for it. I want to split the mail into multiple according to maximum size. But I don't have a way to co-relate the maximum size to number of characters in the string.
import smtplib
smtp = smtplib.SMTP('server.name')
smtp.ehlo()
max_size = smtp.esmtp_features['size']
message_data = <some string data exceeding the max_size>
# Now, how can I get the number of characters in message_data which exceeds the max_szie
Thanks,
The number of chars in a string is the size in memory in bytes to which you must deduct 37 (python 2.7 / mac os)
import sys
def estimate_chars():
"size in bytes"
s = ""
for idx in range(100):
print idx * 10, sys.getsizeof(s), len(s)
s += '1234567890'
estimate_chars()
result: (chars | bytes | len)
0 37 0
10 47 10
20 57 20
30 67 30
40 77 40
50 87 50
...
960 997 960
970 1007 970
980 1017 980
990 1027 990

Python Encoding Issues

I have data that I would like to decode from and its in Windows-1252 basically I send code to a socket and it sends it back and I have to decode the message and use IEEE-754 to get a certain value from it but I can seem to figure out all this encoding stuff. Here is my code.
def printKinds ():
test = "x40\x39\x19\x99\x99\x99\x99\x9A"
print (byt1Hex(test))
test = byt1Hex(test).replace(' ', '')
struct.unpack('<d', binascii.unhexlify(test))
print (test)
printKinds()
def byt1Hex( bytStr ):
return ' '.join( [ "%02X" % ord( x ) for x in bytStr ] )
So I use that and then I have to get the value from that.. But it's not working and I can not figure out why.
The current output I am getting is
struct.unpack('<d', binascii.unhexlify(data))
struct.error: unpack requires a bytes object of length 8
That the error the expected output I am looking for is 25.1
but when I encode it, It actually changes the string into the wrong values so when I do this:
print (byt1Hex(data))
I expect to get this.
40 39 19 99 99 99 99 9A
But I actually get this instead
78 34 30 39 19 99 99 99 99 9A
>>> import struct
>>> struct.pack('!d', 25.1)
b'#9\x19\x99\x99\x99\x99\x9a'
>>> struct.unpack('!d', _) #NOTE: no need to call byt1hex, unhexlify
(25.1,)
You send, receive bytes over the network. No need hexlify/unhexlify them; unless the protocol requires it (you should mention the protocol in the question then).
You have:
test = "x40\x39\x19\x99\x99\x99\x99\x9A"
You need:
test = "\x40\x39\x19\x99\x99\x99\x99\x9A"

How to read part of binary file with numpy?

I'm converting a matlab script to numpy, but have some problems with reading data from a binary file. Is there an equivelent to fseek when using fromfile to skip the beginning of the file? This is the type of extractions I need to do:
fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');
Thanks!
You can use seek with a file object in the normal way, and then use this file object in fromfile. Here's a full example:
import numpy as np
import os
data = np.arange(100, dtype=np.int)
data.tofile("temp") # save the data
f = open("temp", "rb") # reopen the file
f.seek(256, os.SEEK_SET) # seek
x = np.fromfile(f, dtype=np.int) # read the data into numpy
print x
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]
There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.
For example, say chunkyfoo.bin is a file consisting of a 6-byte header, a 1024-byte numpy array, and another 1024-byte numpy array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile does is lseek back to 0). But you can just mmap the file and use fromstring instead:
with open('chunkyfoo.bin', 'rb') as f:
with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
a1 = np.fromstring(m[6:1030])
a2 = np.fromstring(m[1030:])
This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1 and a2 probably depend on the header, rather than being fixed comments.
The header is just m[:6], and you can parse that by explicitly pulling it apart, using the struct module, or whatever else you'd do once you read the data. But, if you'd prefer, you can explicitly seek and read from f before constructing m, or after, or even make the same calls on m, and it will work, without affecting a1 and a2.
An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:
class SeekedFileWrapper(object):
def __init__(self, fileobj):
self.fileobj = fileobj
self.offset = fileobj.tell()
def seek(self, offset, whence=0):
if whence == 0:
offset += self.offset
return self.fileobj.seek(offset, whence)
# ... delegate everything else unchanged
I did the "delegate everything else unchanged" by generating a list of attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpy only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmap would also give you the option of leaving it as a numpy.memmap instead of a numpy.array, which lets numpy have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap and an mmap to work together.)
This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtype of the array.
The Matlab code in the question reads a char and two uint.
Read this paper (easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.
import numpy as np
data = np.arange(10, dtype=np.int)
data.tofile('f')
x = np.fromfile('f', dtype='u1')
print x.size
# 40
second = x[8]
print 'second', second
# second 2
total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0] !endianness
# total_cycles [2]
start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]
x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]
x[3] = 423
print 'start_cycle', start_cycle
# start_cycle [423]
There is a quite new feature of numpy.fromfile()
offset int
The offset (in bytes) from the file’s current position. Defaults to 0. Only permitted for binary files.
New in version 1.17.0.
import numpy as np
import os
data = np.arange(100, dtype=np.int32)
data.tofile("temp") # save the data
x = np.fromfile("temp", dtype=np.int32, offset=256) # use the offset
print (x)
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

Calculating computational time and memory for a code in python

Can some body help me as how to find how much time and how much memory does it take for a code in python?
Use this for calculating time:
import time
time_start = time.clock()
#run your code
time_elapsed = (time.clock() - time_start)
As referenced by the Python documentation:
time.clock()
On Unix, return the current processor time as a floating
point number expressed in seconds. The precision, and in fact the very
definition of the meaning of “processor time”, depends on that of the
C function of the same name, but in any case, this is the function to
use for benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the
first call to this function, as a floating point number, based on the
Win32 function QueryPerformanceCounter(). The resolution is typically
better than one microsecond.
Reference: http://docs.python.org/library/time.html
Use this for calculating memory:
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
Reference: http://docs.python.org/library/resource.html
Based on #Daniel Li's answer for cut&paste convenience and Python 3.x compatibility:
import time
import resource
time_start = time.perf_counter()
# insert code here ...
time_elapsed = (time.perf_counter() - time_start)
memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0
print ("%5.1f secs %5.1f MByte" % (time_elapsed,memMb))
Example:
2.3 secs 140.8 MByte
There is a really good library called jackedCodeTimerPy for timing your code. You should then use resource package that Daniel Li suggested.
jackedCodeTimerPy gives really good reports like
label min max mean total run count
------- ----------- ----------- ----------- ----------- -----------
imports 0.00283813 0.00283813 0.00283813 0.00283813 1
loop 5.96046e-06 1.50204e-05 6.71864e-06 0.000335932 50
I like how it gives you statistics on it and the number of times the timer is run.
It's simple to use. If i want to measure the time code takes in a for loop i just do the following:
from jackedCodeTimerPY import JackedTiming
JTimer = JackedTiming()
for i in range(50):
JTimer.start('loop') # 'loop' is the name of the timer
doSomethingHere = 'This is really useful!'
JTimer.stop('loop')
print(JTimer.report()) # prints the timing report
You can can also have multiple timers running at the same time.
JTimer.start('first timer')
JTimer.start('second timer')
do_something = 'amazing'
JTimer.stop('first timer')
do_something = 'else'
JTimer.stop('second timer')
print(JTimer.report()) # prints the timing report
There are more use example in the repo. Hope this helps.
https://github.com/BebeSparkelSparkel/jackedCodeTimerPY
Use a memory profiler like guppy
>>> from guppy import hpy; h=hpy()
>>> h.heap()
Partition of a set of 48477 objects. Total size = 3265516 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 25773 53 1612820 49 1612820 49 str
1 11699 24 483960 15 2096780 64 tuple
2 174 0 241584 7 2338364 72 dict of module
3 3478 7 222592 7 2560956 78 types.CodeType
4 3296 7 184576 6 2745532 84 function
5 401 1 175112 5 2920644 89 dict of class
6 108 0 81888 3 3002532 92 dict (no owner)
7 114 0 79632 2 3082164 94 dict of type
8 117 0 51336 2 3133500 96 type
9 667 1 24012 1 3157512 97 __builtin__.wrapper_descriptor
<76 more rows. Type e.g. '_.more' to view.>
>>> h.iso(1,[],{})
Partition of a set of 3 objects. Total size = 176 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 33 136 77 136 77 dict (no owner)
1 1 33 28 16 164 93 list
2 1 33 12 7 176 100 int
>>> x=[]
>>> h.iso(x).sp
0: h.Root.i0_modules['__main__'].__dict__['x']

Categories

Resources