Right now, I am buffering bytes using strings, StringIO, or cStringIO. But, I often need to remove bytes from the left side of the buffer. A naive approach would rebuild the entire buffer. Is there an optimal way to do this, if left-truncating is a very common operation? Python's garbage collector should actually GC the truncated bytes.
Any sort of algorithm for this (keep the buffer in small pieces?), or an existing implementation, would really help.
Edit:
I tried to use Python 2.7's memoryview for this, but sadly, the data outside the "view" isn't GCed when the original reference is deleted:
# (This will use ~2GB of memory, not 50MB)
memoryview # Requires Python 2.7+
smalls = []
for i in xrange(10):
big = memoryview('z'*(200*1000*1000))
small = big[195*1000*1000:]
del big
smalls.append(small)
print '.',
A deque will be efficient if left-removal operations are frequent (Unlike using a list, string or buffer, it's amortised O(1) for either-end removal). It will be more costly memory-wise than a string however, as you'll be storing each character as its own string object, rather than a packed sequence.
Alternatively, you could create your own implementation (eg. a linked list of string / buffer objects of fixed size), which may store the data more compactly.
Build your buffer as a list of characters or lines and slice the list. Only join as string on output. This is pretty efficient for most types of 'mutable string' behaviour.
The GC will collect the truncated bytes because they are no longer referenced in the list.
UPDATE: For modifying the list head you can simply reverse the list. This sounds like an inefficient thing to do however python's list implementation optimises this internally.
from http://effbot.org/zone/python-list.htm :
Reversing is fast, so temporarily
reversing the list can often speed
things up if you need to remove and
insert a bunch of items at the
beginning of the list:
L.reverse()
# append/insert/pop/delete at far end
L.reverse()
Related
I'm trying to do some work on a file, the file has various data in it, and I'm pulling it in in string/raw format, and then working on the strings.
I'm trying to make the process multithreaded, so I can work on several chunks at once, but of course the files are quite large, several gigabytes, so memory is an issue.
The processes don't need to modify the input data, so they don't need their own copies. However, I don't know how to make an array of strings as a ctype in Python 2.7.
Currently I have:
import multiprocessing, ctypes
from multiprocessing.sharedctypes import Value, Array
with open('test.txt', 'r') as fin:
rawdata = Array('c', fin.readlines(), lock=False)
But this doesn't work as I'd hoped, it sees the whole thing as one massive char buffer array and fails as it wants a single string object. I need to be able to pull out the original lines and work with them with existing python code that examines the contents of the lines and does some operations, which vary from substring matching, to pulling out integer and float values from the strings for mathematical operations. Is there any sensible way I can achieve this that I'm missing? Perhaps I'm using the wrong item (Array), to push the data to a shared c format?
Do you want your strings to end up as Python strings, or as c-style strings a.k.a. null-terminated character arrays? If you're working with python string processing, then simply reading the file into a non-ctypes python string and using that everywhere is the way to go -- python doesn't copy strings by default, since they're immutable anyway. If you want to use c-style strings, then you will want to allocate a character buffer using ctypes, and use fin.readinto(buffer).
Since python does slice-by-copy, slicing strings can be very costly.
I have a recursive algorithm that is operating on strings. Specifically, if a function is passed a string a, the function calls itself on a[1:] of the passed string. The hangup is that the strings are so long, the slice-by-copy mechanism is becoming a very costly way to remove the first character.
Is there a way to get around this, or do I need to rewrite the algorithm entirely?
The only way to get around this in general is to make your algorithm uses bytes-like types, either Py2 str or Py3 bytes; views of Py2 unicode/Py3 str are not supported. I provided details on how to do this on my answer to a related question, but the short version is, if you can assume bytes-like arguments (or convert to them), wrapping the argument in a memoryview and slicing is a reasonable solution. Once converted to a memoryview, slicing produces new memoryviews with O(1) cost (in both time and memory), rather than the O(n) time/memory cost of text slicing.
Goal
Reading in a massive binary file approx size 1.3GB and change certain bits and then writing it back to a separate file (cannot modify original file).
Method
When I read in the binary file it gets stored in a massive string encoded in hex format which is immutable since I am using python.
My algorithm loops through the entire file and stores in a list all the indexes of the string that need to be modified. The catch is that all the indexes in the string need to be modified to the same value. I cannot do this in place due to immutable nature. I cannot convert this into a list of chars because that blows up my memory constraints and takes a hell lot of time. The viable thing to do is to store it in a separate string, but due to the immutable nature I have to make a ton of string objects and keep on concatenating to them.
I used some ideas from https://waymoot.org/home/python_string/ however it doesn't give me a good performance. Any ideas, the goal is to copy an existing super long string exactly into another except for certain placeholders determined by the values in the index List ?
So, to be honest, you shouldn't be reading your file into a string. You shouldn't especially be writing anything but the bytes you actually change.
That is just a waste of resources, since you only seem to be reading linearly through the file, noting the down the places that need to be modified.
On all OSes with some level of mmap support (that is, Unixes, among them Linux, OS X, *BSD and other OSes like Windows), you can use Python's mmap module to just open the file in read/write mode, scan through it and edit it in place, without the need to ever load it to RAM completely and then write it back out. Stupid example, converting all 12-valued bytes by something position-dependent:
Note: this code is mine, and not MIT-licensed. It's for text-enhancement purposes and thus covered by CC-by-SA. Thanks SE for making this stupid statement necessary.
import mmap
with open("infilename", "r") as in_f:
in_view = mmap.mmap(in_f.fileno(), 0) ##length = 0: complete file mapping
length = in_view.size()
with open("outfilename", "w") as out_f
out_view = mmap.mmap(out_f.fileno(), length)
for i in range(length):
if in_view[i] == 12:
out_view[i] = in_view[i] + i % 10
else:
out_view[i] = in_view[i]
What about slicing the string, modify each slice, write it back on the disk before moving on to the next slice? Too intensive for the disk?
The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.
True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?
EDIT:
It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.
The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).
Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.
Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable,
and your proposal probably doesn't give enough benefits to be accepted.
An other problem with this kind of "optimization" is when you want to deallocate "big strings":
a = "Some string" * 10 ** 7
b = a[10000]
del a
After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data.
The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.
99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts)
and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.
If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.
That's how slices work. Slices always perform a shallow copy, allowing you to do things like
>>> x = [1,2,3]
>>> y = x[:]
Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.
See also this question.
If you are worried about memory (in the case of really large strings), use a buffer():
>>> a = "12345"
>>> b = buffer(a, 2, 2)
>>> b
<read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0>
>>> print b
34
>>> print b[:]
34
Knowing about this allows you for alternatives to string methods such as split().
If you want to split() a string, but keep the original string object (as you maybe need it), you could do:
def split_buf(s, needle):
start = None
add = len(needle)
res = []
while True:
index = s.find(needle, start)
if index < 0:
break
res.append(buffer(s, start, index-start))
start = index + add
return res
or, using .index():
def split_buf(s, needle):
start = None
add = len(needle)
res = []
try:
while True:
index = s.index(needle, start)
res.append(buffer(s, start, index-start))
start = index + add
except ValueError:
pass
return res
I'm quite new to python and trying to port a simple exploit I've written for a stack overflow (just a nop sled, shell code and return address). This isn't for nefarious purposes but rather for a security lecture at a university.
Given a hex string (deadbeef), what are the best ways to:
represent it as a series of bytes
add or subtract a value
reverse the order (for x86 memory layout, i.e. efbeadde)
Any tips and tricks regarding common tasks in exploit writing in python are also greatly appreciated.
In Python 2.6 and above, you can use the built-in bytearray class.
To create your bytearray object:
b = bytearray.fromhex('deadbeef')
To alter a byte, you can reference it using array notation:
b[2] += 7
To reverse the bytearray in place, use b.reverse(). To create an iterator that iterates over it in reverse order, you can use the reversed function: reversed(b).
You may also be interested in the new bytes class in Python 3, which is like bytearray but immutable.
Not sure if this is the best way...
hex_str = "deadbeef"
bytes = "".join(chr(int(hex_str[i:i+2],16)) for i in xrange(0,len(hex_str),2))
rev_bytes = bytes[::-1]
Or might be simpler:
bytes = "\xde\xad\xbe\xef"
rev_bytes = bytes[::-1]
In Python 2.x, regular str values are binary-safe. You can use the binascii module's b2a_hex and a2b_hex functions to convert to and from hexadecimal.
You can use ordinary string methods to reverse or otherwise rearrange your bytes. However, doing any kind of arithmetic would require you to use the ord function to get numeric values for individual bytes, then chr to convert the result back, followed by concatenation to reassemble the modified string.
For mutable sequences with easier arithmetic, use the array module with type code 'B'. These can be initialized from the results of a2b_hex if you're starting from hexadecimal.