How strings are stored in python memory model

How strings are stored in python memory model - python

I am from c background and a beginner in python. I want to know how strings are actually stored in memory in case of python.
I did something like
s="foo"
id(s)=140542718184424
id(s[0])= 140542719027040
id(s[1])= 140542718832152
id(s[2])= 140542718832152
I did not understand how each character is getting stored in memory and and why id of s is not equal to id of s[0] (like it use to be in c) and why id of s1 and s2 are same?

Python has no characters. Indexing into a string creates a new string, which (like every other object) promptly vanquishes if you don't keep a reference to it around. So the id()s in your example can't be compared with each other, an object's id is only unique as long as the object lives. In particular, id(s[0]) != id(s) because the former is a new (temporary) object, and id(s[1]) == id(s[2]) because after the first operand is evaluated, the first temporary string is destroyed and the second temporary string is allocated to the previously freed memory. The latter is an implementation detail and a coincidence and cannot be relied on.
Reasoning about string memory is further complicated by implementation details like small strings (along with integers, some tuples, and more) being interned, so some_str is other_str may be true for equal strings that come from different sources (e.g. from indexing into a string with different indices).

This article is a good reading which explains how strings are stored. Briefly:
When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.
Python does not UTF-8 internally to provide constant access to substrings:
s = 'hello world'
s[0]
s[7]
both do not require to scan the string from the initial char (or, more correctly, the first substring of length 1) to the i-th position.
This is why Python uses the three kinds of internal representations for Unicode strings with 1, 2 or 4 byte(s) per char (Latin-1, UCS-2, UCS-4 encoding) and does not use the space-optimised UTF-8.

This is implementation dependent, but some implementations (not only of Python, other languages too) may keep a moderate-size set of constant values around for expected frequent use. In Python's case those might be values like True, None, 'o', 1, 2, etc. This way, when one of those common values is needed, there is no overhead to create it--just refer to the existing value.

Related

Tuple vs String vs frozenset. Immutable objects and the number of copies in memory

a = "haha"
b = "haha"
print a is b # this is True
The above code prints true. I've read that one of the reasons for this is because strings are immutable, so one copy in memory will be enough. But in the case of a tuple:
a = (1, 2, 3)
b = (1, 2, 3)
print a is b # this is False
This will print False despite the fact that tuples are also immutable in python. After doing some more research, I discovered that tuples can contain mutable elements, so I guess it makes sense to have multiple copies of tuples in memory if it's too expensive to figure out whether a tuple contains mutable objects or not. But when I tried it on frozenset
a = frozenset([1,2])
b = frozenset([1,2])
print a is b # False
This will also print false. As far as I know frozenset are themselves immutable and can only contain immutable objects (I tried to create a frozenset which contains a tuple which contains a mutable list but it's not allowed), and that we can use == to check if two frozensets are identical in value, so why does python create two copies of them in memory?

Your sentence "I've read that one of the reasons for this is because strings are immutable, so one copy in memory will be enough." is correct but it is not true all the times.
for example if you do the same with the string
"dgjudfigur89tyur9egjr9ivr89egre8frejf9reimfkldsmgoifsgjurt89igjkmrt0ivmkrt8g,rt89gjtrt"
It won't be the same object (at least on my python's version).
The same phenomenon can be replicated in integers, where 256 will be the same object but 257 won't.
It has to do with the way python caches objects, it saves "simple" objects. Each object has its criteria, for string it is only containing certains characters, for integers their range.

It's because of the way the python byteops are compiled. When your program is run the first time it compiles the code into byte operations. When it does this and sees string (or some integer) literals in the code, it will create a string object and use a reference to that string object wherever you typed that literal. But in the case of a tuple it's difficult (in some cases impossible) to determine that the tuples are the same, so it doesn't take the extra time to perform this optimization. It is for this reason that you should not generally use is for comparing objects.

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?

Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.

I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Python Remove last char from string and return it

While I know that there is the possibility:
>>> a = "abc"
>>> result = a[-1]
>>> a = a[:-1]
Now I also know that strings are immutable and therefore something like this:
>>> a.pop()
c
is not possible.
But is this really the preferred way?

Strings are "immutable" for good reason: It really saves a lot of headaches, more often than you'd think. It also allows python to be very smart about optimizing their use. If you want to process your string in increments, you can pull out part of it with split() or separate it into two parts using indices:
a = "abc"
a, result = a[:-1], a[-1]
This shows that you're splitting your string in two. If you'll be examining every byte of the string, you can iterate over it (in reverse, if you wish):
for result in reversed(a):
...
I should add this seems a little contrived: Your string is more likely to have some separator, and then you'll use split:
ans = "foo,blah,etc."
for a in ans.split(","):
...

Not only is it the preferred way, it's the only reasonable way. Because strings are immutable, in order to "remove" a char from a string you have to create a new string whenever you want a different string value.
You may be wondering why strings are immutable, given that you have to make a whole new string every time you change a character. After all, C strings are just arrays of characters and are thus mutable, and some languages that support strings more cleanly than C allow mutable strings as well. There are two reasons to have immutable strings: security/safety and performance.
Security is probably the most important reason for strings to be immutable. When strings are immutable, you can't pass a string into some library and then have that string change from under your feet when you don't expect it. You may wonder which library would change string parameters, but if you're shipping code to clients you can't control their versions of the standard library, and malicious clients may change out their standard libraries in order to break your program and find out more about its internals. Immutable objects are also easier to reason about, which is really important when you try to prove that your system is secure against particular threats. This ease of reasoning is especially important for thread safety, since immutable objects are automatically thread-safe.
Performance is surprisingly often better for immutable strings. Whenever you take a slice of a string, the Python runtime only places a view over the original string, so there is no new string allocation. Since strings are immutable, you get copy semantics without actually copying, which is a real performance win.
Eric Lippert explains more about the rationale behind immutable of strings (in C#, not Python) here.

The precise wording of the question makes me think it's impossible.
return to me means you have a function, which you have passed a string as a parameter.
You cannot change this parameter. Assigning to it will only change the value of the parameter within the function, not the passed in string. E.g.
>>> def removeAndReturnLastCharacter(a):
c = a[-1]
a = a[:-1]
return c
>>> b = "Hello, Gaukler!"
>>> removeAndReturnLastCharacter(b)
!
>>> b # b has not been changed
Hello, Gaukler!

Yes, python strings are immutable and any modification will result in creating a new string. This is how it's mostly done.
So, go ahead with it.

I decided to go with a for loop and just avoid the item in question, is it an acceptable alternative?
new = ''
for item in str:
if item == str[n]:
continue
else:
new += item

Exploits in Python - manipulating hex strings

I'm quite new to python and trying to port a simple exploit I've written for a stack overflow (just a nop sled, shell code and return address). This isn't for nefarious purposes but rather for a security lecture at a university.
Given a hex string (deadbeef), what are the best ways to:
represent it as a series of bytes
add or subtract a value
reverse the order (for x86 memory layout, i.e. efbeadde)
Any tips and tricks regarding common tasks in exploit writing in python are also greatly appreciated.

In Python 2.6 and above, you can use the built-in bytearray class.
To create your bytearray object:
b = bytearray.fromhex('deadbeef')
To alter a byte, you can reference it using array notation:
b[2] += 7
To reverse the bytearray in place, use b.reverse(). To create an iterator that iterates over it in reverse order, you can use the reversed function: reversed(b).
You may also be interested in the new bytes class in Python 3, which is like bytearray but immutable.

Not sure if this is the best way...
hex_str = "deadbeef"
bytes = "".join(chr(int(hex_str[i:i+2],16)) for i in xrange(0,len(hex_str),2))
rev_bytes = bytes[::-1]
Or might be simpler:
bytes = "\xde\xad\xbe\xef"
rev_bytes = bytes[::-1]

In Python 2.x, regular str values are binary-safe. You can use the binascii module's b2a_hex and a2b_hex functions to convert to and from hexadecimal.
You can use ordinary string methods to reverse or otherwise rearrange your bytes. However, doing any kind of arithmetic would require you to use the ord function to get numeric values for individual bytes, then chr to convert the result back, followed by concatenation to reassemble the modified string.
For mutable sequences with easier arithmetic, use the array module with type code 'B'. These can be initialized from the results of a2b_hex if you're starting from hexadecimal.

Python: Why does ("hello" is "hello") evaluate as True? [duplicate]

This question already has answers here:
About the changing id of an immutable string
(5 answers)
Closed 4 years ago.
Why does "hello" is "hello" produce True in Python?
I read the following here:
If two string literals are equal, they have been put to same
memory location. A string is an immutable entity. No harm can
be done.
So there is one and only one place in memory for every Python string? Sounds pretty strange. What's going on here?

Python (like Java, C, C++, .NET) uses string pooling / interning. The interpreter realises that "hello" is the same as "hello", so it optimizes and uses the same location in memory.
Another goodie: "hell" + "o" is "hello" ==> True

So there is one and only one place in memory for every Python string?
No, only ones the interpreter has decided to optimise, which is a decision based on a policy that isn't part of the language specification and which may change in different CPython versions.
eg. on my install (2.6.2 Linux):
>>> 'X'*10 is 'X'*10
True
>>> 'X'*30 is 'X'*30
False
similarly for ints:
>>> 2**8 is 2**8
True
>>> 2**9 is 2**9
False
So don't rely on 'string' is 'string': even just looking at the C implementation it isn't safe.

Literal strings are probably grouped based on their hash or something similar. Two of the same literal strings will be stored in the same memory, and any references both refer to that.
Memory Code
-------
| myLine = "hello"
| /
|hello <
| \
| myLine = "hello"
-------

The is operator returns true if both arguments are the same object. Your result is a consequence of this, and the quoted bit.
In the case of string literals, these are interned, meaning they are compared to known strings. If an identical string is already known, the literal takes that value, instead of an alternative one. Thus, they become the same object, and the expression is true.

The Python interpreter/compiler parses the string literals, i.e. the quoted list of characters. When it does this, it can detect "I've seen this string before", and use the same representation as last time. It can do this since it knows that strings defined in this way cannot be changed.

Why is it strange. If the string is immutable it makes a lot of sense to only store it once. .NET has the same behavior.

I think if any two variables (not just strings) contain the same value, the value will be stored only once not twice and both the variables will point to the same location. This saves memory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.