We are able to defeat the small integer intern in this way (a calculation allows us to avoid the caching layer):
>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False
How can you defeat the small string intern, i.e. to see the following result:
>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False
sys.intern mentions that "Interned strings are not immortal", but there's no context about how a string could kicked out of the intern, or how you can create a str instance avoiding the caching layer.
Since interning is CPython implementation detail, answers relying on undocumented implementation details are ok/expected.
Unicode consisting of only one character (with value smaller than 128 or more precisely from latin1) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior for bytes) are created at the start and are stored in an array as long as the interpreter is alive:
truct _Py_unicode_state {
...
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
PyObject *latin1[256];
...
/* This dictionary holds all interned unicode strings...
*/
PyObject *interned;
...
};
So every time a length 1 unicode is created, the character value gets looked up if it is in the latin1-array. E.g. in unicode_decode_utf8:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.
A possibility is to populate the unicode-data by ourselves using C-API. I use Cython for the proof of concept, but also ctypes could be used to the same effect:
%%cython
cdef extern from *:
"""
PyObject* create_new_unicode(char *ch)
{
PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
data[0]=ch[0]; //fill data without using the unicode_decode_utf8
return (PyObject*)ob;
}
"""
object create_new_unicode(char *ch)
def gen1():
return create_new_unicode(b"1")
Noteworthy details:
PyUnicode_New would not look up in latin1, because the characters aren't set yet.
For simplicity, the above works only for ASCII characters - thus we pass 127 as maxchar to PyUnicode_New. As result, we can interpret data via PyUnicode_1BYTE_DATA which makes it easy to manipulate it without much ado manually.
And now:
a,b=gen1(), gen1()
a is b, a == b
# yields (False, True)
as wanted.
Here is a similar idea, but implemented with ctypes:
from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
PyUnicode_New = pythonapi.PyUnicode_New
PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
PyUnicode_New.restype = py_object
PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
PyUnicode_CopyCharacters.restype = c_ssize_t
def clone(orig):
cloned = PyUnicode_New(1,127)
PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
return cloned
Noteworthy details:
It is not possible to use PyUnicode_1BYTE_DATA with ctypes, because it is a macro. An alternative would be to calculate the offset to data-member and directly access this memory (but it depends on the platform and doesn't feel very portable)
As work-around, PyUnicode_CopyCharacters is used (there are probably also other possibilities to achieve the same), which is more abstract and portable than directly calculating/accessing the memory.
Actually, _PyUnicode_FastCopyCharacters is used, because PyUnicode_CopyCharacters would check, that the target-unicode has multiple references and throw. _PyUnicode_FastCopyCharacters doesn't perform those checks and does as asked.
And now:
a="1"
b=clone(a)
a is b, a==b
# yields (False, True)
For strings longer than 1 character, it is a lot easier to avoid interning, e.g.:
a="12"
b="123"[0:2]
a is b, a == b
#yields (False, True)
Related
I am sending strings to my BPF C code and I am not sure if the strings passed in are null-terminated. If they are not, is there a way to make them null terminated? I am sending in my lines of code to BPF so I can count them manually using my stringCounter function but I keep hitting a forever loop sadly. Here is what my Python code looks like:
b = BPF(src_file="hello.c")
lookupTable = b["lookupTable"]
#add hello.csv to the lookupTable array
f = open("hello copy.csv","r")
contents = f.readlines()
for i in range(0,len(contents)):
string = contents[i].encode('utf-8')
lookupTable[ctypes.c_int(i)] = ctypes.create_string_buffer(string, len(string))
And here is the code I found for my null terminated string counter
int stringLength(char* txt)
{
int i=0,count=0;
while(txt[i++]!='\0'){
count+=1;
}
return count;
}
ctypes.create_string_buffer(string, len(string)) is not zero-terminated. But ctypes.create_string_buffer(string) is. It's easy to see that, since ctypes.create_string_buffer(string)[-1] is b'\x00', whereas ctypes.create_string_buffer(string, len(string))[-1] is the last byte in string.
In other words, if you want a zero-terminated buffer, let create_string_buffer figure out the length. (It uses the actual length from the Python bytes object, so it doesn't get fooled by internal NUL bytes, if you were worried about that.)
I'm unfamiliar with BPF but for ctypes, if your string isn't modified by the C code you don't need create_string_buffer as it is used to create mutable buffers, and Python Unicode and byte strings are both always passed nul-terminated wchar_t* or char*, respectively, to C code. Assuming your function is in test.dll or test.so:
import ctypes as ct
dll = ct.CDLL('./test')
dll.stringLength.argtypes = ct.c_char_p,
dll.stringLength.restype = ct.c_int
print(dll.stringLength('somestring'.encode())) # If string is Unicode
print(dll.stringLength(b'someotherstring')) # If already a byte string
Output:
10
15
Note this doesn't preclude having a nul in the string itself, but your count function will return a shorter value in that case:
print(dll.stringLength(b'some\0string')) # Output: 4
Your code could be probably be written as the following assuming there isn't some requirement that a BPF object have hard-coded ctypes types as indexes and values.
with open("hello copy.csv") as file:
for i,line in enumerate(file):
lookupTable[i] = string.encode()
I am new to both C and ctypes, but I cannot seem to find an answer on how to do this, particularly with a numpy array.
C Code
// Import/Export Macros
#define DllImport __declspec( dllimport )
#define DllExport __declspec( dllexport )
// Test function for receiving and transmitting arrays
extern "C"
DllExport void c_fun(char **string_array)
{
string_array[0] = "foo";
string_array[1] = "bar";
string_array[2] = "baz";
}
Python Code
import numpy as np
import ctypes
# Load the DLL library...
# Define function argtypes
lib.c_fun.argtypes = [np.ctypeslib.ndpointer(ctypes.c_char, ndim = 2, flags="C_CONTIGUOUS")]
# Initialize, call, and print
string_array = np.empty((3,10),dtype=ctypes.c_char)
lib.c_fun(string_array)
print(string_array)
I am sure there is some encoding/decoding that needs to happen as well, but I am not sure how/which. Thanks!
Addressing the C code part of the question only...
As noted in comments, C does not allow assignments in this way if the three variables shown are defined as char arrays:
string_array[0] = "foo";
string_array[1] = "bar";
string_array[2] = "baz";
Use the following:
strcpy(string_array[0], "foo");
strcpy(string_array[1], "bar");
strcpy(string_array[2], "baz");
And as long as the caller to this function is pre-allocating and freeing memory for the buffers, this part of the solution is now at least syntactically correct.
But if the strings do indeed need to be immutable to be compatible with Python, then in the caller function allocate memory to create char **string_array such that you can pass an array of 3 pointers as the argument. For example:
char **string_array = malloc(3*sizeof(*string_array));//creates array of 3 pointers.
Then call it as:
c_fun(string_array);
This allows use of the DLL call just as shown in in your original post.:
DllExport void c_fun(char **string_array)
{
//array of pointers being assigned to addresses of 3 string literals
string_array[0] = "foo";//these will now be immutable strings
string_array[1] = "bar";
string_array[2] = "baz";
}
An apparent calling convention mismatch exists where the position and contents of arguments are incorrect when loading a small function using Python's Ctypes module.
In the example I built up while trying to get something working, one positional argument gets another's value while the other gets garbage.
The Ctypes docs state that cdll.LoadLibrary expects the cdecl convention. Resulting standard boilerplate:
# Tell Rustc to output a dynamically linked library
crate-type = ["cdylib"]
// Specify clean symbol and cdecl calling convention
#[no_mangle]
pub extern "cdecl" fn boring_function(
n: *mut size_t,
in_data: *mut [c_ulong],
out_data: *mut [c_double],
garbage: *mut [c_double],
) -> c_int {
//...
Loading our library after build...
lib = ctypes.CDLL("nothing/lib/playtoys.so")
lib.boring_function.restype = ctypes.c_int
Load the result into Python and call it with some initialized data
data_len = 8
in_array_t = ctypes.c_ulong * data_len
out_array_t = ctypes.c_double * data_len
in_array = in_array_t(7, 7, 7, 7, 7, 8, 7, 7)
out_array = out_array_t(10000.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9)
val = ctypes.c_size_t(data_len)
in_array_p = ctypes.byref(in_array)
out_array_p = ctypes.byref(out_array)
n_p = ctypes.byref(val)
garbage = n_p
res = boring_function(n_p,
in_array_p,
# garbage cannot be observed in any callee arg
ctypes.cast(garbage, ctypes.POINTER(out_array_t)),
out_array_p)
Notice the garbage parameter. It is so-named because it winds up containing a garbage address. Note that its position is swapped with out_array_p in the Python call and the Rust declaration.
[src/hello.rs:29] n = 0x00007f56dbce5bc0
[src/hello.rs:30] in_data = 0x00007f56f81e3270
[src/hello.rs:31] out_data = 0x00007f56f81e3230
[src/hello.rs:32] garbage = 0x000000000000000a
in_data, out_data, and n print the correct values in this configuration. The positional swap between garbage and out_data makes this possible.
Other examples using more or less arguments reveal similar patterns of intermediate ordered variables containing odd values that resemble addresses earlier in the program or unrelated garbage.
Either I'm missing something in how I set up the calling convention or some special magic in argtypes must be missing. So far I had no luck with changing the declared calling conventions or explicit argtypes. Are there any other knobs I should try turning?
in_data: *mut [c_ulong],
A slice is not a FFI-safe data type. Namely, Rust's slices use fat pointers, which take up two pointer-sized values.
You need to pass the data pointer and length as two separate arguments.
See also:
Why can comparing two seemingly equal pointers with == return false?
Rust functions with slice arguments in The Rust FFI Omnibus
The complete example from the Omnibus:
extern crate libc;
use libc::{uint32_t, size_t};
use std::slice;
#[no_mangle]
pub extern fn sum_of_even(n: *const uint32_t, len: size_t) -> uint32_t {
let numbers = unsafe {
assert!(!n.is_null());
slice::from_raw_parts(n, len as usize)
};
let sum =
numbers.iter()
.filter(|&v| v % 2 == 0)
.fold(0, |acc, v| acc + v);
sum as uint32_t
}
#!/usr/bin/env python3
import sys, ctypes
from ctypes import POINTER, c_uint32, c_size_t
prefix = {'win32': ''}.get(sys.platform, 'lib')
extension = {'darwin': '.dylib', 'win32': '.dll'}.get(sys.platform, '.so')
lib = ctypes.cdll.LoadLibrary(prefix + "slice_arguments" + extension)
lib.sum_of_even.argtypes = (POINTER(c_uint32), c_size_t)
lib.sum_of_even.restype = ctypes.c_uint32
def sum_of_even(numbers):
buf_type = c_uint32 * len(numbers)
buf = buf_type(*numbers)
return lib.sum_of_even(buf, len(numbers))
print(sum_of_even([1,2,3,4,5,6]))
Disclaimer: I am the primary author of the Omnibus
I wrote a tree object in cython that has many nodes, each containing a single unicode character. I wanted to test whether the character gets interned if I use Py_UNICODE or str as the variable type. I'm trying to test this by creating multiple instances of the node class and getting the memory address of the character for each, but somehow I end up with the same memory address, even if the different instances contain different characters. Here is my code:
from libc.stdint cimport uintptr_t
cdef class Node():
cdef:
public str character
public unsigned int count
public Node lo, eq, hi
def __init__(self, str character):
self.character = character
def memory(self):
return <uintptr_t>&self.character[0]
I am trying to compare the memory locations like so, from Python:
a = Node("a")
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())
But the memory addresses that prints out are all the same. What am I doing wrong?
Obviously, what you are doing is not what you think you would be doing.
self.character[0] doesn't return the address/reference of the first character (as it would be the case for an array for example), but a Py_UCS4-value (i.e. an usigned 32bit-integer), which is copied to a (local, temprorary) variable on the stack.
In your function, <uintptr_t>&self.character[0] gets you the address of the local variable on the stack, which per chance is always the same because when calling memory there is always the same stack-layout.
To make it clearer, here is the difference to a char * c_string, where &c_string[0] gives you the address of the first character in c_string.
Compare:
%%cython
from libc.stdint cimport uintptr_t
cdef char *c_string = "name";
def get_addresses_from_chars():
for i in range(4):
print(<uintptr_t>&c_string[i])
cdef str py_string="name";
def get_addresses_from_pystr():
for i in range(4):
print(<uintptr_t>&py_string[i])
An now:
>>> get_addresses_from_chars() # works - different addresses every time
# ...7752
# ...7753
# ...7754
# ...7755
>>> get_addresses_from_pystr() # works differently - the same address.
# ...0672
# ...0672
# ...0672
# ...0672
You can see it this way: c_string[...] is a cdef functionality, but py_string[...] is a python-functionality and thus cannot return an address per construction.
To influence the stack-layout, you could use a recursive function:
def memory(self, level):
if level==0 :
return <uintptr_t>&self.character[0]
else:
return self.memory(level-1)
Now calling it with a.memory(0), a.memory(1) and so on will give you different addresses (unless tail-call-optimization will kick in, I don't believe it will happen, but you could disable the optimization (-O0) just to be sure). Because depending on the level/recursion-depth, the local variable, whose address will be returned, is in a different place on the stack.
To see whether Unicode-objects are interned, it is enough to use id, which yields the address of the object (this is a CPython's implementation detail) so you don't need Cython at all:
>>> id(a.character) == id(a2.character)
# True
or in Cython, doing the same what id does (a little bit faster):
%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
...
def memory(self):
# cast from object to PyObject, so the address can be used
return <uintptr_t>(<PyObject*>self.character)
You need to cast an object to PyObject *, so the Cython will allow to take the address of the variable.
And now:
>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# ...5800 ...5800 ...5000
If you want to get the address of the first code-point in the unicode object (which is not the same as the address of the string), you can use <PY_UNICODE *>self.character which Cython will replace by a call to PyUnicode_AsUnicode, e.g.:
%%cython
...
def memory(self):
return <uintptr_t>(<Py_UNICODE*>self.character), id(self.character)
and now
>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# (...768, ...800) (...768, ...800) (...144, ...000)
i.e. "a" is interned and has different address than "b" and code-points bufffer has a different address than the objects containing it (as one would expect).
I'm trying to figure out why this works after lots and lots of messing about with
obo.librar_version is a c function which requires char ** as the input and does a strcpy
to passed in char.
from ctypes import *
_OBO_C_DLL = 'obo.dll'
STRING = c_char_p
OBO_VERSION = _stdcall_libraries[_OBO_C_DLL].OBO_VERSION
OBO_VERSION.restype = c_int
OBO_VERSION.argtypes = [POINTER(STRING)]
def library_version():
s = create_string_buffer('\000' * 32)
t = cast(s, c_char_p)
res = obo.library_version(byref(t))
if res != 0:
raise Error("OBO error %r" % res)
return t.value, s.raw, s.value
library_version()
The above code returns
('OBO Version 1.0.1', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', '')
What I don't understand is why 's' does not have any value? Anyone have any ideas? Thx
When you cast s to c_char_p you store a new object in t, not a reference. So when you pass t to your function by reference, s doesn't get updated.
UPDATE:
You are indeed correct:
cast takes two parameters, a ctypes
object that is or can be converted to
a pointer of some kind, and a ctypes
pointer type. It returns an instance
of the second argument, which
references the same memory block as
the first argument.
In order to get a reference to your string buffer, you need to use the following for your cast:
t = cast(s, POINTER(c_char*33))
I have no idea why c_char_p doesn't create a reference where this does, but there you go.
Because library_version requires a char**, they don't want you to allocate the characters (as you're doing with create_string_buffer. Instead, they just want you to pass in a reference to a pointer so they can return the address of where to find the version string.
So all you need to do is allocate the pointer, and then pass in a reference to that pointer.
The following code should work, although I don't have obo.dll (or know of a suitable replacement) to test it.
from ctypes import *
_OBO_C_DLL = 'obo.dll'
STRING = c_char_p
_stdcall_libraries = dict()
_stdcall_libraries[_OBO_C_DLL] = WinDLL(_OBO_C_DLL)
OBO_VERSION = _stdcall_libraries[_OBO_C_DLL].OBO_VERSION
OBO_VERSION.restype = c_int
OBO_VERSION.argtypes = [POINTER(STRING)]
def library_version():
s_res = c_char_p()
res = OBO_VERSION(byref(s_res))
if res != 0:
raise Error("OBO error %r" % res)
return s_res.value
library_version()
[Edit]
I've gone a step further and written my own DLL that implements a possible implementation of OBO_VERSION that does not require an allocated character buffer, and is not subject to any memory leaks.
int OBO_VERSION(char **pp_version)
{
static char result[] = "Version 2.0";
*pp_version = result;
return 0; // success
}
As you can see, OBO_VERSION simply sets the value of *pp_version to a pointer to a null-terminated character array. This is likely how the real OBO_VERSION works. I've tested this against my originally suggested technique above, and it works as prescribed.