Why does a space affect the identity comparison of equal strings? [duplicate] - python

This question already has answers here:
'is' operator behaves differently when comparing strings with spaces
(6 answers)
Closed 8 years ago.
I've noticed that adding a space to identical strings makes them compare unequal using is, while the non-space versions compare equal.
a = 'abc'
b = 'abc'
a is b
#outputs: True
a = 'abc abc'
b = 'abc abc'
a is b
#outputs: False
I have read this question about comparing strings with == and is. I think this is a different question because the space character is changing the behavior, not the length of the string. See:
a = 'abc'
b = 'abc'
a is b # True
a = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
b = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
a is b # True
Why does adding a space to the string change the result of this comparison?

The python interpreter caches some strings based on certain criteria, the first abc string is cached and used for both but the second is not. It is the same for small ints from -5 to 256.
Because the strings are interned/cached assigning a and b to "abc" makes a and b point to the same objects in memory so using is, which checks if two objects are actually the same object, returns True.
The second string abc abc is not cached so they are two entirely different object in memory so out identity check using is returns False. This time a is not b. They are both pointing to different objects in memory.
In [43]: a = "abc" # python caches abc
In [44]: b = "abc" # it reuses the object when assigning to b
In [45]: id(a)
Out[45]: 139806825858808 # same id's, same object in memory
In [46]: id(b)
Out[46]: 139806825858808
In [47]: a = 'abc abc' # not cached
In [48]: id(a)
Out[48]: 139806688800984
In [49]: b = 'abc abc'
In [50]: id(b) # different id's different objects
Out[50]: 139806688801208
The criteria for caching strings is if the string only has letters, underscores and numbers in the string so in your case the space does not meet the criteria.
Using the interpreter there is one case where you can end up pointing to the same object even when the string does not meet the above criteria, multiple assignments.
In [51]: a,b = 'abc abc','abc abc'
In [52]: id(a)
Out[52]: 139806688801768
In [53]: id(b)
Out[53]: 139806688801768
In [54]: a is b
Out[54]: True
Looking codeobject.c source for deciding the criteria we see NAME_CHARS decides what can be interned:
#define NAME_CHARS \
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */
static int
all_name_chars(unsigned char *s)
{
static char ok_name_char[256];
static unsigned char *name_chars = (unsigned char *)NAME_CHARS;
if (ok_name_char[*name_chars] == 0) {
unsigned char *p;
for (p = name_chars; *p; p++)
ok_name_char[*p] = 1;
}
while (*s) {
if (ok_name_char[*s++] == 0)
return 0;
}
return 1;
}
A string of length 0 or 1 will always be shared as we can see in the PyString_FromStringAndSize function in the stringobject.c source.
/* share short strings */
if (size == 0) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
nullstring = op;
Py_INCREF(op);
} else if (size == 1 && str != NULL) {
PyObject *t = (PyObject *)op;
PyString_InternInPlace(&t);
op = (PyStringObject *)t;
characters[*str & UCHAR_MAX] = op;
Py_INCREF(op);
}
return (PyObject *) op;
}
Not directly related to the question but for those interested PyCode_New also from the codeobject.c source shows how more strings are interned when building a codeobject once the strings meet the criteria in all_name_chars.
PyCodeObject *
PyCode_New(int argcount, int nlocals, int stacksize, int flags,
PyObject *code, PyObject *consts, PyObject *names,
PyObject *varnames, PyObject *freevars, PyObject *cellvars,
PyObject *filename, PyObject *name, int firstlineno,
PyObject *lnotab)
{
PyCodeObject *co;
Py_ssize_t i;
/* Check argument types */
if (argcount < 0 || nlocals < 0 ||
code == NULL ||
consts == NULL || !PyTuple_Check(consts) ||
names == NULL || !PyTuple_Check(names) ||
varnames == NULL || !PyTuple_Check(varnames) ||
freevars == NULL || !PyTuple_Check(freevars) ||
cellvars == NULL || !PyTuple_Check(cellvars) ||
name == NULL || !PyString_Check(name) ||
filename == NULL || !PyString_Check(filename) ||
lnotab == NULL || !PyString_Check(lnotab) ||
!PyObject_CheckReadBuffer(code)) {
PyErr_BadInternalCall();
return NULL;
}
intern_strings(names);
intern_strings(varnames);
intern_strings(freevars);
intern_strings(cellvars);
/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!PyString_Check(v))
continue;
if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
continue;
PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
This answer is based on simple assignments using the cpython interpreter, as far as interning in relation to functions or any other functionality outside of simple assignments, that was not asked nor answered.
If anyone with a greater understanding of c code has anything to add feel free to edit.
There is a much more thorough explanation here of the whole string interning.

Related

How would I go about creating another int() function in Python so I can understand it?

I need to know how I would go about recreating a version of the int() function in Python so that I can fully understand it and create my own version with multiple bases that go past base 36. I can convert from a decimal to my own base (base 54, altered) just fine, but I need to figure out how to go from a string in my base's format to an integer (base 10).
Basically, I want to know how to go from my base, which I call base 54, to base 10. I don't need specifics, because if I have an example, I can work it out on my own. Unfortunately, I can't find anything on the int() function, though I know it has to be somewhere, since Python is open-source.
This is the closest I can find to it, but it doesn't provide source code for the function itself. Python int() test.
If you can help, thanks. If not, well, thanks for reading this, I guess?
You're not going to like this answer, but int(num, base) is defined in C (it's a builtin)
I went searching around and found it:
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/longobject.c
/* Parses an int from a bytestring. Leading and trailing whitespace will be
* ignored.
*
* If successful, a PyLong object will be returned and 'pend' will be pointing
* to the first unused byte unless it's NULL.
*
* If unsuccessful, NULL will be returned.
*/
PyObject *
PyLong_FromString(const char *str, char **pend, int base)
{
int sign = 1, error_if_nonzero = 0;
const char *start, *orig_str = str;
PyLongObject *z = NULL;
PyObject *strobj;
Py_ssize_t slen;
if ((base != 0 && base < 2) || base > 36) {
PyErr_SetString(PyExc_ValueError,
"int() arg 2 must be >= 2 and <= 36");
return NULL;
}
while (*str != '\0' && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str == '+') {
++str;
}
else if (*str == '-') {
++str;
sign = -1;
}
if (base == 0) {
if (str[0] != '0') {
base = 10;
}
else if (str[1] == 'x' || str[1] == 'X') {
base = 16;
}
else if (str[1] == 'o' || str[1] == 'O') {
base = 8;
}
else if (str[1] == 'b' || str[1] == 'B') {
base = 2;
}
else {
/* "old" (C-style) octal literal, now invalid.
it might still be zero though */
error_if_nonzero = 1;
base = 10;
}
}
if (str[0] == '0' &&
((base == 16 && (str[1] == 'x' || str[1] == 'X')) ||
(base == 8 && (str[1] == 'o' || str[1] == 'O')) ||
(base == 2 && (str[1] == 'b' || str[1] == 'B')))) {
str += 2;
/* One underscore allowed here. */
if (*str == '_') {
++str;
}
}
if (str[0] == '_') {
/* May not start with underscores. */
goto onError;
}
start = str;
if ((base & (base - 1)) == 0) {
int res = long_from_binary_base(&str, base, &z);
if (res < 0) {
/* Syntax error. */
goto onError;
}
}
else {
/***
Binary bases can be converted in time linear in the number of digits, because
Python's representation base is binary. Other bases (including decimal!) use
the simple quadratic-time algorithm below, complicated by some speed tricks.
First some math: the largest integer that can be expressed in N base-B digits
is B**N-1. Consequently, if we have an N-digit input in base B, the worst-
case number of Python digits needed to hold it is the smallest integer n s.t.
BASE**n-1 >= B**N-1 [or, adding 1 to both sides]
BASE**n >= B**N [taking logs to base BASE]
n >= log(B**N)/log(BASE) = N * log(B)/log(BASE)
The static array log_base_BASE[base] == log(base)/log(BASE) so we can compute
this quickly. A Python int with that much space is reserved near the start,
and the result is computed into it.
The input string is actually treated as being in base base**i (i.e., i digits
are processed at a time), where two more static arrays hold:
convwidth_base[base] = the largest integer i such that base**i <= BASE
convmultmax_base[base] = base ** convwidth_base[base]
The first of these is the largest i such that i consecutive input digits
must fit in a single Python digit. The second is effectively the input
base we're really using.
Viewing the input as a sequence <c0, c1, ..., c_n-1> of digits in base
convmultmax_base[base], the result is "simply"
(((c0*B + c1)*B + c2)*B + c3)*B + ... ))) + c_n-1
where B = convmultmax_base[base].
Error analysis: as above, the number of Python digits `n` needed is worst-
case
n >= N * log(B)/log(BASE)
where `N` is the number of input digits in base `B`. This is computed via
size_z = (Py_ssize_t)((scan - str) * log_base_BASE[base]) + 1;
below. Two numeric concerns are how much space this can waste, and whether
the computed result can be too small. To be concrete, assume BASE = 2**15,
which is the default (and it's unlikely anyone changes that).
Waste isn't a problem: provided the first input digit isn't 0, the difference
between the worst-case input with N digits and the smallest input with N
digits is about a factor of B, but B is small compared to BASE so at most
one allocated Python digit can remain unused on that count. If
N*log(B)/log(BASE) is mathematically an exact integer, then truncating that
and adding 1 returns a result 1 larger than necessary. However, that can't
happen: whenever B is a power of 2, long_from_binary_base() is called
instead, and it's impossible for B**i to be an integer power of 2**15 when
B is not a power of 2 (i.e., it's impossible for N*log(B)/log(BASE) to be
an exact integer when B is not a power of 2, since B**i has a prime factor
other than 2 in that case, but (2**15)**j's only prime factor is 2).
The computed result can be too small if the true value of N*log(B)/log(BASE)
is a little bit larger than an exact integer, but due to roundoff errors (in
computing log(B), log(BASE), their quotient, and/or multiplying that by N)
yields a numeric result a little less than that integer. Unfortunately, "how
close can a transcendental function get to an integer over some range?"
questions are generally theoretically intractable. Computer analysis via
continued fractions is practical: expand log(B)/log(BASE) via continued
fractions, giving a sequence i/j of "the best" rational approximations. Then
j*log(B)/log(BASE) is approximately equal to (the integer) i. This shows that
we can get very close to being in trouble, but very rarely. For example,
76573 is a denominator in one of the continued-fraction approximations to
log(10)/log(2**15), and indeed:
>>> log(10)/log(2**15)*76573
16958.000000654003
is very close to an integer. If we were working with IEEE single-precision,
rounding errors could kill us. Finding worst cases in IEEE double-precision
requires better-than-double-precision log() functions, and Tim didn't bother.
Instead the code checks to see whether the allocated space is enough as each
new Python digit is added, and copies the whole thing to a larger int if not.
This should happen extremely rarely, and in fact I don't have a test case
that triggers it(!). Instead the code was tested by artificially allocating
just 1 digit at the start, so that the copying code was exercised for every
digit beyond the first.
***/
twodigits c; /* current input character */
Py_ssize_t size_z;
Py_ssize_t digits = 0;
int i;
int convwidth;
twodigits convmultmax, convmult;
digit *pz, *pzstop;
const char *scan, *lastdigit;
char prev = 0;
static double log_base_BASE[37] = {0.0e0,};
static int convwidth_base[37] = {0,};
static twodigits convmultmax_base[37] = {0,};
if (log_base_BASE[base] == 0.0) {
twodigits convmax = base;
int i = 1;
log_base_BASE[base] = (log((double)base) /
log((double)PyLong_BASE));
for (;;) {
twodigits next = convmax * base;
if (next > PyLong_BASE) {
break;
}
convmax = next;
++i;
}
convmultmax_base[base] = convmax;
assert(i > 0);
convwidth_base[base] = i;
}
/* Find length of the string of numeric characters. */
scan = str;
lastdigit = str;
while (_PyLong_DigitValue[Py_CHARMASK(*scan)] < base || *scan == '_') {
if (*scan == '_') {
if (prev == '_') {
/* Only one underscore allowed. */
str = lastdigit + 1;
goto onError;
}
}
else {
++digits;
lastdigit = scan;
}
prev = *scan;
++scan;
}
if (prev == '_') {
/* Trailing underscore not allowed. */
/* Set error pointer to first underscore. */
str = lastdigit + 1;
goto onError;
}
/* Create an int object that can contain the largest possible
* integer with this base and length. Note that there's no
* need to initialize z->ob_digit -- no slot is read up before
* being stored into.
*/
double fsize_z = (double)digits * log_base_BASE[base] + 1.0;
if (fsize_z > (double)MAX_LONG_DIGITS) {
/* The same exception as in _PyLong_New(). */
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
size_z = (Py_ssize_t)fsize_z;
/* Uncomment next line to test exceedingly rare copy code */
/* size_z = 1; */
assert(size_z > 0);
z = _PyLong_New(size_z);
if (z == NULL) {
return NULL;
}
Py_SIZE(z) = 0;
/* `convwidth` consecutive input digits are treated as a single
* digit in base `convmultmax`.
*/
convwidth = convwidth_base[base];
convmultmax = convmultmax_base[base];
/* Work ;-) */
while (str < scan) {
if (*str == '_') {
str++;
continue;
}
/* grab up to convwidth digits from the input string */
c = (digit)_PyLong_DigitValue[Py_CHARMASK(*str++)];
for (i = 1; i < convwidth && str != scan; ++str) {
if (*str == '_') {
continue;
}
i++;
c = (twodigits)(c * base +
(int)_PyLong_DigitValue[Py_CHARMASK(*str)]);
assert(c < PyLong_BASE);
}
convmult = convmultmax;
/* Calculate the shift only if we couldn't get
* convwidth digits.
*/
if (i != convwidth) {
convmult = base;
for ( ; i > 1; --i) {
convmult *= base;
}
}
/* Multiply z by convmult, and add c. */
pz = z->ob_digit;
pzstop = pz + Py_SIZE(z);
for (; pz < pzstop; ++pz) {
c += (twodigits)*pz * convmult;
*pz = (digit)(c & PyLong_MASK);
c >>= PyLong_SHIFT;
}
/* carry off the current end? */
if (c) {
assert(c < PyLong_BASE);
if (Py_SIZE(z) < size_z) {
*pz = (digit)c;
++Py_SIZE(z);
}
else {
PyLongObject *tmp;
/* Extremely rare. Get more space. */
assert(Py_SIZE(z) == size_z);
tmp = _PyLong_New(size_z + 1);
if (tmp == NULL) {
Py_DECREF(z);
return NULL;
}
memcpy(tmp->ob_digit,
z->ob_digit,
sizeof(digit) * size_z);
Py_DECREF(z);
z = tmp;
z->ob_digit[size_z] = (digit)c;
++size_z;
}
}
}
}
if (z == NULL) {
return NULL;
}
if (error_if_nonzero) {
/* reset the base to 0, else the exception message
doesn't make too much sense */
base = 0;
if (Py_SIZE(z) != 0) {
goto onError;
}
/* there might still be other problems, therefore base
remains zero here for the same reason */
}
if (str == start) {
goto onError;
}
if (sign < 0) {
Py_SIZE(z) = -(Py_SIZE(z));
}
while (*str && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str != '\0') {
goto onError;
}
long_normalize(z);
z = maybe_small_long(z);
if (z == NULL) {
return NULL;
}
if (pend != NULL) {
*pend = (char *)str;
}
return (PyObject *) z;
onError:
if (pend != NULL) {
*pend = (char *)str;
}
Py_XDECREF(z);
slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
strobj = PyUnicode_FromStringAndSize(orig_str, slen);
if (strobj == NULL) {
return NULL;
}
PyErr_Format(PyExc_ValueError,
"invalid literal for int() with base %d: %.200R",
base, strobj);
Py_DECREF(strobj);
return NULL;
}
If you want to defined it in C, go ahead and try using this- if not, you'll have to write it yourself

Python Generated String in C

I need to generate the following string in C:
$(python -c "print('\x90' * a + 'blablabla' + '\x90' * b + 'h\xef\xff\xbf')")
where a and b are arbitrary integers and blablabla represents an arbitrary string. I am attempting to do this by first creating
char str1[size];
and then doing:
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
Next I use strcat again:
strcat(str1, "blablabla");
and I run the loop again, this time b times, to concatenate the next b x90 characters. Finally, I use strcat once more as follows:
strcat(str1, "h\xef\xff\xbf");
However, these two strings do not match. Is there a more efficient way of replicating the behaviour of python's * in C? Or am I missing something?
char str1[size];
Even assuming you calculated size correctly, I recommend using
char * str = malloc(size);
Either way, after you get the needed memory for the string one way or the other, you gonna have to initialize it by first doing
str[0]=0;
if you intend in using strcat.
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
This is useful, if "\x90" actually is a string (i.e. something composed of more than one character) and that string is short (hard to give a hard border, but something about 16 bytes would be tops) and a is rather small[1]. Here, as John Coleman already suggested, memset is a better way to do it.
memset(str, '\x90', a);
Because you know the location, where "blablabla" shall be stored, just store it there using strcpy instead of strcat
// strcat(str1, "blablabla");
strcpy(str + a, "blablabla");
However, you need the address of the character after "blablabla" (one way or the other). So I would not even do it that way but instead like this:
const char * add_str = "blablabla";
size_t sl = strlen(add_str);
memcpy(str + a, add_str, sl);
Then, instead of your second loop, use another memset:
memset(str + a + sl, '\x90', b);
Last but not least, instead of strcat again strcpy is better (here, memcpy doesn't help):
strcpy(str + a + sl + b, "h\xef\xff\xbf");
But you need it's size for the size calculation at the beginning, so better do it like the blablabla string anyway (and remember the tailing '\0').
Finally, I would put all this code into a function like this:
char * gen_string(int a, int b) {
const char * add_str_1 = "blablabla";
size_t sl_1 = strlen(add_str_1);
const char * add_str_2 = "h\xef\xff\xbf";
size_t sl_2 = strlen(add_str_2);
size_t size = a + sl_1 + b + sl_2 + 1;
// The + 1 is important for the '\0' at the end
char * str = malloc(size);
if (!str) {
return NULL;
}
memset(str, '\x90', a);
memcpy(str + a, add_str_1, sl_1);
memset(str + a + sl_1, '\x90', b);
memcpy(str + a + sl_1 + b, add_str_2, sl_2);
str[a + sl_1 + b + sl_2] = 0; // 0 is the same as '\0'
return str;
}
Remember to free() the retval of gen_string at some point.
If the list of memset and memcpy calls get longer, then I'd suggest to do it like this:
char * ptr = str;
memset(ptr, '\x90', a ); ptr += a;
memcpy(ptr, add_str_1, sl_1); ptr += sl_1;
memset(ptr, '\x90', b ); ptr += b;
memcpy(ptr, add_str_2, sl_2); ptr += sl_2;
*ptr = 0; // 0 is the same as '\0'
maybe even creating a macro for memset and memcpy:
#define MEMSET(c, l) do { memset(ptr, c, l); ptr += l; } while (0)
#define MEMCPY(s, l) do { memcpy(ptr, s, l); ptr += l; } while (0)
char * ptr = str;
MEMSET('\x90', a );
MEMCPY(add_str_1, sl_1);
MEMSET('\x90', b );
MEMCPY(add_str_2, sl_2);
*ptr = 0; // 0 is the same as '\0'
#undef MEMSET
#undef MEMCPY
For the justifications why to do it the way I recommend it, I suggest you read the blog post Back to Basics (by one of the founders of Stack Overflow) which happens not only to be John Coleman's favorite blog post but mine also. There you will learn, that using strcat in a loop like the way you tried it first has quadratic run time and hence, why not use it the way you did it.
[1] If a is big and/or the string that needs to be repeated is long, a better solution would be something like this:
const char * str_a = "\x90";
size_t sl_a = strlen(str_a);
char * ptr = str;
for (size_t i = 0; i < a; ++i) {
strcpy(ptr, str_a);
ptr += sl_a;
}
// then go on at address str + a * sl_a
For individual 1-byte chars you can use memset to partially replicate the behavior of Python's *:
#include<stdio.h>
#include<string.h>
int main(void){
char buffer[100];
memset(buffer,'#',10);
buffer[10] = '\0';
printf("%s\n",buffer);
memset(buffer, '*', 5);
buffer[5] = '\0';
printf("%s\n",buffer);
return 0;
}
Output:
##########
*****
For a more robust solution, see this.

Why don't cython compile logic or to `||` expression?

For example, here is an or expression:
c = f1 == 0 or f1 - f0 > th
Here is the compiled C code:
__pyx_t_24 = (__pyx_v_f1 == 0);
if (!__pyx_t_24) {
} else {
__pyx_t_23 = __pyx_t_24;
goto __pyx_L5_bool_binop_done;
}
__pyx_t_24 = ((__pyx_v_f1 - __pyx_v_f0) > __pyx_v_th);
__pyx_t_23 = __pyx_t_24;
__pyx_L5_bool_binop_done:;
__pyx_v_c = __pyx_t_23;
Why not output this?
__pyx_v_c = (__pyx_v_f1 == 0) || ((__pyx_v_f1 - __pyx_v_f0) > __pyx_v_th)
is the goto version faster than ||?
Using the following two files:
test.c:
int main(int argc, char** argv) {
int c, f0, f1, th;
int hold, hold1;
f0 = (int) argv[1];
f1 = (int) argv[2];
th = (int) argv[3];
hold1 = (f0 == 0);
if (!hold1) {
} else {
hold = hold1;
goto done;
}
hold1 = (f1 - f0 > th);
hold = hold1;
done: c = hold;
return c;
}
test2.c:
int main(int argc, char** argv) {
int c, f0, f1, th;
f0 = (int) argv[1];
f1 = (int) argv[2];
th = (int) argv[3];
c = (f1 == 0) || (f1 - f0 > th);
return c;
}
I had to assign f0, f1 and th to something so that the compiler would not just return 1 as the C specification states that ints are initialized to 0 and f1 == 0 would yield true, and thus the whole boolean statement would yield true and the assembly would be:
main:
.LFB0:
.cfi_startproc
.L2:
movl $1, %eax
ret
.cfi_endproc
Compiling using GCC with -S -O2 flags (optimization enabled), both test.s and test2.s become:
main:
.LFB0:
.cfi_startproc
movl 8(%rsi), %edx
movq 16(%rsi), %rdi
movl $1, %eax
movq 24(%rsi), %rcx
testl %edx, %edx
je .L2
subl %edx, %edi
xorl %eax, %eax
cmpl %ecx, %edi
setg %al
.L2:
rep
ret
.cfi_endproc
So it unless you disable optimizations, in which the one with goto would have about 50% more instructions, the result would be the same.
The reason the output C code is ugly is because of the way the interpreter visits the nodes in the AST. When an or node is visited, the interpreter first evaluates the first parameter and then the second. If the boolean expression was much more complex, it would be much more easier to parse. Imagine calling a lambda function that returns a boolean (I'm not sure if Cython supports this); the interpreter would have to follow the structure:
hold = ... evaluate the lambda expression...
if (hold) {
result = hold;
goto done; // short circuit
}
hold = ... evaluate the second boolean expression...
done:
...
It would be a daunting task to optimize during the interpreting phase and thus Cython won't even bother.
If my understanding of C is correct (and I last used C many years ago , so it may be rusty) the '||' ( OR ) operator in C only return Boolean values ( that is 0 for False or 1 for True ) . If this is correct , then it's not about if goto is faster or slower .
The || would give different results than the goto code . This is because of how 'or' in Python works , let's take an example -
c = a or b
In the above statement first as value is evaluated , If it is a true like value that value is returned from the or expression (not true or 1 , but as value) , if the value is false like ( false like values in Python are 0 , empty string , empty lists, False , etc) then b's value is evaluated and it is returned . Please note 'or' returns the last evaluated value , and not True(1) or False(0).
This is basically useful when you want to set default values like -
s = d or 'default value'

Parsing unsigned ints (uint32_t) in python's C api

If I write a function accepting a single unsigned integer (0 - 0xFFFFFFFF), I can use:
uint32_t myInt;
if(!PyArg_ParseTuple(args, "I", &myInt))
return NULL;
And then from python, I can pass an int or long.
But what if I get passed a list of integers?
uint32_t* myInts;
PyObject* pyMyInts;
PyArg_ParseTuple(args, "O", &pyMyInts);
if (PyList_Check(intsObj)) {
size_t n = PyList_Size(v);
myInts = calloc(n, sizeof(*myInts));
for(size_t i = 0; i < n; i++) {
PyObject* item = PyList_GetItem(pyMyInts, i);
// What function do I want here?
if(!GetAUInt(item, &myInts[i]))
return NULL;
}
}
// cleanup calloc'd array on exit, etc
Specifically, my issue is with dealing with:
Lists containing a mixture of ints and longs
detecting overflow when assigning to the the uint32
You could create a tuple and use the same method you used for a single argument. On the C side, the tuple objects are not really immutable, so it wouldn't be to much trouble.
Also PyLong_AsUnsignedLong could work for you. It accepts int and long objects and raises an error otherwise. But if sizeof(long) is bigger than 4, you might need to check for an upper-bound overflow yourself.
static int
GetAUInt(PyObject *pylong, uint32_t *myint) {
static unsigned long MAX = 0xffffffff;
unsigned long l = PyLong_AsUnsignedLong(pylong);
if (l == -1 && PyErr_Occurred() || l > MAX) {
PyErr_SetString(PyExc_OverflowError, "can't convert to uint32_t");
return false;
}
*myint = (uint32_t) l;
return true;
}

How does python extend work?

If I have a list:
a = [1,2,3,4]
and then add 4 elements using extend
a.extend(range(5,10))
I get
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
How does python do this? does it create a new list and copy the elements across or does it make 'a' bigger? just concerned that using extend will gobble up memory. I'am also asking as there is a comment in some code I'm revising that extending by 10000 x 100 is quicker than doing it in one block of 1000000.
Python's documentation on it says:
Extend the list by appending all the
items in the given list; equivalent to
a[len(a):] = L.
As to "how" it does it behind the scene, you really needn't concern yourself about it.
L.extend(M) is amortized O(n) where n=len(m), so excessive copying is not usually a problem. The times it can be a problem is when there is not enough space to extend into, so a copy is performed. This is a problem when the list is large and you have limits on how much time is acceptable for an individual extend call.
That is the point when you should look for a more efficient datastructure for your problem. I find it is rarely a problem in practice.
Here is the relevant code from CPython, you can see that extra space is allocated when the list is extended to avoid excessive copying
static PyObject *
listextend(PyListObject *self, PyObject *b)
{
PyObject *it; /* iter(v) */
Py_ssize_t m; /* size of self */
Py_ssize_t n; /* guess for size of b */
Py_ssize_t mn; /* m + n */
Py_ssize_t i;
PyObject *(*iternext)(PyObject *);
/* Special cases:
1) lists and tuples which can use PySequence_Fast ops
2) extending self to self requires making a copy first
*/
if (PyList_CheckExact(b) || PyTuple_CheckExact(b) || (PyObject *)self == b) {
PyObject **src, **dest;
b = PySequence_Fast(b, "argument must be iterable");
if (!b)
return NULL;
n = PySequence_Fast_GET_SIZE(b);
if (n == 0) {
/* short circuit when b is empty */
Py_DECREF(b);
Py_RETURN_NONE;
}
m = Py_SIZE(self);
if (list_resize(self, m + n) == -1) {
Py_DECREF(b);
return NULL;
}
/* note that we may still have self == b here for the
* situation a.extend(a), but the following code works
* in that case too. Just make sure to resize self
* before calling PySequence_Fast_ITEMS.
*/
/* populate the end of self with b's items */
src = PySequence_Fast_ITEMS(b);
dest = self->ob_item + m;
for (i = 0; i < n; i++) {
PyObject *o = src[i];
Py_INCREF(o);
dest[i] = o;
}
Py_DECREF(b);
Py_RETURN_NONE;
}
it = PyObject_GetIter(b);
if (it == NULL)
return NULL;
iternext = *it->ob_type->tp_iternext;
/* Guess a result list size. */
n = _PyObject_LengthHint(b, 8);
if (n == -1) {
Py_DECREF(it);
return NULL;
}
m = Py_SIZE(self);
mn = m + n;
if (mn >= m) {
/* Make room. */
if (list_resize(self, mn) == -1)
goto error;
/* Make the list sane again. */
Py_SIZE(self) = m;
}
/* Else m + n overflowed; on the chance that n lied, and there really
* is enough room, ignore it. If n was telling the truth, we'll
* eventually run out of memory during the loop.
*/
/* Run iterator to exhaustion. */
for (;;) {
PyObject *item = iternext(it);
if (item == NULL) {
if (PyErr_Occurred()) {
if (PyErr_ExceptionMatches(PyExc_StopIteration))
PyErr_Clear();
else
goto error;
}
break;
}
if (Py_SIZE(self) < self->allocated) {
/* steals ref */
PyList_SET_ITEM(self, Py_SIZE(self), item);
++Py_SIZE(self);
}
else {
int status = app1(self, item);
Py_DECREF(item); /* append creates a new ref */
if (status < 0)
goto error;
}
}
/* Cut back result list if initial guess was too large. */
if (Py_SIZE(self) < self->allocated)
list_resize(self, Py_SIZE(self)); /* shrinking can't fail */
Py_DECREF(it);
Py_RETURN_NONE;
error:
Py_DECREF(it);
return NULL;
}
PyObject *
_PyList_Extend(PyListObject *self, PyObject *b)
{
return listextend(self, b);
}
It works as if it were defined like this
def extend(lst, iterable):
for x in iterable:
lst.append(x)
This mutates the list, it does not create a copy of it.
Depending on the underlying implementation, append and extend may trigger the list to copy its own data structures but this is normal and nothing to worry about. For example array-based implementations typically grow the underlying array exponentially and need to copy the list of elements when they do so.
How does python do this? does it create a new list and copy the elements across or does it make 'a' bigger?
>>> a = ['apples', 'bananas']
>>> b = a
>>> a is b
True
>>> c = ['apples', 'bananas']
>>> a is c
False
>>> a.extend(b)
>>> a
['apples', 'bananas', 'apples', 'bananas']
>>> b
['apples', 'bananas', 'apples', 'bananas']
>>> a is b
True
>>>
It does not create a new list object, it extends a. This is self-evident from the fact that you don't make an assigment. Python will not magically replace your objects with other objects. :-)
How the memory allocation happens inside the list object is implementation dependent.

Categories

Resources