In my Python C Extension I am performing actions on an iterable of strings. So in a first step I call PySequence_Fast to convert it to a list and then iterate over the elements. For each string I use PyUnicode_DATA and then compare the strings using some criteria. So I only read from PyObjects, but never modify them.
Now I would like to process the list in parallel, which would require me to release the GIL. However I do not know which effects this has on my use case. Here are my current thoughts:
I can still use those APIs, since they are only macros, that directly read from the PyObjects without modifying them.
I have to use the APIs beforehand and store a array of structs that hold kind, length and data pointer of the strings
I have to use the APis beforehand and have to store a copy of the strings in a array
Case 1 would be the most performant and memory efficient. However it is stated, that without acquiring the GIL it is not allowed to perform on Python objects (does this include reading access) or use Python/C API functions.
Case 2 would be the next most efficient, since at least I do not have to copy all strings. However when I am not allowed to read from Python objects while the GIL is released, I wonder whether I would even be allowed to use a pointer to the data inside the PyObject.
Case 3 would require me to copy all strings. In my case this might make the multithreaded solution slower than a sequential solutions.
I hope someone can help me understand what I am allowed to do while the GIL is released.
I think the official answer is that you should not do method 1 and should use methods 2 and 3. And that while it might work now it could change in the future and break. This is especially important if you want to support things like PyPy's C-API wrapper (which might well use a different representation that Python internally). There are increasing moves to try to hide implementation details that you slightly risk getting caught out by.
Practically I think method 1 would work fine provided you only use the macro forms with no error checking - the GIL is mainly about stopping simultaneous writes putting Python objects in an undefined state, and you aren't doing this. Where I'd be slightly careful is if you ever have (deprecated) "non-canonical" unicode objects - things that look "macro-y" like PyUnicode_READY can cause them to be modified to the canonical state. Again, be especially wary of alternative (non-CPython) implementations of the C-API.
One alternative to consider would be to use the buffer protocol instead. Although I can't find it explicitly stated in the docs, the idea is that PyObject_GetBuffer and PyBuffer_Release require the GIL but reading/writing to the buffer doesn't. Here I have two sub-suggestions:
can you have a single object like a Numpy array that exposes all your strings as a buffer?
you can also get a buffer from a unicode object (as a utf-8 C-string) - the thing to do would be to create all the buffers with the GIL, do your parallel processing without, and them free them with the GIL. It's possible that the overhead for this might be inefficient. This is basically an "official" version of method 2.
I short, you'd probably get away with it, but if it ever breaks I doubt that a bug report to Python would be well-received (since it's technically wrong)
Related
I want to implicitly extend the int, float, str, list, dict, set, and module classes with custom built substitutions (extensions).
When I say 'implicitly', what I mean is that when I declare 'a = 1', and object of the type Custom_Int (as an example) is produced, as opposed to a normal integer object.
Now, I understand and respect the reasons not to do this. Firstly- messing with built-ins is like messing with the laws of physics. No good can come from it. That said- I do understand the gravity of what I'm trying to do and what can happen if I do it wrong.
Second- I understand that modifying a base case will effect not just the current run-time but all running python processes. I feel that by overriding the __new__ method of these base classes, such that it returns Custom_Object_Whatever if and ONLY IF certain environmental factors are true, other run times will remain largely unaffected.
So, getting back to the issue at hand- how can I override the __new__ method of these various types?
Pythons forbiddenfruit package seems to be promising. I havn't had a chance to reeeeeeally investigate it though, and if someone who understands it could summarize what it does, that would save me a lot of time.
Beyond that, I've observed something strange.
Every answer to monkeypatching that doesn't eventually circle back to forbiddenfruit or how forbiddenfruit works has to do with modifying what I will refer to as the 'absolute_dictionary' of the class. Because everything in Python is essentially a mapping (or dictionary) of functions/values to names, if you change the name __new__ within the right mapping, you change the nature of the object.
Problem is- every near-success I've had has it that if I call 'str( 'a' ).__new__( *args )' it works fine {in some cases}, but the calling of varOne = 'a' does not seem to actually call str.__new__().
My guess- this has something to do with either python's parsing of a program prior to launch, or else the caching of the various classes during/post launch. Or maybe I'm totally off the mark. Either python pre-reads and applies some regex to it's modules prior to launch or else the machine code, when it attempts to implicitly create an object, it reaches for something other than the class located in moduleObject.builtins[ __classname__ ]
Any ideas?
If you want to do this, your best option is probably to modify the CPython source code and build your own custom Python build with your extensions baked into the actual built-in types. The result will integrate a lot better with all the low-level mechanisms you don't yet understand, and you'll learn a lot in the process.
Right now, you're getting blocked by a lot of factors. Here are the ones that have come to my mind.
The first is that most ways of creating built-in objects don't go through a __new__ method at all. They go through C-level calls like PyLong_FromLong or PyList_New. These calls are hardwired to use the actual built-in types, allocating memory sized for the real built-in types, fetching the type object by the address of its statically-allocated C struct, and stuff like that. It's basically impossible to change any of this without building your own custom Python.
The second factor is that messing with __new__ isn't even enough to correctly affect things that theoretically should go through __new__, like int("5"). Python has reasons for stopping you from setting attributes on built-in classes, and two of those reasons are slots and the type attribute cache.
Slots are a public part of the C API that you'll probably learn about if you try to modify the CPython source code. They're function pointers in the C structs that make up type objects at C level, and most of them correspond to Python-level magic methods. For example, the __new__ method has a corresponding tp_new slot. Most C code accesses slots instead of methods, and there's code to ensure the slots and methods are in sync, but if you bypass Python's protections, that breaks and everything goes to heck.
The type attribute cache isn't a public part of anything even at C level. It's a cache that saves the results of type object attribute lookups, to make Python go faster. Its memory safety relies on all type object attribute modification going through type.__setattr__ (and all built-in type object attribute modification getting rejected by type.__setattr__), but if you bypass the protection, memory safety goes out the window and arbitrarily weird results can occur.
The third factor is that there's a bunch of caching for immutable objects. The small int cache, the interned string dict, constants getting saved in bytecode objects, compile-time constant folding... there's a lot. Objects aren't going to be created when you expect. (There's also stuff like, say, zip saving the last output tuple and reusing it if it sees you didn't keep a reference, for even more ways object creation will mess with your assumptions.)
There's more. Stuff like, what argument would int.__new__ even take if you tried to use int.__new__ to evaluate the expression 5? Stuff like all the low-level code that knows exactly how to work with the types it expects and will get very confused if it gets a MyCustomTuple with a completely different memory layout from a real tuple. Screwing with built-ins has a lot of issues.
Incidentally, one of the things you expected to be a problem is mostly not a problem. Screwing with one Python process's built-ins won't affect other Python processes' built-ins... unless those other processes are created by forking the first process, such as with multiprocessing in fork mode.
This question already has an answer here:
Prevent RAM from paging to swap area (mlock)
(1 answer)
Closed 3 years ago.
I'm working on a password manager application for linux and I'm using Python for it.
Because of the security reasons I want to call mlock system call in order to avoid swapping password variable on hard drive.
I noticed that python itself didn't wrap this function.
so is there any way so can I avoid swapping?
Thanks
For CPython, there is no good answer for this that doesn't involve writing a Python C extension, since mlock works on pages, not objects. The internals of the str object differ from version to version (in Py3.3 and higher, a str may actually have several copies of the data in memory in different encodings, some inlined after the object structure, some dynamically allocated separately and linked by pointer), and even if you used ctypes to retrieve the necessary addresses and mlock-ed them all through ctypes mlock calls, you'll have a hell of a time determining when to mlock and when to munlock. Since mlock works on pages, you'd have to carefully track how many strings are currently in any given page (because if you just mlock and munlock blindly, and there are more than one things to lock in a page, the first munlock would unlock all of them; mlock/munlock is a boolean flag, it doesn't count the number of locks and unlocks).
Even if you manage that, you still would have a race between password acquisition and mlock during which the data could be written to swap, and those cached alternate encodings are computed lazily, so mlocking the non-NULL pointers at any given time doesn't necessarily mean those pointers might not be populated later.
You could partially avoid these problems through careful use of the mmap module and memoryviews (mmap gives you pages of memory, memoryview references said memory without copying it, so ctypes could be used to mlock the page), but you'd have to build it all from scratch (can't use the getpass module because it would store as a str for a moment).
In short, Python doesn't care about swapping or memory protection in the way you want; it trusts the swap file to be configured to your desired security (e.g. disabled or encrypted), neither providing additional protection nor providing the information you'd need to add it in.
AFAIK, the use of shared_ptr is often discouraged because of potential bugs caused by careless usage of them (unless you have a really good explanation for significant benefit and carefully checked design).
On the other hand, Python objects seem to be essentially shared_ptrs (ref_count and garbage collection).
I am wondering what makes them work nicely in Python but potentially dangerous in C++. In other words, what are the differences between Python and C++ in dealing with shared_ptr that makes their usage discouraged in C++ but not causing similar problems in Python?
I know e.g. Python automatically detects cycles between objects which prevents memory leaks that dangling cyclic shared_ptrs can cause in C++.
"I know e.g. Python automatically detects cycles" -- that's what makes them work nicely, at least so far as the "potential bugs" relate to memory leaks.
Besides which, C++ programs are more commonly written under tight performance constraints than Python programs (which IMO is a combination of different genuine requirements with some fairly bogus differences in rules-of-thumb, but that's another story). A fairly high proportion of the Python objects I use don't strictly need reference counting, they have exactly one owner and a unique_ptr would be fine (or for that matter a data member of class type). In C++ it's considered (by the people writing the advice you're reading) worth taking the performance advantage and the explicitly simplified design. In Python it's usually not considered a problem, you pay the performance and you keep the flexibility to decide later that it's shared after all without any code change required (other than to take additional references that outlive the original, I mean).
Btw in any language, shared mutable objects have "potential bugs" associated with them, if you lose track of what objects will or won't change when you're not looking at them. I don't just mean race conditions: even in a single-threaded program you need to be aware that C++ Predicates shouldn't change anything and that you (often) can't mutate a container while iterating over it. I don't see this as a difference between C++ and Python, though. Rather, to some extent you should be slightly wary of shared objects in Python too, and when you proliferate references to an object at least understand why you're doing it.
So, on to the list of issues in the question you link to:
cyclic references -- as mentioned, Python rolls its sleeves up, finds them and frees them. For reasons to do with the design and specific uses of the languages, cycle-breaking garbage collection is rather difficult to implement in C++, although not impossible.
creating multiple unrelated shared_ptrs to the same object -- no analog is possible in Python, since the reference-counter isn't open to the user to mess up.
Constructing an anonymous temporary shared pointer -- doesn't arise in Python, there's no risk of a memory leak that way in Python since there's no "gap" in which the object exists but is not yet subject to collection if it becomes unreferenced.
Calling the get() function to get the raw pointer and use it after the pointed-to object goes out of scope -- well, you can mess this up if you're writing Python/C, but not in pure Python.
Passing a reference of or a raw pointer to a shared_ptr should be dangerous too, since it won't increment the internal count -- there's no means in Python to add a reference without the language taking care of the refcount.
we passed 'this' to some thread workers instead of 'shared_from_this' -- in other words, forgot to create a shared_ptr when needed. Can't do this in Python.
most of the predicates you know and love from <functional> don't play nicely with shared_ptr -- Python refcounting is so built in to the runtime (or I suppose to be precise I should say: garbage collection is so built in to the language design) that there are no libraries that fail to cope with it.
Using shared_ptr for really small objects (like char short) could be an overhead -- issue exists in Python, and Python programmers generally don't sweat it. If you need an array of "primitive type" then you can use numpy to reduce overhead. Sometimes Python programs run out of memory and you need to do something about it, that's life ;-)
Giving out a shared_ptr< T > to this inside a class definition is also dangerous. Use enabled_shared_from_this instead -- it may not be obvious, but this is "don't create multiple unrelated shared_ptr to the same object" again.
You need to be careful when you use shared_ptr in multithread code -- it's possible to create race conditions in Python too, this is part of "shared mutable objects are tricksy".
Most of this is to do with the fact that in C++ you have to explicitly do something to get refcounting, and you don't get it if you don't ask for it. This provides several opportunities for error that Python doesn't make available to the programmer because it just does it for you. If you use shared_ptr correctly then apart from the existence of libraries that don't co-operate with it, none of these problems comes up in C++ either. Those who are cautious of using it for these reasons are basically saying they're afraid they'll use it incorrectly, or at any rate more afraid than that they'll misuse some alternative. Much of C++ programming is trading different potential bugs off against each other until you come up with a design that you consider yourself competent to execute. Furthermore it has "don't pay for what you don't need" as a design philosophy. Between these two factors, you don't do anything without a really good explanation, a significant benefit, and a carefully checked design. shared_ptr is no different ;-)
AFAIK, the use of shared_ptr is often discouraged because of potential bugs caused by careless usage of them (unless you have a really good explanation for significant benefit and carefully checked design).
I wouldn't agree. The tendency goes towards generally using these smart pointers unless you have a very good reasons not to do so.
shared_ptr that makes their usage discouraged in C++ but not causing similar problems in Python?
Well, I don't know about your favourite largish signal processing framework ecosystem, but GNU Radio uses shared_ptrs for all their blocks, which are the core elements of the GNU Radio architecture. In fact, blocks are classes, with private constructors, which are only accessible by a friend make function, which returns a shared_ptr. We haven't had problems with this -- and GNU Radio had good reason to adopt such a model. Now, we don't have a single place where users try to use deallocated block objects, not a single block is leaked. Nice!
Also, we use SWIG and a gateway class for a few C++ types that can't just be represented well as Python types. All this works very well on both sides, C++ and Python. In fact, it works so very well, that we can use Python classes as blocks in the C++ runtime, wrapped in shared_ptr.
Also, we never had performance problems. GNU Radio is a high rate, highly optimized, heavily multithreaded framework.
Are there any security exploits that could occur in this scenario:
eval(repr(unsanitized_user_input), {"__builtins__": None}, {"True":True, "False":False})
where unsanitized_user_input is a str object. The string is user-generated and could be nasty. Assuming our web framework hasn't failed us, it's a real honest-to-god str instance from the Python builtins.
If this is dangerous, can we do anything to the input to make it safe?
We definitely don't want to execute anything contained in the string.
See also:
Funny blog post about eval safety
Previous Question
Blog: Fast deserialization in Python
The larger context which is (I believe) not essential to the question is that we have thousands of these:
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
unsanitized_user_input_3,
unsanitized_user_input_4,
...])
in some cases nested:
repr([[unsanitized_user_input_1,
unsanitized_user_input_2],
[unsanitized_user_input_3,
unsanitized_user_input_4],
...])
which are themselves converted to strings with repr(), put in persistent storage, and eventually read back into memory with eval.
Eval deserialized the strings from persistent storage much faster than pickle and simplejson. The interpreter is Python 2.5 so json and ast aren't available. No C modules are allowed and cPickle is not allowed.
It is indeed dangerous and the safest alternative is ast.literal_eval (see the ast module in the standard library). You can of course build and alter an ast to provide e.g. evaluation of variables and the like before you eval the resulting AST (when it's down to literals).
The possible exploit of eval starts with any object it can get its hands on (say True here) and going via .__class_ to its type object, etc. up to object, then gets its subclasses... basically it can get to ANY object type and wreck havoc. I can be more specific but I'd rather not do it in a public forum (the exploit is well known, but considering how many people still ignore it, revealing it to wannabe script kiddies could make things worse... just avoid eval on unsanitized user input and live happily ever after!-).
If you can prove beyond doubt that unsanitized_user_input is a str instance from the Python built-ins with nothing tampered, then this is always safe. In fact, it'll be safe even without all those extra arguments since eval(repr(astr)) = astr for all such string objects. You put in a string, you get back out a string. All you did was escape and unescape it.
This all leads me to think that eval(repr(x)) isn't what you want--no code will ever be executed unless someone gives you an unsanitized_user_input object that looks like a string but isn't, but that's a different question--unless you're trying to copy a string instance in the slowest way possible :D.
With everything as you describe, it is technically safe to eval repred strings, however, I'd avoid doing it anyway as it's asking for trouble:
There could be some weird corner-case where your assumption that only repred strings are stored (eg. a bug / different pathway into the storage that doesn't repr instantly becmes a code injection exploit where it might otherwise be unexploitable)
Even if everything is OK now, assumptions might change at some point, and unsanitised data may get stored in that field by someone unaware of the eval code.
Your code may get reused (or worse, copy+pasted) into a situation you didn't consider.
As Alex Martelli pointed out, in python2.6 and higher, there is ast.literal_eval which will safely handle both strings and other simple datatypes like tuples. This is probably the safest and most complete solution.
Another possibility however is to use the string-escape codec. This is much faster than eval (about 10 times according to timeit), available in earlier versions than literal_eval, and should do what you want:
>>> s = 'he\nllo\' wo"rld\0\x03\r\n\tabc'
>>> repr(s)[1:-1].decode('string-escape') == s
True
(The [1:-1] is to strip the outer quotes repr adds.)
Generally, you should never allow anyone to post code.
So called "paid professional programmers" have a hard-enough time writing code that actually works.
Accepting code from the anonymous public -- without benefit of formal QA -- is the worst of all possible scenarios.
Professional programmers -- without good, solid formal QA -- will make a hash of almost any web site. Indeed, I'm reverse engineering some unbelievably bad code from paid professionals.
The idea of allowing a non-professional -- unencumbered by QA -- to post code is truly terrifying.
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
...
... unsanitized_user_input is a str object
You shouldn't have to serialise strings to store them in a database..
If these are all strings, as you mentioned - why can't you just store the strings in a db.StringListProperty?
The nested entries might be a bit more complicated, but why is this the case? When you have to resort to eval to get data from the database, you're probably doing something wrong..
Couldn't you store each unsanitized_user_input_x as it's own db.StringProperty row, and have group them by an reference field?
Either of those may not be applicable, since I've no idea what you're trying to achieve, but my point is - can you not structure the data in a way you where don't have to rely on eval (and also rely on it not being a security issue)?
I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?
As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).
First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.
from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants
That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.
If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.
I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.