Zipkin for profiling the internals of a traditional progamm

Zipkin for profiling the internals of a traditional progamm - python

I want to use zipkin to profile the internals of a traditional program.
I use the term "traditional", since AFAIK zipkin is for tracing in a microservice environment where one request gets computed by N sub-requests.
I would like to analyse the performance of my python program.
I would like to trace all python method calls and all linux syscalls which gets done.
How to trace the python method calls and linux syscalls to get the spans into zipkin?
Even if it is not feasible, I am interesting how this could be done. I would like to learn how zipkin works.

In zipkin lingo, what you are asking about is often called "local spans" or "local tracing", basically an operation that neither originated, nor resulted in a remote call.
I'm not aware of anything at the syscall level, but many tracers support explicit instrumentation of function calls.
For example, using py_zipkin
#zipkin_span(service_name='my_service', span_name='some_function')
def some_function(a, b):
return do_stuff(a, b)
Besides explicit instrumentation like this, one could also export data to zipkin. For example, one could convert trace data that is made in another tool to zipkin's json format.
This probably doesn't answer your question, but I hope it helps.

Related

Python multiprocessing.Pool(): am I limited in what I can return?

I am using Python's multi-processing pool. I have been told, although not experienced this myself so I cannot post the code, that one cannot just "return" anything from within the multiprocessing.Pool()-worker back to the multiprocessing.Pool()'s main process. Words like "pickling" and "lock" were being thrown around but I am not sure.
Is this correct, and if so, what are these limitations?
In my case, I have a function which generates a mutable class object and then returns it after it has done some work with it. I'd like to have 8 processes run this function, generate their own classes, and return each of them after they're done. Full code is NOT written yet, so I cannot post it.
Any issues I may run into?
My code is: res = pool.map(foo, list_of_parameters)

Q : "Is this correct, and if so, what are these limitations?"
It depends. It is correct, but the SER/DES processing is the problem here, as a pair of disjoint processes tries to "send" something ( there: a task specification with parameters and back: ... Yessss, the so long waited for result* )
Initial versions of the Python standard library of modules piece, responsible for doing this, the pickle-module, was not able to SER-ialise some more complex types of objects, Class-instances being one such example.
There are newer and newer versions evolving, sure, yet this SER/DES step is one of the SPoFs that may avoid a smooth code-execution for some such cases.
Next are the cases, that finish by throwing a Memory Error as they request as much memory allocations, that the O/S simply rejects any new request for such an allocation, and the whole process attempt to produce and send pickle.dumps( ... ) un-resolvably crashes.
Do we have any remedies available?
Well, maybe yes, maybe no - Mike McKearn's dill may help in some cases to better handle complex objects in SER/DES-processing.
May try to use import dill as pickle; pickle.dumps(...) and test your hot-candidates for Class()-instances to get SER/DES-ed, if they get a chance to pass through. If not, no way using this low-hanging fruit first trick.
Next, a less easy way would be to avoid your dependence on hardwired multiprocessing.Pool()-instantiations and their (above)-limited SER/comms/DES-methods, and design your processing strategy as a distributed-computing system, based on a communicating agents paradigm.
That way you benefit from a right-sized, just-enough designed communication interchange between intelligent-enough agents, that know (as you've designed them to know it) what to tell one to the others, even without sending any mastodon-sized BLOB(s), that accidentally crash the processing in any of the SPoF(s) you cannot both prevent and salvage ex-post.
There seem no better ways forward I know about or can foresee in 2020-Q4 for doing this safe and smart.

Storing and running user defined functions from a database in Django [duplicate]

I'm developing a web game in pure Python, and want some simple scripting available to allow for more dynamic game content. Game content can be added live by privileged users.
It would be nice if the scripting language could be Python. However, it can't run with access to the environment the game runs on since a malicious user could wreak havoc which would be bad. Is it possible to run sandboxed Python in pure Python?
Update: In fact, since true Python support would be way overkill, a simple scripting language with Pythonic syntax would be perfect.
If there aren't any Pythonic script interpreters, are there any other open source script interpreters written in pure Python that I could use? The requirements are support for variables, basic conditionals and function calls (not definitions).

This is really non-trivial.
There are two ways to sandbox Python. One is to create a restricted environment (i.e., very few globals etc.) and exec your code inside this environment. This is what Messa is suggesting. It's nice but there are lots of ways to break out of the sandbox and create trouble. There was a thread about this on Python-dev a year ago or so in which people did things from catching exceptions and poking at internal state to break out to byte code manipulation. This is the way to go if you want a complete language.
The other way is to parse the code and then use the ast module to kick out constructs you don't want (e.g. import statements, function calls etc.) and then to compile the rest. This is the way to go if you want to use Python as a config language etc.
Another way (which might not work for you since you're using GAE), is the PyPy sandbox. While I haven't used it myself, word on the intertubes is that it's the only real sandboxed Python out there.
Based on your description of the requirements (The requirements are support for variables, basic conditionals and function calls (not definitions)) , you might want to evaluate approach 2 and kick out everything else from the code. It's a little tricky but doable.

Roughly ten years after the original question, Python 3.8.0 comes with auditing. Can it help? Let's limit the discussion to hard-drive writing for simplicity - and see:
from sys import addaudithook
def block_mischief(event,arg):
if 'WRITE_LOCK' in globals() and ((event=='open' and arg[1]!='r')
or event.split('.')[0] in ['subprocess', 'os', 'shutil', 'winreg']): raise IOError('file write forbidden')
addaudithook(block_mischief)
So far exec could easily write to disk:
exec("open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')", dict(locals()))
But we can forbid it at will, so that no wicked user can access the disk from the code supplied to exec(). Pythonic modules like numpy or pickle eventually use the Python's file access, so they are banned from disk write, too. External program calls have been explicitly disabled, too.
WRITE_LOCK = True
exec("open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')", dict(locals()))
exec("open('/tmp/FILE','a').write('pwned by l33t h4xx0rz')", dict(locals()))
exec("numpy.savetxt('/tmp/FILE', numpy.eye(3))", dict(locals()))
exec("import subprocess; subprocess.call('echo PWNED >> /tmp/FILE', shell=True)", dict(locals()))
An attempt of removing the lock from within exec() seems to be futile, since the auditing hook uses a different copy of locals that is not accessible for the code ran by exec. Please prove me wrong.
exec("print('muhehehe'); del WRITE_LOCK; open('/tmp/FILE','w')", dict(locals()))
...
OSError: file write forbidden
Of course, the top-level code can enable file I/O again.
del WRITE_LOCK
exec("open('/tmp/FILE','w')", dict(locals()))
Sandboxing within Cpython has proven extremely hard and many previous attempts have failed. This approach is also not entirely secure e.g. for public web access:
perhaps hypothetical compiled modules that use direct OS calls cannot be audited by Cpython - whitelisting the safe pure pythonic modules is recommended.
Definitely there is still the possibility of crashing or overloading the Cpython interpreter.
Maybe there remain even some loopholes to write the files on the harddrive, too. But I could not use any of the usual sandbox-evasion tricks to write a single byte. We can say the "attack surface" of Python ecosystem reduces to rather a narrow list of events to be (dis)allowed: https://docs.python.org/3/library/audit_events.html
I would be thankful to anybody pointing me to the flaws of this approach.
EDIT: So this is not safe either! I am very thankful to #Emu for his clever hack using exception catching and introspection:
#!/usr/bin/python3.8
from sys import addaudithook
def block_mischief(event,arg):
if 'WRITE_LOCK' in globals() and ((event=='open' and arg[1]!='r') or event.split('.')[0] in ['subprocess', 'os', 'shutil', 'winreg']):
raise IOError('file write forbidden')
addaudithook(block_mischief)
WRITE_LOCK = True
exec("""
import sys
def r(a, b):
try:
raise Exception()
except:
del sys.exc_info()[2].tb_frame.f_back.f_globals['WRITE_LOCK']
import sys
w = type('evil',(object,),{'__ne__':r})()
sys.audit('open', None, w)
open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')""", dict(locals()))
I guess that auditing+subprocessing is the way to go, but do not use it on production machines:
https://bitbucket.org/fdominec/experimental_sandbox_in_cpython38/src/master/sandbox_experiment.py

AFAIK it is possible to run a code in a completely isolated environment:
exec somePythonCode in {'__builtins__': {}}, {}
But in such environment you can do almost nothing :) (you can not even import a module; but still a malicious user can run an infinite recursion or cause running out of memory.) Probably you would want to add some modules that will be the interface to you game engine.

I'm not sure why nobody mentions this, but Zope 2 has a thing called Python Script, which is exactly that - restricted Python executed in a sandbox, without any access to filesystem, with access to other Zope objects controlled by Zope security machinery, with imports limited to a safe subset.
Zope in general is pretty safe, so I would imagine there are no known or obvious ways to break out of the sandbox.
I'm not sure how exactly Python Scripts are implemented, but the feature was around since like year 2000.
And here's the magic behind PythonScripts, with detailed documentation: http://pypi.python.org/pypi/RestrictedPython - it even looks like it doesn't have any dependencies on Zope, so can be used standalone.
Note that this is not for safely running arbitrary python code (most of the random scripts will fail on first import or file access), but rather for using Python for limited scripting within a Python application.
This answer is from my comment to a question closed as a duplicate of this one: Python from Python: restricting functionality?

I would look into a two server approach. The first server is the privileged web server where your code lives. The second server is a very tightly controlled server that only provides a web service or RPC service and runs the untrusted code. You provide your content creator with your custom interface. For example you if you allowed the end user to create items, you would have a look up that called the server with the code to execute and the set of parameters.
Here's and abstract example for a healing potion.
{function_id='healing potion', action='use', target='self', inventory_id='1234'}
The response might be something like
{hp='+5' action={destroy_inventory_item, inventory_id='1234'}}

Hmm. This is a thought experiment, I don't know of it being done:
You could use the compiler package to parse the script. You can then walk this tree, prefixing all identifiers - variables, method names e.t.c. (also has|get|setattr invocations and so on) - with a unique preamble so that they cannot possibly refer to your variables. You could also ensure that the compiler package itself was not invoked, and perhaps other blacklisted things such as opening files. You then emit the python code for this, and compiler.compile it.
The docs note that the compiler package is not in Python 3.0, but does not mention what the 3.0 alternative is.
In general, this is parallel to how forum software and such try to whitelist 'safe' Javascript or HTML e.t.c. And they historically have a bad record of stomping all the escapes. But you might have more luck with Python :)

I think your best bet is going to be a combination of the replies thus far.
You'll want to parse and sanitise the input - removing any import statements for example.
You can then use Messa's exec sample (or something similar) to allow the code execution against only the builtin variables of your choosing - most likely some sort of API defined by yourself that provides the programmer access to the functionality you deem relevant.

Can switching in-and-out PyFrameObjects be a good implementation of continuations?

I'm interested in continuations, specifically in Python's C-API. From what i understand, the nature of continuations requires un-abstracting low-level calling conventions in order to manipulate the call stack as needed. I was fortunate enough to come across a few examples of these scattered here and there. In the few examples i've come across, this un-abstraction is done using either clever C (with assumptions about the environment), or custom assembly.
However, what's cool about Python is that it has its own interpreter stack made up of PyFrameObjects. Assuming single-threaded applications for now, shouldn't it be enough to just switch in-and-out PyFrameObjectss to implement continuations in Python's C-API? Why do these authors even bother with the low-level stuff?

Generators work by manipulating the stack (actually linked list) of frame objects. But that will only help for pure Python code. It won't help you if your code has any C code running. For example, if you are in C code inside an I/O routine, you can't change the Python frame object to get execution to go somewhere else. You have to be able to change the C stack in order to do that. That's what packages like greenlets do for you.

Save Workspace - save all variables to a file. Python doesn't have it)

I cannot understand it. Very simple, and obvious functionality:
You have a code in any programming language, You run it. In this code You generate variables, than You save them (the values, names, namely everything) to a file, with one command. When it's saved You may open such a file in Your code also with simple command.
It works perfect in matlab (save Workspace , load Workspace ) - in python there's some weird "pickle" protocol, which produces errors all the time, while all I want to do is save variable, and load it again in another session (?????)
f.e. You cannot save class with variables (in Matlab there's no problem)
You cannot load arrays in cPickle (but YOu can save them (?????) )
Why don't make it easier?
Is there a way to save the current variables with values, and then load them?

What you are describing is Matlab environment feature not a programming language.
What you need is a way to store serialized state of some object which could be easily done in almost any programming language. In python world pickle is the easiest way to achieve it and if you could provide more details about the errors it produces for you people would probably be able to give you more details on that.
In general for object oriented languages (including python) it is always a good approach to incapsulate a your state into single object that could be serialized and de-serialized and then store/load an instance of such class. Pickling and unpickling of such objects works perfectly for many developers so this must be something specific to your implementation.

Since you're talking about Matlab, you probably want to try out IPython, which is a shell for Python offering much more functionality than the standard interpreter shell you get when executing Python.
Among this functionality is the ability to load/save workspace sessions, create macros out of session input etc., which is probably more like what you are used to in Matlab (I actually use both and find IPython to be much more elegant, but YMMV):
http://ipython.scipy.org

PiCloud has implemented a fancier pickle, but I can't find the code. I saw a poster session.
Generally in Python instantiated objects don't have any one way to recreate them, and in some cases its particularly difficult (like an open file) as it takes several steps to recreate.

I take issue with the statement that the saving of variables in Matlab is an environment function. the "save" statement in matlab is a function and part of the matlab language not just a command. It is a very useful function as you don't have to worry about the trivial minutia of file i/o and it handles all sorts of variables from scalar, matrix, objects, structures.

It's an old thread, but thought I should throw it out there anyway - Spyder the Scientific Python development environment allows you to do just this through the Variable explorer. There's a button there Save data that packs your whole workspace up in a .spydata file that you can later reload. Works like a charm when you're switching between projects!

How do I dump an entire Python process for later debugging inspection?

I have a Python application in a strange state. I don't want to do live debugging of the process. Can I dump it to a file and examine its state later? I know I've restored corefiles of C programs in gdb later, but I don't know how to examine a Python application in a useful way from gdb.
(This is a variation on my question about debugging memleaks in a production system.)

There is no builtin way other than aborting (with os.abort(), causing the coredump if resource limits allow it) -- although you can certainly build your own 'dump' function that dumps relevant information about the data you care about. There are no ready-made tools for it.
As for handling the corefile of a Python process, the Python source has a gdbinit file that contains useful macros. It's still a lot more painful than somehow getting into the process itself (with pdb or the interactive interpreter) but it makes life a little easier.

If you only care about storing the traceback object (which is all you need to start a debugging session), you can use debuglater (a fork of pydump). It works with recent versions of Python and has a IPython/Jupyter integration.
If you want to store the entire session, look at dill. It has a dump_session, and load_session functions.
Here are two other relevant projects:
python-checkpointing2
pycrunch-trace
If you're looking for a language agnostic solution, you want to create a core dump file. Here's an example with Python.

Someone above said that there is no builtin way to perform this, but that's not entirely true. For an example, you could take a look at the pylons debugging tools. Whene there is an exception, the exception handler saves the stack trace and prints a URL on the console that can be used to retrieve the debugging session over HTTP.
While they're probably keeping these sessions in memory, they're just python objects, so there's nothing to stop you from pickling a stack dump and restoring it later for inspection. It would mean some changes to the app, but it should be possible...
After some research, it turns out the relevant code is actually coming from Paste's EvalException module. You should be able to look there to figure out what you need.

It's also possible to write something that would dump all the data from the process, e.g.
Pickler that ignores the objects it can't pickle (replacing them with something else) (e.g. Python: Pickling a dict with some unpicklable items)
Method that recursively converts everything into serializable stuff (e.g. this, except it needs a check for infinitely recursing objects and do something with those; also it could try dir() and getattr() to process some of the unknown objects, e.g. extension classes).
But leaving a running process with manhole or pylons or something like that certainly seems more convenient when possible.
(also, I wonder if something more convenient was written since this question was first asked)

This answer suggests making your program core dump and then continuing execution on another sufficiently similar box.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.