opencensus exporter - one global or per thread? - python

I am using Opencensus to do some monitoring on a grpc server with 10 workers. My question is whether, when making a Tracer, the exporter for the tracer should be local or Global. IE
this is the server:
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
Do I do:
tracer_module.Tracer(sampler=always_on.AlwaysOnSampler(), exporter=GLOBAL_EXPORTER)
where:
GLOBAL_EXPORTER = stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport))
OR do I do:
tracer_module.Tracer(sampler=always_on.AlwaysOnSampler(), exporter=stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport)))
I have tried both and they work. The former will use a global exporter which should be more efficient (I would think) but the aggregation seems a bit odd (one call is 'aggregated with another). On the other hand, the second way makes a second exporter (which is short lived, since it will exist only for that call) and does seem to export correctly. Question is more what is more correct from a system perspective. IE for the second option does creating stackdriver_exporter.StackdriverExporter(transport=BackgroundThreadTransport) invalidate a different exporter (which was created in a different thread)?

You should use a global exporter. It was not intended for a new export thread to be created for every Tracer. There should be one background thread running which handles all exporting to StackDriver.
As for the aggregation, it shouldn't be aggregating all the spans together. That may be a bug in the StackDriver UI (there are a number of known issues).

Related

bokeh multiple live streaming graphs in different objects / register update routine located in other class

I use python and bokeh to implement streamed live graphing. I want to include several live graphs into a gridplot and run into a kind of "deathlock".
The graphs (there are a lot of them) are created by different classes and the figure objects are returned and then used as input to the gridplot() function.
For live graphing curdoc().add_periodic_callback(update1, 300) references the update routine. I call the update routines of the other graphs directly from update1(). This works but gives me following error continuously:
`raise RuntimeError("_pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes")
This is expected behavior since data of the other graphs is altered from the 'outside' of their object and from an 'unregistered update routine'. I want to get rid of this error.
In my main object (where the layout is pieced together and curdoc().add_root() is called) I intend to register the other graphs update routines (which have to be regular object routines, so that they can be referenced.) via curdoc().add_periodic_callback(). the problem with this approach is, that the objects update functions take the self parameter and bokeh does not accept that.
Yet I can not do it without self, cause update() needs to reference the source.stream object.
I have no clue how to solve this or do it the 'correct' way. Suggestions are appreciated.
thanks
for clarification:
main object:
def graph(self):
.... bokeh code
#count()
def update(t):
.... update code
curdoc().add_root(gridplot([[l[0]], [l[1]]], toolbar_location="left", plot_width=1000))
curdoc().add_periodic_callback(update, 300)
this works
generic other object
def graph(self):
.... bokeh code
def update(self,t): ....
main object:
curdoc().add_periodic_callback(other_object.update, 300)
this does NOT work.
"_pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes"
Disclaimer: I've been dealing with this error in my own work for two weeks now, and finally resolved the issue today. (: It's easy when every sample you see comes with a csv file that's read and pushed to the doc, all in the same thread, but when things get real and you have a streaming client doing the same thing, suddenly everything stops working.
The general issue is, Bokeh server needs to keep its version of the document model in sync with what Bokeh client has. This happens through a series of events and communication happening between the client (Javascript running in the browser) and the server (inside an event loop that we will get to later on).
So every time you need to change the document, which essentially affects the model, the document needs to be locked (the simplest reason I could think of is concurrency). The simplest way to get around this issue, is to tell the Bokeh document instance you hold, that you are in need of making a change - and request a callback, so Bokeh manages when to call your delegate and allow you to update the document.
Now, with that said, there are few methods in bokeh.plotting.Document that help you request a callback.
The method you would want to probably use based on your use case, for example, if you need an ASAP callback, is add_next_tick_callback.
One thing to remember is that, the reference/pointer to your doc must be correct.
In order to make sure of that, I did wrap all my charting application into a class, and kept an instance of doc internally to access its add_next_tick_callback when new data is received. The way I could point at the right instance, was to initialize my Bokeh app using bokeh.server.server.Server - when it initializes the app, you will receive a doc variable that it's created before starting the server - that would be the right reference to the doc you present in the app. One benefit for having this "chart initializer" in a class, is that you can instantiate it as many times as you may need to construct more charts/documents.
Now, if you are a fan of data pipelines, and streaming, and use something like StreamZ to stream the data to the Pipe or Buffer instance you may have, you must remember one thing:
Be aware of what happens asynchronously, in a thread, or outside of it. Bokeh relies on tornado.ioloop.IOLoop for the most part, and if you are anywhere near running things asynchronously, you must have come across asyncio.
The event loops on these two modules can conflict, and will affect how/when you can change the document.
If you running your streaming in a thread (as the streaming client I wrote did..), make sure that thread has a current loop, otherwise you will face other similar issues. Threads can cause conflicts with internally created loops, and affect how they interact with each other.
With something like the following:
asyncio.set_event_loop(asyncio.new_event_loop())
Finally, be aware of what #gen.coroutine does in tornado. Your callbacks for the streaming, the way I understood, must be decorated with #gen.coroutine if you are doing things asynchronously.

How to specify dask client via environment variable

How can I instruct dask to use a distributed Client as the scheduler, externally from the code, e.g. via an environment variable?
The motivation is to take advantage of one of the key features of dask - namely the transparency of going from a single machine to a distributed cluster. However, there seems to be one little thing obscuring this transparency - the need to register a Client via code.
I can set the named schedulers (e.g. "synchronous" and "processes") via the config (file/env var) as instructed here, but how do I use the same mechanism with a distributed one?
Ideally, I would like to set something like:
DASK_SCHEDULER=distributed(scheduler_file=...)
as an environment variable which would be equivalent of running client = Client(scheduler_file=...) within python code.
This would then mean the EXACT same code can be run in different environments (local and distributed).
One way to do it would be do add to pass the scheduler has an argument; per say using Argparse.
Thus you could have python my_script.py <ip:port> were you specify either the distributed or <127.0.0.1:port> for local.

Python API implementing a simple log file

I have a Python script that will regulary check an API for data updates. Since it runs without supervision I would like to be able monitor what the script does to make sure it works properly.
My initial thought is just to write every communication attempt with the API to a text file with date, time and if data was pulled or not. A new line for every imput. My question to you is if you would recommend doing it in another way? Write to excel for example to be able to sort the columns? Or are there any other options worth considering?
I would say it really depends on two factors
How often you update
How much interaction do you want with the monitoring data (i.e. notification, reporting etc)
I have had projects where we've updated Google Sheets (using the API) to be able to collaboratively extract reports from update data.
However, note that this means a web call at every update, so if your updates are close together, this will affect performance. Also, if your app is interactive, there may be a delay while the data gets updated.
The upside is you can build things like graphs and timelines really easily (and collaboratively) where needed.
Also - yes, definitely the logging module as answered below. I sort of assumed you were using the logging module already for the local file for some reason!
Take a look at the logging documentation.
A new line for every input is a good start. You can configure the logging module to print date and time automatically.

Python program on server - control via browser

I have to setup a program which reads in some parameters from a widget/gui, calculates some stuff based on database values and the input, and finally sends some ascii files via ftp to remote servers.
In general, I would suggest a python program to do the tasks. Write a Qt widget as a gui (interactively changing views, putting numbers into tables, setting up check boxes, switching between various layers - never done something as complex in python, but some experience in IDL with event handling etc), set up data classes that have unctions, both to create the ascii files with the given convention, and to send the files via ftp to some remote server.
However, since my company is a bunch of Windows users, each sitting at their personal desktop, installing python and all necessary libraries on each individual machine would be a pain in the ass.
In addition, in a future version the program is supposed to become smart and do some optimization 24/7. Therefore, it makes sense to put it to a server. As I personally rather use Linux, the server is already set up using Ubuntu server.
The idea is now to run my application on the server. But how can the users access and control the program?
The easiest way for everybody to access something like a common control panel would be a browser I guess. I have to make sure only one person at a time is sending signals to the same units at a time, but that should be doable via flags in the database.
After some google-ing, next to QtWebKit, django seems to the first choice for such a task. But...
Can I run a full fledged python program underneath my web application? Is django the right tool to do so?
As mentioned previously, in the (intermediate) future ( ~1 year), we might have to implement some computational expensive tasks. Is it then also possible to utilize C as it is within normal python?
Another question I have is on the development. In order to become productive, we have to advance in small steps. Can I first create regular python classes, which later on can be imported to my web application? (Same question applies for widgets / QT?)
Finally: Is there a better way to go? Any standards, any references?
Django is a good candidate for the website, however:
It is not a good idea to run heavy functionality from a website. it should happen in a separate process.
All functions should be asynchronous, I.E. You should never wait for something to complete.
I would personally recommend writing a separate process with a message queue and the website would only ask that process for statuses and always display a result immediatly to the user
You can use ajax so that the browser will always have the latest result.
ZeroMQ or Celery are useful for implementing the functionality.
You can implement functionality in C pretty easily. I recomment however that you write that functionality as pure c with a SWIG wrapper rather that writing it as an extension module for python. That way the functionality will be portable and not dependent on the python website.

Python + MySQLDB Batch Insert/Update command for two of the same databases

I'm working with two databases, a local version and the version on the server. The server is the most up to date version and instead of recopying all values on all tables from the server to my local version,
I would like to enter each table and only insert/update the values that have changed, from server, and copy those values to my local version.
Is there some simple method to handling such a case? Some sort of batch insert/update? Googl'ing up the answer isn't working and I've tried my hand at coding one but am starting to get tied up in error handling..
I'm using Python and MySQLDB... Thanks for any insight
Steve
If all of your tables' records had timestamps, you could identify "the values that have changed in the server" -- otherwise, it's not clear how you plan to do that part (which has nothing to do with insert or update, it's a question of "selecting things right").
Once you have all the important values, somecursor.executemany will let you apply them all as a batch. Depending on your indexing it may be faster to put them into a non-indexed auxiliary temporary table, then insert/update from all of that table into the real one (before dropping the aux/temp one), the latter of course being a single somecursor.execute.
You can reduce wall-clock time for the whole job by using one (or a few) threads to do the selects and put the results onto a Queue.Queue, and a few worker threads to apply results plucked from the queue into the internal/local server. (Best balance of reading vs writing threads is best obtained by trying a few and measuring -- writing per se is slower than reading, but your bandwidth to your local server may be higher than to the other one, so it's difficult to predict).
However, all of this is moot unless you do have a strategy to identify "the values that have changed in the server", so it's not necessarily very useful to enter into more discussion about details "downstream" from that identification.

Categories

Resources