I am working on a simulation engine using Python where I collect a lot of metrics. The simulation runs at a high speed and generates around 100K events/second (I can do some processing by consolidating these events on a per second basis). I am looking for a mechanism to record these metrics as a time series.
My requirements are:
I would like to have this logging mechanism in the same process
as the simulation as opposed to an external process such as Graphite
The mechanism must be able to handle 100K events/second without slowing down the simulation.
I would like to store data as follows: Every metric related data should be stored with 1 second granularity for 60 minutes, 1 minute granularity for 1 day, 5 minute granularity for two days, 1 hour granularity for 6 months and 1 day granularity for 3 years of duration. I would like this mechanism to handle the consolidation of data as per the ranges specified.
Ideally, I want to maintain one file that holds the metrics information for one simulation run. For another run of the simulation a separate file would have to be created.
It would be nice to have a well-tested library/module that is readily available :)
BTW, I took a cursory look at RRDTool but from what I understand it seems like the Python library is a thin wrapper around the RRDTool binary. I'm looking for a tighter integration if possible.
TIA
The functionality provided by RRDTool fits my requirement. Initially I found a Python library https://pypi.python.org/pypi/python-rrdtool/ and misunderstood the nature of integration. I thought it was executing the binary of RRDTool as a separate process but the documentation says that this is a proper Python accessible wrapper that invokes the functionality in the same process space.
Later on I found this (https://pypi.python.org/pypi/PyRRD) Python library that wraps RRDTool functionality in a more pythonic OOPS kind of fashion that I found comfortable working with. The documentation available on the link page was good so I faced no roadblocks in using it.
This link (http://www.vandenbogaerdt.nl/rrdtool/tutorial/rrdcreate.php) was helpful in figuring out how to configure the RRD database during creation.
Related
I am currently working in a project where I need to sychronise data between python and c sharp.I need to label data from c sharp using python machine learning program. To label the data, I am using timestamp from both the application and based on the common timestamp, I am labelling the data.
Python program is running every 0.5 to 1.5 sec and C sharp program is running 10 times every 1 sec. Since the two process are running differently, I know there is some time lag. So labelling the data using the timestamp is not much accurate. I want to analyse the time lag properly. For this I am looking for options of real time synchronization between the two programs. I have looked into sockets but I think there is a better way using IPC. I donot know much about this.
I am thinking to create a shared variable between python and c#. Since python is slower, I will update that variable using python and read that variable from c# program. so same variable instance on both the program would tell us that they are synchronized perfectly. So I can look the value of this variable instead of timestamp for labelling the data. I am thinking that this might solve the issue. Please let me know what would be the best optimal solution to minimize the time difference lag between the two program.
since these are complex projects, I cannot implement them in a single program. I need to find a way to synchronize these two programs.
Any suggestions would be appreciated. Thank you.
I tried working with socket programming but they were not that good and a bit complex. So I am now thinking about IPC but still not sure which is the best way.
First of all, I implemented a socket in C# program so that I get data from the socket. Then I implemented multiprocessing in python. One process will request to the socket and another process will work for ML model. I was able to achieve the synchronization using the multiprocessing module. I used multiprocessing.Event() to wait for the event from another process. You can also look into shared variables in python using multiprocessing.Value, multiprocessing.Array, multiprocessing.Event.
I have a Python script that will regulary check an API for data updates. Since it runs without supervision I would like to be able monitor what the script does to make sure it works properly.
My initial thought is just to write every communication attempt with the API to a text file with date, time and if data was pulled or not. A new line for every imput. My question to you is if you would recommend doing it in another way? Write to excel for example to be able to sort the columns? Or are there any other options worth considering?
I would say it really depends on two factors
How often you update
How much interaction do you want with the monitoring data (i.e. notification, reporting etc)
I have had projects where we've updated Google Sheets (using the API) to be able to collaboratively extract reports from update data.
However, note that this means a web call at every update, so if your updates are close together, this will affect performance. Also, if your app is interactive, there may be a delay while the data gets updated.
The upside is you can build things like graphs and timelines really easily (and collaboratively) where needed.
Also - yes, definitely the logging module as answered below. I sort of assumed you were using the logging module already for the local file for some reason!
Take a look at the logging documentation.
A new line for every input is a good start. You can configure the logging module to print date and time automatically.
I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?
I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.
For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.
My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).
I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue
I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?
I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.
Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.