Dask load JSON (for realtime plots) - python

I am trying to load a JSON from a http address using dask and then put it into a dataframe in order to plot some experiment data with dash. The goal is to fetch the data in realtime and show realtime plots of the machines (example data can be found here: http://aav.rz-berlin.mpg.de:17668/retrieval/data/getData.json?pv=FHIMP%3AHeDrop%3AForepressure_Droplet_Src)
This is what I tried:
import json
import dask.bag as db
mybag = db.read_text("http://aav.rz-berlin.mpg.de:17668/retrieval/data/getData.json?pv=FHIMP%3AHeDrop%3AForepressure_Droplet_Src").map(json.loads)
mybag.to_dataframe()
but mybag.to_dataframe() freezes my code.
I also tried:
import dask.dataframe as dd
dd.read_json('url')
which returned "ValueError: Expected object or value". So according to the error message, there's no JSON at all. Does the problem derive from the JSON consisting of a meta and a data field?
Sidequestion: Does my system even make sense like this if I want to provide a Webapp for monitoring? It's my first time working with Dash and Dask. Dask basically does the work of a backend here if I understood it right and there's no need to have it standing on it's own if I have an API that's sending me JSON data.

Dask is not, generally, a realtime/streaming analysis engine. Mostly, things are expected to be functional, that running the same task with the same arguments is guaranteed to produce the same output - clearly not the case here.
Realtime analysis can be produced by the client.submit API, which creates arbitrary tasks at the time of invocation. However, it still requires that the task be finite in order for other tasks to then take the result and operate on it. Reading the from given URL never ends.
If you want to use dask in conjunction with streaming data, or generally want to work on streaming data in python, you might want to try streamz. The sources listed are mostly polling (repeat some action on a timer to check for new events) or driven by inbound events (like a server waiting for connections). You could easily make a source for the HTTP endpoint, though:
from streamz import Source, Stream
import aiohttp
#Stream.register_api(staticmethod)
class from_http(Source):
def __init__(self, url, chunk_size=1024, **kwargs):
self.url = url
self.chunk_size = chunk_size
super().__init__(**kwargs)
async def run(self):
async with aiohttp.ClientSession() as session:
async with session.get(self.url) as resp:
async for chunk in resp.content.iter_chunked(self.chunk_size):
await self.emit(chunk, asynchronous=True)
The output of this streaming node is chunks of binary data - it would be up to you to write downstream nodes which can parse this into JSON (since the chunk boundaries won't respect the JSON record terminators).

A small note on the error in the second snippet in the question, the string is passed to dd.read_json, which triggers the error:
import dask.dataframe as dd
dd.read_json('url') # note that a string is passed here

Related

How to receive real-time data in excel using excel RTD and python?

I want to use excel as a front-end, which will continuously update multiple tables in real-time (every 2 seconds). I have a python program that prepares all the data tables, but it is running on some other server. I am storing python data in a Redis cache as a key-value pair.
E.g.
'bitcoin':'bitcoin,2021-04-23 14:23:23,49788,dollars,4068890,INR,100000'
'doge':'doge,2021-04-23 14:23:23,0.2334,dollars,21,INR,1000'
But, now I also want to use the same data in excel. Furthermore, I found that I can use excel RTD functions to update data in excel in real-time. But, I have no idea how will python send data to the excel RTD function.
As per my understanding, I need to set up some RTD server in python and that will inject data to the excel RTD function. But how ?, I am not quite sure. Please help me with the required infrastructure or any code examples in python.
Note: I cannot use xlwings and pyxll(paid) for some reasons.
Thanking you in advance.
You can do this with xlOil which is free (disclaimer: I wrote it). It allows you to write an async generator function in python which is presented to Excel as an RTD function.
As an example, the following code defines an RTD worksheet function pyGetUrl which will fetch a URL every N seconds. I'm not familiar with Redis, but I can see several async python client libraries which should be able to replace aiohttp in the below to access your data.
import aiohttp
import ssl
import xloil as xlo
# This is the implementation: it pulls the URL and returns the response as text
async def _getUrlImpl(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, ssl=ssl.SSLContext()) as response:
return await response.text()
#
# We declare an async gen function which calls the implementation either once,
# or at regular intervals
#
#xlo.func(local=False, rtd=True)
async def pyGetUrl(url, seconds=0):
yield await _getUrlImpl(url)
while seconds > 0:
await asyncio.sleep(seconds)
yield await _getUrlImpl(url)
Rolling your own RTD server using Excel's COM interface and pywin32 may also be viable, you can look at this python example to see it done. You'll need to add a ProgId and CLSID to the windows registry so Excel can find your server; the example show you how to do this. Fair warning: this recent questioner was unable to make the example work. I also tried the example and had even less luck, so some debugging may be required.

Dask scheduler empty / graph not showing

I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.
Using bag.starmap I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.
There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.

Difference between IOLoop.current().run_in_executor() and ThreadPoolExecutor().submit()

I'm quite new to Python Tornado and have been trying to start a new thread to run some IO blocking code whilst allowing the server to continue to handle new requests. I've been doing some reading but still can't seem to figure out what the difference is between these two functions?
For example calling a method like this:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(1) as executor:
future = executor.submit(report.write_gresb_workbook)
print(future.result())
compared to:
from concurrent.futures import ThreadPoolExecutor
from tornado import ioloop
with ThreadPoolExecutor(1) as executor:
my_success = await ioloop.IOLoop.current().run_in_executor(executor, report.write_gresb_workbook)
print(my_success)
write_gresb_workbook takes some information from the object report and writes it to an excel spreadsheet (however I'm using openpyxl which takes ~20s to load an appropriately formatted workbook and another ~20s to save it which stops the server from handling new requests!)
The function simply returns True or False (which is what my_success is) as the report object has the path of the output file attached to it.
I haven't quite gotten either of these methods to work yet so they might be incorrect but was just looking for some background information.
Cheers!
IOLoop.run_in_executor and Executor.submit do essentially the same thing, but return different types. IOLoop.run_in_executor returns an asyncio.Future, while Executor.submit returns a concurrent.futures.Future.
The two Future types have nearly identical interfaces, with one important difference: Only asyncio.Future can be used with await in a coroutine. The purpose of run_in_executor is to provide this conversion.

Run Multiple BigQuery Jobs via Python API

I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)

High level / Abstract Canbus Interface in Python

I am working with canbus in python (Pcan basic api) and would like to make it easier to use.
Via the bus a lot of devices/modules are connected. They are all allowed to send data, if a collison would happen the lowest ID will win.
The data Is organized in frames with ID, SubID, hexvalues
To illustrate the problem I am trying to adress, imagine the amplitude of a signal.
To read the value a frame is send to
QuestionID QuestionSUBID QuestionData
If there is no message with higher priority(=lowerID) the answer is written to the bus:
AnswerID AnswerSubID AnswerData
Since any module/device is allowed to write to the bus, you don't know in advance which answer you will get next. Setting a value morks the same way, just with different IDs. So for the above example the amplitude would have:
4 IDs and SubIds Associated with read/write question/answer
Additionally the lenght of the data has (0-8) has to be specified /stored.
Since the data is all hex values a parser has to be specified to obtain the human readable value (e.g Voltage in decimal representation)
To store this information I use nested dicts:
parameters = {'Parameter_1': {'Read': {'question_ID': ID,
'question_SUBID': SubID,
'question_Data': hex_value_list,
'answer_ID': ...,
'answer_subID': ...,
'answer_parser': function},
'Write': {'ID': ...,
'SubID': ...,
'parser' ...,
'answer_ID': ...,
'answer_subID': ...}},
'Parameter_2': ... }}
There are a lot of tools to show which value was set when, but for hardware control, the order in which paremeters are read are not relevant as long as they are up to date. Thus one part of a possible solution would be storing the whole traffic in a dict of dicts:
busdata = {'firstID' : {'first_subID': {'data': data,
'timestamp': timestamp},
'second_subID': {'data': data,
'timestamp': timestamp},
},
secondID': ...}
Due to the nature of the bus, I get a lot of answers other devices asked - the bus is quite full - these should not be dismissed since they might be the values I need next and there is no need to create additional traffic - I might use the timestamp with an expiry date, but I didn't think a lot about that so far.
This works, but is horrible to work with. In general I guess I will have about 300 parameters. The final goal is to controll the devices via a (pyqt) Gui, read some values like serial numbers but as well run measurement tasks.
So the big question is how to define a better datastructure that is easily accesible and understandable? I am looking forward to any suggestion on a clean design.
The main goal would be something like rid of the whole message based aproach.
EDIT: My goal is to get rid of the whole CAN specific message based aproach:
I assume I will need one thread for the communication, it should:
Read the buffer and update my variables
Send requests (messages) to obtain other values/variables
Send some values periodically
So from the gui I would like to be abled to:
get parameter by name --> send a string with parameter name
set parameter signal --> str(name), value(as displayedin the gui)
get values periodically --> name, interval, duration(10s or infinite)
The thread would have to:
Log all data on the bus for internal storage
Process requests by generating messages from name, value and read until result is obtained
Send periodical signals
I would like to have this design idependant of the actual hardware:
The solution I thought of, is the above parameters_dict
For internal storage I thought about the bus_data_dict
Still I am not sure how to:
Pass data from the bus thread to the gui (all values vs. new/requested value)
How to implement it with signals and slots in pyqt
Store data internally (dict of dicts or some new better idea)
If this design is a good choice
Using the python-can library will get you the networking thread - giving you a buffered queue of incoming messages. The library supports the PCAN interface among others.
Then you would create a middle-ware layer that converts and routes these can.Message types into pyqt signals. Think of this as a one to many source of events/signals.
I'd use another controller to be in charge of sending messages to the bus. It could have tasks like requesting periodic measurements from the bus, as well as on demand requests driven by the gui.
Regarding storing the data internally, it really depends on your programming style and the complexity. I have seen projects where each CAN message would have its own class.
Finally, queues are your friend!
Agree with #Hardbyte on the use of python-can. It's excellent.
As far as messaging between app layers, I've had a lot of luck with Zero MQ -- you can set up your modules as event based, from the canbus message event all the way through to updating a UI or whatever.
For data storage / persistence, I'm dropping messages into SQLite, and in parallel (using ZMQ Pub/Sub pattern) passing the data to an IoT hub (via MQTT).
class MySimpleCanBus:
def parse_message(self,raw_message):
return MyMessageClass._create(*struct.unpack(FRAME_FORMAT,msg))
def recieve_message(self,filter_data):
#code to recieve and parse a message(filtered by id)
raw = canbus.recv(FRAME_SIZE)
return self.parse_message(raw)
def send_message(msg_data):
# code to make sure the message can be sent and send the message
return self.recieve_message()
class MySpecificCanBus(MySimpleCanBus):
def get_measurement_reading():
msg_data = {} #code to request a measurement
return self.send_message(msg_data)
def get_device_id():
msg_data = {} # code to get device_id
return self.send_message(msg_data)
I probably dont understand your question properly ... maybe you could update it with additional details

Categories

Resources