I am trying to run a Python UDF directly on Druid. Running the Python function directly on the machines has many advantages, not the least of which avoiding huge data transfers from and to the remote database server.
For simplicity sake, let I have a simple Python function that I would like to run directly inside the Druid system. Here's a sample function:
# Calculates the Inverse of a Matrix
def matrix_inverse(A):
return numpy.linalg.inv(A)
I would like to run this function remotely and directly in Druid (and not on the client's side). The data used in the parameters (A) would be obtained from the database.
How could that be done?
No. Python UDFs are not available...yet.
There are JavaScript user defined functions:
https://druid.apache.org/docs/latest/development/javascript.html
Also consider creating a new feature request at: https://github.com/apache/druid/issues
and/or comment on this one: https://github.com/apache/druid/issues/10180
Related
I've got a few small Python functions that post to twitter running on AWS. I'm a novice when it comes to Lambda, knowing only enough to get the functions running.
The functions have environment variables set in Lambda with various bits of configuration, such as post frequency and the secret data for the twitter application. These are read into the python script directly.
It's all triggered by an Event Bridge cron job that runs every hour.
I'm wanting to create a test event that will allow me to invoke the function manually, but would like to be able to change the post frequency variable when run like this.
Is there a simple way to change environment variables when running a test event?
That is very much possible and there are multiple ways to do it. One is to use AWS CLI's aws lambda update-function-configuration: https://docs.aws.amazon.com/cli/latest/reference/lambda/update-function-configuration.html
Alternatively, depending on programming language that you prefer, you can use AWS SDK that also has a similar method, you can find an example with JS SDK in this doc: https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/javascript_lambda_code_examples.html
How can I instruct dask to use a distributed Client as the scheduler, externally from the code, e.g. via an environment variable?
The motivation is to take advantage of one of the key features of dask - namely the transparency of going from a single machine to a distributed cluster. However, there seems to be one little thing obscuring this transparency - the need to register a Client via code.
I can set the named schedulers (e.g. "synchronous" and "processes") via the config (file/env var) as instructed here, but how do I use the same mechanism with a distributed one?
Ideally, I would like to set something like:
DASK_SCHEDULER=distributed(scheduler_file=...)
as an environment variable which would be equivalent of running client = Client(scheduler_file=...) within python code.
This would then mean the EXACT same code can be run in different environments (local and distributed).
One way to do it would be do add to pass the scheduler has an argument; per say using Argparse.
Thus you could have python my_script.py <ip:port> were you specify either the distributed or <127.0.0.1:port> for local.
I want to execute a snippet of python code based on some trigger using Microsoft-Flow. Is there a way to do this?
Basically I am exploring on Powerapps and Microsoft-Flow. I have data in powerapp, I can do basic operations there. But, I want to execute a python script whenever a user press button in the powerapp and display the result on powerapp again.
In theory you can do with Azure Functions. The steps you need are the following:
Create an Azure function
Create the API definition using Python as the language
Export the definition to PowerApps/Flow
Add the function to your app as a data source OR
Add the function to Flow
It is still a little bit experimental, but you should be able to make it work.
We are trying to create an Azure ML web-service that will receive a (.csv) data file, do some processing, and return two similar files. The Python support recently added to the azure ML platform was very helpful and we were able to successfully port our code, run it in experiment mode and publish the web-service.
Using the "batch processing" API, we are now able to direct a file from blob-storage to the service and get the desired output. However, run-time for small files (a few KB) is significantly slower than on a local machine, and more importantly, the process seems to never return for slightly larger input data files (40MB). Processing time on my local machine for the same file is under 1 minute.
My question is if you can see anything we are doing wrong, or if there is a way to get this to speed up. Here is the DAG representation of the experiment:
Is this the way the experiment should be set up?
It looks like the problem was with processing of a timestamp column in the input table. The successful workaround was to explicitly force the column to be processed as string values, using the "Metadata Editor" block. The final model now looks like this:
I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.
I would like suggestions on how best to run this program on Hadoop.
After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.
From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.
Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?
You needn't write another job. You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase.
The Java interface of Hadoop has the concept of chaining several jobs:
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
However, since you're using Hadoop Streaming you don't have any support for chaining jobs and managing workflows.
You should checkout Oozie which should do the job for you:
http://yahoo.github.com/oozie/
Here are a few ways to do it: github.com/bwhite/hadoop_vision/tree/master/kmeans
Also check this out (has oozie support): http://bwhite.github.com/hadoopy/
Feels funny to be answering my own question. I used PIG 0.9 (not released yet, but available in the trunk). In this, there is support for modularity and flow control by way of allowing PIG Statements to be embedded inside scripting languages like Python.
So, I wrote a main python script that had a loop, and inside that called my PIG Scripts. The PIG scripts inturn made calls to the UDFs. So, had to write three different programs. But it worked out fine.
You can check the example here - http://www.mail-archive.com/user#pig.apache.org/msg00672.html
For the record, my UDFs were also written in Python, using this new feature that allows writing UDFs in scripting languages.