pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset - python

currently I'm working with Kafka / Zookeeper and pySpark (1.6.0).
I have successfully created a kafka consumer, which is using the KafkaUtils.createDirectStream().
There is no problem with all the streaming, but I recognized, that my Kafka Topics are not updated to the current offset, after I have consumed some messages.
Since we need the topics updated to have a monitoring here in place this is somehow weird.
In the documentation of Spark I found this comment:
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream\
.transform(storeOffsetRanges)\
.foreachRDD(printOffsetRanges)
You can use this to update Zookeeper yourself if you want Zookeeper-based Kafka monitoring tools to show progress of the streaming application.
Here is the documentation:
http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
I found a solution in Scala, but I can't find an equivalent for python.
Here is the Scala example: http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/
Question
But the question is, how I'm able to update the zookeeper from that point on?

I write some functions to save and read Kafka offsets with python kazoo library.
First function to get singleton of Kazoo Client:
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
Then functions to read and write offsets:
def read_offsets(zk, topics):
from pyspark.streaming.kafka import TopicAndPartition
from_offsets = {}
for topic in topics:
for partition in zk.get_children(f'/consumers/{topic}'):
topic_partion = TopicAndPartition(topic, int(partition))
offset = int(zk.get(f'/consumers/{topic}/{partition}')[0])
from_offsets[topic_partion] = offset
return from_offsets
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{offset.topic}/{offset.partition}"
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
Then before starting streaming you could read offsets from zookeeper and pass them to createDirectStream
for fromOffsets argument.:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def main(brokers="127.0.0.1:9092", topics=['test1', 'test2']):
sc = SparkContext(appName="PythonStreamingSaveOffsets")
ssc = StreamingContext(sc, 2)
zk = get_zookeeper_instance()
from_offsets = read_offsets(zk, topics)
directKafkaStream = KafkaUtils.createDirectStream(
ssc, topics, {"metadata.broker.list": brokers},
fromOffsets=from_offsets)
directKafkaStream.foreachRDD(save_offsets)
if __name__ == "__main__":
main()

I encountered similar question.
You are right, by using directStream, means using kafka low-level API directly, which didn't update reader offset.
there are couple of examples for scala/java around, but not for python.
but it's easy to do it by yourself, what you need to do are:
read from the offset at the beginning
save the offset at the end
for example, I save the offset for each partition in redis by doing:
stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
ranges = rdd.offsetRanges()
for rng in ranges:
rng.untilOffset # save offset somewhere
then at the begin, you can use:
fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.
for some tools that use zk to track offset, it's better to save the offset in zookeeper.
this page:
https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html
describe how to set the offset, basically, the zk node is:
/consumers/[consumer_name]/offsets/[topic name]/[partition id]
as we are using directStream, so you have to make up a consumer name.

Related

How to get all resources with details from Azure subscription via Python

I am trying to get all resources and providers from Azure subscription by using Python SDK.
Here is my code:
get all resources by "resource group"
extract id of each resource within "resource group"
then calling details about particular resource by its id
The problem is that each call from point 3. requires a correct "API version" and it differs from object to object. So obviously my code keeps failing when trying to find some common API version that fits to everything.
Is there a way to retrieve suitable API version per resource in resource group ??? (similarly as retrieving id, name, ...)
# Import specific methods and models from other libraries
from azure.mgmt.resource import SubscriptionClient
from azure.identity import AzureCliCredential
from azure.mgmt.resource import ResourceManagementClient
credential = AzureCliCredential()
client = ResourceManagementClient(credential, "<subscription_id>")
rg = [i for i in client.resource_groups.list()]
# Retrieve the list of resources in "myResourceGroup" (change to any name desired).
# The expand argument includes additional properties in the output.
rg_resources = {}
for i in range(0, len(rg)):
rg_resources[rg[i].as_dict()
["name"]] = client.resources.list_by_resource_group(
rg[i].as_dict()["name"],
expand="properties,created_time,changed_time")
data = {}
for i in rg_resources.keys():
details = []
for _data in iter(rg_resources[i]):
a = _data
details.append(client.resources.get_by_id(vars(_data)['id'], 'latest'))
data[i] = details
print(data)
error:
azure.core.exceptions.HttpResponseError: (NoRegisteredProviderFound) No registered resource provider found for location 'westeurope' and API version 'latest' for type 'workspaces'. The supported api-versions are '2015-03-20, 2015-11-01-preview, 2017-01-01-preview, 2017-03-03-preview, 2017-03-15-preview, 2017-04-26-preview, 2020-03-01-preview, 2020-08-01, 2020-10-01, 2021-06-01, 2021-03-01-privatepreview'. The supported locations are 'eastus, westeurope, southeastasia, australiasoutheast, westcentralus, japaneast, uksouth, centralindia, canadacentral, westus2, australiacentral, australiaeast, francecentral, koreacentral, northeurope, centralus, eastasia, eastus2, southcentralus, northcentralus, westus, ukwest, southafricanorth, brazilsouth, switzerlandnorth, switzerlandwest, germanywestcentral, australiacentral2, uaecentral, uaenorth, japanwest, brazilsoutheast, norwayeast, norwaywest, francesouth, southindia, jioindiawest, canadaeast, westus3

What information exactly do you want to retrieve from the resources?
In most cases, I would recommend to use the Graph API to query over all resources. This is very powerful, as you can query the whole platform using a simple Query language - Kusto Query Lanaguage (KQL)
You can try the queries directly in the Azure service Azure Resource Graph Explorer in the Portal
A query that summarizes all types of resources would be:
resources
| project resourceGroup, type
| summarize count() by type, resourceGroup
| order by count_
A simple python-codeblock can be seen on the linked documentation above.
The below sample uses DefaultAzureCredential for authentication and lists the first resource in detail, that is in a resource group, where its name starts with "rg".
# Import Azure Resource Graph library
import azure.mgmt.resourcegraph as arg
# Import specific methods and models from other libraries
from azure.mgmt.resource import SubscriptionClient
from azure.identity import DefaultAzureCredential
# Wrap all the work in a function
def getresources( strQuery ):
# Get your credentials from environment (CLI, MSI,..)
credential = DefaultAzureCredential()
subsClient = SubscriptionClient(credential)
subsRaw = []
for sub in subsClient.subscriptions.list():
subsRaw.append(sub.as_dict())
subsList = []
for sub in subsRaw:
subsList.append(sub.get('subscription_id'))
# Create Azure Resource Graph client and set options
argClient = arg.ResourceGraphClient(credential)
argQueryOptions = arg.models.QueryRequestOptions(result_format="objectArray")
# Create query
argQuery = arg.models.QueryRequest(subscriptions=subsList, query=strQuery, options=argQueryOptions)
# Run query
argResults = argClient.resources(argQuery)
# Show Python object
print(argResults)
getresources("Resources | where resourceGroup startswith 'rg' | limit 1")

Examining file on different node (different IP address), same network - possible?

I have a small group of Raspberry Pis, all on the same local network (192.168.1.2xx) All are running Python 3.7.3, one (R Pi CM3) on Raspbian Buster, the other (R Pi 4B 8gig) on Raspberry Pi OS 64.
I have a file on one device (the Pi 4B), located at /tmp/speech.wav, that is generated on the fly, real-time:
192.168.1.201 - /tmp/speech.wav
I have a script that works well on that device, that tells me the play duration time of the .wav file in seconds:
import wave
import contextlib
def getPlayTime():
fname = '/tmp/speech.wav'
with contextlib.closing(wave.open(fname,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = round(frames / float(rate), 2)
return duration
However - the node that needs to operate on that duration information is running on another node at 192.168.1.210. I cannot simply move the various files all to the same node as there is a LOT going on, things are where they are for a reason.
So what I need to know is how to alter my approach such that I can change the script reference to something like this pseudocode:
fname = '/tmp/speech.wav # 192.168.1.201'
Is such a thing possible? Searching the web it seems that I am up against millions of people looking for how to obtain IP addresses, fix multiple IP address issues, fix duplicate ip address issues... but I can't seem yet to find how to simply examine a file on a different ip address as I have described here. I have no network security restrictions, so any setting is up for consideration. Help would be much appreciated.

There are lots of possibilities, and it probably comes down to how often you need to check the duration, from how many clients, and how often the file changes and whether you have other information that you want to share between the nodes.
Here are some options:
set up an SMB (Samba) server on the Pi that has the WAV file and let the other nodes mount the filesystem and access the file as if it was local
set up an NFS server on the Pi that has the WAV file and let the other nodes mount the filesystem and access the file as if it was local
let other nodes use ssh to login and extract the duration, or scp to retrieve the file - see paramiko in Python
set up Redis on one node and throw the WAV file in there so anyone can get it - this is potentially attractive if you have lots of lists, arrays, strings, integers, hashes, queues or sets that you want to share between Raspberry Pis very fast. Example here.
Here is a very simple example of writing a sound track into Redis from one node (say Redis is on 192.168.0.200) and reading it back from any other. Of course, you may just want the writing node to write the duration in there rather than the whole track - which would be more efficient. Or you may want to store loads of other shared data or settings.
This is the writer:
#!/usr/bin/env python3
import redis
from pathlib import Path
host='192.168.1.200'
# Connect to Redis
r = redis.Redis(host)
# Load some music, or otherwise create it
music = Path('song.wav').read_bytes()
# Put music into Redis where others can see it
r.set("music",music)
And this is the reader:
#!/usr/bin/env python3
import redis
from pathlib import Path
host='192.168.1.200'
# Connect to Redis
r = redis.Redis(host)
# Retrieve music track from Redis
music = r.get("music")
print(f'{len(music)} bytes read from Redis')
Then, during testing, you may want to manually push a track into Redis from the Terminal:
redis-cli -x -h 192.168.0.200 set music < OtherTrack.wav
Or manually retrieve the track from Redis to a file:
redis-cli -h 192.168.0.200 get music > RetrievedFromRedis.wav

OK, this is what I finally settled on - and it works great. Using ZeroMQ for message passing, I have the function to get the playtime of the wav, and another gathers data about the speech about to be spoken, then all that is sent to the motor core prior to sending the speech. The motor core handles the timing issues to sync the jaw to the speech. So, I'm not actually putting the code that generates the wav and also returns the length of the wav playback time onto the node that ultimately makes use of it, but it turns out that message passing is fast enough so there is plenty of time space to receive, process and implement the motion control to match the speech perfectly. Posting this here in case it's helpful for folks in the future working on similar issues.
import time
import zmq
import os
import re
import wave
import contextlib
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555") #Listens for speech to output
print("Connecting to Motor Control")
jawCmd = context.socket(zmq.PUB)
jawCmd.connect("tcp://192.168.1.210:5554") #Sends to MotorFunctions for Jaw Movement
def getPlayTime(): # Checks to see if current file duration has changed
fname = '/tmp/speech.wav' # and if yes, sends new duration
with contextlib.closing(wave.open(fname,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = round(frames / float(rate), 3)
speakTime = str(duration)
return speakTime
def set_voice(V,T):
T2 = '"' + T + '"'
audioFile = "/tmp/speech.wav" # /tmp set as tmpfs, or RAMDISK to reduce SD Card write ops
if V == "A":
voice = "Allison"
elif V == "B":
voice = "Belle"
elif V == "C":
voice = "Callie"
elif V == "D":
voice = "Dallas"
elif V == "V":
voice = "David"
else:
voice = "Belle"
os.system("swift -n " + voice + " -o " + audioFile + " " +T2) # Record audio
tailTrim = .5 # Calculate Jaw Timing
speakTime = eval(getPlayTime()) # Start by getting playlength
speakTime = round((speakTime - tailTrim), 2) # Chop .5 s for trailing silence
wordList = T.split()
jawString = []
for index in range(len(wordList)):
wordLen = len(wordList[index])
jawString.append(wordLen)
jawString = str(jawString)
speakTime = str(speakTime)
jawString = speakTime + "|" + jawString # 3.456|[4, 2, 7, 4, 2, 9, 3, 4, 3, 6] - will split on "|"
jawCmd.send_string(jawString) # Send Jaw Operating Sequence
os.system("aplay " + audioFile) # Play audio
pronunciationDict = {'teh':'the','process':'prawcess','Maeve':'Mayve','Mariposa':'May-reeposah','Lila':'Lala','Trump':'Ass hole'}
def adjustResponse(response): # Adjusts spellings in output string to create better speech output.
for key, value in pronunciationDict.items():
if key in response or key.lower() in response:
response = re.sub(key, value, response, flags=re.I)
return response
SpeakText="Speech center connected and online."
set_voice(V,SpeakText) # Cepstral Voices: A = Allison; B = Belle; C = Callie; D = Dallas; V = David;
while True:
SpeakText = socket.recv().decode('utf-8') # .decode gets rid of the b' in front of the string
SpeakTextX = adjustResponse(SpeakText) # Run the string through the pronunciation dictionary
print("SpeakText = ",SpeakTextX)
set_voice(V,SpeakTextX)
print("Received request: %s" % SpeakTextX)
socket.send_string(str(SpeakTextX)) # Send data back to source for confirmation

How to export GCP's Security Center Assets to a Cloud Storage via cloud Function?

I have a cloud function calling SCC's list_assets and converting the paginated output to a List (to fetch all the results). However, since I have quite a lot of assets in the organization tree, it is taking a lot of time to fetch and cloud function times out (540 seconds max timeout).
asset_iterator = security_client.list_assets(org_name)
asset_fetch_all=list(asset_iterator)
I tried to export via WebUI and it works fine (took about 5 minutes). Is there a way to export the assets from SCC directly to a Cloud Storage bucket using the API?

I develop the same thing, in Python, for exporting to BQ. Searching in BigQuery is easier than in a file. The code is very similar for GCS storage. Here my working code with BQ
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
from google.cloud.asset_v1 import enums
def GCF_ASSET_TO_BQ(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getenv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}/datasets/{}'.format(os.getenv('PROJECT_ID'),os.getenv('DATASET'))
content_type = enums.ContentType.RESOURCE
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config, content_type=content_type)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
GCF_ASSET_TO_BQ('')
As you can see, there is some values in Env Var (OrganizationID, projectId and Dataset). For exporting to Cloud Storage, you have to change the definition of the output_config like this
output_config = asset_service_pb2.OutputConfig()
output_config.gcs_destination.uri = 'gs://path/to/file'
You have example in other languages here

Try something like this:
We use it to upload finding into a bucket. Make sure to give the SP the function is running the right permissions to the bucket.
def test_list_medium_findings(source_name):
# [START list_findings_at_a_time]
from google.cloud import securitycenter
from google.cloud import storage
# Create a new client.
client = securitycenter.SecurityCenterClient()
#Set query paramaters
organization_id = "11112222333344444"
org_name = "organizations/{org_id}".format(org_id=organization_id)
all_sources = "{org_name}/sources/-".format(org_name=org_name)
#Query Security Command Center
finding_result_iterator = client.list_findings(all_sources,filter_=YourFilter)
#Set output file settings
bucket="YourBucketName"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket)
output_file_name = "YourFileName"
my_file = bucket.blob(output_file_name)
with open('/tmp/data.txt', 'w') as file:
for i, finding_result in enumerate(finding_result_iterator):
file.write(
"{}: name: {} resource: {}".format(
i, finding_result.finding.name, finding_result.finding.resource_name
)
)
#Upload to bucket
my_file.upload_from_filename("/tmp/data.txt")

How to describe a topic using kafka client in Python

I'm beginner to kafka client in python, i need some help to describe the topics using the client.
I was able to list all my kafka topics using the following code:-
consumer = kafka.KafkaConsumer(group_id='test', bootstrap_servers=['kafka1'])
topicList = consumer.topics()

After referring multiple articles and code samples, I was able to do this through describe_configs using confluent_kafka.
Link 1 [Confluent-kafka-python]
Link 2 Git Sample
Below is my sample code!!
from confluent_kafka.admin import AdminClient, NewTopic, NewPartitions, ConfigResource
import confluent_kafka
import concurrent.futures
#Creation of config
conf = {'bootstrap.servers': 'kafka1','session.timeout.ms': 6000}
adminClient = AdminClient(conf)
topic_configResource = adminClient.describe_configs([ConfigResource(confluent_kafka.admin.RESOURCE_TOPIC, "myTopic")])
for j in concurrent.futures.as_completed(iter(topic_configResource.values())):
config_response = j.result(timeout=1)

I have found how to do it with kafka-python:
from kafka.admin import KafkaAdminClient, ConfigResource, ConfigResourceType
KAFKA_URL = "localhost:9092" # kafka broker
KAFKA_TOPIC = "test" # topic name
admin_client = KafkaAdminClient(bootstrap_servers=[KAFKA_URL])
configs = admin_client.describe_configs(config_resources=[ConfigResource(ConfigResourceType.TOPIC, KAFKA_TOPIC)])
config_list = configs.resources[0][4]
In config_list (list of tuples) you have all the configs for the topic.

Refer: https://docs.confluent.io/current/clients/confluent-kafka-python/
list_topics provide confluent_kafka.admin.TopicMetadata (topic,
partitions)
kafka.admin.TopicMetadata.partitions provide: confluent_kafka.admin.PartitionMetadata (Partition id, leader, replicas, isrs)
from confluent_kafka.admin import AdminClient
kafka_admin = AdminClient({"bootstrap.servers": bootstrap_servers})
for topic in topics:
x = kafka_admin.list_topics(topic=topic)
print x.topics, '\n'
for key, value in x.topics.items():
for keyy, valuey in value.partitions.items():
print keyy, ' Partition id : ', valuey, 'leader : ', valuey.leader,' replica: ', valuey.replicas

Interestingly, for Java this functionality (describeTopics()) sits within the KafkaAdminCLient.java.
So, I was trying to look for the python equivalent of the same and I discovered the code repository of kafka-python.
The documentation (in-line comments) in admin-client equivalent in kafka-python package says the following:
describe topics functionality is in ClusterMetadata
Note: if implemented here, send the request to the controller
I then switched to cluster.py file in the same repository. This contains the topics() function that you've used to retrieve the list of topics and the following 2 functions that could help you achieve the describe functionality:
partitions_for_topic() - Return set of all partitions for topic (whether available or not)
available_partitions_for_topic() - Return set of partitions with known leaders
Note: I haven't tried this myself so I'm not entierly sure if the behaviour would be identical to what you would see in the result for kafka-topics --describe ... command but worth a try.
I hope this helps!

How to publish/subscribe a python “list of list” as topic in ROS

I am new to ROS and rospy, and I am not familiar with non-simple data type as topic.
I want to build a ROS node as both a subscriber and publisher: it receives a topic (a list of two float64), and uses a function (say my_function) which returns a list of lists of float64, then publish this list of list as a topic.
To do this, I built a node as following:
from pymongo import MongoClient
from myfile import my_function
import rospy
import numpy as np
pub = None
sub = None
def callback(req):
client = MongoClient()
db = client.block
lon = np.float64(req.b)
lat = np.float64(req.a)
point_list = my_function(lon, lat, db)
pub.publish(point_list)
def calculator():
global sub, pub
rospy.init_node('calculator', anonymous=True)
pub = rospy.Publisher('output_data', list)
# Listen
sub = rospy.Subscriber('input_data', list, callback)
print "Calculation finished. \n"
ros.spin()
if __name__ == '__main__':
try:
calculator()
except rospy.ROSInterruptException:
pass
I know clearly that list in Subscriber and Publisher is not a message data, but I cannot figure out how to fix it since it is not an integer nor list of integer.

This post on the ROS forums gives you most of what you need. This is also useful. You can define a new message type FloatList.msg with the following specification:
float64[] elements
And then a second message type FloatArray.msg defined as:
FloatList[] lists
Then your function could look like:
def callback(req):
client = MongoClient()
db = client.block
lon = np.float64(req.b)
lat = np.float64(req.a)
point_list = my_function(lon, lat, db)
float_array = FloatArray()
for i in range(len(point_list)):
float_list = FloatList()
float_list.elements = point_list[i]
float_array.lists[i] = float_list
pub.publish(float_array)
And then you can unpack it with:
def unpack_callback(float_array_msg):
for lst in float_array_msg.lists:
for e in lst.elements:
print "Here is an element: %f" % e
In general, it is recommended you put ROS related questions on the ROS Forums since you are way more likely to get an answer to your question there.

You can complicate yourself by defining a new ros Type in the msg OR use the default and easy to implement std_msgs Type, maybe will be useful to use a json module, so you serialize the data before publishing and deserialize it back on the other side after receiving...
the rest Pub/Sub , topic and handlers remain the same :)

I agree with the solution, I thought about organizing it a bit more
Create a file FloatArray.msg in your catkin_ws in the src folder where you have all your other message files.
Build your env using catkin_make or catkin build.
In your script (e.g. Python script) import the message type and use it in the subscriber e.g.
joint_state_publisher_Unity = rospy.Publisher("/joint_state_unity", FloatArray , queue_size = 10)
specific case (bonus :)): if you are using Unity and ROS# build the message in Unity

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset - python

Related

How to get all resources with details from Azure subscription via Python

Examining file on different node (different IP address), same network - possible?

How to export GCP's Security Center Assets to a Cloud Storage via cloud Function?

How to describe a topic using kafka client in Python

How to publish/subscribe a python “list of list” as topic in ROS

Categories

Resources