I'm beginner to kafka client in python, i need some help to describe the topics using the client.
I was able to list all my kafka topics using the following code:-
consumer = kafka.KafkaConsumer(group_id='test', bootstrap_servers=['kafka1'])
topicList = consumer.topics()
After referring multiple articles and code samples, I was able to do this through describe_configs using confluent_kafka.
Link 1 [Confluent-kafka-python]
Link 2 Git Sample
Below is my sample code!!
from confluent_kafka.admin import AdminClient, NewTopic, NewPartitions, ConfigResource
import confluent_kafka
import concurrent.futures
#Creation of config
conf = {'bootstrap.servers': 'kafka1','session.timeout.ms': 6000}
adminClient = AdminClient(conf)
topic_configResource = adminClient.describe_configs([ConfigResource(confluent_kafka.admin.RESOURCE_TOPIC, "myTopic")])
for j in concurrent.futures.as_completed(iter(topic_configResource.values())):
config_response = j.result(timeout=1)
I have found how to do it with kafka-python:
from kafka.admin import KafkaAdminClient, ConfigResource, ConfigResourceType
KAFKA_URL = "localhost:9092" # kafka broker
KAFKA_TOPIC = "test" # topic name
admin_client = KafkaAdminClient(bootstrap_servers=[KAFKA_URL])
configs = admin_client.describe_configs(config_resources=[ConfigResource(ConfigResourceType.TOPIC, KAFKA_TOPIC)])
config_list = configs.resources[0][4]
In config_list (list of tuples) you have all the configs for the topic.
Refer: https://docs.confluent.io/current/clients/confluent-kafka-python/
list_topics provide confluent_kafka.admin.TopicMetadata (topic,
partitions)
kafka.admin.TopicMetadata.partitions provide: confluent_kafka.admin.PartitionMetadata (Partition id, leader, replicas, isrs)
from confluent_kafka.admin import AdminClient
kafka_admin = AdminClient({"bootstrap.servers": bootstrap_servers})
for topic in topics:
x = kafka_admin.list_topics(topic=topic)
print x.topics, '\n'
for key, value in x.topics.items():
for keyy, valuey in value.partitions.items():
print keyy, ' Partition id : ', valuey, 'leader : ', valuey.leader,' replica: ', valuey.replicas
Interestingly, for Java this functionality (describeTopics()) sits within the KafkaAdminCLient.java.
So, I was trying to look for the python equivalent of the same and I discovered the code repository of kafka-python.
The documentation (in-line comments) in admin-client equivalent in kafka-python package says the following:
describe topics functionality is in ClusterMetadata
Note: if implemented here, send the request to the controller
I then switched to cluster.py file in the same repository. This contains the topics() function that you've used to retrieve the list of topics and the following 2 functions that could help you achieve the describe functionality:
partitions_for_topic() - Return set of all partitions for topic (whether available or not)
available_partitions_for_topic() - Return set of partitions with known leaders
Note: I haven't tried this myself so I'm not entierly sure if the behaviour would be identical to what you would see in the result for kafka-topics --describe ... command but worth a try.
I hope this helps!
Related
I am trying to get all resources and providers from Azure subscription by using Python SDK.
Here is my code:
get all resources by "resource group"
extract id of each resource within "resource group"
then calling details about particular resource by its id
The problem is that each call from point 3. requires a correct "API version" and it differs from object to object. So obviously my code keeps failing when trying to find some common API version that fits to everything.
Is there a way to retrieve suitable API version per resource in resource group ??? (similarly as retrieving id, name, ...)
# Import specific methods and models from other libraries
from azure.mgmt.resource import SubscriptionClient
from azure.identity import AzureCliCredential
from azure.mgmt.resource import ResourceManagementClient
credential = AzureCliCredential()
client = ResourceManagementClient(credential, "<subscription_id>")
rg = [i for i in client.resource_groups.list()]
# Retrieve the list of resources in "myResourceGroup" (change to any name desired).
# The expand argument includes additional properties in the output.
rg_resources = {}
for i in range(0, len(rg)):
rg_resources[rg[i].as_dict()
["name"]] = client.resources.list_by_resource_group(
rg[i].as_dict()["name"],
expand="properties,created_time,changed_time")
data = {}
for i in rg_resources.keys():
details = []
for _data in iter(rg_resources[i]):
a = _data
details.append(client.resources.get_by_id(vars(_data)['id'], 'latest'))
data[i] = details
print(data)
error:
azure.core.exceptions.HttpResponseError: (NoRegisteredProviderFound) No registered resource provider found for location 'westeurope' and API version 'latest' for type 'workspaces'. The supported api-versions are '2015-03-20, 2015-11-01-preview, 2017-01-01-preview, 2017-03-03-preview, 2017-03-15-preview, 2017-04-26-preview, 2020-03-01-preview, 2020-08-01, 2020-10-01, 2021-06-01, 2021-03-01-privatepreview'. The supported locations are 'eastus, westeurope, southeastasia, australiasoutheast, westcentralus, japaneast, uksouth, centralindia, canadacentral, westus2, australiacentral, australiaeast, francecentral, koreacentral, northeurope, centralus, eastasia, eastus2, southcentralus, northcentralus, westus, ukwest, southafricanorth, brazilsouth, switzerlandnorth, switzerlandwest, germanywestcentral, australiacentral2, uaecentral, uaenorth, japanwest, brazilsoutheast, norwayeast, norwaywest, francesouth, southindia, jioindiawest, canadaeast, westus3
What information exactly do you want to retrieve from the resources?
In most cases, I would recommend to use the Graph API to query over all resources. This is very powerful, as you can query the whole platform using a simple Query language - Kusto Query Lanaguage (KQL)
You can try the queries directly in the Azure service Azure Resource Graph Explorer in the Portal
A query that summarizes all types of resources would be:
resources
| project resourceGroup, type
| summarize count() by type, resourceGroup
| order by count_
A simple python-codeblock can be seen on the linked documentation above.
The below sample uses DefaultAzureCredential for authentication and lists the first resource in detail, that is in a resource group, where its name starts with "rg".
# Import Azure Resource Graph library
import azure.mgmt.resourcegraph as arg
# Import specific methods and models from other libraries
from azure.mgmt.resource import SubscriptionClient
from azure.identity import DefaultAzureCredential
# Wrap all the work in a function
def getresources( strQuery ):
# Get your credentials from environment (CLI, MSI,..)
credential = DefaultAzureCredential()
subsClient = SubscriptionClient(credential)
subsRaw = []
for sub in subsClient.subscriptions.list():
subsRaw.append(sub.as_dict())
subsList = []
for sub in subsRaw:
subsList.append(sub.get('subscription_id'))
# Create Azure Resource Graph client and set options
argClient = arg.ResourceGraphClient(credential)
argQueryOptions = arg.models.QueryRequestOptions(result_format="objectArray")
# Create query
argQuery = arg.models.QueryRequest(subscriptions=subsList, query=strQuery, options=argQueryOptions)
# Run query
argResults = argClient.resources(argQuery)
# Show Python object
print(argResults)
getresources("Resources | where resourceGroup startswith 'rg' | limit 1")
Similar to this: How to export GCP's Security Center Assets to a Cloud Storage via cloud Function?
I need to export the Findings as seen in the Security Command Center to BigQuery so we can easily filter the data we need and generate custom reports.
Using this documentation as an example (https://cloud.google.com/security-command-center/docs/how-to-api-list-findings#python), I wrote the following:
from google.cloud import securitycenter
from google.cloud import bigquery
JSONPath = "Path to JSON File For Service Account"
client = securitycenter.SecurityCenterClient().from_service_account_json(JSONPath)
BQclient = bigquery.Client().from_service_account_json(JSONPath)
table_id = "project.security_center.assets"
org_name = "organizations/1234567891011"
all_sources = "{org_name}/sources/-".format(org_name=org_name)
finding_result_iterator = client.list_findings(request={"parent": all_sources})
for i, finding_result in enumerate(finding_result_iterator):
errors = BQclient.insert_rows_json(table_id, finding_result)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
However, that then gave me the error:
"json_rows argument should be a sequence of dicts".
Any help with this would be greatly appreciated :)
Not sure if this existed back then in Q2 of 2021, but now there is documentation telling how to do this:
https://cloud.google.com/security-command-center/docs/how-to-analyze-findings-in-big-query
You can create exports of SCC findings to bigquery using this command:
gcloud scc bqexports create BIG_QUERY_EXPORT \
--dataset=DATASET_NAME \
--folder=FOLDER_ID | --organization=ORGANIZATION_ID | --project=PROJECT_ID \
[--description=DESCRIPTION] \
[--filter=FILTER]
Filter will allow to filter out unwanted findings (they will be in SCC, but won't be copied to the BigQuery).
It's useful if you want to export findings from one project or selected categories only. (Use -category:CATEGORY to exclude categories, works the same on different parameters as well).
I managed to sort this by writing:
for i, finding_result in enumerate(finding_result_iterator):
rows_to_insert = [
{u"category": finding_result.finding.category, u"name": finding_result.finding.name, u"project": finding_result.resource.project_display_name, u"external_uri": finding_result.finding.external_uri},
]
My Bash script using kubectl create/apply -f ... to deploy lots of Kubernetes resources has grown too large for Bash. I'm converting it to Python using the PyPI kubernetes package.
Is there a generic way to create resources given the YAML manifest? Otherwise, the only way I can see to do it would be to create and maintain a mapping from Kind to API method create_namespaced_<kind>. That seems tedious and error prone to me.
Update: I'm deploying many (10-20) resources to many (10+) GKE clusters.
Update in the year 2020, for anyone still interested in this (since the docs for the python library is mostly empty).
At the end of 2018 this pull request has been merged,
so it's now possible to do:
from kubernetes import client, config
from kubernetes import utils
config.load_kube_config()
api = client.ApiClient()
file_path = ... # A path to a deployment file
namespace = 'default'
utils.create_from_yaml(api, file_path, namespace=namespace)
EDIT: from a request in a comment, a snippet for skipping the python error if the deployment already exists
from kubernetes import client, config
from kubernetes import utils
config.load_kube_config()
api = client.ApiClient()
def skip_if_already_exists(e):
import json
# found in https://github.com/kubernetes-client/python/blob/master/kubernetes/utils/create_from_yaml.py#L165
info = json.loads(e.api_exceptions[0].body)
if info.get('reason').lower() == 'alreadyexists':
pass
else
raise e
file_path = ... # A path to a deployment file
namespace = 'default'
try:
utils.create_from_yaml(api, file_path, namespace=namespace)
except utils.FailToCreateError as e:
skip_if_already_exists(e)
I have written a following piece of code to achieve the functionality of creating k8s resources from its json/yaml file:
def create_from_yaml(yaml_file):
"""
:param yaml_file:
:return:
"""
yaml_object = yaml.loads(common.load_file(yaml_file))
group, _, version = yaml_object["apiVersion"].partition("/")
if version == "":
version = group
group = "core"
group = "".join(group.split(".k8s.io,1"))
func_to_call = "{0}{1}Api".format(group.capitalize(), version.capitalize())
k8s_api = getattr(client, func_to_call)()
kind = yaml_object["kind"]
kind = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', kind)
kind = re.sub('([a-z0-9])([A-Z])', r'\1_\2', kind).lower()
if "namespace" in yaml_object["metadata"]:
namespace = yaml_object["metadata"]["namespace"]
else:
namespace = "default"
try:
if hasattr(k8s_api, "create_namespaced_{0}".format(kind)):
resp = getattr(k8s_api, "create_namespaced_{0}".format(kind))(
body=yaml_object, namespace=namespace)
else:
resp = getattr(k8s_api, "create_{0}".format(kind))(
body=yaml_object)
except Exception as e:
raise e
print("{0} created. status='{1}'".format(kind, str(resp.status)))
return k8s_api
In above function, If you provide any object yaml/json file, it will automatically pick up the API type and object type and create the object like statefulset, deployment, service etc.
PS: The above code doesn't handler multiple kubernetes resources in one file, so you should have only one object per yaml file.
I see what you are looking for. This is possible with other k8s clients available in other languages. Here is an example in java. Unfortunately the python client library does not support that functionality yet. I opened a new feature request requesting the same and you can either choose to track it or contribute yourself :). Here is the link for the issue on GitHub.
The other way to still do what you are trying to do is to use java/golang client and put your code in a docker container.
currently I'm working with Kafka / Zookeeper and pySpark (1.6.0).
I have successfully created a kafka consumer, which is using the KafkaUtils.createDirectStream().
There is no problem with all the streaming, but I recognized, that my Kafka Topics are not updated to the current offset, after I have consumed some messages.
Since we need the topics updated to have a monitoring here in place this is somehow weird.
In the documentation of Spark I found this comment:
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream\
.transform(storeOffsetRanges)\
.foreachRDD(printOffsetRanges)
You can use this to update Zookeeper yourself if you want Zookeeper-based Kafka monitoring tools to show progress of the streaming application.
Here is the documentation:
http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
I found a solution in Scala, but I can't find an equivalent for python.
Here is the Scala example: http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/
Question
But the question is, how I'm able to update the zookeeper from that point on?
I write some functions to save and read Kafka offsets with python kazoo library.
First function to get singleton of Kazoo Client:
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
Then functions to read and write offsets:
def read_offsets(zk, topics):
from pyspark.streaming.kafka import TopicAndPartition
from_offsets = {}
for topic in topics:
for partition in zk.get_children(f'/consumers/{topic}'):
topic_partion = TopicAndPartition(topic, int(partition))
offset = int(zk.get(f'/consumers/{topic}/{partition}')[0])
from_offsets[topic_partion] = offset
return from_offsets
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{offset.topic}/{offset.partition}"
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
Then before starting streaming you could read offsets from zookeeper and pass them to createDirectStream
for fromOffsets argument.:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def main(brokers="127.0.0.1:9092", topics=['test1', 'test2']):
sc = SparkContext(appName="PythonStreamingSaveOffsets")
ssc = StreamingContext(sc, 2)
zk = get_zookeeper_instance()
from_offsets = read_offsets(zk, topics)
directKafkaStream = KafkaUtils.createDirectStream(
ssc, topics, {"metadata.broker.list": brokers},
fromOffsets=from_offsets)
directKafkaStream.foreachRDD(save_offsets)
if __name__ == "__main__":
main()
I encountered similar question.
You are right, by using directStream, means using kafka low-level API directly, which didn't update reader offset.
there are couple of examples for scala/java around, but not for python.
but it's easy to do it by yourself, what you need to do are:
read from the offset at the beginning
save the offset at the end
for example, I save the offset for each partition in redis by doing:
stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
ranges = rdd.offsetRanges()
for rng in ranges:
rng.untilOffset # save offset somewhere
then at the begin, you can use:
fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.
for some tools that use zk to track offset, it's better to save the offset in zookeeper.
this page:
https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html
describe how to set the offset, basically, the zk node is:
/consumers/[consumer_name]/offsets/[topic name]/[partition id]
as we are using directStream, so you have to make up a consumer name.
I am new to ROS and rospy, and I am not familiar with non-simple data type as topic.
I want to build a ROS node as both a subscriber and publisher: it receives a topic (a list of two float64), and uses a function (say my_function) which returns a list of lists of float64, then publish this list of list as a topic.
To do this, I built a node as following:
from pymongo import MongoClient
from myfile import my_function
import rospy
import numpy as np
pub = None
sub = None
def callback(req):
client = MongoClient()
db = client.block
lon = np.float64(req.b)
lat = np.float64(req.a)
point_list = my_function(lon, lat, db)
pub.publish(point_list)
def calculator():
global sub, pub
rospy.init_node('calculator', anonymous=True)
pub = rospy.Publisher('output_data', list)
# Listen
sub = rospy.Subscriber('input_data', list, callback)
print "Calculation finished. \n"
ros.spin()
if __name__ == '__main__':
try:
calculator()
except rospy.ROSInterruptException:
pass
I know clearly that list in Subscriber and Publisher is not a message data, but I cannot figure out how to fix it since it is not an integer nor list of integer.
This post on the ROS forums gives you most of what you need. This is also useful. You can define a new message type FloatList.msg with the following specification:
float64[] elements
And then a second message type FloatArray.msg defined as:
FloatList[] lists
Then your function could look like:
def callback(req):
client = MongoClient()
db = client.block
lon = np.float64(req.b)
lat = np.float64(req.a)
point_list = my_function(lon, lat, db)
float_array = FloatArray()
for i in range(len(point_list)):
float_list = FloatList()
float_list.elements = point_list[i]
float_array.lists[i] = float_list
pub.publish(float_array)
And then you can unpack it with:
def unpack_callback(float_array_msg):
for lst in float_array_msg.lists:
for e in lst.elements:
print "Here is an element: %f" % e
In general, it is recommended you put ROS related questions on the ROS Forums since you are way more likely to get an answer to your question there.
You can complicate yourself by defining a new ros Type in the msg OR use the default and easy to implement std_msgs Type, maybe will be useful to use a json module, so you serialize the data before publishing and deserialize it back on the other side after receiving...
the rest Pub/Sub , topic and handlers remain the same :)
I agree with the solution, I thought about organizing it a bit more
Create a file FloatArray.msg in your catkin_ws in the src folder where you have all your other message files.
Build your env using catkin_make or catkin build.
In your script (e.g. Python script) import the message type and use it in the subscriber e.g.
joint_state_publisher_Unity = rospy.Publisher("/joint_state_unity", FloatArray , queue_size = 10)
specific case (bonus :)): if you are using Unity and ROS# build the message in Unity