Setting Topic log retention in confluent-kafka-python - python

I could not find in the documentation how to set the retention time when creating a producer using confluent-kafka.
If I just specify 'bootstrap-servers' the default retention time is 1 day. I would like to be able to change that.
(I want to do this in the python API not on the command line.)

Retention time is set when you create a topic, not on the producer configuration.
If your server.properties allows for auto-topic creation, then you will get the defaults set in there.
Otherwise, you can use the AdminClient API to send a NewTopic request which supports a config attribute of dict<str,str>
from confluent_kafka.admin import AdminClient, NewTopic
# a = AdminClient(...)
topics = list()
t = NewTopic(topic, num_partitions=3, replication_factor=1, config={'log.retention.hours': '168'})
topics.append(t)
# Call create_topics to asynchronously create topics, a dict
# of <topic,future> is returned.
fs = a.create_topics(topics)
# Wait for operation to finish.
# Timeouts are preferably controlled by passing request_timeout=15.0
# to the create_topics() call.
# All futures will finish at the same time.
for topic, f in fs.items():
try:
f.result() # The result itself is None
print("Topic {} created".format(topic))
except Exception as e:
print("Failed to create topic {}: {}".format(topic, e))
In the same link, you can find an alter topic request

the retention time is not a property of the producer.
The default retention time is set in broker configfile server.properties and properties like log.retention.hours, e.g. /etc/kafka/server.properties ...depending on your installation.
You can alter the retention time on a per topic base via e.g.
$ <path-to-kafka>/bin/kafka-topics.sh --zookeeper <zookeeper-quorum> --alter --topic <topic-name> --config retention.ms=<your-desired-retention-in-ms>
HTH....

Related

Facebook Marketing API - how to handle rate limit for retrieving *all* ad sets through campaign ids?

I've recently started working with the Facebook Marketing API, using the facebook_business SDK for Python (running v3.9 on Ubuntu 20.04). I think I've mostly wrapped my head around how it works, however, I'm still kind of at a loss as to how I can handle the arbitrary way in which the API is rate-limited.
Specifically, what I'm attempting to do is to retrieve all Ad Sets from all the campaigns that have ever run on my ad account, regardless of whether their effective_status is ACTIVE, PAUSED, DELETED or ARCHIVED.
Hence, I pulled all the campaigns for my ad account. These are stored in a dict, whereby the key indicates the effective_status, like so, called output:
{'ACTIVE': ['******************',
'******************',
'******************'],
'PAUSED': ['******************',
'******************',
'******************'}
Then, I'm trying to pull the Ad Set ids, like so:
import pandas as pd
import json
import re
import time
from random import *
from facebook_business.api import FacebookAdsApi
from facebook_business.adobjects.adaccount import AdAccount # account-level info
from facebook_business.adobjects.campaign import Campaign # campaign-level info
from facebook_business.adobjects.adset import AdSet # ad-set level info
from facebook_business.adobjects.ad import Ad # ad-level info
# auth init
app_id = open(APP_ID_PATH, 'r').read().splitlines()[0]
app_secret = open(APP_SECRET_PATH, 'r').read().splitlines()[0]
token = open(APP_ACCESS_TOKEN, 'r').read().splitlines()[0]
# init the connection
FacebookAdsApi.init(app_id, app_secret, token)
campaign_types = list(output.keys())
ad_sets = {}
for status in campaign_types:
ad_sets_for_status = []
for campaign_id in output[status]:
# sleep and wait for a random time
sleepy_time = uniform(1, 3)
time.sleep(sleepy_time)
# pull the ad_sets for this particular campaign
campaign_ad_sets = Campaign(campaign_id).get_ad_sets()
for entry in campaign_ad_sets:
ad_sets_for_status.append(entry['id'])
ad_sets[status] = ad_sets_for_status
Now, this crashes at different times whenever I run it, with the following error:
FacebookRequestError:
Message: Call was not successful
Method: GET
Path: https://graph.facebook.com/v11.0/23846914220310083/adsets
Params: {'summary': 'true'}
Status: 400
Response:
{
"error": {
"message": "(#17) User request limit reached",
"type": "OAuthException",
"is_transient": true,
"code": 17,
"error_subcode": 2446079,
"fbtrace_id": "***************"
}
}
I can't reproduce the time at which it crashes, however, it certainly doesn't take ~600 calls (see here: https://stackoverflow.com/a/29690316/5080858), and as you can see, I'm sleeping ahead of every API call. You might suggest that I should just call the get_ad_sets method on the AdAccount endpoint, however, this pulls fewer ad sets than the above code does, even before it crashes. For my use-case, it's important to pull ads that are long over as well as ads that are ongoing, hence it's important that I get as much data as possible.
I'm kind of annoyed with this -- seeing as we are paying for these ads to run, you'd think FB would make it as easy as possible to retrieve info on them via API, and not introduce API rate limits similar to those for valuable data one doesn't necessarily own.
Anyway, I'd appreciate any kind of advice or insights - perhaps there's also a much better way of doing this that I haven't considered.
Many thanks in advance!
The error with 'code': 17 means that you reach the limit of call and in order to get more nodes you have to wait.
Firstly I would handle the error in this way:
from facebook_business.exceptions import FacebookRequestError
...
for status in campaign_types:
ad_sets_for_status = []
for campaign_id in output[status]:
# keep trying until the request is ok
while True:
try:
campaign_ad_sets = Campaign(campaign_id).get_ad_sets()
break
except FacebookRequestError as error:
if error.api_error_code() in [17, 80000]:
time.sleep(sleepy_time) # sleep for a period of time
for entry in campaign_ad_sets:
ad_sets_for_status.append(entry['id'])
ad_sets[status] = ad_sets_for_status
I'd like to suggest you moreover to fetch the list of nodes from the account (by using the 'level': node param in params) and by using the batch calls: I can assure you that this will help you a lot and it will decrease the program run time.
I hope I was helpful.

Google Cloud Functions randomly retrying on success

I have a Google Cloud Function triggered by a PubSub. The doc states messages are acknowledged when the function end with success.
link
But randomly, the function retries (same execution ID) exactly 10 minutes after execution. It is the PubSub ack max timeout.
I also tried to get message ID and acknowledge it programmatically in Function code but the PubSub API respond there is no message to ack with that id.
In StackDriver monitoring, I see some messages not being acknowledged.
Here is my code : main.py
import base64
import logging
import traceback
from google.api_core import exceptions
from google.cloud import bigquery, error_reporting, firestore, pubsub
from sql_runner.runner import orchestrator
logging.getLogger().setLevel(logging.INFO)
def main(event, context):
bigquery_client = bigquery.Client()
firestore_client = firestore.Client()
publisher_client = pubsub.PublisherClient()
subscriber_client = pubsub.SubscriberClient()
logging.info(
'event=%s',
event
)
logging.info(
'context=%s',
context
)
try:
query_id = base64.b64decode(event.get('data',b'')).decode('utf-8')
logging.info(
'query_id=%s',
query_id
)
# inject dependencies
orchestrator(
query_id,
bigquery_client,
firestore_client,
publisher_client
)
sub_path = (context.resource['name']
.replace('topics', 'subscriptions')
.replace('function-sql-runner', 'gcf-sql-runner-europe-west1-function-sql-runner')
)
# explicitly ack message to avoid duplicates invocations
try:
subscriber_client.acknowledge(
sub_path,
[context.event_id] # message_id to ack
)
logging.warning(
'message_id %s acknowledged (FORCED)',
context.event_id
)
except exceptions.InvalidArgument as err:
# google.api_core.exceptions.InvalidArgument: 400 You have passed an invalid ack ID to the service (ack_id=982967258971474).
logging.info(
'message_id %s already acknowledged',
context.event_id
)
logging.debug(err)
except Exception as err:
# catch all exceptions and log to prevent cold boot
# report with error_reporting
error_reporting.Client().report_exception()
logging.critical(
'Internal error : %s -> %s',
str(err),
traceback.format_exc()
)
if __name__ == '__main__': # for testing
from collections import namedtuple # use namedtuple to avoid Class creation
Context = namedtuple('Context', 'event_id resource')
context = Context('666', {'name': 'projects/my-dev/topics/function-sql-runner'})
script_to_start = b' ' # launch the 1st script
script_to_start = b'060-cartes.sql'
main(
event={"data": base64.b64encode(script_to_start)},
context=context
)
Here is my code : runner.py
import logging
import os
from retry import retry
PROJECT_ID = os.getenv('GCLOUD_PROJECT') or 'my-dev'
def orchestrator(query_id, bigquery_client, firestore_client, publisher_client):
"""
if query_id empty, start the first sql script
else, call the given query_id.
Anyway, call the next script.
If the sql script is the last, no call
retrieve SQL queries from FireStore
run queries on BigQuery
"""
docs_refs = [
doc_ref.get() for doc_ref in
firestore_client.collection(u'sql_scripts').list_documents()
]
sorted_queries = sorted(docs_refs, key=lambda x: x.id)
if not bool(query_id.strip()) : # first execution
current_index = 0
else:
# find the query to run
query_ids = [ query_doc.id for query_doc in sorted_queries]
current_index = query_ids.index(query_id)
query_doc = sorted_queries[current_index]
bigquery_client.query(
query_doc.to_dict()['request'], # sql query
).result()
logging.info(
'Query %s executed',
query_doc.id
)
# exit if the current query is the last
if len(sorted_queries) == current_index + 1:
logging.info('All scripts were executed.')
return
next_query_id = sorted_queries[current_index+1].id.encode('utf-8')
publish(publisher_client, next_query_id)
#retry(tries=5)
def publish(publisher_client, next_query_id):
"""
send a message in pubsub to call the next query
this mechanism allow to run one sql script per Function instance
so as to not exceed the 9min deadline limit
"""
logging.info('Calling next query %s', next_query_id)
future = publisher_client.publish(
topic='projects/{}/topics/function-sql-runner'.format(PROJECT_ID),
data=next_query_id
)
# ensure publish is successfull
message_id = future.result()
logging.info('Published message_id = %s', message_id)
It looks like the pubsub message is not ack on success.
I do not think I have background activity in my code.
My question : why my Function is randomly retrying even when success ?
Cloud Functions does not guarantee that your functions will run exactly once. According to the documentation, background functions, including pubsub functions, are given an at-least-once guarantee:
Background functions are invoked at least once. This is because of the
asynchronous nature of handling events, in which there is no caller
that waits for the response. The system might, in rare circumstances,
invoke a background function more than once in order to ensure
delivery of the event. If a background function invocation fails with
an error, it will not be invoked again unless retries on failure are
enabled for that function.
Your code will need to expect that it could possibly receive an event more than once. As such, your code should be idempotent:
To make sure that your function behaves correctly on retried execution
attempts, you should make it idempotent by implementing it so that an
event results in the desired results (and side effects) even if it is
delivered multiple times. In the case of HTTP functions, this also
means returning the desired value even if the caller retries calls to
the HTTP function endpoint. See Retrying Background Functions for more
information on how to make your function idempotent.

Pulling historical channel messages python

I am attempting to create a small dataset by pulling messages/responses from a slack channel I am a part of. I would like to use python to pull the data from the channel however I am having trouble figuring out my api key. I have created an app on slack but I am not sure how to find my api key. I see my client secret, signing secret, and verification token but can't find my api key
Here is a basic example of what I believe I am trying to accomplish:
import slack
sc = slack.SlackClient("api key")
sc.api_call(
"channels.history",
channel="C0XXXXXX"
)
I am willing to just download the data manually if that is possible as well. Any help is greatly appreciated.
messages
See below for is an example code on how to pull messages from a channel in Python.
It uses the official Python Slack library and calls
conversations_history with paging. It will therefore work with
any type of channel and can fetch large amounts of messages if
needed.
The result will be written to a file as JSON array.
You can specify channel and max message to be retrieved
threads
Note that the conversations.history endpoint will not return thread messages. Those have to be retrieved additionaly with one call to conversations.replies for every thread you want to retrieve messages for.
Threads can be identified in the messages for each channel by checking for the threads_ts property in the message. If it exists there is a thread attached to it. See this page for more details on how threads work.
IDs
This script will not replace IDs with names though. If you need that here are some pointers how to implement it:
You need to replace IDs for users, channels, bots, usergroups (if on a paid plan)
You can fetch the lists for users, channels and usergroups from the API with users_list, conversations_list and usergroups_list respectively, bots need to be fetched one by one with bots_info (if needed)
IDs occur in many places in messages:
user top level property
bot_id top level property
as link in any property that allows text, e.g. <#U12345678> for users or <#C1234567> for channels. Those can occur in the top level text property, but also in attachments and blocks.
Example code
import os
import slack
import json
from time import sleep
CHANNEL = "C12345678"
MESSAGES_PER_PAGE = 200
MAX_MESSAGES = 1000
# init web client
client = slack.WebClient(token=os.environ['SLACK_TOKEN'])
# get first page
page = 1
print("Retrieving page {}".format(page))
response = client.conversations_history(
channel=CHANNEL,
limit=MESSAGES_PER_PAGE,
)
assert response["ok"]
messages_all = response['messages']
# get additional pages if below max message and if they are any
while len(messages_all) + MESSAGES_PER_PAGE <= MAX_MESSAGES and response['has_more']:
page += 1
print("Retrieving page {}".format(page))
sleep(1) # need to wait 1 sec before next call due to rate limits
response = client.conversations_history(
channel=CHANNEL,
limit=MESSAGES_PER_PAGE,
cursor=response['response_metadata']['next_cursor']
)
assert response["ok"]
messages = response['messages']
messages_all = messages_all + messages
print(
"Fetched a total of {} messages from channel {}".format(
len(messages_all),
CHANNEL
))
# write the result to a file
with open('messages.json', 'w', encoding='utf-8') as f:
json.dump(
messages_all,
f,
sort_keys=True,
indent=4,
ensure_ascii=False
)
This is using the slack webapi. You would need to install requests package. This should grab all the messages in channel. You need a token which can be grabbed from apps management page. And you can use the getChannels() function. Once you grab all the messages you will need to see who wrote what message you need to do id matching(map ids to usernames) you can use getUsers() functions. Follow this https://api.slack.com/custom-integrations/legacy-tokens to generate a legacy-token if you do not want to use a token from your app.
def getMessages(token, channelId):
print("Getting Messages")
# this function get all the messages from the slack team-search channel
# it will only get all the messages from the team-search channel
slack_url = "https://slack.com/api/conversations.history?token=" + token + "&channel=" + channelId
messages = requests.get(slack_url).json()
return messages
def getChannels(token):
'''
function returns an object containing a object containing all the
channels in a given workspace
'''
channelsURL = "https://slack.com/api/conversations.list?token=%s" % token
channelList = requests.get(channelsURL).json()["channels"] # an array of channels
channels = {}
# putting the channels and their ids into a dictonary
for channel in channelList:
channels[channel["name"]] = channel["id"]
return {"channels": channels}
def getUsers(token):
# this function get a list of users in workplace including bots
users = []
channelsURL = "https://slack.com/api/users.list?token=%s&pretty=1" % token
members = requests.get(channelsURL).json()["members"]
return members

Confluent kafka python pause-resume functionality example

Was trying to use the confluent kafka consumer's pause and resume functionality but couldnt find any examples over the internet except the main link.
https://docs.confluent.io/5.0.0/clients/confluent-kafka-python/index.html
Couldn't understand the parameters to be passed to it. Either its list of patitions or topic names or what?
As OneCricketeer mentioned pause() and resume() takes list of TopicPartition
and to initialize TopicPartition class you need topic, partition, and offset which you can get from the message object
This is how you can achieve it through Confluent-Kafka-Python:
import time
from confluent_kafka import Consumer, Producer, TopicPartition
conf = {
'bootstrap.servers': "localhost:9092",
'group.id': "test-consumer-group",
'auto.offset.reset': 'earliest',
'enable.auto.commit': False
}
topics = ['topic1']
consumer = Consumer(conf)
consumer.subscribe(topics)
while True:
try:
msg = consumer.poll(1.0)
if msg is None:
print("Waiting for message or event/error in poll()...")
continue
if msg.error():
print("Error: {}".format(msg.error()))
continue
else:
# Call to your processing function and pause the consumer
consumer.pause([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
time.sleep(60) # Think of it as processing time
# Once the processing is done resume the consumer and commit the message
consumer.resume([TopicPartition(msg.topic(),msg.partition(),msg.offset())])
consumer.commit()
except Exception as e:
print(e)
This is just an example and you might want to modify it based on your use case.
Pause and resume take a list of TopicPartition
class confluent_kafka.TopicPartition
TopicPartition is a generic type to hold a single partition and various information about it.
It is typically used to provide a list of topics or partitions for various operations, such as Consumer.assign().
TopicPartition(topic[, partition][, offset])
Instantiate a TopicPartition object.
Parameters:
topic (string) – Topic name
partition (int) – Partition id
offset (int) – Initial partition offset

Read/write values using Ethernet/IP

I recently have acquired an ACS Linear Actuator (Tolomatic Stepper) that I am attempting to send data to from a Python application. The device itself communicates using Ethernet/IP protocol.
I have installed the library cpppo via pip. When I issue a command
in an attempt to read status of the device, I get None back. Examining the
communication with Wireshark, I see that it appears like it is
proceeding correctly however I notice a response from the device indicating:
Service not supported.
Example of the code I am using to test reading an "Input Assembly":
from cpppo.server.enip import client
HOST = "192.168.1.100"
TAGS = ["#4/100/3"]
with client.connector(host=HOST) as conn:
for index, descr, op, reply, status, value in conn.synchronous(
operations=client.parse_operations(TAGS)):
print(": %20s: %s" % (descr, value))
I am expecting to get a "input assembly" read but it does not appear to be
working that way. I imagine that I am missing something as this is the first
time I have attempted Ethernet/IP communication.
I am not sure how to proceed or what I am missing about Ethernet/IP that may make this work correctly.
clutton -- I'm the author of the cpppo module.
Sorry for the delayed response. We only recently implemented the ability to communicate with simple (non-routing) CIP devices. The ControlLogix/CompactLogix controllers implement an expanded set of EtherNet/IP CIP capability, something that most simple CIP devices do not. Furthermore, they typically also do not implement the *Logix "Read Tag" request; you have to struggle by with the basic "Get Attribute Single/All" requests -- which just return raw, 8-bit data. It is up to you to turn that back into a CIP REAL, INT, DINT, etc.
In order to communicate with your linear actuator, you will need to disable these enhanced encapsulations, and use "Get Attribute Single" requests. This is done by specifying an empty route_path=[] and send_path='', when you parse your operations, and to use cpppo.server.enip.getattr's attribute_operations (instead of cpppo.server.enip.client's parse_operations):
from cpppo.server.enip import client
from cpppo.server.enip.getattr import attribute_operations
HOST = "192.168.1.100"
TAGS = ["#4/100/3"]
with client.connector(host=HOST) as conn:
for index, descr, op, reply, status, value in conn.synchronous(
operations=attribute_operations(
TAGS, route_path=[], send_path='' )):
print(": %20s: %s" % (descr, value))
That should do the trick!
We are in the process of rolling out a major update to the cpppo module, so clone the https://github.com/pjkundert/cpppo.git Git repo, and checkout the feature-list-identity branch, to get early access to much better APIs for accessing raw data from these simple devices, for testing. You'll be able to use cpppo to convert the raw data into CIP REALs, instead of having to do it yourself...
...
With Cpppo >= 3.9.0, you can now use much more powerful cpppo.server.enip.get_attribute 'proxy' and 'proxy_simple' interfaces to routing CIP devices (eg. ControlLogix, Compactlogix), and non-routing "simple" CIP devices (eg. MicroLogix, PowerFlex, etc.):
$ python
>>> from cpppo.server.enip.get_attribute import proxy_simple
>>> product_name, = proxy_simple( '10.0.1.2' ).read( [('#1/1/7','SSTRING')] )
>>> product_name
[u'1756-L61/C LOGIX5561']
If you want regular updates, use cpppo.server.enip.poll:
import logging
import sys
import time
import threading
from cpppo.server.enip import poll
from cpppo.server.enip.get_attribute import proxy_simple as device
params = [('#1/1/1','INT'),('#1/1/7','SSTRING')]
# If you have an A-B PowerFlex, try:
# from cpppo.server.enip.ab import powerflex_750_series as device
# parms = [ "Motor Velocity", "Output Current" ]
hostname = '10.0.1.2'
values = {} # { <parameter>: <value>, ... }
poller = threading.Thread(
target=poll.poll, args=(device,), kwargs={
'address': (hostname, 44818),
'cycle': 1.0,
'timeout': 0.5,
'process': lambda par,val: values.update( { par: val } ),
'params': params,
})
poller.daemon = True
poller.start()
# Monitor the values dict (updated in another Thread)
while True:
while values:
logging.warning( "%16s == %r", *values.popitem() )
time.sleep( .1 )
And, Voila! You now have regularly updating parameter names and values in your 'values' dict. See the examples in cpppo/server/enip/poll_example*.py for further details, such as how to report failures, control exponential back-off of connection retries, etc.
Version 3.9.5 has recently been released, which has support for writing to CIP Tags and Attributes, using the cpppo.server.enip.get_attribute proxy and proxy_simple APIs. See cpppo/server/enip/poll_example_many_with_write.py
hope this is obvious, but accessing HOST = "192.168.1.100" will only be possible from a system located on the subnet 192.168.1.*

Categories

Resources