connect to a heroku kafka instance with kafka-python from outside heroku

connect to a heroku kafka instance with kafka-python from outside heroku - python

I have set up a heroku kafka instance, and I am trying to connect to it using the python consumer. I have the heroku environment in a file called .env by going heroku config -s > .env, and then load and export it before running this python program:
import os
from kafka import KafkaConsumer
for variable in ['KAFKA_TRUSTED_CERT', 'KAFKA_CLIENT_CERT', 'KAFKA_CLIENT_CERT_KEY']:
with open(f'{variable}.txt', "w") as text_file:
print(os.environ[variable], file=text_file)
consumer = KafkaConsumer('test-topic',
bootstrap_servers=os.environ['KAFKA_URL'],
security_protocol="SSL",
ssl_certfile="KAFKA_CLIENT_CERT.txt",
ssl_keyfile="KAFKA_CLIENT_CERT_KEY.txt"
)
for msg in consumer:
print (msg)
I couldn't find any options that looked like they could load the certificates from a variable, so I put them all in files when I start the program.
When I run the program, it creates the temp files and doesn't complain, but doesn't print any messages.
When I write to the topic using the heroku cli like this
heroku kafka:topics:write test-topic "this is a test"
the python client doesn't print the message, but I can see the message by going
heroku kafka:topics:tail test-topic
Does anybody know what I am missing in the python consumer configuration?

In the official Heroku Kafka documentation:
https://devcenter.heroku.com/articles/kafka-on-heroku#using-kafka-in-python-applications
it states that using the Kafka helper is beneficial. If you look at the source code:
https://github.com/heroku/kafka-helper/blob/master/kafka_helper.py
one can see that they are writing the Kafka variables to files and creating an ssl_context.

Related

Apache beam DataFlow runner throwing setup error

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,
A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.
But could not find detailed worker-startup logs.
We tried increasing memory size, worker count etc, but still getting the same error.
Here is the command we use,
python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2
pipeline snippet,
data = pipeline | "load data" >> beam.io.Read(
beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)
data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)
Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.
Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.
Update:
Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.
If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.
I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example
Read about it here - https://medium.com/#rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

I'm building a prediction pipeline using Apache Beam/Dataflow. I need to include the model files inside the dependencies available to the remote workers. The Dataflow job failed with the same error log:
Error message from worker: A setup error was detected in beamapp-xxx-xxxxxxxxxx-xxxxxxxx-xxxx-harness-xxxx. Please refer to the worker-startup log for detailed information.
However, this error message didn't give any details about the worker-startup log. Finally, I found a way to have the worker log and solve the problem.
As is known, Dataflow creates compute engines to run jobs and save logs on them so that we can access the vm to see logs. We can connect to the vm in use by our Dataflow job from the GCP console via SSH. Then we can check the boot-json.log file located in /var/log/dataflow/taskrunner/harness:
$ cd /var/log/dataflow/taskrunner/harness
$ cat boot-json.log
Here we should pay attention. When running in batch mode, the vm created by Dataflow is ephemeral and closed when the job failed. If the vm is closed, we can't access it anymore. But a process including a failing item is retried 4 times, so normally we have enough time to open boot-json.log and see what is going on.
At last, I put my Python setup solution here that may help someone else:
main.py
...
model_path = os.path.dirname(os.path.abspath(__file__)) + '/models/net.pd'
# pipeline code
...
MANIFEST.in
include models/*.*
setup.py complete example
REQUIRED_PACKAGES = [...]
setuptools.setup(
...
include_package_data=True,
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
package_data={"models": ["models/*"]},
...
)
Run Dataflow pipelines
$ python main.py --setup_file=/absolute/path/to/setup.py ...

Pypiserver logging in API

I'm trying to create a private Python index using PyPiServer using the API.
According to the documentation I can specify the verbosity and the log file in the pypiserver app setup.
This is what I have:
import pypiserver
from pypiserver import bottle
app = pypiserver.app(root='./packages', password_file='.htpasswd', verbosity=4, log_file='F:\repo\logfile.txt')
bottle.run(app=app, host='itdevws09', port=8181, server='auto')
However when I start it using python mypyserver.py, the index starts up normally and works, however no log file is created. If I create one manually, the log file isn't actually written to.
If I start the pypiserver using the command line using:
pypi-server -p 8080 -P .htpasswd -vvvv --log-file F:/repo/logfile.txt ./packages
The log file is created and written to as normal.
I have tried putting the log-file and verbosity in the bottle.run() method but that doesn't work either. How can I get the logging to work?

staring service in windows through nsis

i have compiled myscripts to myscripts.exe file using pyinstaller --onefile
myscripts.py contain
import os
os.popen("python celery -A tasks worker --loglevel=info -P solo -c 1")
Once i got .exe file
SimpleSC::InstallService "ERP" "ERP Data Cloud" "16" "2" "$INSTDIR\myscript1.exe" "" "" ""
SimpleSC::StartService "ERP" "" 30
i compiled using nsis got my setup.exe
Now when i see the service window i can see the service is added bt the status is blank, even i try to start the service manually i got an error
control request is not timely fashion
after install
!define MUI_FINISHPAGE_RUN "$INSTDIR\dist\myscripts.exe"
i am able to run myscripts.exe which start celery with no problem, But i want it to run in service .
now Question
Am i doing it completly worng way, Or i need to add something,
What am i missing.

A service has to call specific service functions. If you are unable to write a proper service you could try a helper utility like Srvany...

How to run kafka producer that is written in Python?

I am new to Kafka but have seen a few tutorials so I know how Kafka works. I am trying to run a producer that I have written in Python but I don't know how to run this file after I have started my zookeeper server and kafka server. If anyone can tell me the structure of the command that is to be written in command prompt, I would really appreciate it.
Thanks!
Kafka Producer:
import json
import time
from kafka import KafkaProducer
from kafka.errors import KafkaError
from kafka.future import log
if __name__ == "__main__":
producer = KafkaProducer(bootstrap_servers= 'localhost: 9092')
future = producer.send('my-topic', b"test")
try:
record_metadata = future.get( timeout=10)
except KafkaError :
log.exeption()
pass
print( record_metadata.topic)
print(record_metadata.partition)
print(record_metadata.offset)
producer = KafkaProducer(value_serializer = lambda m: json.dumps(m).encode('ascii'))
producer.send('json-topic',{'key':'value'})
for _ in range (100):
producer.send('my-topic', b"test")
producer.send('my-topic',b"\xc2Hola, mundo!")
time.sleep(1)

So your question is how to run a python script? Simply save it, make executable and execute:
chmod +x ./kProducer.py
python ./kproducer.py
More detail are here: How to Run a Python Script via a File or the Shell

Add shebang line at the top to your script:
#!/usr/bin/env python-version
Replace the python-version with python2 for 2.x and python3 for 3.x
To check the version of python use the command:
python -V
The shebang line will determine the script ability to run as standalone. This will help when you want to double-click the script and execute it and not from the terminal. or simply say
python scriptname.py

Running Python scripts on local machine

I am trying to run a simple hello world example in python that runs against mongodb. I've set up mongo, bottle and pymong and have the following script inside C:\Python27\Scripts:
import bottle
import pymongo
#bottle.route('/')
def index()
from pymongo import Connection
connection = Connection('localhost', 27017)
db = connection.test
names = db.names
item = names.find_one()
return '<b>Hello %s!</b>' % item['name']
bottle.run(host='localhost', port=8082)
-!-- hello.py All L8 (Python)
I want to run this locally and I go to http://localhost:8082 but I get not found not found. How can I run that code to test it locally on my computer so I can test the code via the browser. I am running Windows 7 and have WAMP installed.

1) Add : after function name:
def index():
2) WAMP does not include MongoDB. You need to install Mongodb locally as well.
3) If something doesn't work, then you generally should be looking console for errors.

This script will run standalone (bottle.run() starts its own Python webserver), so you do not need any WAMP - just run this script. Run it from command line so you see if there are any errors.
You also need to have running MongoDB to connect to it. You can run it from command line too if you do not have MongoDB configured to start automatically after Windows start.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.