Failed to find data source: kafka - python

I was reading through this post, https://nycdatascience.com/blog/student-works/yelp-recommender-part-2/, and followed basically everything they showed. However, after reading this post, Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark), when I run
SPARK_HOME/bin/spark-submit read_stream_spark.py --master local[4] --jars spark-sql-kafka-0.10_2.11-2.1.0.jar
I still get the error that 'Failed to find data source: kafka'.
I also read through this. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. The official doc ask for two hosts and two ports while I only use one. Should I specify another host and port other than cloud server and the kafka port? Thanks.
Could you please let me know what I am missing. Or I shouldn't have run the script alone?

The official doc ask for two hosts and two ports
That's not related to your error. A minimum of one bootstrap server is required.
You need to move your Python file to the end of the command, otherwise all of the options you provided are given as command line arguments to the Python script, not spark-submit. And therefore it's using the default master with no external jars.
It's also recommended that you use --packages since this should ensure transitive dependencies are included with the submission

Related

What is the best way to work with whitelisted IP's on a MySQL DB when using Docker?

I have a server which contains a python file that connects to two external MySQL DB's. One of those DB's can be easily reached while the other server requires that IP's be whitelisted in order to have access to that DB. That server's IP is already whitelisted and works as intended when ran.
The problem arises however when I attempt to run the docker-ized variation of the application. The first DB runs just as it does before but the second DB no longer works. When inside the container, I can ping the second DB and it works, but whenever I try to access it via the code hosted on the server, it doesn't return data within any of the functions that utilize it. I noticed that the container has a separate IP, and may be causing issues since that docker container's IP would not have been whitelisted and may be where the problem begins. I am fairly new to Docker, so any documentation links that would assist me would be extremely helpful.
So for anyone who is dealing with this situation in the future, I added the line
network_mode: "host"
to my docker.compose.yaml file.
Here is some docs related to this: https://docs.docker.com/network/host/
Essentially what was happening is that the container could not be recognized by the whitelist and was not being allowed access to the second DB. With this change, it allowed the container to share the same network as the server it was being hosted on, and since that server has been whitelisted prior, it all worked out of the gate.
If you are using docker, then use
--net=host
within your run command. Here is a SO link about what this addition does:
What does --net=host option in Docker command really do?

how to monitor a node with cacti -collect data with script (shell..) no SNMP MIB

I would like to set up the Cacti supervision tool, but among the nodes I want to supervise that do not support SNMP.
Here I thought about programming scripts shell on the nodes themselves to extract the requested output, but here I am blocked! How sent this measurement and displayed on Cacti!
Do you know a way to do it? any URL guide ? Thanks for guiding me
The Cacti manual covers this topic.
Note that the scripts run on the Cacti server. In order to execute scripts on your targets, you could create a server script which uses SSH to connect to the target (I'm assuming that if the targets have a shell environment, they're hopefully accessible via SSH). After connecting, the server-side script can execute client-side script(s) the script and relay the output back.
Additionally, there's whole directory of example scripts bundled with the source code. In particular, there's an example shell script which demonstrates outputting using the multiple field format described in the manual.

Is it possible to use python to establish a putty ssh session and send some input?

Fist of all, due to Company Policy, Paramiko, or installing anything that requires administrative access to local machine it right out; otherwise I would have just done that.
All I have to work with is python with standard libraries & putty.
I am attempting to automate some tedious work that involves logging into a network device (usually Cisco, occasionally Alcatel-Lucent, or Juniper), running some show commands, and saving the data. (I am planning on using some other scripts to pull data from this file, parse it, and do other things, but that should be irrelevant to the task of retrieving the data.) I know this can be done with telnet, however I need to do this via ssh.
My thought is to use putty's logging ability to record output from a session to a file. I would like to use Python to establish a putty session, send scripted log-in and show commands, and then close the session. Before I set out on this crusade, does anyone know of any way to do this? The closest answers I have found to this all suggest to use Paramiko, or other python ssh library; I am looking for a way to do this given the constraints I am under.
The end-result would ideal be able to be used as a function, so that I can iterate through hundreds of devices from a list of ip addresses.
Thank you for your time and consideration.
If you can't use paramiko, and Putty is all you get so the correct tool is actually not Putty - it's his little brother Plink - you can download it here
Plink is the command line tool for Putty and you can your python script to call it using os.system("plink.exe [options] username#server.com [command])
See MAN Page here
Hope it will help,
Liron

ssh - getting metadata of many remote files

there's a remote file system which i can access using ssh.
i need to:
scan this file system to find all the files newer than a given datetime.
retrieve a list of those files' names, size, and modified_time_stamp
some restrictions:
i can't upload a script to this remote server. i can only run commands through ssh
there could be well over 100k of files in the remote server, and this process should happen at least once a minute, so the number of ssh calls should be minimal, and preferably equal to 1
i've already managed to get (1) using this:
`touch -am -t {timestamp} /tmp/some_filename; find {path} -newer /tmp/some_filename; rm /tmp/some_filename')`
and i thought i can move in the direction of piping the results into "xargs ls -l" and then parsing the results to extract the size and timestamp from there, but then i found this article...
also, i'm running the command using python (i.e. it's not just a command line), so it's ok to do some post processing on the results coming from the ssh command
I suggest writing or modifying your python script on the server side as follows:
When no data has been acquired in a while, acquire initial data using the touch/find script you provided and making calls on the found files to get the needed properties
Then, in the python script on the server, subscribe to inotify() data to get updates.
When a remote connects and needs all this data, provide the latest update from combining 1+2
inotify is a system call supported in Linux that allows you to monitor file system events on a directory in real time.
See:
https://serverfault.com/questions/30292/automatic-notification-of-new-or-changed-files-in-a-folder-or-share
http://linux.die.net/man/7/inotify
https://github.com/seb-m/pyinotify

Access CVS through Apache service using SSPI

I'm running an Apache server (v2.2.10) with mod_python, Python 2.5 and Django. I have a small web app that will show the current projects we have in CVS and allow users to make a build of the different projects (the build checks out the project, and copies certain files over with the source stripped out).
On the Django dev server, everything works fine. I can see the list of projects in cvs, check out, etc. On the production server (the Apache one) I get the following error:
[8009030d] The credentials supplied to the package were not recognized
I'm trying to log in to the CVS server using SSPI. Entering the same command into a shell will execute properly.
This is the code I'm using:
def __execute(self, command = ''):
command = 'cvs.exe -d :sspi:user:password#cvs-serv.example.com:/Projects ls'
p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr = subprocess.STDOUT, shell=True)
return p.communicate()
I've tried a number of different variations of things, and I can't seem to get it to work. Right now I believe that Apache is the culprit.
Any help would be appreciated
Usage of SSPI make me think you are using CVSNT, thus a Windows system; what is the user you are running Apache into? Default user for services is SYSTEM, which does not share the same registry as your current user.

Categories

Resources