I am running the docker image for snappydata v0.9. From inside that image, I can run queries against the database. However, I cannot do so from a second server on my machine.
I copied the python files from snappydata to the installed pyspark (editing snappysession to SnappySession in the imports) and (based on the answer to Unable to connect to snappydata store with spark-shell command), I wrote the following script (it is a bit of cargo-cult programming as I was copying from the python code in the docker image -- suggestions to improve it are welcome):
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.snappy import SnappyContext
from pyspark.storagelevel import StorageLevel
SparkContext._ensure_initialized()
spark = SparkSession.builder.appName("test") \
.master("local[*]") \
.config("snappydata.store.locators", "localhost:10034") \
.getOrCreate()
spark.sql("SELECT col1, min(col2) from TABLE1")
However, I get a traceback with:
pyspark.sql.utils.AnalysisException: u'Table or view not found: TABLE1
I have verified with wireshark that my program is communicating with the docker image (TCP follow stream shows the traceback message and a scala traceback). My assumption is that the permissions in the snappydata cluster is set wrong, but grepping through the logs and configuration did not show anything obvious.
How can I proceed?
-------- Edit 1 ------------
The new code that I am running (still getting the same error), incorporating the suggestions for the change in the config and ensuring that I get a SnappySession is:
from pyspark.sql.snappy import SnappySession
snappy = SnappySession.builder.appName("test") \
.master("local[*]") \
.config("spark.snappydata.connection", "localhost:1527") \
.getOrCreate()
snappy.sql("SELECT col1, min(col2) from TABLE1")
Can you change your config to the following -
.config("spark.snappydata.connection", "localhost:1527")
The 'snappydata.store.locators' property is no more there in 0.9.
You can refer the docs here - https://github.com/SnappyDataInc/snappydata/blob/master/docs/deployment.md#connectormode
Also, you need to create a SnappySession to access the Snappy managed Tables.
Something like this ....
spark = SparkSession.builder.appName("test") \
.master("local[*]") \
.config("spark.snappydata.connection", "localhost:1527") \
.getOrCreate()
snappy = SnappySession(spark)
snappy.sql("SELECT col1, min(col2) from TABLE1")
Related
Helllo guys,
Im running jupyter on serveur. I configure the serveur and did jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
I did this when creating my container to use only 7GO:
sudo docker run -it --memory="7g" jupyter
I created a notebook, containing only these rows
import os
import sys
import pandas as pd
from src.connection import create_postgres_engine
postgres_connection = create_postgres_engine('prod')
query1 = """
select
id, col1, date, col2, col3
from s inner join c on s.id = c.id_1
and s.date BETWEEN c.date_beg AND date_end where date >='2020-01-04' and due_date <'2020-04-08';
"""
query1 = pd.read_sql_query(query1, postgres_connection[0], coerce_float=False)
After 3, 4 minutes i got this error
kernel seems to crash, it will restart automatically.
Please who can help me?
I spend 2 days on it, and didn't got what happens.
I'm using pyspark to make some sql queries to a parquet file. I need to use multiple cores, but i didn't find any useful information. Here's the code i'm using. As you can see i set to 3 the number of cores, but when i run the script, i can see on htop that there's only 1 core in use. How can i solve this?
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Python Spark SQL tests") \
.config("spark.executor.cores", 3) \
.getOrCreate()
# Check conf
for item in spark.sparkContext.getConf().getAll():
print(item)
# Open file and create dataframe
filename = "gs://path/to/file.parquet"
df = spark.read.parquet(filename)
# Create table
df.createOrReplaceTempView("myTable")
# Query
sqlDF = spark.sql("SELECT * FROM myTable")
sqlDF.show()
From my understanding you use spark standalone(only in your machine not a cluster).
try:
from pyspark import sql
spark = (
sql.SparkSession.builder.master("local[*]")
.config("spark.executor.memory", "32g")
.config("spark.driver.memory", "32g")
...
.getOrCreate()
)
* in local[*] means use all avaliable cores. You can give a number like local[3].
I'm trying to connect to hive using Python. I installed all of the dependencies required (sasl, thrift_sasl, etc..)
Here is how I try to connect:
configuration = {"hive.server2.authentication.kerberos.principal" : "hive/_HOST#REALM_HOST", "hive.server2.authentication.kerberos.keytab" : "/etc/security/keytabs/hive.service.keytab"}
connection = hive.Connection(configuration = configuration, host="host", port=port, auth="KERBEROS", kerberos_service_name = "hiveserver2")
But I get this error:
Minor code may provide more information (Cannot find KDC for realm "REALM_DOMAIN")
Whay I'm missing? Does someone has an example of an pyHive connection using kerberos?
Thank you for your help.
Thank you #Kishore.
Actually in PySpark, the code looks like this :
import pyspark
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
import pyspark.sql.types as T
def connection(self):
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
sc = pyspark.SparkContext(conf=conf)
self.cursor = HiveContext(sc)
self.cursor.setConf("hive.exec.dynamic.partition", "true")
self.cursor.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
self.cursor.setConf("hive.warehouse.subdir.inherit.perms", "true")
self.cursor.setConf('spark.scheduler.mode', 'FAIR')
and you can request using :
rows = self.cursor.sql("SELECT someone FROM something")
for row in rows.collect():
print row
I'm actually running the code via the command :
spark-submit --master yarn MyProgram.py
I guess you could using basically run the python with pyspark installed like :
python MyProgram.py
but I didn't tried so I won't assure that it's working
I don't know in pyspark, but I am using below scala code and it is working since last one year. If you can change this code in python. Replace the value of properties based on your kerberos.
System.setProperty("hive.metastore.uris", "add hive.metastore.uris url");
System.setProperty("hive.metastore.sasl.enabled", "true")
System.setProperty("hive.metastore.kerberos.keytab.file", "add keytab")
System.setProperty("hive.security.authorization.enabled", "false")
System.setProperty("hive.metastore.kerberos.principal", "replace hive.metastore.kerberos.principal value")
System.setProperty("hive.metastore.execute.setugi", "true")
val hiveContext = new HiveContext(sparkContext)
I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.
I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example?
from flask import Flask, request
from pyspark import SparkConf, SparkContext
app = Flask(__name__)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)
#app.route('/accessFunction', methods=['POST'])
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080)
In IPython Notebook I don't define the SparkContext because it is automatically configured. I don't remember how I did this, I followed some blogs.
On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.
Edit:
Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.
Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
This returns "Successfully imported Spark Modules". However, the next .py file I made returns an exception:
from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()
This returns exception:
"Java gateway process exited before sending the driver its port number"
Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success...
Any suggestions about what to try next? I am at a loss.
Okay, so I'm going to answer my own question in the hope that someone out there won't suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.
Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:
from pyspark import SparkContext
sc = SparkContext('local')
So the full code will be:
from pyspark import SparkContext
sc = SparkContext('local')
from flask import Flask, request
app = Flask(__name__)
#app.route('/whateverYouWant', methods=['POST']) #can set first param to '/'
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080) #note set to 8080!
Editing the setup:
It is essential that the file (yourrfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.
Then issue the following command within the directory:
./bin/spark-submit yourfilename.py
which initiates the service at 10.0.0.XX:8080/accessFunction/ .
Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively
You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:
curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/accessFunction/
I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here's the full content:
import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')
from flaskapp import app as application
Modify your .py file as it is shown in the linked guide 'Using IPython Notebook with Spark' part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)