I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example?
from flask import Flask, request
from pyspark import SparkConf, SparkContext
app = Flask(__name__)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)
#app.route('/accessFunction', methods=['POST'])
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080)
In IPython Notebook I don't define the SparkContext because it is automatically configured. I don't remember how I did this, I followed some blogs.
On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.
Edit:
Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.
Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
This returns "Successfully imported Spark Modules". However, the next .py file I made returns an exception:
from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()
This returns exception:
"Java gateway process exited before sending the driver its port number"
Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success...
Any suggestions about what to try next? I am at a loss.
Okay, so I'm going to answer my own question in the hope that someone out there won't suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.
Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:
from pyspark import SparkContext
sc = SparkContext('local')
So the full code will be:
from pyspark import SparkContext
sc = SparkContext('local')
from flask import Flask, request
app = Flask(__name__)
#app.route('/whateverYouWant', methods=['POST']) #can set first param to '/'
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080) #note set to 8080!
Editing the setup:
It is essential that the file (yourrfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.
Then issue the following command within the directory:
./bin/spark-submit yourfilename.py
which initiates the service at 10.0.0.XX:8080/accessFunction/ .
Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively
You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:
curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/accessFunction/
I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here's the full content:
import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')
from flaskapp import app as application
Modify your .py file as it is shown in the linked guide 'Using IPython Notebook with Spark' part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
Related
I am following the testdriven.io Test-Driven Development with FastAPI and Docker tutorial and I am stuck on the Pytest setup step. I have checked over an over again to see what I am missing, and keep coming up short.
The code sample from the tutorial shows that, in conftest.py, you are to ahve the following from statement:
from app import main
from app.config import get_settings, Settings
For starters, Pycharm is telling me that it is unable to import anything from above.
My Folder Structure:
main.py:
import os
from fastapi import FastAPI, Depends
from tortoise.contrib.fastapi import register_tortoise
from .config import get_settings, Settings
app = FastAPI()
register_tortoise(
app,
db_url=os.environ.get("DATABASE_URL"),
modules={"models": ["app.models.tortoise"]},
generate_schemas=False,
add_exception_handlers=True,
)
#app.get("/ping")
async def pong(settings: Settings = Depends(get_settings)):
return {"ping": "pong", "environment": settings.environment, "testing": settings.testing}
conftest.py
import os
import pytest
from starlette.testclient import TestClient
from app import main
from app.config import get_settings, Settings
def get_settings_override():
return Settings(testing=1, database_url=os.environ.get("DATABASE_TEST_URL"))
#pytest.fixture(scope="module")
def test_app():
# set up
main.app.dependency_overrides[get_settings] = get_settings_override
with TestClient(main.app) as test_client:
# testing
yield test_client
# tear down
The tutorial has you run the tests using docker-compose exec web python -m pytest
This is the output I get when running the tests:
Any help would be appreciated. I feel like this is entry level stuff that is causing an extreme headache.
Thanks to #MatsLindh for the help. As he mentioned in his comments above, the tutorial has you running pytest on the entire project instead of just the tests folder. Running directly on tests solved my issue with pytest failing. He also gave good advice on getting imports to work correctly in an IDE by suggesting to look at the pytest documentation for further integration steps.
I am trying to set up a Flask web app using Elastic Beanstalk on AWS. I have followed the tutorial here and that works fine. I am now looking to expand the Flask webapp, and this works fine, until I import scipy.spatial as spatial, when this is part of my import statements, running eb open just times out. I receive
>>>> HTTP ERROR 504
running the webapp locally works absolutely fine even with the scipy import, it is only when I try and deploy to beanstalk that it doesn't want to work. Below is my code
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import scipy.spatial as spatial ##### Removing this and everything works!
from flask import Flask
from flask_cors import CORS
from flask_restful import Resource, Api
from flask_jsonpify import jsonify
# print a nice greeting.
def say_hello(username = "World"):
df = pd.DataFrame({"a":[1,2,3]})
return '<p>Hello %s!</p>\n' % username
# some bits of text for the page.
header_text = '''
<html>\n<head> <title>EB Flask Test</title> </head>\n<body>'''
instructions = '''
<p><em>Hint</em>: This is a RESTful web service! Append a username
to the URL (for example: <code>/Thelonious</code>) to say hello to
someone specific.</p>\n'''
home_link = '<p>Back</p>\n'
footer_text = '</body>\n</html>'
# EB looks for an 'application' callable by default.
application = Flask(__name__)
# add a rule for the index page.
application.add_url_rule('/', 'index', (lambda: header_text +
say_hello() + instructions + footer_text))
# add a rule when the page is accessed with a name appended to the site
# URL.
application.add_url_rule('/<username>', 'hello', (lambda username:
header_text + say_hello(username) + home_link + footer_text))
# run the app.
if __name__ == "__main__":
# Setting debug to True enables debug output. This line should be
# removed before deploying a production app.
application.debug = True
application.run()
I have tried increasing the command timeout for the environment from 600 to 900, although the timeout error occurs well before 600 seconds has elapsed.
Right, I am not sure why this is the case but I updated the version of scipy in my requirements.txt and the app is working as expected!
Originally I had
scipy==1.4.1
Now I have
scipy==1.2.3
I have no idea why this has fixed the deployment issue, especially given 1.4.1 works perfectly locally. If anyone has an idea, or if this a bug I should be reporting it would be good to know!
I'm trying to connect to hive using Python. I installed all of the dependencies required (sasl, thrift_sasl, etc..)
Here is how I try to connect:
configuration = {"hive.server2.authentication.kerberos.principal" : "hive/_HOST#REALM_HOST", "hive.server2.authentication.kerberos.keytab" : "/etc/security/keytabs/hive.service.keytab"}
connection = hive.Connection(configuration = configuration, host="host", port=port, auth="KERBEROS", kerberos_service_name = "hiveserver2")
But I get this error:
Minor code may provide more information (Cannot find KDC for realm "REALM_DOMAIN")
Whay I'm missing? Does someone has an example of an pyHive connection using kerberos?
Thank you for your help.
Thank you #Kishore.
Actually in PySpark, the code looks like this :
import pyspark
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
import pyspark.sql.types as T
def connection(self):
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
sc = pyspark.SparkContext(conf=conf)
self.cursor = HiveContext(sc)
self.cursor.setConf("hive.exec.dynamic.partition", "true")
self.cursor.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
self.cursor.setConf("hive.warehouse.subdir.inherit.perms", "true")
self.cursor.setConf('spark.scheduler.mode', 'FAIR')
and you can request using :
rows = self.cursor.sql("SELECT someone FROM something")
for row in rows.collect():
print row
I'm actually running the code via the command :
spark-submit --master yarn MyProgram.py
I guess you could using basically run the python with pyspark installed like :
python MyProgram.py
but I didn't tried so I won't assure that it's working
I don't know in pyspark, but I am using below scala code and it is working since last one year. If you can change this code in python. Replace the value of properties based on your kerberos.
System.setProperty("hive.metastore.uris", "add hive.metastore.uris url");
System.setProperty("hive.metastore.sasl.enabled", "true")
System.setProperty("hive.metastore.kerberos.keytab.file", "add keytab")
System.setProperty("hive.security.authorization.enabled", "false")
System.setProperty("hive.metastore.kerberos.principal", "replace hive.metastore.kerberos.principal value")
System.setProperty("hive.metastore.execute.setugi", "true")
val hiveContext = new HiveContext(sparkContext)
In my main.py have the below code:
app.config.from_object('config.DevelopmentConfig')
In another module I used import main and then used main.app.config['KEY'] to get a parameter, but Python interpreter says that it couldn't load the module in main.py because of the import part. How can I access config parameters in another module in Flask?
Your structure is not really clear but by what I can get, import your configuration object and just pass it to app.config.from_object():
from flask import Flask
from <path_to_config_module>.config import DevelopmentConfig
app = Flask('Project')
app.config.from_object(DevelopmentConfig)
if __name__ == "__main__":
application.run(host="0.0.0.0")
if your your config module is in the same directory where your application module is, you can just use :
from .config import DevelopmentConfig
The solution was to put app initialization in another file (e.g: myapp_init_file.py) in the root:
from flask import Flask
app = Flask(__name__)
# Change this on production environment to: config.ProductionConfig
app.config.from_object('config.DevelopmentConfig')
Now to access config parameters I just need to import this module in different files:
from myapp_init_file import app
Now I have access to my config parameters as below:
app.config['url']
The problem was that I had an import loop an could not run my python app. With this solution everything works like a charm. ;-)
While I am working at localhost:8080, when I open interactive console and do some operations, like getting list of Kind etc (address: http://localhost:8080/_ah/admin/interactive) then it gives me this error:
<class 'google.appengine.dist._library.UnacceptableVersionError'>: django 1.2 was requested, but 0.96.4.None is already in use
This errors happened several times, in similar cases. It is stuck until restart localhost by dev_appserver.py
Is this a bug or what I am doing wrong?
Example for what I did at interactive console:
from myapp.models import *
for room in Room.all():
room.update_time = room.create_time
room.put()
Note:
This is my django_bootstrap :
import os
import sys
import logging
import __builtin__
from google.appengine.ext.webapp import util
import pickle
sys.modules['cPicle'] =pickle
logging.getLogger().setLevel(logging.INFO)
sys.path.insert(0, os.path.abspath((os.path.dirname(__file__))))
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
from google.appengine.dist import use_library
use_library('django', '1.2')
import django.core.handlers.wsgi
def main():
application = django.core.handlers.wsgi.WSGIHandler()
util.run_wsgi_app(application)
if __name__ == '__main__':
main()
my index.ymal in root folder says:
# AUTOGENERATED
# This index.yaml is automatically updated whenever the dev_appserver
# detects that a new type of query is run. If you want to manage the
# index.yaml file manually, remove the above marker line (the line
# saying "# AUTOGENERATED"). If you want to manage some indexes
# manually, move them above the marker line. The index.yaml file is
# automatically uploaded to the admin console when you next deploy
# your application using appcfg.py.
Thus each time I open http://localhost:8080/_ah/admin/datastore, this file updated: which is still has the same content but timestamp of file on operating system says it is updated.
I think here, As the http://localhost:8080 sees that models.py is not the same then it could load it then can not start django_bootstrap.
However if I first open http://localhost:8080/_ah/admin/datastore and then http://localhost:8080, it works. So this is why sometimes I get error sometimes not: It depends of order urls respective