firstly I describe my scenario.
Ubuntu 14.04
Spark 1.6.3
Python 3.5
I'm trying to execute my python scripts thru spark-submit. I need to create a context and then apply SQLContext as well.
Primarily I have tested a very easy case in my pyspark console:
Then I'm creating my python script.
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
numbers = [1,2,3,4,5,6]
numbersRDD = sc.parallelize(numbers)
numbersRDD.take(2)
But, when I run this in my submit-spark it is not going thru.I never get the results :(
There is no reason you should get any "results". You script doesn't execute any obvious side effects (printing to stdout, writing to file), other than standard Spark logging (visible in the output). numbersRDD.take(2) will execute just fine.
If you want to get some form of output print:
print(numbersRDD.take(2))
You should also stop the context before exiting:
sc.stop()
Related
I'm trying to learn Spark together with Python on a Win10 virtual machine. For that, I'm trying to read data from a CSV file, with PySpark, but stops a the following:
C:\Users\israel\AppData\Local\Programs\Python\Python37\python.exe
C:/Users/israel/Desktop/airbnb_python/src/main/python/spark_python/airbnb.py
hello world1
System cannot find the specified route
I have read How to link PyCharm with PySpark? , PySpark, Win10 - The system cannot find the path specified ,
The system cannot find the path specified error while running pyspark , PySpark - The system cannot find the path specified but haven't found luck implementing the solutions.
I'm using IntelliJ, python 3.7. This is the run configuration.
I'm using IntelliJ, python 3.7. The code is as follows
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
if __name__ == "__main__":
print("hello world1")
spark = SparkSession \
.builder \
.appName("spark_python") \
.master("local") \
.getOrCreate()
print("hello world2")
path = "C:\\Users\\israel\\Desktop\\data\\listings.csv"
df = spark.read\
.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(path)
df.show()
spark.stop()
It seems like the error is in the SparkSession, but I don't see how the announced error is related to that line. It is worth to mention that the execution never ends, I have to manually stop the execution to rerun it. Can anyone give me lights on what I'm doing wrong?. Please
I'm sure this is not the best solution, but one approach would be to launch your python interpreter directly from pyspark binary.
This can be located in:
$SPARK_HOME\bin\pyspark
Additionally, if you modify your environment variables when any terminals are active the variables are not refreshed till the next launch. This applies to Pycharm too. If you haven't tried, a restart of pycharm may also help.
If the error message is written with sys.stderr
The answers I provide here are not for real questions,
but I noticed what you said: but I don't see how the announced error is related to that line...
So I want to provide you with debugging to find the location of the code that generated this message.
According to the image of your airhnb(the first one), the error message El sistema no puede encontrar la ruta especificada.
It looks like this was written by sys.stderr
So my method is to redirect sys.stderr, like the following:
import sys
def the_process():
...
sys.stderr.write('error message')
class RedirectStdErr:
def write(self, msg: str):
if msg == 'error message':
set_debug_point_at_here = 1
original.write(msg)
original.flush()
original = sys.stderr
sys.stderr = RedirectStdErr()
the_process()
As long as you set the breakpoint on the set_debug_point_at_here = 1, then you can know where the real place to call this code is.
The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically):
import sys, os, multiprocessing
from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as sFn
from pyspark.sql.types import *
from pyspark.sql.types import Row
# ------------------------------------------
# Note: Row() in .../pyspark/sql/types.py
# isn't included in '__all__' list(), so
# we must import it by name here.
# ------------------------------------------
num_cpus = multiprocessing.cpu_count() # Number of CPUs for SPARK Local mode.
os.environ.pop('SPARK_MASTER_HOST', None) # Since we're using pip/pySpark these three ENVs
os.environ.pop('SPARK_MASTER_POST', None) # aren't needed; and we ensure pySpark doesn't
os.environ.pop('SPARK_HOME', None) # get confused by them, should they be set.
os.environ.pop('PYTHONSTARTUP', None) # Just in case pySpark 2.x attempts to read this.
os.environ['PYSPARK_PYTHON'] = sys.executable # Make SPARK Workers use same Python as Master.
os.environ['JAVA_HOME'] = '/usr/lib/jvm/jre' # Oracle JAVA for our pip/python3/pySpark 2.4 (CDH's JRE won't work).
JARS_IVY_REPO = '/home/jdoe/SPARK.JARS.REPO.d/'
# ======================================================================
# Maven Coordinates for JARs (and their dependencies) needed to plug
# extra functionality into Spark 2.x (e.g. Kafka SQL and Streaming)
# A one-time internet connection is necessary for Spark to autimatically
# download JARs specified by the coordinates (and dependencies).
# ======================================================================
spark_jars_packages = ','.join(['org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.0',
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0',])
# ======================================================================
spark_conf = SparkConf()
spark_conf.setAll([('spark.master', 'local[{}]'.format(num_cpus)),
('spark.app.name', 'myApp'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.eventLog.enabled', 'false'),
('spark.logConf', 'false'),
('spark.jars.repositories', 'file:/' + JARS_IVY_REPO),
('spark.jars.ivy', JARS_IVY_REPO),
('spark.jars.packages', spark_jars_packages), ])
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark_ctxt = spark_sesn.sparkContext
spark_reader = spark_sesn.read
spark_streamReader = spark_sesn.readStream
spark_ctxt.setLogLevel("WARN")
However the plugins aren't downloading and/or loading when I run the snippet (e.g. ./python -i init_spark.py), as they should.
This mechanism used to work, but then stopped. What am I missing?
Thank you in advance!
This is the kind of post where the QUESTION will be worth more than the ANSWER, because the code above works but isn't anywhere to be found in Spark 2.x documentation or examples.
The above is how I've programmatically added functionality to Spark 2.x by way of Maven Coordinates. I had this working but then it stopped working. Why?
When I ran the above code in a jupyter notebook, the notebook had -- behind the scenes -- already run that identical code snippet by way of my PYTHONSTARTUP script. That PYTHONSTARTUP script has the same code as the above, but omits the maven coordinates (by intent).
Here, then, is how this subtle problem emerges:
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
Because a Spark Session already existed, the above statement simply reused that existing session (.getOrCreate()), which did not have the jars/libraries loaded (again, because my PYTHONSTARTUP script intentionally omits them). This is why it is a good idea to put print statements in PYTHONSTARTUP scripts (which are otherwise silent).
In the end, I simply forgot to do this: $ unset PYTHONSTARTUP before starting the JupyterLab / Notebook daemon.
I hope the Question helps others because that's how to programmatically add functionality to Spark 2.x (in this case Kafka). Note that you'll need an internet connection for the one-time download of the specified jars and recursive dependencies from Maven Central.
I am able to create, drop, modify tables using pyspark and hivecontext. I load a list with commands I want to send, in string format, and pass them into this function:
def hiveCommands(commands, database):
conf = SparkConf().setAppName(database + 'project').setMaster('local')
sc = SparkContext(conf=conf)
df = HiveContext(sc)
f = df.sql('use ' + database)
for command in commands:
f = df.sql(command)
f.collect()
It works fine for maintenance, but I'm trying to dip my toes into analysis, and I don't see any output when I try to send a command like "describe table."
I just that it takes in the command and executes it without any errors, but I don't see what the actual output of the query is. I may need to mess with my .profile or .bashrc, not really sure. Something of a Linux newby. Any help would be appreciated.
Call show method to see results:
for command in commands:
df.sql(command).show()
I have a metrics.py which calculates a graph.
I can call it in the terminal command line (python ./metrics.py -i [input] [output]).
I want to write a function in Spark. It calls the metrics.py script to run on the provide file path and collects the values that metrics.py prints out.
How can I do that?
In order to run metrics.py, you essentially ship it to all the executor nodes that run your Spark Job.
To do this, you either pass it via SparkContext -
sc = SparkContext(conf=conf, pyFiles=['path_to_metrics.py'])
or pass it later using the Spark Context's addPyFile method -
sc.addPyFile('path_to_metrics.py')
In either case, after that, do not forget to import metrics.py and then just call needed function that gives needed output.
import metrics
metrics.relevant_function()
Also make sure you have all the python libraries that are imported inside metrics.py installed on all executor nodes. Else, take care of them using the --py-files and --jars handles while spark-submitting your job.
I have composed an ArcPy script which is run via a windows scheduler.The same script is loaded into a script tool so a user can run the process manually. I've used: get parameters as text, with or's and not's, to hard-wire the standard variables if they are not speicifed.
ReportFolder = arcpy.GetParameterAsText(0)
if ReportFolder == '#' or not ReportFolder:
ReportFolder = "C:\\Data\\GIS"
The process runs and during so writes to a text file log, for example:
txtFile.write("= For ArcGIS 10.3.1: Date: "+str(timed)),txtFile.write ('\n')
I'd like to record what method was used to execute the script; was it via the windows scheduler, or by the script tool via ArcGIS, or by a python client like PyScripter.
Is anyone aware of some form of os environment thingy that can be called by Python?