I'm trying to learn Spark together with Python on a Win10 virtual machine. For that, I'm trying to read data from a CSV file, with PySpark, but stops a the following:
C:\Users\israel\AppData\Local\Programs\Python\Python37\python.exe
C:/Users/israel/Desktop/airbnb_python/src/main/python/spark_python/airbnb.py
hello world1
System cannot find the specified route
I have read How to link PyCharm with PySpark? , PySpark, Win10 - The system cannot find the path specified ,
The system cannot find the path specified error while running pyspark , PySpark - The system cannot find the path specified but haven't found luck implementing the solutions.
I'm using IntelliJ, python 3.7. This is the run configuration.
I'm using IntelliJ, python 3.7. The code is as follows
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
if __name__ == "__main__":
print("hello world1")
spark = SparkSession \
.builder \
.appName("spark_python") \
.master("local") \
.getOrCreate()
print("hello world2")
path = "C:\\Users\\israel\\Desktop\\data\\listings.csv"
df = spark.read\
.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(path)
df.show()
spark.stop()
It seems like the error is in the SparkSession, but I don't see how the announced error is related to that line. It is worth to mention that the execution never ends, I have to manually stop the execution to rerun it. Can anyone give me lights on what I'm doing wrong?. Please
I'm sure this is not the best solution, but one approach would be to launch your python interpreter directly from pyspark binary.
This can be located in:
$SPARK_HOME\bin\pyspark
Additionally, if you modify your environment variables when any terminals are active the variables are not refreshed till the next launch. This applies to Pycharm too. If you haven't tried, a restart of pycharm may also help.
If the error message is written with sys.stderr
The answers I provide here are not for real questions,
but I noticed what you said: but I don't see how the announced error is related to that line...
So I want to provide you with debugging to find the location of the code that generated this message.
According to the image of your airhnb(the first one), the error message El sistema no puede encontrar la ruta especificada.
It looks like this was written by sys.stderr
So my method is to redirect sys.stderr, like the following:
import sys
def the_process():
...
sys.stderr.write('error message')
class RedirectStdErr:
def write(self, msg: str):
if msg == 'error message':
set_debug_point_at_here = 1
original.write(msg)
original.flush()
original = sys.stderr
sys.stderr = RedirectStdErr()
the_process()
As long as you set the breakpoint on the set_debug_point_at_here = 1, then you can know where the real place to call this code is.
Related
I want to run Python code as a COM server. Eventually I want to run an RTD server available here. But first I want to know what exactly you have to do to getting any COM server running. So let's focus on this example.
class HelloWorld:
_reg_clsid_ = "{7CC9F362-486D-11D1-BB48-0000E838A65F}"
_reg_desc_ = "Python Test COM Server"
_reg_progid_ = "Python.TestServer"
_public_methods_ = ['Hello']
_public_attrs_ = ['softspace', 'noCalls']
_readonly_attrs_ = ['noCalls']
def __init__(self):
self.softspace = 1
self.noCalls = 0
def Hello(self, who):
self.noCalls = self.noCalls + 1
# insert "softspace" number of spaces
return "Hello" + " " * self.softspace + who
if __name__=='__main__':
import win32com.server.register
win32com.server.register.UseCommandLine(HelloWorld)
Ok, this works in the way that there were no errors and server is registered, hence it is available in the HKEY_CLASSES_ROOT registry. But what can I do with this? Some say you have to compile a instance and have a .dll or .exe file. WHat else do I have to do?
Well, I ran your example. The registry key for the server is at:
HKEY_CLASSES_ROOT\WOW6432Node\CLSID\{7CC9F362-486D-11D1-BB48-0000E838A65F}
It has two subkeys... one for LocalServer32 and one for InProcServer32
I created a simple VBA macro in Excel:
Sub d()
Set obj = CreateObject("Python.TestServer")
MsgBox obj.Hello("joe")
End Sub
Macro ran just fine. My version of Excel is 64-bit. I ran the macro and then fired up Task Manager while the message box was being displayed. I could see pythonw.exe running in the background.
The only difference between my python script and yours is probably the name and also that I added a line to print to make sure I was executing the function:
if __name__=='__main__':
import win32com.server.register
print("Going to register...")
win32com.server.register.UseCommandLine(HelloWorld)
When I ran the 64-bit csript.exe test, it worked... as expected... when I ran the 32-bit version it failed.
I know why...sort of...
The registry entry for InProcServer32 is pythoncom36.dll
That's no good. It is an incomplete path. I tried modifying the path variable on my shell to add to one of the 3 places where the DLL existed on my system, but it didn't work. Also, tried coding the path in the InProcServer32. That didn't work.. kept saying it couldn't find the file.
I ran procmon, and then I observerved that it couldn't load vcruntime140.dll. Found the directory under python where those files were, and added to my path. It got further along. If I cared enough, I might try more. Eventually using procmon, I could find all the problems. But you can do that.
My simple solution was to rename the key InProcServer32 for the CLSID to be _InProcServer32. How does that work? Well, the system can't find InProcServer32 so it always uses LocalServer32--for 32-bit and 64-bit processes. If you need the speed of in process then you'd need to fix the problem by using procmon and being relentless until you solved all the File Not Found errors and such. But, if you don't need the speed of in process, then just using the LocalServer32 might solve the problem.
Caveats I'm using an Anaconda distro that my employer limits access to and I can only install it from the employee store. YMMV.
This question is focused on Windows + LibreOffice + Python 3.
I've installed LibreOffice (6.3.4.2), also
pip install unoconv and pip install unotools (pip install uno is another unrelated library), but still I get this error after import uno:
ModuleNotFoundError: No module named 'uno'
More generally, and as an example of use of UNO, how to open a .docx document with LibreOffice UNO and export it to PDF?
I've searched extensively on this since a few days, but I haven't found a reproducible sample code working on Windows:
headless use of soffice.exe, see my question+answer Headless LibreOffice very slow to export to PDF on Windows (6 times slow than on Linux) and the notes on the answer: it "works" with soffice.exe --headless ... but something closer to a COM interaction (Component Object Model) would be useful for many applications, thus this question here
Related forum post, and LibreOffice: Programming with Python Scripts, but the way uno should be installed on Windows, with Python, is not detailed; also Detailed tutorial regarding LibreOffice to Python macro writing, especially for Calc
I've also tried this (unsuccessfully): Getting python to import uno / pyuno:
import os
os.environ["URE_BOOTSTRAP"] = r"vnd.sun.star.pathname:C:\Program Files\LibreOffice\program\fundamental.ini"
os.environ["PATH"] += r";C:\Program Files\LibreOffice\program"
import uno
In order to interact with LibreOffice, start an instance listening on a socket. I don't use COM much, but I think this is the equivalent of the COM interaction you asked about. This can be done most easily on the command line or using a shell script, but it can also work with a system call using a time delay and subprocess.
chdir "%ProgramFiles%\LibreOffice\program\"
start soffice -accept=socket,host=localhost,port=2002;urp;
Next, run the installation of python that comes with LibreOffice, which has uno installed by default.
"C:\Program Files\LibreOffice\program\python.exe"
>> import uno
If instead you are using an installation of Python on Windows that was not shipped with LibreOffice, then getting it to work with UNO is much more difficult, and I would not recommend it unless you enjoy hacking.
Now, here is all the code. In a real project, it's probably best to organize into classes, but this is a simplified version.
import os
import uno
from com.sun.star.beans import PropertyValue
def createProp(name, value):
prop = PropertyValue()
prop.Name = name
prop.Value = value
return prop
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve(
"uno:socket,host=localhost,port=2002;urp;"
"StarOffice.ComponentContext")
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext(
"com.sun.star.frame.Desktop", ctx)
dispatcher = smgr.createInstanceWithContext(
"com.sun.star.frame.DispatchHelper", ctx)
filepath = r"C:\Users\JimStandard\Desktop\Untitled 1.docx"
fileUrl = uno.systemPathToFileUrl(os.path.realpath(filepath))
uno_args = (
createProp("Minimized", True),
)
document = desktop.loadComponentFromURL(
fileUrl, "_default", 0, uno_args)
uno_args = (
createProp("FilterName", "writer_pdf_Export"),
createProp("Overwrite", False),
)
newpath = filepath[:-len("docx")] + "pdf"
fileUrl = uno.systemPathToFileUrl(os.path.realpath(newpath))
try:
document.storeToURL(fileUrl, uno_args) # Export
except ErrorCodeIOException:
raise
try:
document.close(True)
except CloseVetoException:
raise
Finally, since speed is a concern, using a listening instance of LibreOffice can be slow. To do this faster, move the code into a macro. APSO provides a menu to organize Python macros. Then call the macro like this:
soffice "vnd.sun.star.script:myscript.py$name_of_maindef?language=Python&location=user"
In macros, obtain the document objects from XSCRIPTCONTEXT rather than the resolver.
The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically):
import sys, os, multiprocessing
from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as sFn
from pyspark.sql.types import *
from pyspark.sql.types import Row
# ------------------------------------------
# Note: Row() in .../pyspark/sql/types.py
# isn't included in '__all__' list(), so
# we must import it by name here.
# ------------------------------------------
num_cpus = multiprocessing.cpu_count() # Number of CPUs for SPARK Local mode.
os.environ.pop('SPARK_MASTER_HOST', None) # Since we're using pip/pySpark these three ENVs
os.environ.pop('SPARK_MASTER_POST', None) # aren't needed; and we ensure pySpark doesn't
os.environ.pop('SPARK_HOME', None) # get confused by them, should they be set.
os.environ.pop('PYTHONSTARTUP', None) # Just in case pySpark 2.x attempts to read this.
os.environ['PYSPARK_PYTHON'] = sys.executable # Make SPARK Workers use same Python as Master.
os.environ['JAVA_HOME'] = '/usr/lib/jvm/jre' # Oracle JAVA for our pip/python3/pySpark 2.4 (CDH's JRE won't work).
JARS_IVY_REPO = '/home/jdoe/SPARK.JARS.REPO.d/'
# ======================================================================
# Maven Coordinates for JARs (and their dependencies) needed to plug
# extra functionality into Spark 2.x (e.g. Kafka SQL and Streaming)
# A one-time internet connection is necessary for Spark to autimatically
# download JARs specified by the coordinates (and dependencies).
# ======================================================================
spark_jars_packages = ','.join(['org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.0',
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0',])
# ======================================================================
spark_conf = SparkConf()
spark_conf.setAll([('spark.master', 'local[{}]'.format(num_cpus)),
('spark.app.name', 'myApp'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.eventLog.enabled', 'false'),
('spark.logConf', 'false'),
('spark.jars.repositories', 'file:/' + JARS_IVY_REPO),
('spark.jars.ivy', JARS_IVY_REPO),
('spark.jars.packages', spark_jars_packages), ])
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark_ctxt = spark_sesn.sparkContext
spark_reader = spark_sesn.read
spark_streamReader = spark_sesn.readStream
spark_ctxt.setLogLevel("WARN")
However the plugins aren't downloading and/or loading when I run the snippet (e.g. ./python -i init_spark.py), as they should.
This mechanism used to work, but then stopped. What am I missing?
Thank you in advance!
This is the kind of post where the QUESTION will be worth more than the ANSWER, because the code above works but isn't anywhere to be found in Spark 2.x documentation or examples.
The above is how I've programmatically added functionality to Spark 2.x by way of Maven Coordinates. I had this working but then it stopped working. Why?
When I ran the above code in a jupyter notebook, the notebook had -- behind the scenes -- already run that identical code snippet by way of my PYTHONSTARTUP script. That PYTHONSTARTUP script has the same code as the above, but omits the maven coordinates (by intent).
Here, then, is how this subtle problem emerges:
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
Because a Spark Session already existed, the above statement simply reused that existing session (.getOrCreate()), which did not have the jars/libraries loaded (again, because my PYTHONSTARTUP script intentionally omits them). This is why it is a good idea to put print statements in PYTHONSTARTUP scripts (which are otherwise silent).
In the end, I simply forgot to do this: $ unset PYTHONSTARTUP before starting the JupyterLab / Notebook daemon.
I hope the Question helps others because that's how to programmatically add functionality to Spark 2.x (in this case Kafka). Note that you'll need an internet connection for the one-time download of the specified jars and recursive dependencies from Maven Central.
firstly I describe my scenario.
Ubuntu 14.04
Spark 1.6.3
Python 3.5
I'm trying to execute my python scripts thru spark-submit. I need to create a context and then apply SQLContext as well.
Primarily I have tested a very easy case in my pyspark console:
Then I'm creating my python script.
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
numbers = [1,2,3,4,5,6]
numbersRDD = sc.parallelize(numbers)
numbersRDD.take(2)
But, when I run this in my submit-spark it is not going thru.I never get the results :(
There is no reason you should get any "results". You script doesn't execute any obvious side effects (printing to stdout, writing to file), other than standard Spark logging (visible in the output). numbersRDD.take(2) will execute just fine.
If you want to get some form of output print:
print(numbersRDD.take(2))
You should also stop the context before exiting:
sc.stop()
I am able to create, drop, modify tables using pyspark and hivecontext. I load a list with commands I want to send, in string format, and pass them into this function:
def hiveCommands(commands, database):
conf = SparkConf().setAppName(database + 'project').setMaster('local')
sc = SparkContext(conf=conf)
df = HiveContext(sc)
f = df.sql('use ' + database)
for command in commands:
f = df.sql(command)
f.collect()
It works fine for maintenance, but I'm trying to dip my toes into analysis, and I don't see any output when I try to send a command like "describe table."
I just that it takes in the command and executes it without any errors, but I don't see what the actual output of the query is. I may need to mess with my .profile or .bashrc, not really sure. Something of a Linux newby. Any help would be appreciated.
Call show method to see results:
for command in commands:
df.sql(command).show()