Running a pyspark program on python3 kernel in jupyter notebook

Running a pyspark program on python3 kernel in jupyter notebook - python

I used pip install pyspark to install PySpark. I didn't set any path etc.; however, I found that everything was downloaded and copied into C:/Users/Admin/anaconda3/scripts. I opened jupyter notebook in a Python3 kernel and I tried to run a SystemML script but it was giving me an error. I realized that I needed to place winutils.exe in C:/Users/Admin/anaconda3/scripts as well, so I did that and the script ran as expected.
Now, my program includes GridSearch and when I run it on my personal laptop, it is markedly slower than how it is on a Cloud data platform where I can initiate a kernel with Spark (such as IBM Watson Studio).
So my questions are:
(i) How do I add PySpark to the Python3 kernel? Or is it already working in the background when I import pyspark?
(ii) When I run the same code on the same dataset using pandas and scikit-learn, there is not much difference in performance. When is PySpark preferred/beneficial over pandas and scikit-learn?
Another thing is, even though PySpark seems to be working fine and I'm able to import its libraries, when I try to run
import findspark
findspark.init()
it throws up and error (on line 2), saying the list is out of range. I googled a bit and found an advice that said that I had to explicitly set SPARK_HOME='C:/Users/Admin/anaconda3/Scripts'; but when I do that, pyspark stops working (findspark.init() still not working).
If anyone can explain what is going on, I'd be very grateful. Thank you.

How do I add PySpark to the Python3 kernel
pip install, like you've said you have done
there is not much difference in performance
You're only using one machine, so there wouldn't be
When is PySpark preferred/beneficial over pandas and scikit-learn?
When you want to deploy the same code onto an actual Spark cluster and your dataset is stored in distributed storage
You don't necessarily need findspark if your environment variables are already setup

Related

Cant use ipynb files on jupyter/vsc

its my first post so do tell me if you require more specifics.
some details may not be too relevant here but i want to give to be as detailed as possible with the timeline here:
I have been using jupyter for my ipynb files for quite some time until i discovered tensorflow, at first it was going ok after installing the module but ever since i tried to use tensorflow to detect and utilise my gpu everything just went south from there. i tried things like downloading some nvidia stuff that my laptop does in fact support, and eventually got my tensorflow to detect my gpu. But the moment i tried to train my model with cnn, however simple the layers are, my kernel will crash. Eventually i used kaggle/colab as temporary solution but now i want to fix it.
After trying to fix the issue of tensorflow/revert back to when tensorflow runs just fine with only my cpu to no avail, i eventually decided to do a hard reset and deleted python/anaconda entirely from my computer.
After installing anaconda back. I booted up jupyter to see that there is a python3 ipykernel that is most likely preinstalled when i downloaded anaconda and i can run a simple hello world just fine. However i realise that after pip installing tensorflow my 'old' settings of tensorflow is still there and can detect my gpu, and hence kernel will crash yet again.
So, i thought why not just make a completely new environment so i can 100% install a new and fresh version of tensorflow. Then i realised that jupyter couldnt exactly detect the new environment that i made (idk if its cause of ipykernel but i did do a pip install ipykernel in the correct environment and its still not detected).
My next solution was to try to use vsc, so i used vsc and managed to detect the new environment but when running a print('hello world') i was told that 'The kernel failed to start as a dll could not be loaded. View Jupyter log for further details.'
I'm really lost as to what to do now, all i want to do at this point of time is to use tensorflow (whether cpu or gpu i rlly dont care anymore) in either vsc/jupyter. As long as my files are .py, i should be able to run it with any environment just fine ( though i didnt test with tensorflow module on py files because i dont see a point in training a model on a py file)
I use windows 10 if that helps
Im sorry if i gave unneccessary details. I would appreciate if i get some advice in anything im doing wrong/have a misunderstanding of/solutions and please do try to dumb it down for me with appropriate explanations if possible... thanks...... i can also be contacted on a voice call in discord if you think typing it is too much of a hassle

Using Matlab.engine and installing tensorflow at the same time

Currently I am working on a project with Jupyter Notebook in which I need to run a matlab script (.m) which includes a function that provides me with data which i try to solve with a tensorflow model afterwards. I can set up an environment that runs the matlab code an gives me the data and I can set up an environment that does the tensorflow thing but my problem is I can`t do it in the same environment.
Here is the setup and the problems. I am using matlab.engine which I installed like described here: https://de.mathworks.com/help/matlab/matlab_external/install-the-matlab-engine-for-python.html
To run my Jupyter Notebook I first navigate to the location where my python.exe and the matlab files are lying ("C:\Users\Philipp\AppData\Local\Programs\Python\Python37-32\Scripts"). If I try to run pip install tensorflow (in Anaconda Prompt) I got a lot of different errors like the following. Conda install works but even when it is installed i can`t import it.
ImportError: No module named 'tensorflow.core' or
ERROR: Could not find a version that satisfies the requirement tensorflow or just No module named 'tensorflow'
I searched for all those problems but nothing helped me. I think this has something to do with the directory I am working in and I know it is bad but I have no idea how to change that. The error also occurs in different environments.

Have you tried running !pip install tensorflow directly in Jupyter Notebook? It's a temporary workaround, but I am having the same problems and this one helped. Remember to comment it out after installation, so you wont re-run it by accident.

I found a solution to my problem. For this I needed a Jupyter Notebook and an external .py script that I design as a Flask. I can luckily run those in different environments. I past and request the data from the server by using "get" and "post".
If someone still has another idea to do all this in one JN, I would still be happy about answers.

Python - Can't Get Pandas and Numpy Working in Visual Studio Code or Eclipse

I'm fairly new to IDE's and I'm trying to take courses in Python. No matter what I try, I cannot successfully run a python script that has import pandas and import numpy in it in either Visual Studio Code or Eclipse (running on Windows 10). I have Python 3.8 installed, and when I try running those commands in the shell it works fine. I suspect when I try executing an actual Python script instead of using the console, it might be using a different interpreter, and I only get errors when I try doing this, saying numpy is not defined. I also get the error "cannot import name 'numpy' from partially initialized module 'pandas' (most likely due to a circular import)" when I specify "from pandas import numpy" rather than "from pandas import *".
I am very frustrated and don't know how to fix this problem. I've tried searching for help but not having a programming background, I don't know where to go to resolve this or how.
I also cannot get pip or pip3 to work at all to install packages. Those commands don't get recognized.
Please help!

I recommend using Jupyter Notebooks/pycharm(IDE). Both are very useful for learning python and working with data, data manipulation, and data visualizations.
PyCharm knows everything about your code. Rely on it for intelligent code completion, on-the-fly error checking and quick-fixes & easy project navigation.
While
Jupyter Notebooks can run line by line, rerun specific lines after making changes, and it's inline output is very useful for debugging and visualizations. You can get it from https://jupyter.org.
Zepellin Notebooks can also serve as alternatives.

Adding python modules to AzureML workspace

I've been working recently on deploying a machine learning model as a web service. I used Azure Machine Learning Studio for creating my own Workspace ID and Authorization Token. Then, I trained LogisticRegressionCV model from sklearn.linear_model locally on my machine (using python 2.7.13) and with the usage of below code snippet I wanted to publish my model as web service:
from azureml import services
#services.publish('workspaceID','authorization_token')
#services.types(var_1= float, var_2= float)
#services.returns(int)
def predicting(var_1, var_2):
input = np.array([var_1, var_2].reshape(1,-1)
return model.predict_proba(input)[0][1]
where input variable is a list with data to be scored and model variable contains trained classifier. Then after defining above function I want to make a prediction on sample input vector:
predicting.service(1.21, 1.34)
However following error occurs:
RuntimeError: Error 0085: The following error occurred during script
evaluation, please view the output log for more information:
And the most important message in log is:
AttributeError: 'module' object has no attribute 'LogisticRegressionCV'
The error is strange to me because when I was using normal sklearn.linear_model.LogisticRegression everything was fine. I was able to make predictions sending POST requests to created endpoint, so I guess sklearn worked correctly.
After changing to LogisticRegressionCV it does not.
Therefore I wanted to update sklearn on my workspace.
Do you have any ideas how to do it? Or even more general question: how to install any python module on azure machine learning studio in a way to use predict functions of any model I develpoed locally?
Thanks

For anyone who came across this question like I did in hopes of installing modules in AzureML notebooks; it seems the current environments sit on Conda on the compute so it's now as simple as executing
!conda env list
# conda environments:
#
base * /anaconda
azureml_py36 /anaconda/envs/azureml_py36
!conda -n azureml_py36 -y <packages>
from within the notebook environment or doing pretty much the same without the ! in the terminal environment

For installing python module on Azure ML Studio, there is a section Technical Notes of the offical document Execute Python Script which introduces it.
The general steps as below.
Create a Python project via virtualenv and active it.
Install all packages you want via pip on the virtual Python environment, and then
Package all files and directorys under the path Lib\site-packages of your project as a zip file.
Upload the zip package into your Azure ML WorkSpace as a dataSet.
Follow the offical document to import Python Module for your Execute Python Script.
For more details, you can refer to the other similar SO thread Updating pandas to version 0.19 in Azure ML Studio, it even introduced how to update the version of Python packages installed by Azure.
Hope it helps.

I struggled with the same issue: error 0085
I was able to resolve it by using Azure ML code example available from their library:
Deployment of AzureML Web Services from Python Notebooks
can be found at https://gallery.cortanaintelligence.com/Notebook/Deployment-of-AzureML-Web-Services-from-Python-Notebooks-4
I won't copy the whole code here, but I used it exactly as is and it worked with Boston dataset. Then I used it with my dataset, and I no longer got error 0085. I haven't tracked down the error yet but it's most likely due to some misbehaving character or indent. Hope this helps.

Trouble setting up 'Bachbot': Python gives no such command-error while installed correctly?

I am trying to set up Bachbot (https://github.com/feynmanliang/bachbot) on my Windows 10 system in Python 3.5.1, Anaconda 4.0.0. Though doing several attempts, I keep failing at getting this to work. I downloaded the source code from github (didn't use Docker) and got to work.
First thing that's good to know is that I changed all print statements and added parantheses. Furthermore I changed every import of cPickle to
import _pickle as cPickle
since I'm using a newer version of Python. By doing this, I cleared all compile errors, but now I'm stuck at the first few steps of getting the program to work. When calling
bachbot chorales prepare_poly
I get an error
Usage: bachbot-script.py [OPTIONS] COMMAND [ARGS]
Error: no such command "chorales"
I figured the chorales script is part of the music21-module, which I installed on my computer using pip.
As far as I know I followed the installation steps more or less correctly (see github Getting Started and Workflow):
run activate script
run pip install --editable .
2.5 (installed the missing module music21)
run bachbot chorales prepare_poly
I suspect it has something to do with the entry point but I can't put a finger on what's wrong. I tried several re-installs but that does not seem to do the trick.
I would be grateful if someone could help me with this. Thanks in advance!

My apologies, I was rushing to get the thesis in on time so the documentation is not the best!
The commands for building the polyphonic dataset and training the model are:
bachbot datasets prepare
bachbot datasets concatenate_corpus scratch/BWV-*.utf
bachbot make_h5
bachbot train
To use the model trained for $ITER iterations to generate samples with a sampling temperature of $TMP:
bachbot sample ~/bachbot/scratch/checkpoints/*/checkpoint_<ITER>.t7 -t <TEMP>
bachbot decode sampled_stream ~/bachbot/scratch/sampled_$TMP.utf
The first and last section of a recent presentation I made summarizes this workflow.
By the way, I would recommend using the Docker image described in the presentation I linked. While the CLI is in Python, the actual LSTM has additional dependencies (e.g. Lua, Torch, CUDA if you plan on using a GPU).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.