Helllo guys,
Im running jupyter on serveur. I configure the serveur and did jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
I did this when creating my container to use only 7GO:
sudo docker run -it --memory="7g" jupyter
I created a notebook, containing only these rows
import os
import sys
import pandas as pd
from src.connection import create_postgres_engine
postgres_connection = create_postgres_engine('prod')
query1 = """
select
id, col1, date, col2, col3
from s inner join c on s.id = c.id_1
and s.date BETWEEN c.date_beg AND date_end where date >='2020-01-04' and due_date <'2020-04-08';
"""
query1 = pd.read_sql_query(query1, postgres_connection[0], coerce_float=False)
After 3, 4 minutes i got this error
kernel seems to crash, it will restart automatically.
Please who can help me?
I spend 2 days on it, and didn't got what happens.
Related
system: raspberry pi 4 model B, 32bit, linux run python
Is a dumb question, I was planning to read data from MongoDB to excel and also read excel toMongoDB. Overall the .py scrip/code is fine and working. (the code is below)
I do know if in the code I do "import pandas as pd" then raspberry pi cmd
need to pip install it
my main quesion:
but we also acknowledge that raspberrypi's memory not as bigger as other laptop, is there other way instead of pip install all the stuff, we can still use them?
Becides, I only pip install pandas by raspberrypi took about 15 min, and laptop is like 30sec, and factory might have more than hundred of raspberrypis for recording such as temperature, product data etc on production line.
There should be an efficient way to implement (use pandas and other pymongo without manually pip install on raspberrypi)
the memory left:
joy#raspberrypi:/ $ free
3834332/total , 223876/used , 2844436/free
the fine code.py script MongoDB to excel:
import pandas as pd
from pymongo import MongoClient
import pymongo
from json2excel import Json2Excel
import json
from bson.objectid import ObjectId
from bson import json_util
client = pymongo.MongoClient("mongodb://localhost:27017/")
# Database Name
db = client["(practice_10_14)-0002"]
# Collection Name
col = db["(practice_10_24)read_MongoDB_to_Excel"]
# Find All: It works like Select * query of SQL.
x = col.find()
list_01 = []
for data in x:
list_01.append(data)
print(data)
print("= = = = = ")
df = pd.DataFrame(data,index=[0])
# select two columns
for y in df:
print(y)
print("= = = = = ")
print(type(list_01))
print(list_01)
df = pd.DataFrame(list_01)
writer = pd.ExcelWriter('test10.24.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='welcome', index=False)
writer.save()
When I open the folder through windows powershell it works, but through ubuntu it doesn't work
import matplotlib.pyplot as plt
import psycopg2
import os
import sys
cur.execute(f"SELECT date as date, revenue_rates_usd ->> '{desired_currency}' AS {desired_currency} FROM usd_rates WHERE date BETWEEN '{start_date}' AND '{end_date}';", conn)
dates = []
values = []
for row in cur.fetchall():
# print(row[1])
dates.append(row[0])
values.append(row[1])
plt.plot_date(dates, values, "-")
plt.title(f'Exchange from USD to {desired_currency}')
plt.show()
That is how I run it:
/mnt/c/Users/owner/Desktop/Tamatem/.venv/bin/python /mnt/c/Users/owner/Desktop/Tamatem/report.py JOD 2021-07-1 2021-07-22
And when I run it, there is no any errors.
You might have to change the "backend".
import matplotlib
matplotlib.use('Agg')
Do you call the show() method inside a terminal or application that has access to a graphical environment?
Also try to use other GUI backends (TkAgg, wxAgg, Qt5Agg, Qt4Agg).
Further information how this can be done here:How can I set the 'backend' in matplotlib in Python?
I'm trying to connect to hive using Python. I installed all of the dependencies required (sasl, thrift_sasl, etc..)
Here is how I try to connect:
configuration = {"hive.server2.authentication.kerberos.principal" : "hive/_HOST#REALM_HOST", "hive.server2.authentication.kerberos.keytab" : "/etc/security/keytabs/hive.service.keytab"}
connection = hive.Connection(configuration = configuration, host="host", port=port, auth="KERBEROS", kerberos_service_name = "hiveserver2")
But I get this error:
Minor code may provide more information (Cannot find KDC for realm "REALM_DOMAIN")
Whay I'm missing? Does someone has an example of an pyHive connection using kerberos?
Thank you for your help.
Thank you #Kishore.
Actually in PySpark, the code looks like this :
import pyspark
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
import pyspark.sql.types as T
def connection(self):
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
sc = pyspark.SparkContext(conf=conf)
self.cursor = HiveContext(sc)
self.cursor.setConf("hive.exec.dynamic.partition", "true")
self.cursor.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
self.cursor.setConf("hive.warehouse.subdir.inherit.perms", "true")
self.cursor.setConf('spark.scheduler.mode', 'FAIR')
and you can request using :
rows = self.cursor.sql("SELECT someone FROM something")
for row in rows.collect():
print row
I'm actually running the code via the command :
spark-submit --master yarn MyProgram.py
I guess you could using basically run the python with pyspark installed like :
python MyProgram.py
but I didn't tried so I won't assure that it's working
I don't know in pyspark, but I am using below scala code and it is working since last one year. If you can change this code in python. Replace the value of properties based on your kerberos.
System.setProperty("hive.metastore.uris", "add hive.metastore.uris url");
System.setProperty("hive.metastore.sasl.enabled", "true")
System.setProperty("hive.metastore.kerberos.keytab.file", "add keytab")
System.setProperty("hive.security.authorization.enabled", "false")
System.setProperty("hive.metastore.kerberos.principal", "replace hive.metastore.kerberos.principal value")
System.setProperty("hive.metastore.execute.setugi", "true")
val hiveContext = new HiveContext(sparkContext)
I am running the docker image for snappydata v0.9. From inside that image, I can run queries against the database. However, I cannot do so from a second server on my machine.
I copied the python files from snappydata to the installed pyspark (editing snappysession to SnappySession in the imports) and (based on the answer to Unable to connect to snappydata store with spark-shell command), I wrote the following script (it is a bit of cargo-cult programming as I was copying from the python code in the docker image -- suggestions to improve it are welcome):
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.snappy import SnappyContext
from pyspark.storagelevel import StorageLevel
SparkContext._ensure_initialized()
spark = SparkSession.builder.appName("test") \
.master("local[*]") \
.config("snappydata.store.locators", "localhost:10034") \
.getOrCreate()
spark.sql("SELECT col1, min(col2) from TABLE1")
However, I get a traceback with:
pyspark.sql.utils.AnalysisException: u'Table or view not found: TABLE1
I have verified with wireshark that my program is communicating with the docker image (TCP follow stream shows the traceback message and a scala traceback). My assumption is that the permissions in the snappydata cluster is set wrong, but grepping through the logs and configuration did not show anything obvious.
How can I proceed?
-------- Edit 1 ------------
The new code that I am running (still getting the same error), incorporating the suggestions for the change in the config and ensuring that I get a SnappySession is:
from pyspark.sql.snappy import SnappySession
snappy = SnappySession.builder.appName("test") \
.master("local[*]") \
.config("spark.snappydata.connection", "localhost:1527") \
.getOrCreate()
snappy.sql("SELECT col1, min(col2) from TABLE1")
Can you change your config to the following -
.config("spark.snappydata.connection", "localhost:1527")
The 'snappydata.store.locators' property is no more there in 0.9.
You can refer the docs here - https://github.com/SnappyDataInc/snappydata/blob/master/docs/deployment.md#connectormode
Also, you need to create a SnappySession to access the Snappy managed Tables.
Something like this ....
spark = SparkSession.builder.appName("test") \
.master("local[*]") \
.config("spark.snappydata.connection", "localhost:1527") \
.getOrCreate()
snappy = SnappySession(spark)
snappy.sql("SELECT col1, min(col2) from TABLE1")
I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example?
from flask import Flask, request
from pyspark import SparkConf, SparkContext
app = Flask(__name__)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)
#app.route('/accessFunction', methods=['POST'])
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080)
In IPython Notebook I don't define the SparkContext because it is automatically configured. I don't remember how I did this, I followed some blogs.
On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.
Edit:
Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.
Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
This returns "Successfully imported Spark Modules". However, the next .py file I made returns an exception:
from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()
This returns exception:
"Java gateway process exited before sending the driver its port number"
Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success...
Any suggestions about what to try next? I am at a loss.
Okay, so I'm going to answer my own question in the hope that someone out there won't suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.
Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:
from pyspark import SparkContext
sc = SparkContext('local')
So the full code will be:
from pyspark import SparkContext
sc = SparkContext('local')
from flask import Flask, request
app = Flask(__name__)
#app.route('/whateverYouWant', methods=['POST']) #can set first param to '/'
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080) #note set to 8080!
Editing the setup:
It is essential that the file (yourrfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.
Then issue the following command within the directory:
./bin/spark-submit yourfilename.py
which initiates the service at 10.0.0.XX:8080/accessFunction/ .
Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively
You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:
curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/accessFunction/
I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here's the full content:
import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')
from flaskapp import app as application
Modify your .py file as it is shown in the linked guide 'Using IPython Notebook with Spark' part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)