Pyspark shell - Unable to receive output log file for errors

Pyspark shell - Unable to receive output log file for errors - python

This is my script. I have edited this script for log filethrough some information on internet. Still i am unable to reveive log files for error. Anyone can help me solve this. This is my script.
from pyspark.sql.functions import udf
from datetime import datetime
from math import floor
from pyspark.context import SparkContext
from pyspark.sql import DataFrame
from pyspark.sql.functions import expr`
import os
import sys
import logging
import logging.handlers
log = logging.getLogger('log_file')
handler = logging.FileHandler("spam.log")
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
log.addHandler(handler)
sys.stderr.write = log.error
sys.stdout.write = log.info
sc = SparkContext('local')
spark = SparkSession(sc)
sc.addPyFile("udf3.py")
from udf3 import BW_diag
bw = udf(BW_diag);`
My pysparkshell previously used to show me the errors in shell itself but now after this log script i haved copied from somewhere, I am not getting errors on shell. I also want log file because i have to run my scrpit in oozie and have to get error logs for checking my script.
More simply - first i have to run script for non restricted in Pyspark sample data and again place in OOzie for main data run.
Thank you experts!

Related

Why logging is not working when I use pytest

When I run pytest, I am not seeing any log messages, how can I fix it, I tried to search for pytest.ini file which is not present in my local, I am new to pytest and need some help.
import test_todo_lib
import logging
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.CRITICAL)
# NOTE: Run pytest using --capture=tee-sys option inorder to display standard output
def test_basic_unit_tests(browser):
test_page = test_todo_lib.ToDoAppPage(browser)
test_page.load()
logging.info("Check the basic functions")

Python logging & 3rd party modules

I'm struggling to get Python to log the way I want. I've added some context to my log statements by adding a filter and updating the format string to print what's in the filter. That works as expected as long as all the code is mine.
But if a 3rd party module logs something, it throws an error because that module doesn't know about my filter.
How do I get context into my logs without blowing up 3rd party module logging?
This code works fine in my modules. But if a 3rd party module wants to log something, they don't know about my ContextFilter, which details the nstid I want in my log messages.
import logging
import sys
import boto3
from ContextFilter import ContextFilter
logging.basicConfig(
format='%(asctime)s %(levelname)-8s nstid:%(nstid)8s %(message)s',
handlers=[logging.StreamHandler(sys.stdout)],
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S'
)
log = logging.getLogger(__name__)
log.addFilter(ContextFilter())
log.info("important information")

I was able to get what I needed using a CustomAdapter instead of a ContextFilter, but I'm still interested in other solutions:
CustomAdapter.py:
import logging
import os
class CustomAdapter(logging.LoggerAdapter):
def process(self, msg, kwargs):
nstid = os.environ['NSTID'] if 'NSTID' in os.environ else None
return '[NSTID: %s] %s' % (nstid, msg), kwargs
import logging
import os
import sys
import boto3
from CustomAdapter import CustomAdapter
logging.basicConfig(
format='%(asctime)s %(levelname)-8s %(message)s',
handlers=[logging.StreamHandler(sys.stdout)],
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S'
)
log = CustomAdapter(logging.getLogger(__name__), {})
log.info("important information")
sqs = boto3.resource('sqs', region_name=os.environ['AWS_REGION'])
outputs:
2020-04-13 13:24:38 INFO [NSTID: 24533] important information
2020-04-13 13:24:38 INFO Found credentials in shared credentials file: ~/.aws/credentials
the first line is from my code, the second from boto3

How to set the log level for an imported module?

Write your code with a nice logger
import logging
def init_logging():
logFormatter = logging.Formatter("[%(asctime)s] %(levelname)s::%(module)s::%(funcName)s() %(message)s")
rootLogger = logging.getLogger()
LOG_DIR = os.getcwd() + '/' + 'logs'
if not os.path.exists(LOG_DIR):
os.makedirs(LOG_DIR)
fileHandler = logging.FileHandler("{0}/{1}.log".format(LOG_DIR, "g2"))
fileHandler.setFormatter(logFormatter)
rootLogger.addHandler(fileHandler)
rootLogger.setLevel(logging.DEBUG)
consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)
return rootLogger
logger = init_logging()
works as expected. Logging using logger.debug("Hello! :)") logs to file and console.
In a second step you want to import an external module which is also logging using logging module:
Install it using pip3 install pymisp (or any other external module)
Import it using from pymisp import PyMISP (or any other external module)
Create an object of it using self.pymisp = PyMISP(self.ds_model.api_url, self.ds_model.api_key, False, 'json') (or any other...)
What now happens is, that every debug log output from the imported module is getting logged to the log file and the console. The question now is, how to set a different (higher) log level for the imported module.

As Meet Sinoja and anishtain4 pointed out in the comments, the best and most generic method is to retrieve the logger by the name of the imported module as follows:
import logging
import some_module_with_logging
logging.getLogger("some_module_with_logging").setLevel(logging.WARNING)
Another option (though not recommended if the generic method above works) is to extract the module's logger variable and customize it to your needs. Most third-party modules store it in a module-level variable called logger or _log. In your case:
import logging
import pymisp
pymisp.logger.setLevel(logging.INFO)
# code of module goes here

A colleague of mine helped with this question:
Get a named logger yourLogger = logging.getLogger('your_logger')
Add a filter to each handler prevents them to print/save other logs than yours
for handler in logging.root.handlers:
handler.addFilter(logging.Filter('your_logger'))

Access to Spark from Flask app

I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it's own server. I don't think that the Spark context is running within the script. How do I get Spark working in the following example?
from flask import Flask, request
from pyspark import SparkConf, SparkContext
app = Flask(__name__)
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)
#app.route('/accessFunction', methods=['POST'])
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080)
In IPython Notebook I don't define the SparkContext because it is automatically configured. I don't remember how I did this, I followed some blogs.
On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.
Edit:
Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.
Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
This returns "Successfully imported Spark Modules". However, the next .py file I made returns an exception:
from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()
This returns exception:
"Java gateway process exited before sending the driver its port number"
Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success...
Any suggestions about what to try next? I am at a loss.

Okay, so I'm going to answer my own question in the hope that someone out there won't suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.
Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:
from pyspark import SparkContext
sc = SparkContext('local')
So the full code will be:
from pyspark import SparkContext
sc = SparkContext('local')
from flask import Flask, request
app = Flask(__name__)
#app.route('/whateverYouWant', methods=['POST']) #can set first param to '/'
def toyFunction():
posted_data = sc.parallelize([request.get_data()])
return str(posted_data.collect()[0])
if __name__ == '__main_':
app.run(port=8080) #note set to 8080!
Editing the setup:
It is essential that the file (yourrfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.
Then issue the following command within the directory:
./bin/spark-submit yourfilename.py
which initiates the service at 10.0.0.XX:8080/accessFunction/ .
Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively
You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:
curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/accessFunction/

I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here's the full content:
import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')
from flaskapp import app as application

Modify your .py file as it is shown in the linked guide 'Using IPython Notebook with Spark' part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:
import sys
try:
sys.path.append("your/spark/home/python")
from pyspark import context
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)

No handlers could be found for logger "kazoo.client"

While giving the hardcoded value for the following code , its working fine.
#!/usr/bin/env python3
import time
from kazoo.client import KazooClient
import os
nodes = ["172.22.105.53"]
zk = KazooClient(hosts='172.22.105.53:2181')
output : no error
But the following lines gives error like No handlers could be found for logger "kazoo.client"
#!/usr/bin/env python3
import time
from kazoo.client import KazooClient
import os
nodes = ["172.22.105.53"]
lead = "172.22.105.53"
zk = KazooClient(hosts='lead:2181')
any help on this regard is quietly appreciable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark shell - Unable to receive output log file for errors - python

Related

Why logging is not working when I use pytest

Python logging & 3rd party modules

How to set the log level for an imported module?

Access to Spark from Flask app

No handlers could be found for logger "kazoo.client"

Categories

Resources