What is the best PySpark practice to load config from external file - python

I would like to initialize config once, and then use it in many modules of my PySpark project.
I see 2 ways to do it.
load it in entry point and pass as an argument to each function
main.py:
with open(sys.argv[1]) as f:
config = json.load(f)
df = load_df(config)
df = parse(df, config)
df = validate(df, config, strict=True)
dump(df, config)
But it seems unbeauty to pass one external argument to each function.
Load config in config.py and import this object in each module
config.py
import sys
import json
with open(sys.argv[1]) as f:
config = json.load(f)
main.py
from config import config
df = load_df()
df = parse(df)
df = validate(df, strict=True)
dump(df)
and in each module add row
from config import config
It seems more beauty because config is not, strictly speaking, an argument of function. It is general context where they execute.
Unfortunately, PySpark pickle config.py and tries to execute it on server, but doesn't pass sys.argv to them!
So, I see error when run it
File "/PycharmProjects/spark_test/config.py", line 6, in <module>
CONFIG_PATH = sys.argv[1]
IndexError: list index out of range
What is the best practice to work with general config, loaded from file, in PySpark?

Your program starts execution on master and passes main bulk of its work to executors by invoking some functions on them. The executors are different processes that are typically run on different physical machines.
Thus anything that the master would want to reference on executors needs to be either a standard library function (to which the executors have access) or a pickelable object that can be sent over.
You typically don't want to load and parse any external resources on the executors, since you would always have to copy them over and make sure you load them properly... Passing a pickelable object as an argument of the function (e.g. for a UDF) works much better, since there is only one place in your code where you need to load it.
I would suggest creating a config.py file and add it as an argument to your spark-submit command:
spark-submit --py-files /path/to/config.py main_program.py
Then you can create spark context like this:
spark_context = SparkContext(pyFiles=['/path/to/config.py'])
and simply use import config wherever you need.
You can even include whole python packages in a tree packaged as a single zip file instead of just a single config.py file, but then be sure to include __init__.py in every folder that needs to be referenced as python module.

Related

How and when to initialise configuration in Python?

I'm getting pretty confused as to how and where to initialise application configuration in Python 3.
I have configuration that consists of application specific config (db connection strings, url endpoints etc.) and logging configuration.
Before my application performs its intended function I want to initialise the application and logging config.
After a few different attempts, I eventually ended up with something like the code below in my main entry module. It has the nice effect of all imports being grouped at the top of the file (https://www.python.org/dev/peps/pep-0008/#imports), but it doesn't feel right since the config modules are being imported for side effects alone which is pretty non-intuitive.
import config.app_config # sets up the app config
import config.logging_config # sets up the logging config
...
if __name__ == "__main__":
...
config.app_config looks something like follows:
_config = {
'DB_URL': None
}
_config['DB_URL'] = _get_db_url()
def db_url():
return _config['DB_URL']
def _get_db_url():
#somehow get the db url
and config.logging_config looks like:
if not os.path.isdir('.\logs'):
os.makedirs('.\logs')
if os.path.exists('logging_config.json'):
with open(path, 'rt') as f:
config = json.load(f)
logging.config.dictConfig(config)
else:
logging.basicConfig(level=log_level)
What is the common way to set up application configuration in Python? Bearing in mind that I will have multiple applications each using the config.app_config and config.logging_config module, but with different connection string possibly read from a file
I ended up with a cut down version of the Django approach: https://github.com/django/django/blob/master/django/conf/init.py
It seems pretty elegant and has the nice benefit of working regardless of which module imports settings first.

Call external matlab function in Spark

I am trying to distribute some programs to a local cluster I built using Spark. The aim of this project is to pass some data to each worker and pass the data to an external matlab function to process and collect the data back to master node. I met problem of how to call matlab function. Is that possible for Spark to call external function? In other word, could we control each function parallelized in Spark to search local path of each node to execute external function.
Here is a small test code:
run.py
import sys
from operator import add
from pyspark import SparkContext
import callmatlab
def run(a):
# print '__a'
callmatlab.sparktest()
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
output = sc.parallelize(range(1,2)).map(run)
print output
sc.stop()
sparktest.py
import matlab.engine as eng
import numpy as np
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
data = eng.sparktest()
print "----the return data:\n", type(data), data
if __name__ == "__main__":
sparktest()
submit spark
#!/bin/bash
path=/home/zzz/ProgramFiles/spark
$path/bin/spark-submit \
--verbose \
--py-files $path/hpc/callmatlab.py $path/hpc/sparktest.m \
--master local[4] \
$path/hpc/run.py \
README.md
It seems Spark asks all attached .py files shown as parameters of --py-files, however, Spark does not recognize sparktest.m.
I do not know how to continue. Could anyone give me some advice? Does Spark allow this way? Or any recommendation of other distributed python framework?
Thanks
Thanks for trying to answer my question. I use a different way to solve this problem. I uploaded the matlab files and data need to call and load to a path in the node file system. And the python just add the path and call it using matlab.engine module.
So my callmatlab.py becomes
import matlab.engine as eng
import numpy as np
import os
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
eng.addpath(os.path.join(os.getenv("HOME"), 'zzz/hpc/'),nargout=0)
data = eng.sparktest([12, 1, 2])
print data
Firstly, I do not see any reason to pass on sparktest.m.
Secondly, recommended way is putting them in a .zip file. From documentation:
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
At the end, remember your function will be executed in an executor jvm in a remote m/c, so Spark framework ships function, closure and additional files as part of the job. Hope that helps.
Add the
--files
option before the sparktest.m .
That tells Spark to ship the sparktest.m file to all workers.

Creating a function to load data into a list? Passing filename to import through function

I am new to writing functions and modules, and I need some help creating a function within my new module that will make my currently repetitive process of loading data much more efficient.
I would like this function to reside in a larger overall module with other function that I can keep stored on my home directory and not have to copy into my working directory every time I want to call one of my function.
The data that I have is just some JSON Twitter data from the streaming API, and I would like to use a function to load the data (list of dicts) into a list that I can access after the function runs by using something like data = my_module.my_function('file.json').
I've created a folder in my home directory for my python modules and I have two files in that directory: __init__.py and my_module.py.
I've also went ahead and added the python module folder to sys.path by using sys.path.append('C:\python')
Within python module folder, the file __init__.py has nothing in it, it's just an empty file.
Do I need to put anything in the __init__.py file?
my_module.py has the following code:
import json
def my_function(parameter1):
tweets = []
for line in open(parameter1):
try:
tweets.append(json.loads(line))
except:
pass
I would like to cal the function as such:
import my_module
data = my_module.my_function('tweets.json')
What else do I need to do to create this function to make loading my data more efficient?
To import the module from the package, for example:
import my_package.my_module as my_module
would do what you want. In this case it's fine to leave the init.py empty, and the module will be found just by it being in the package folder "my_package". There are many alternatives on how to define a package/module structure and how to import them, I encourage you to read up, as otherwise you will get confused at some point.
I would like this function to reside in a larger overall module with
other function that I can keep stored on my home directory and not
have to copy into my working directory every time I want to call one
of my function.
For this to work, you need to create a .pth file in C:\Python\site-packages\ (or wherever you have installed Python). This is a simple text file, and inside it you'd put the path to your module:
C:/Users/You/Some/Path/
Call it custom.pth and make sure its in the site-packages directory, otherwise it won't work.
I've also went ahead and added the python module folder to sys.path by
using sys.path.append('C:\python')
You don't need to do this. The site-packages directory is checked by default for modules.
In order to use your function as you intend, you need to make sure:
The file is called my_module.py
It is in the directory you added to the .pth file as explained earlier
That's it, nothing else needs to be done.
As for your code itself the first thing is you are missing a return statement.
If each line in the file is a json object, then use this:
from __future__ import with_statement
import json
def my_function(parameter1):
tweets = []
with open(parameter1) as inf:
for line in inf:
try:
tweets.append(json.loads(line))
except:
pass
return tweets
If the enter file is a json object, then you can do this:
def my_function(parameter1):
tweets = None
with open(parameter1) as inf:
tweets = json.load(inf)
return tweets

Adding Command Line Arguments to Python Twisted

I am still new to Python so keep that in mind when reading this.
I have been hacking away at an existing Python script that was originally "put" together by a few different people.
The script was originally designed to load it's 'configuration' using a module named "conf/config.py" which is basically Python code.
SETTING_NAME='setting value'
I've modified this to instead read it's settings from a configuration file using ConfigParser:
import ConfigParser
config_file_parser = ConfigParser.ConfigParser()
CONFIG_FILE_NAME = "/etc/service_settings.conf"
config_file_parser.readfp(open(r'' + CONFIG_FILE_NAME))
SETTING_NAME = config_file_parser.get('Basic', 'SETTING_NAME')
The problem I am having is how to specify the configuration file to use. Currently I have managed to get it working (somewhat) by having multiple TAC files and setting the "CONFIG_FILE_NAME" variable there using another module to hold the variable value. For example, I have a module 'conf/ConfigLoader.py":
global CONFIG_FILE_NAME
Then my TAC file has:
import conf.ConfigLoader as ConfigLoader
ConfigLoader.CONFIG_FILE_NAME = '/etc/service_settings.conf'
So the conf/config.py module now looks like:
import ConfigLoader
config_file_parser = ConfigParser.ConfigParser()
config_file_parser.readfp(open(r'' + ConfigLoader.CONFIG_FILE_NAME))
It works, but it requires managing two files instead of a single conf file. I attempted to use the "usage.Options" feature as described on http://twistedmatrix.com/documents/current/core/howto/options.html. So I have twisted/plugins/Options.py
from twisted.python import usage
global CONFIG_FILE_NAME
class Options(usage.Options):
optParameters = [['conf', 'c', 'tidepool.conf', 'Configuration File']]
# Get config
config = Options()
config.parseOptions()
CONFIG_FILE_NAME = config.opts['conf']
That does not work at all. Any tips?
I don't know if I understood your problem.
If you want to load the configuration from multiple locations you could pass a list of filenames to the configparser: https://docs.python.org/2/library/configparser.html#ConfigParser.RawConfigParser.read
If you were trying to make a generic configuration manager, you could create a class of a functions the receives the filename or you could use set the configuration file name in an environment variable and read that variable in your script using something like os.environ.get('CONFIG_FILE_NAME').

Is it possible to 'import * from DIRECTORY', then somehow (anyhow) iterate over the loaded modules?

Let me explain the use case...
In a simple python web application framework designed for Google App Engine, I'd like to have my models loaded automatically from a 'models' directory, so all that's needed to add a model to the application is place a file user.py (for example), which contains a class called 'User', in the 'models/' directory.
Being GAE, I can't read from the file system so I can't just read the filenames that way, but it seems to me that I must be able to 'import * from models' (or some equivalent), and retrieve at least a list of module names that were loaded, so I can subject them to further processing logic.
To be clear, I want this to be done WITHOUT having to maintain a separate list of these module names for the application to read from.
You can read from the filesystem in GAE just fine; you just can't write to the filesystem.
from models import * will only import modules listed in __all__ in models/__init__.py; there's no automatic way to import all modules in a package if they're not declared to be part of the package. You just need to read the directory (which you can do) and __import__() everything in it.
As explained in the Python tutorial, you cannot load all .py files from a directory unless you list them manually in the list named __all__ in the file __init__.py. One of the reasons why this is impossible is that it would not work well on case-insensitive file systems -- Python would not know in which case the module names should be used.
Let me start by saying that I'm not familiar with Google App Engine, but the following code demonstrates how to import all python files from a directory. In this case, I am importing files from my 'example' directory, which contains one file, 'dynamic_file.py'.
import os
import imp
import glob
def extract_module_names(python_files):
module_names = []
for py_file in python_files:
module_name = (os.path.basename(py_file))[:-3]
module_names.append(module_name)
return module_names
def load_modules(modules, py_files):
module_count = len(modules)
for i in range(0, module_count):
globals()[modules[i]] = imp.load_source(modules[i], py_files[i])
if __name__ == "__main__":
python_files = glob.glob('example/*.py')
module_names = extract_module_names(python_files)
load_modules(module_names, python_files)
dynamic_file.my_func()
Also, if you wish to iterate over these modules, you could modify the load_modules function to return a list of the loaded module objects by appending the 'imp.load_source(..)' call to a list.
Hope it helps.

Categories

Resources