import local packages inside PyFlink - python

I'm trying to write a local package in PyFlink project.
But I can only import via relative path.
like
from .package import func
Can I use absolute paths in packages inside PyFlink project imported as env.add_python_file('/path_to_project') ?

For using absolute paths answer from https://lists.apache.org/list.html?user#flink.apache.org:
full answer here
for abstract structure the directory:
flink_app/
data_service/
filesystem.py
validator/
validator.py
common/
constants.py
main.py <- entry job
When submitting the PyFlink job you can specify python files and entry main module with option --pyFiles and --pyModule1, like:
$ ./bin/flink run --pyModule flink_app.main --pyFiles ${WORKSPACE}/flink_app
In this way, all files under the directory will be added to the PYTHONPAHT of both the local client and the remote python UDF worker.

Related

import issues in monorepo setting

I am struggling with import issues with Python. I am working in a mono repo setting, where other directories are not related to python. Following is the structure of the directory.
monorepo
services
app1
app2
__init__.py
src
__init__.py
api
__init__.py
foo1.py
foo2.py
app3
I want to use the import structure of from app2.api.foo1 import Foo1 in foo2.py script.
In both cases, it fails. I see the path to app2 in the sys.path but still python does not see this as a module.
To export the path: I tried.
PYTHONPATH="${PYTHONPATH}:$(realpath $(pwd))" at terminal and
sys.path.append(full_path_app2) at console level. Still get the import error.
Any help on how to solve this?
Python imports can be tricky.
Python will look for importable modules inside the PythonPath (sys.path).
Importable modules are subfolders of any location in the path which contain a __init__.py file.
If you call a script file directly python script.py the directoy containing that file is added to the path automatically. If you use the module syntax python -m mymodule it will search the PYTHONPATH and execute the first module with the name mymodule.
In order to import app2, you need to add monorepo/services to the PythonPath. This should also allow you to directly import files in app2, i.e. from app2 import xyz.
If you are working with monorepos, you need to add all folders which may contain python modules to the path. I would recommend doing it using the environment variable PYTHONPATH instead of modifying sys.path programmatically.

How to get project's root path inside my package?

I am trying to create one distributable package in Python.So in my package I need to access a file from the root of the project for which my package is installed inside. I am trying to do that like this,
Path(__file__).parents[1] / Path(self.file_name)
But is is just returning '.' which means it indicating my package's root folder.
For example my package directory structure looks like this,
mypackage
- lib
- myfile.py
- __init__.py
So when I convert that package to installable wheel format I can install that using pip command.If now I create project like this,
myproject
- __init__.py
- config.json
- main.py
- venv
Now see I want to use/install mypackage in myproject. All the packages are installed inside the venv folder. So now my question is how I can access the config.json file inside my mypackage. So to do that I require to access the root folder of myproject.I hope this is clear to understand.
Any help would be appreciated.
you can use
import os
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) # This is your Project Root
and see this answer as reference , it will help you
If my understanding is correct, a python script in the root directory will import your package. If this is correct, the following code may help:
import os
import __main__ as main
if hasattr(main, '__file__'):
root_dir = os.path.dirname(main.__file__)
Use importlib.resources to access file-like resources in packages. This uses the import machinery to locate the resource, ensuring that you get the resource in the package used by your program.
For example, if you want the resource as a file:
import importlib.resources
import json
# create a pathlib.Path pointing to the resource
with importlib.resources.path("myproject", "config.json") as json_path:
with json_path.open() as json_file:
data = json.load(json_file)
Using importlib will work for any importable package, including those not backed by the filesystem (e.g. a zipped package). Note that if you are just interested in the contents, then importlib.resources.read_text is more convenient instead of creating a file just to discard it.

Importing module from a directory above current working directory

First of all, there are a bunch of solutions on stackoverflow regarding this but from the ones I tried none of them is working. I am working on a remote machine (linux). I am prototyping within the dir-2/module_2.py file using an ipython interpreter. Also I am trying to avoid using absolute paths as the absolute path in this remote machine is long and ugly, and I want my code to run on other machines upon download.
My directory structure is as follows:
/project-dir/
-/dir-1/
-/__ init__.py
-/module_1.py
-/dir-2/
-/__ init__.py
-/module_2.py
-/module_3.py
Now I want to import module_1 from module_2. However the solution mentioned in this stackoverflow post: link of using
sys.path.append('../..')
import module_2
Does not work. I get the error: ModuleNotFoundError: No module named 'module_1'
Moreover, within the ipython interpreter things like import .module_3 within module_2 throws error:
import .module_3
^ SyntaxError: invalid syntax
Isn't the dot operator supposed to work within the same directory as well. Overall I am quite confused by the importing mechanism. Any help with the initial problem is greatly appreciated! Thanks a lot!
Why it didn't work?
If you run the module1.py file and you want to import module2 then you need something like
sys.path.append("../dir-2")
If you use sys.path.append("../..") then the folder you added to the path is the folder containing project-dirand there is notmodule2.py` file inside it.
The syntax import .module_3 is for relative imports. if you tried to execute module2.py and it contains import .module_3 it does not work because you are using module2.py as a script. To use relative imports you need to treat both module2.py and module_3.py as modules. That is, some other file imports module2 and module2 import something from module3 using this syntax.
Suggestion on how you can proceed
One possible solution that solves both problems is property organizing the project and (optionally, ut a good idea) packaging your library (that is, make your code "installable"). Then, once your library is installed (in the virtual environment you are working) you don't need hacky sys.path solutions. You will be able to import your library from any folder.
Furthermore, don't treat your modules as scripts (don't run your modules). Use a separate python file as your "executable" (or entry point) and import everything you need from there. With this, relative imports in your module*.py files will work correctly and you don't get confused.
A possible directory structure could be
/project-dir/
- apps/
- main.py
- yourlib/
-/__ init__.py
-/dir-1/
-/__ init__.py
-/module_1.py
-/dir-2/
-/__ init__.py
-/module_2.py
-/module_3.py
Notice that the the yourlib folder as well as subfolders contain an __init__.py file. With this structure, you only run main.py (the name does not need to be main.py).
Case 1: You don't want to package your library
If you don't want to package your library, then you can add sys.path.append("../") in main.py to add "the project-dir/ folder to the path. With that your yourlib library will be "importable" in main.py. You can do something like from yourlib import module_2 and it will work correctly (and module_2 can use relative imports). Alternatively, you can also directly put main.py in the project-dir/ folder and you don't need to change sys.path at all, since project-dir/ will be the "working directory" in that case.
Note that you can also have a tests folder inside project-dir and to run a test file you can do the same as you did to run main.py.
Case 2: You want to package your library
The previous solution already solves your problems, but going the extra mile adds some benefits, such as dependency management and no need to change sys.path no matter where you are. There are several options to package your library and I will show one option using poetry due to its simplicity.
After installing poetry, you can run the command below in a terminal to create a new project
poetry new mylib
This creates the following folder structure
mylib/
- README.rst
- mylib/
- __init__.py
- pyproject.toml
- tests
You can then add the apps folder if you want, as well as subfolders inside mylib/ (each with a __init__.py file).
The pyproject.toml file specifies the dependencies and project metadata. You can edit it by hand and/or use poetry to add new dependencies, such as
poetry add pandas
poetry add --dev mypy
to add pandas as a dependency and mypy as a development dependency, for instance. After that, you can run
poetry build
to create a virtual environment and install your library in it. You can activate the virtual environment with poetry shell and you will be able to import your library from anywhere. Note that you can change your library files without the need to run poetry build again.
At last, if you want to publish your library in PyPi for everyone to see you can use
poetry publish --username your_pypi_username --password _passowrd_
TL; DR
Use an organized project structure with a clear place for the scripts you execute. Particularly, it is better if the script you execute is outside the folder with your modules. Also, don't run a module as a script (otherwise you can't use relative imports).

How to import external scripts in a Airflow DAG with Python?

I have the following structure:
And I try to import the script inside some files of the inbound_layer like so:
import calc
However I get the following error message on Airflow web:
Any idea?
For airflow DAG, when you import your own module, you need make sure 2 things:
where is the module? You need to find where is the root path in you airflow folder. For example, in my dev box, the folders are:
~/projects/data/airflow/teams/team_name/projects/default/dags/dag_names/dag_files.py
The root is airflow, so if I put my modules my_module in
~/projects/data/airflow/teams/team_name/common
Then I need to use
from teams.team_name.common import my_module
In your case, if the root is the upper folder of bi, and you put the scripts of calc in bi/inbound_layer/test.py then you can use:
from bi.inbound_layer.test import calc
And you must make sure you have \__init\__.py files in the directory structure for the imports to function properly. You should have an empty file \__init\__.py in each folder in the path. It indicates this directory is part of airflow packages. In your case, you can use touch \__init\__.py (cli) under bi and _inbound_layer_ folders to create the empty __init\__.py.
Airflow adds dags/, plugins/, and config/ directories in the Airflow home to PYTHONPATH by default so you can for example create folder commons under dags folder, create file there (scriptFileName ). Assuming that script has some class (GetJobDoneClass) you want to import in your DAG you can do it like this:
from common.scriptFileName import GetJobDoneClass
I needed insert the following script inside at the top of ren.py :
import sys, os
from airflow.models import Variable
DAGBAGS_DIR = Variable.get('DAGBAGS_DIR')
sys.path.append(DAGBAGS_DIR + '/bi/inbound_layer/')
This way I make available the current folder packages.

python path during development of a package

Let's say my project structure looks like this:
app/
main.py
modules/
__init__.py
validation.py
configuration.py
modules package contains reusable code.
main.py executes main application logic.
When I try this in main.py
from modules import validation
I get an error which says that import inside of the validation failed. Validation tries to import configuration and I get 'no module named configuration'
I am using Anaconda distribution on windows.
What is the best way of handling PYTHONPATH during development of the package ?
Is there a way to utilize virtualenv (or conda env) in order to get package, that is in development,on the PYTHONPATH without changing sys.path from the code ?
What is the preferred practice when developing a package ?
I've also tried adding modules (folder) package to the lib/site-packages but it still didn't work.
Change your import in validation.py to:
from . import configuration
This is needed for Python 3 but also works with Python 2.

Categories

Resources