How to use Kedro with Great-expectations? - python

I am using Kedro to create a pipeline for ETL purposes and column specific validations are being done using Great-Expectations. There is a hooks.py file listed in Kedro documentation here. This hook is registered as per the instructions mentioned on Kedro-docs.
This is my current workflow:
The workflow:
Created a kedro project using kedro new, project name ecom_analytics
Stored the datasets in data/01_raw folder called dataset_raw.csv & dataset_validate.csv
Initialize great_expectations project using great_expectations init
Create a new datasource using great_expectations datasource new. The name I added was main_datasource
Create a new expectation using great_expectations suite new. This expectation is called data.raw using data assistant
Edited the great_expectations suite using great_expectations suite edit data.raw
Created the catalog entries for the datasets in data/01_raw
Added the Great expectations hooks.py' given in the kedro documentation and registered the hook on settings.py` file
Tried kedro viz --autoreload. This works to view the visualisation
When using kedro run it gives the error
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:758 in get_batch │
│ │
│ 755 │ │ else: │
│ 756 │ │ │ data_asset_type = arg3 │
│ 757 │ │ batch_parameters = kwargs.get("batch_parameters") │
│ ❱ 758 │ │ return self._get_batch_v2( │
│ 759 │ │ │ batch_kwargs=batch_kwargs, │
│ 760 │ │ │ expectation_suite_name=expectation_suite_name, │
│ 761 │ │ │ data_asset_type=data_asset_type, │
│ │
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:867 in _get_batch_v2 │
│ │
│ 864 │ │ │ expectation_suite = self.get_expectation_suite(expectation_suite_name) │
│ 865 │ │ │
│ 866 │ │ datasource = self.get_datasource(batch_kwargs.get("datasource")) # type: ignore │
│ ❱ 867 │ │ batch = datasource.get_batch( # type: ignore[union-attr] │
│ 868 │ │ │ batch_kwargs=batch_kwargs, batch_parameters=batch_parameters │
│ 869 │ │ ) │
│ 870 │ │ if data_asset_type is None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Datasource' object has no attribute 'get_batch'
Please use the latest develop branch for the following project to look through the issue : https://github.com/DhavalThkkar/ecom-analytics
This is extremely difficult to work with. I have loaded the dataset for which I want to check validations inside the data/01_raw folder. If someone can help me with an end-2-end example for this repo, it'd really be appreciated

You can check a minimal example here: https://github.com/erwinpaillacan/kedro-great-expectations-example
Basically, you need to define:
A memory dataset, which is already defined in the example
Define your expectations
Define your checkpoint linked to your expectation.
A mapper dataset: checkpoint conf/base/parameters/great_expectations_hook.yml

You need to create the datasource. For more information (and example code to resolve a very similar issue), see https://github.com/great-expectations/great_expectations/issues/1389#issuecomment-624955813.

Related

Coverage unittest fails in Python 3.8

I am struggling with the implementation of unittest for subdirectories. I have the following project
project
│ README.md
│ __init__.py
│
└───common
│ __init__.py
│ └───my_func
│ │ __init__.py
│ │ func1.py
│ │ func2.py
│
└───scripts
│ __init__.py
│ └───folder_1
│ │ __init__.py
│ │ code1.py
│ │ code2.py
│
│ └───folder_2
│ │ __init__.py
│ │ code3.py
│ │ code4.py
│ │
│ └───tests
│ │ └───test1
│ │ │ __init__.py
│ │ │ test_code3.py
│ │ │ test_code4.py
│ │ │
│ │ └───test2
│ │ │ __init__.py
│ │ │ test_code3.py
│ │ │ test_code4.py
I set the working directory to be ./project. In my code1.py file I import common.my_func.func1 and it runs normally.
Now, I am trying to implement some unittest functions in ./project/scripts/tests. To do so, I am using the coverage package and running the command: coverage run --source=scripts -m unittest discover scripts/tests. However, when doing so, I get the following error:
ModeluNotFoundError: No module named common.my_func
Weirdly speaking, the scripts works perfectly when I try to run it for one test folder only, and not the whole folder coverage run --source=scripts -m unittest discover scripts/tests/test1.
I tried multiple combinations of removing the source, getting more specific with the folder and on. Have any of you faced similar problems with python 3.8?
Thank you very much in advance,

Own module import not working and no solution in other questions found

Common question, but I have not found a solution reading >10 question threads.
I have the following directory:
ParentFolder
├───notebooks
│ │ .gitkeep
│ │ 1raw_eda.ipynb
│ │ 2raw_feature_engineering.ipynb
│ │ 3advanced_eda.ipynb
│ │ 4advanced_feature_engineering.ipynb
│ │ 5modelling.ipynb
│ │ __init__.py
|
├───src
│ │ __init__.py
│ ├───features
│ │ .gitkeep
│ │ build_features.py
│ │ __init__.py
|
|__init__.py
I am in 1raw_eda.ipynb, my os.getcwd() is notebooks. If i now try import ParentFolder.src.features.build_features, I am getting a No module named 'ParentFolder' error and I don't know why as I have __init__.py everywhere.
Additional Information: ParentFolder is capitalized like this and I just changed its name from Parent-Folder to ParentFolder, in case this is relevant. I have no info whether the import worked before as I didn't try it before (actually I renamed it to allow for the import).
Thanks already!

Python Setuptools including data with a specific folder structure

I've done a bit of searching and can't find any examples that match my scenario.
I'm trying to create a x-lang module to return some data that we use internally.
My git repo would contain both the powershell and python version of the module, and each module would reference the static data.
I'm struggling to package it in a way that doesn't have a wierd folder structure and works from a python packaging point of view:
the folder structure looks like this (anonymised)
<Root of Repo>
│ .gitignore
│ MANIFEST.in
│ pyproject.toml
│ README.md
│ requirements.txt
│ setup.py
│
├───data
│ 1.json
│ 2.json
│ 3.json
│ 4.json
│
├───src
│ ├───powershell
│ │ │ readme.md
│ │ │ test.ps1
│ │ │ xlangTest.psd1
│ │ │
│ │ ├───classes
│ │ ├───private
│ │ │ get-something1.ps1
│ │ │
│ │ └───public
│ │ Get-DataExample.ps1
│ │
│ └───python
│ │ 1.py
│ │ 2.py
│ │ 3.py
│ │ 4.py
│ │ 5.py
│ │ 6.py
│ │ 7.py
│ │ 8.py
│ │ 9.py
│ │ __init__.py
│ │
│ └───ff.ff.ff.egg-info
│ dependency_links.txt
│ PKG-INFO
│ SOURCES.txt
│ top_level.txt
│
├───tests
│ ├───powershell
│ │ Get-DataExample.tests.ps1
│ │ get-something1.tests.ps1
│ │
│ └───python
and my setup.py looks like this:
import setuptools
with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()
setuptools.setup(
name="ff.ff.ff",
version="0.0.1",
author="ffff",
author_email="david.wallis#fff.fff",
description="An xlang module test",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://ffff/ffff",
project_urls={
"Bug Tracker": "https://ffff/ffff",
},
classifiers=[
"Programming Language :: Python :: 3",
"Operating System :: OS Independent",
],
package_dir={"": "src"},
# data_files={
# "data":['data/*.json'],
# },
# include_package_data = True,
# package_data = {
# "data" : ["../data/*.json"]
# },
packages=setuptools.find_packages(where="src"),
# install_requires=[
# 'readContent',
# 'json',
# ],
python_requires=">=3.6",
)
manifest.IN contains:
graft data
when I package it I end up with the following in the tar file within the tar.gz
C:.
│ ff.ff.ff-0.0.1.tar
│
└───ff.ff.ff-0.0.1
│ #PaxHeader
│
└───ff.ff.ff-0.0.1
│ MANIFEST.in
│ PKG-INFO
│ pyproject.toml
│ README.md
│ setup.cfg
│ setup.py
│
├───data
│ 1.json
│ 2.json
│ 3.json
│ 4.json
│
└───src
├───python
│ 1.py
│ 2.py
│ 3.py
│ 4.py
│ 5.py
│ 6.py
│ 7.py
│ 8.py
│ 9.py
│ __init__.py
│
└───ff.ff.ff.egg-info
dependency_links.txt
PKG-INFO
SOURCES.txt
top_level.txt
I guess what I'm looking for is a sanity check, is this ok, or can I some how flatten the src/python in the distribution or even rename to the module name?
and I suppose the key point here is that I want the data to be common across the two modules. and ultimatley I want to do the same for test cases next.

Docstrings not populating Sphinx documentation

I am trying to generate Sphinx documentation for my Python application. Originally I had a complex structure as follows...
venv
docs
├───source
├───├───_static
├───├───_templates
├───├───conf.py
├───├───index.rst
├───├───modules.rst
├───├───...
├───build
├───make.bat
├───Makefile
├───MyCode
├───├───Utilities
│ │ └───class1.py
├───├───Configurations
│ │ ├───Archive
│ │ ├───API1_Configurations
│ │ │ ├───Config1.ini
│ │ ├───API2_Configurations
│ │ │ ├───Config2.ini
│ │ ├───API3_Configurations
│ │ │ ├───Config3.ini
│ │ ├───API4_Configurations
│ │ │ ├───Config4.ini
├───├───APIs
│ │ ├───API1
│ │ │ ├───Class1.py
│ │ │ ├───Class2.py
│ │ ├───API2
│ │ │ ├───Class1.py
│ │ │ ├───Class2.py
│ │ │ ├───Supporting
│ │ │ │ └───Class1.py
│ │ ├───API3
│ │ │ ├───Support
│ │ │ │ ├───SupportPackage1
│ │ │ │ ├───Support Package2
│ │ │ │ │ └───Class1.py
│ │ │ │ └───__pycache__
│ │ │ └───Class1.py
In this case, my source code exists in ./docs/MyCode.
I am using...
Python 3.8
Sphinx 4.2 (although I've tried with many versions)
NumPy docstrings
I have...
Added the following extensions
sphinx.ext.autodoc
sphinx.ext.apidoc
sphinx.ext.napoleon
Pointed conf.py to my code using both a relative path and absolute path (relative path being ../MyCode).
For some reason, the closest I can get to actually populating the HTML pages with my documentation is simply having the classes in the index toctree. They link out to blank html pages without my Python docstrings.
Does anyone have any idea why it won't grab my docstrings??
Ok... feeling pretty dumb about this, but it was just cause of the warnings.
Sphinx requires that the package is a fully installable package in the sense that when it is placed in the sphinx package, it should be able to compile. I had some absolute references in my import statements which caused my program to fail. As such, the actual docstrings in my program could not be pulled while the names of the classes still showed in my sphinx html.
Basically make sure your code compiles when trying to throw it in Sphinx.

Converting my project to a package with modules

I am converting my python project into packages, with sub-packages, with modules. The tree structure is as shown below (sorry for the long tree). When I run run.py, it imports MyEnvironment.environment which in turn has statements like from station import Station. It is at this stage that I get an error saying ModuleNotFoundError: No module named 'station'. I am not able to figure why this is happening. Note that it work when I do sys.append('.\Environment'), but it should work without it since I already added __init__.py. Could someone help here?
C:.
├───.vscode
│ settings.json
│
└───DRLOD-Package
│ run.py
│ __init__.py
│
├───MyEnvironment
│ │ agv.py
│ │ environment.py
│ │ job.py
│ │ job_queue.py
│ │ parent_area.py
│ │ path_finding.py
│ │ pdnode.py
│ │ policy.py
│ │ request.py
│ │ station.py
│ │ stats.py
│ │ test.py
│ │ __init__.py
```
In environment.py use from .station import Station

Categories

Resources