This is my first time using Google's Vertex AI Pipelines. I checked this codelab as well as this post and this post, on top of some links derived from the official documentation. I decided to put all that knowledge to work, in some toy example: I was planning to build a pipeline consisting of 2 components: "get-data" (which reads some .csv file stored in Cloud Storage) and "report-data" (which basically returns the shape of the .csv data read in the previous component). Furthermore, I was cautious to include some suggestions provided in this forum. The code I currently have, goes as follows:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from google.cloud import aiplatform
# Components section
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
import pandas as pd
from google.cloud import storage
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# path = "gs://my-bucket/program_grouping_data.zip"
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
):
import pandas as pd
df = pd.read_csv(inputd.path)
return df.shape
# Pipeline section
#pipeline(
# Default pipeline root. You can override it when submitting the pipeline.
pipeline_root=PIPELINE_ROOT,
# A name for the pipeline.
name="my-pipeline",
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compilation section
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running and submitting job
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={"url": "test_vertex/pipeline_root/program_grouping_data.zip", "bucket": "my-bucket"},
enable_caching=True,
)
run1.submit()
I was happy to see that the pipeline compiled with no errors, and managed to submit the job. However "my happiness lasted short", as when I went to Vertex AI Pipelines, I stumbled upon some "error", which goes like:
The DAG failed because some tasks failed. The failed tasks are: [get-data].; Job (project_id = my-project, job_id = 4290278978419163136) is failed due to the above error.; Failed to handle the job: {project_number = xxxxxxxx, job_id = 4290278978419163136}
I did not find any related info on the web, neither could I find any log or something similar, and I feel a bit overwhelmed that the solution to this (seemingly) easy example, is still eluding me.
Quite obviously, I don't what or where I am mistaking. Any suggestion?
With some suggestions provided in the comments, I think I managed to make my demo pipeline work. I will first include the updated code:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from datetime import datetime
from google.cloud import aiplatform
from typing import NamedTuple
# Importing 'COMPONENTS' of the 'PIPELINE'
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
"""Reads a csv file, from some location in Cloud Storage"""
import ast
import pandas as pd
from google.cloud import storage
# 'Pulling' demo .csv data from a know location in GCS
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# Reading the pulled demo .csv data
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
) -> NamedTuple("output", [("rows", int), ("columns", int)]):
"""From a passed csv file existing in Cloud Storage, returns its dimensions"""
import pandas as pd
df = pd.read_csv(inputd.path+".csv")
return df.shape
# Building the 'PIPELINE'
#pipeline(
# i.e. in my case: PIPELINE_ROOT = 'gs://my-bucket/test_vertex/pipeline_root/'
# Can be overriden when submitting the pipeline
pipeline_root=PIPELINE_ROOT,
name="readcsv-pipeline", # Your own naming for the pipeline.
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compiling the 'PIPELINE'
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running the 'PIPELINE'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={
"url": "test_vertex/pipeline_root/program_grouping_data.zip",
"bucket": "my-bucket"
},
enable_caching=True,
)
# Submitting the 'PIPELINE'
run1.submit()
Now, I will add some complementary comments, which in sum, managed to solve my problem:
First, having the "Logs Viewer" (roles/logging.viewer) enabled for your user, will greatly help to troubleshoot any existing error in your pipeline (Note: that role worked for me, however you might want to look for a better matching role for you own purposes here). Those errors will appear as "Logs", which can be accessed by clicking the corresponding button:
NOTE: In the picture above, when the "Logs" are displayed, it might be helpful to carefully check each log (close to the time when you created you pipeline), as generally each eof them corresponds with a single warning or error line:
Second, the output of my pipeline was a tuple. In my original approach, I just returned the plain tuple, but it is advised to return a NamedTuple instead. In general, if you need to input / output one or more "small values" (int or str, for any reason), pick a NamedTuple to do so.
Third, when the connection between your pipelines is Input[Dataset] or Ouput[Dataset], adding the file extension is needed (and quite easy to forget). Take for instance the ouput of the get_data component, and notice how the data is recorded by specifically adding the file extension, i.e. dataset.path + ".csv".
Of course, this is a very tiny example, and projects can easily scale to huge projects, however as some sort of "Hello Vertex AI Pipelines" it will work well.
Thank you.
Thanks for your writeup. Very helpful! I had the same error, but it turned out to be for a different reasons, so noting it here...
In my pipeline definition step I have the following parameters...
'''
def my_pipeline(bq_source_project: str = BQ_SOURCE_PROJECT,
bq_source_dataset: str = BQ_SOURCE_DATASET,
bq_source_table: str = BQ_SOURCE_TABLE,
output_data_path: str = "crime_data.csv"):
'''
My error was when I run the pipeline, I did not have these same parameters entered. Below is the fixed version...
'''
job = pipeline_jobs.PipelineJob(
project=PROJECT_ID,
location=LOCATION,
display_name=PIPELINE_NAME,
job_id=JOB_ID,
template_path=FILENAME,
pipeline_root=PIPELINE_ROOT,
parameter_values={'bq_source_project': BQ_SOURCE_PROJECT,
'bq_source_dataset': BQ_SOURCE_DATASET,
'bq_source_table': BQ_SOURCE_TABLE}
'''
I have been using PyTest for some time now to write some simple tests (like the ones you find in tutorials and youtube video's) and I thought now it was time to start writing actual test for our python scripts. The scripts are way more advanced than any shown in tutorials so I am getting a bit stuck. I do not want the entire correct answer, but rather a nudge in the right direction if possible. Here is my issue:
We have a script that reads a .md text file and converts it to a pdf file based on an external template. Part of the script is here below (I removed most of it because I first just want to have 1 running test)
class DocumentationEngine:
def __init__(self, title, subtitle, series, style='TIIStyle_Digital_Aug_2020', templateFile='template.docet', tableOfContents=True, listOfFigures=False, listOfTables=False):
self.title = title
self.subtitle = subtitle
self.series = series
self.style = style
self.template = {}
self.hasTOC = tableOfContents
self.hasLOF = listOfFigures
self.hasLOT = listOfTables
self.loadTemplate(templateFile)
def loadTemplate(self, file='template.docet'):
with open(file, "r") as templatefile:
lines = templatefile.readlines()
key = "dummy"
value = ""
for line in lines:
line = line.strip()
if line.startswith('[') and line.endswith(']'):
self.template[key] = value
key = line[1:-1]
value = ""
else:
value += line + '\n'
def build(self, versions=[], content='', filename='Documenter\\_Autogenerated'):
document = self.template["doc"]
document = document.replace("%%style%%", self.style)
document = document.replace("%%body%%",
self.buildFirstPage() +
self.buildTableOfContents() +
self.buildListOfFigures() +
self.buildListOfTables() +
self.buildVersionTable(versions, filename) +
self.buildContentPages(content=content) +
self.buildLastPage()
)
return document
def buildLastPage(self):
return self.template["last_page"]
I am trying to write a simple unit test for the buildLastPage method and have been stuck for several days now.
I am not sure whether or not I need to mock the template file, use a fixture and/or if I can actually test only that method with all dependencies.
I started with the following:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error regarding 3 required arguments. When adding the arguments like this:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self, title, subtitle, series):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error that the fixture is not found.
I added a fixture in conftest.py file like this:
import pytest
from doceng import DocumentationEngine
#pytest.fixture
def title(title):
return title("test")
which will get me another error, recursive dependency involving fixture 'title' detected
I'm quite stuck so any nudge in the right direction for a newbie would be highly appreciated
The error of the fixtures is regarding your test function test_buildLastPage. The way you are using it, it only needs the self argument.
A test function in pytest without any decorators always expects to find fixtures, that have the same name as the arguments. You did not define any fixtures and also do not use the arguments in your function. Therefore, you can remove them safely.
The actual error points DocumentationEngine(). The class expect 3 arguments when initializing the object. You set no arguments. Check your __init__ function again to find the proper arguments.
I am having issues properly patching an imported function in pytest. The function I want to patch is a function designed to do a large SQL fetch, so for speed I would like to replace this with reading a CSV file. Here is the code I currently have:
from data import postgres_fetch
import pytest
#pytest.fixture
def data_patch_market(monkeypatch):
test_data_path = os.path.join(os.path.dirname(__file__), 'test_data')
if os.path.exists(test_data_path):
mock_data_path = os.path.join(test_data_path, 'test_data_market.csv')
mock_data = pd.read_csv(mock_data_path)
monkeypatch.setattr(postgres_fetch, 'get_data_for_market', mock_data)
def test_mase(data_patch_market):
data = postgres_fetch.get_data_for_market(market_name=market,
market_level=market_level,
backtest_log_ids=log_ids,
connection=conn)
test_result= build_features.MASE(data)
However when I run this test I am getting a type error about calling a DataFrame:
TypeError: 'DataFrame' object is not callable
I know the csv can be read properly as I've tested that separately, so I assume something is wrong with how I am implementing the patch fixture, but I can't seem to work it out
Here, your call to monkeypatch.setattr is replacing any call to postgres_fetch.get_data_for_market with a call to mock_data.
This can't work since mock_data is not a function - its a DataFrame object.
Instead, in your call to monkeypatch.setattr, you need to pass in a function that returns the mocked data (i.e. the DataFrame object).
Hence, something like this should work:
#pytest.fixture
def data_patch_market(monkeypatch):
test_data_path = os.path.join(os.path.dirname(__file__), 'test_data')
if os.path.exists(test_data_path):
mock_data_path = os.path.join(test_data_path, 'test_data_market.csv')
mock_data = pd.read_csv(mock_data_path)
# The lines below are new - here, we define a function that will return the data we have mocked
def return_mocked(*args, **kwargs):
return mock_data
monkeypatch.setattr(postgres_fetch, 'get_data_for_market', return_mocked)
I asked the same question in GitHub.
I learned about pytest-helpers-namespace from s0undt3ch in his very helpful answer. However I found a usecase I cant seem to find an obvious workaround. Here is the paste of my original question on GitHub.
How can I use the fixtures already declared in my conftest within my helper functions?
I am have a large, memory heavy configuration object (for simplicity, a dictionary) in all test, but I dont want to tear it down and rebuild this object, thus scoped as session and reused. Often times, I want to grab values from the configuration object within my test.
I know reusing fixtures within fixtures, you have to pass a reference
# fixtures
#pytest.fixture(scope="session")
def return_dictionary():
return {
"test_key": "test_value"
}
#pytest.fixture(scope="session")
def add_random(return_dictionary):
_temp = return_dictionary
_temp["test_key_random"] = "test_random_value"
return _temp
Is it because pytest collects similar decorators, and analyzes them together? I would like someone's input into this. Thanks!
Here is a few files I created to demonstrate what I was looking for, and what the error I am seeing.
# conftest.py
import pytest
from pprint import pprint
pytest_plugins = ["helpers_namespace"]
# fixtures
#pytest.fixture(scope="session")
def return_dictionary():
return {
"test_key": "test_value"
}
# helpers
#pytest.helpers.register
def super_print(_dict):
pprint(_dict)
#pytest.helpers.register
def super_print_always(key, _dict=return_dictionary):
pprint(_dict[key])
# test_check.py
import pytest
def test_option_1(return_dictionary):
print(return_dictionary)
def test_option_2(return_dictionary):
return_dictionary["test_key_2"] = "test_value_2"
pytest.helpers.super_print(return_dictionary)
def test_option_3():
pytest.helpers.super_print_always('test_key')
key = 'test_key', _dict = <function return_dictionary at 0x039B6C48>
#pytest.helpers.register
def super_print_always(key, _dict=return_dictionary):
> pprint(_dict[key])
E TypeError: 'function' object is not subscriptable
conftest.py:30: TypeError
I am in the process of learning unit testing, however I am struggling to understand how to mock functions for unit testing. I have reviewed many how-to's and examples but the concept is not transferring enough for me to use it on my code. I am hoping getting this to work on a actual code example I have will help.
In this case I am trying to mock isTokenValid.
Here is example code of what I want to mock.
<in library file>
import xmlrpc.client as xmlrpclib
class Library(object):
def function:
#...
AuthURL = 'https://example.com/xmlrpc/Auth'
auth_server = xmlrpclib.ServerProxy(AuthURL)
socket.setdefaulttimeout(20)
try:
if pull == 0:
valid = auth_server.isTokenValid(token)
#...
in my unit test file I have
import library
class Tester(unittest.TestCase):
#patch('library.xmlrpclib.ServerProxy')
def test_xmlrpclib(self, fake_xmlrpclib):
assert 'something'
How would I mock the code listed in 'function'? Token can be any number as a string and valid would be a int(1)
First of all, you can and should mock xmlrpc.client.ServerProxy; your library imports xmlrpc.client as a new name, but it is still the same module object so both xmlrpclib.ServerProxy in your library and xmlrpc.client.ServerProxy lead to the same object.
Next, look at how the object is used, and look for calls, the (..) syntax. Your library uses the server proxy like this:
# a call to create an instance
auth_server = xmlrpclib.ServerProxy(AuthURL)
# on the instance, a call to another method
valid = auth_server.isTokenValid(token)
So there is a chain here, where the mock is called, and the return value is then used to find another attribute that is also called. When mocking, you need to look for that same chain; use the Mock.return_value attribute for this. By default a new mock instance is returned when you call a mock, but you can also set test values.
So to test your code, you'd want to influence what auth_server.isTokenValid(token) returns, and test if your code works correctly. You may also want to assert that the correct URL is passed to the ServerProxy instance.
Create separate tests for different outcomes. Perhaps the token is valid in one case, not valid in another, and you'd want to test both cases:
class Tester(unittest.TestCase):
#patch('xmlrpc.client.ServerProxy')
def test_valid_token(self, mock_serverproxy):
# the ServerProxy(AuthURL) return value
mock_auth_server = mock_serverproxy.return_value
# configure a response for a valid token
mock_auth_server.isTokenValid.return_value = 1
# now run your library code
return_value = library.Library().function()
# and make test assertions
# about the server proxy
mock_serverproxy.assert_called_with('some_url')
# and about the auth_server.isTokenValid call
mock_auth_server.isTokenValid.assert_called_once()
# and if the result of the function is expected
self.assertEqual(return_value, 'expected return value')
#patch('xmlrpc.client.ServerProxy')
def test_invalid_token(self, mock_serverproxy):
# the ServerProxy(AuthURL) return value
mock_auth_server = mock_serverproxy.return_value
# configure a response; now testing for an invalid token instead
mock_auth_server.isTokenValid.return_value = 0
# now run your library code
return_value = library.Library().function()
# and make test assertions
# about the server proxy
mock_serverproxy.assert_called_with('some_url')
# and about the auth_server.isTokenValid call
mock_auth_server.isTokenValid.assert_called_once()
# and if the result of the function is expected
self.assertEqual(return_value, 'expected return value')
There are many mock attributes to use, and you can change your patch decorator usage a little as follows:
class Tester(unittest.TestCase):
def test_xmlrpclib(self):
with patch('library.xmlrpclib.ServerProxy.isTokenValid') as isTokenValid:
self.assertEqual(isTokenValid.call_count, 0)
# your test code calling xmlrpclib
self.assertEqual(isTokenValid.call_count, 1)
token = isTokenValid.call_args[0] # assume this token is valid
self.assertEqual(isTokenValid.return_value, 1)
You can adjust the code above to satisfy your requirements.