Reading Data in Vertex AI Pipelines

Reading Data in Vertex AI Pipelines - python

This is my first time using Google's Vertex AI Pipelines. I checked this codelab as well as this post and this post, on top of some links derived from the official documentation. I decided to put all that knowledge to work, in some toy example: I was planning to build a pipeline consisting of 2 components: "get-data" (which reads some .csv file stored in Cloud Storage) and "report-data" (which basically returns the shape of the .csv data read in the previous component). Furthermore, I was cautious to include some suggestions provided in this forum. The code I currently have, goes as follows:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from google.cloud import aiplatform
# Components section
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
import pandas as pd
from google.cloud import storage
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# path = "gs://my-bucket/program_grouping_data.zip"
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
):
import pandas as pd
df = pd.read_csv(inputd.path)
return df.shape
# Pipeline section
#pipeline(
# Default pipeline root. You can override it when submitting the pipeline.
pipeline_root=PIPELINE_ROOT,
# A name for the pipeline.
name="my-pipeline",
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compilation section
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running and submitting job
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={"url": "test_vertex/pipeline_root/program_grouping_data.zip", "bucket": "my-bucket"},
enable_caching=True,
)
run1.submit()
I was happy to see that the pipeline compiled with no errors, and managed to submit the job. However "my happiness lasted short", as when I went to Vertex AI Pipelines, I stumbled upon some "error", which goes like:
The DAG failed because some tasks failed. The failed tasks are: [get-data].; Job (project_id = my-project, job_id = 4290278978419163136) is failed due to the above error.; Failed to handle the job: {project_number = xxxxxxxx, job_id = 4290278978419163136}
I did not find any related info on the web, neither could I find any log or something similar, and I feel a bit overwhelmed that the solution to this (seemingly) easy example, is still eluding me.
Quite obviously, I don't what or where I am mistaking. Any suggestion?

With some suggestions provided in the comments, I think I managed to make my demo pipeline work. I will first include the updated code:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from datetime import datetime
from google.cloud import aiplatform
from typing import NamedTuple
# Importing 'COMPONENTS' of the 'PIPELINE'
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
"""Reads a csv file, from some location in Cloud Storage"""
import ast
import pandas as pd
from google.cloud import storage
# 'Pulling' demo .csv data from a know location in GCS
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# Reading the pulled demo .csv data
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
) -> NamedTuple("output", [("rows", int), ("columns", int)]):
"""From a passed csv file existing in Cloud Storage, returns its dimensions"""
import pandas as pd
df = pd.read_csv(inputd.path+".csv")
return df.shape
# Building the 'PIPELINE'
#pipeline(
# i.e. in my case: PIPELINE_ROOT = 'gs://my-bucket/test_vertex/pipeline_root/'
# Can be overriden when submitting the pipeline
pipeline_root=PIPELINE_ROOT,
name="readcsv-pipeline", # Your own naming for the pipeline.
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compiling the 'PIPELINE'
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running the 'PIPELINE'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={
"url": "test_vertex/pipeline_root/program_grouping_data.zip",
"bucket": "my-bucket"
},
enable_caching=True,
)
# Submitting the 'PIPELINE'
run1.submit()
Now, I will add some complementary comments, which in sum, managed to solve my problem:
First, having the "Logs Viewer" (roles/logging.viewer) enabled for your user, will greatly help to troubleshoot any existing error in your pipeline (Note: that role worked for me, however you might want to look for a better matching role for you own purposes here). Those errors will appear as "Logs", which can be accessed by clicking the corresponding button:
NOTE: In the picture above, when the "Logs" are displayed, it might be helpful to carefully check each log (close to the time when you created you pipeline), as generally each eof them corresponds with a single warning or error line:
Second, the output of my pipeline was a tuple. In my original approach, I just returned the plain tuple, but it is advised to return a NamedTuple instead. In general, if you need to input / output one or more "small values" (int or str, for any reason), pick a NamedTuple to do so.
Third, when the connection between your pipelines is Input[Dataset] or Ouput[Dataset], adding the file extension is needed (and quite easy to forget). Take for instance the ouput of the get_data component, and notice how the data is recorded by specifically adding the file extension, i.e. dataset.path + ".csv".
Of course, this is a very tiny example, and projects can easily scale to huge projects, however as some sort of "Hello Vertex AI Pipelines" it will work well.
Thank you.

Thanks for your writeup. Very helpful! I had the same error, but it turned out to be for a different reasons, so noting it here...
In my pipeline definition step I have the following parameters...
'''
def my_pipeline(bq_source_project: str = BQ_SOURCE_PROJECT,
bq_source_dataset: str = BQ_SOURCE_DATASET,
bq_source_table: str = BQ_SOURCE_TABLE,
output_data_path: str = "crime_data.csv"):
'''
My error was when I run the pipeline, I did not have these same parameters entered. Below is the fixed version...
'''
job = pipeline_jobs.PipelineJob(
project=PROJECT_ID,
location=LOCATION,
display_name=PIPELINE_NAME,
job_id=JOB_ID,
template_path=FILENAME,
pipeline_root=PIPELINE_ROOT,
parameter_values={'bq_source_project': BQ_SOURCE_PROJECT,
'bq_source_dataset': BQ_SOURCE_DATASET,
'bq_source_table': BQ_SOURCE_TABLE}
'''

Related

How to test kfp components with pytest

I'm trying to local test a kubeflow component from kfp.v2.ds1 (which works on a pipeline) using pytest, but struggling with the input/output arguments together with fixtures.
Here is a code example to illustrate the issue:
First, I created a fixture to mock a dataset. This fixture is also a kubeflow component.
# ./fixtures/
#pytest.fixture
#component()
def sample_df(dataset: Output[Dataset]):
df = pd.DataFrame(
{
'name': ['Ana', 'Maria', 'Josh'],
'age': [15, 19, 22],
}
)
dataset.path += '.csv'
df.to_csv(dataset.path, index=False)
return
Lets suppose the component double the ages.
# ./src/
#component()
def double_ages(df_input: Input[Dataset], df_output: Output[Dataset]):
df = pd.read_csv(df_input.path)
double_df = df.copy()
double_df['age'] = double_df['age']*2
df_output.path += '.csv'
double_df.to_csv(df_output.path, index=False)
Then, the test:
#./tests/
#pytest.mark.usefixtures("sample_df")
def test_double_ages(sample_df):
expected_df = pd.DataFrame(
{
'name': ['Ana', 'Maria', 'Josh'],
'age': [30, 38, 44],
}
)
df_component = double_ages(sample_df) # This is where I call the component, sample_df is an Input[Dataset]
df_output = df_component.outputs['df_output']
df = pd.read_csv(df_output.path)
assert df['age'].tolist() == expected_df['age'].tolist()
But that's when the problem occurs. The Output[Dataset] that should be passed as an output, is not, so the component cannot properly work with it, then I would get the following error on assert df['age'].tolist() == expected_df['age'].tolist():
AttributeError: 'TaskOutputArgument' object has no attribute 'path'
Aparently, the object is of the type TaskOutputArgument, instead of Dataset.
Does anyone knows how to fix this? Or how to properly use pytest with kfp components? I've searched a lot on internet but couldn't find a clue about it.

After spending my afternoon on this, I finally figured out a way to pytest a python-based KFP component. As I found no other lead on this subject, I hope this can help:
Access the function to test
The trick is not to directly test the KFP component created by the #component decorator. However you can access the inner decorated Python function through the component attribute python_func.
Mock artifacts
Regarding the Input and Output artifacts, as you get around KFP to access and call the tested function, you have to create them manually and pass them to the function:
input_artifact = Dataset(uri='input_df_previously_saved.csv')
output_artifact = Dataset(uri='target_output_path.csv')
I had to come up with a workaround for how the Artifact.path property works (which also applies for all KFP Artifact subclasses: Dataset, Model, ...). If you look in KFP source code, you'll find that it uses the _get_path() method that returns None if the uri attribute does not start with one of the defined cloud prefixes: "gs://", "s3://" or "minio://". As we're manually building artifacts with local paths, the tested component that wants to read the path property of an artifact would read a None value.
So I made a simple method that builds a subclass of an Artifact (or a Dataset or any other Artifact child class). The built subclass is simply altered to return the uri value instead of None in this specific case of a non-cloud uri.
Your example
Putting this all together for your test and your fixture, we can get the following code to work:
src/double_ages_component.py: your component to test
Nothing changes here. I just added the pandas import:
from kfp.v2.dsl import component, Input, Dataset, Output
#component
def double_ages(df_input: Input[Dataset], df_output: Output[Dataset]):
import pandas as pd
df = pd.read_csv(df_input.path)
double_df = df.copy()
double_df['age'] = double_df['age'] * 2
df_output.path += '.csv'
double_df.to_csv(df_output.path, index=False)
tests/utils.py: the Artifact subclass builder
import typing
def make_test_artifact(artifact_type: typing.Type):
class TestArtifact(artifact_type):
def _get_path(self):
return super()._get_path() or self.uri
return TestArtifact
I am still not sure it is the most proper workaround. You could also manually create a subclass for each Artifact that you use (Dataset in your example). Or you could directly mock the kfp.v2.dsl.Artifact class using pytest-mock.
tests/conftest.py: your fixture
I separated the sample dataframe creator component from the fixture. Hence we have a standard KFP component definition + a fixture that builds its output artifact and calls its python function:
from kfp.v2.dsl import component, Dataset, Output
import pytest
from tests.utils import make_test_artifact
#component
def sample_df_component(dataset: Output[Dataset]):
import pandas as pd
df = pd.DataFrame({
'name': ['Ana', 'Maria', 'Josh'],
'age': [15, 19, 22],
})
dataset.path += '.csv'
df.to_csv(dataset.path, index=False)
#pytest.fixture
def sample_df():
# define output artifact
output_path = 'local_sample_df.csv' # any writable local path. I'd recommend to use pytest `tmp_path` fixture.
sample_df_artifact = make_test_artifact(Dataset)(uri=output_path)
# call component python_func by passing the artifact yourself
sample_df_component.python_func(dataset=sample_df_artifact)
# the artifact object is now altered with the new path that you define in sample_df_component (".csv" extension added)
return sample_df_artifact
The fixture returns an artifact object referencing a selected local path where the sample dataframe has been saved to.
tests/test_component.py: your actual component test
Once again, the idea is to build the I/O artifact(s) and to call the component's python_func:
from kfp.v2.dsl import Dataset
import pandas as pd
from src.double_ages_component import double_ages
from tests.utils import make_test_artifact
def test_double_ages(sample_df):
expected_df = pd.DataFrame({
'name': ['Ana', 'Maria', 'Josh'],
'age': [30, 38, 44],
})
# input artifact is passed in parameter via sample_df fixture
# create output artifact
output_path = 'local_test_output_df.csv'
output_df_artifact = make_test_artifact(Dataset)(uri=output_path)
# call component python_func
double_ages.python_func(df_input=sample_df, df_output=output_df_artifact)
# read output data
df = pd.read_csv(output_df_artifact.path)
# write your tests
assert df['age'].tolist() == expected_df['age'].tolist()
Result
> pytest
================ test session starts ================
platform linux -- Python 3.8.13, pytest-7.1.3, pluggy-1.0.0
rootdir: /home/USER/code/kfp_tests
collected 1 item
tests/test_component.py . [100%]
================ 1 passed in 0.28s ================

I spent some time investigating this and my conclusion is that individual components are not meant to be unit tested by kfp's design. That means that you must rely on unit testing each component's logic, wrapping each piece of that logic in a component, and then testing the end-to-end functionality of the kfp pipeline.
I agree that it would be quite cool if there were a way to easily mock Inputs and Outputs but I dug quite deep and it does not seem like this is an intended use (or an easy hack) at this point in time.

This has worked for me. I've used create_autospec to mock the output parameters.
#dsl.component(
base_image="pipeline:latest",
target_image="simple:latest",
)
def simple(
word: str,
number: int,
output_path: Output[Dataset],
output_metric: Output[Metrics],
) -> None:
output_path.metadata["meta"] = "my meta data"
output_metric.log_metric("numbers", number)
output_metric.log_metric("other numbers", 5678)
simple_stage(output_path.path, word, number)
def test_simple(uses_temp_directory: str) -> None:
# arrange
dataset_file = f"{uses_temp_directory}/dataset"
dataset = create_autospec(Dataset, metadata=dict(), path=dataset_file)
metrics = create_autospec(Metrics)
# act
simple.python_func(
word="my word",
number=1234,
output_path=dataset,
output_metric=metrics,
)
# assert
result = pd.read_csv(dataset_file)
assert 1234 == len(result.index)
metrics.log_metric.assert_has_calls(
[call("numbers", 1234), call("other numbers", 5678)]
)

list all compute instances in a specific region gcp with python

So, I can list my instances by zones using this API.
GET https://compute.googleapis.com/compute/v1/projects/{project}/zones/{zone}/instances.
I want now to filter my instances by region. Any idea how can I do this (using python)?

You can use aggregated_list(), to list all your instances on your project. Filtering via region could be done on the actual code. See code below where I used regex to mimic a filter using region variable.
from typing import Dict, Iterable
from google.cloud import compute_v1
import re
def list_all_instances(
project_id: str,
region: str
) -> Dict[str, Iterable[compute_v1.Instance]]:
instance_client = compute_v1.InstancesClient()
request = {
"project" : project_id,
}
agg_list = instance_client.aggregated_list(request=request)
all_instances = {}
print("Instances found:")
for zone, response in agg_list:
if response.instances:
if re.search(f"{region}*", zone):
all_instances[zone] = response.instances
print(f" {zone}:")
for instance in response.instances:
print(f" - {instance.name} ({instance.machine_type})")
return all_instances
list_all_instances(project_id="your-project-id",region="us-central1") #used us-central1 for testing
NOTE: Code above is from this code. I just modified it to apply the filtering above.
Actual instances on my GCP account:
Result from code above (only zones with prefix us-central1 were displayed):

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

I'm working on building a Dataflow pipeline that reads a CSV file (containing 250,000 rows) from my Cloud Storage bucket, modifies the value of each row and then writes the modified contents to a new CSV in the same bucket. With the code below I'm able to read and modify the contents of the original file, but when I attempt to write the contents of the new file in GCS I get the following error:
google.api_core.exceptions.TooManyRequests: 429 POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=multipart: {
"error": {
"code": 429,
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"errors": [
{
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"domain": "usageLimits",
"reason": "rateLimitExceeded"
}
]
}
}
: ('Request failed with status code', 429, 'Expected one of', <HTTPStatus.OK: 200>) [while running 'Store Output File']
My code in Dataflow:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import traceback
import sys
import pandas as pd
from cryptography.fernet import Fernet
import google.auth
from google.cloud import storage
fernet_secret = 'aD4t9MlsHLdHyuFKhoyhy9_eLKDfe8eyVSD3tu8KzoP='
bucket = 'my-bucket'
inputFile = f'gs://{bucket}/product-codes/test_codes.csv'
outputFile = 'product-codes/URL_test_codes.csv'
#Pipeline Logic
def product_codes_pipeline(project, env, region='us-central1'):
options = PipelineOptions(
streaming=False,
project=project,
region=region,
staging_location="gs://my-bucket-dataflows/Templates/staging",
temp_location="gs://my-bucket-dataflows/Templates/temp",
template_location="gs://my-bucket-dataflows/Templates/Generate_Product_Codes.py",
subnetwork='https://www.googleapis.com/compute/v1/projects/{}/regions/us-central1/subnetworks/{}-private'.format(project, env)
)
# Transform function
def genURLs(code):
f = Fernet(fernet_secret)
encoded = code.encode()
encrypted = f.encrypt(encoded)
decrypted = f.decrypt(encrypted.decode().encode())
decoded = decrypted.decode()
if code != decoded:
print(f'Error: Code {code} and decoded code {decoded} do not match')
sys.exit(1)
url = 'https://some-url.com/redeem/product-code=' + encrypted.decode()
return url
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, urls):
df = pd.DataFrame([urls], columns=['URL'])
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f'{outputFile}').upload_from_string(df.to_csv(index=False), 'text/csv')
# End function
p = beam.Pipeline(options=options)
(p | 'Read Input CSV' >> beam.io.ReadFromText(inputFile, skip_header_lines=1)
| 'Map Codes' >> beam.Map(genURLs)
| 'Store Output File' >> beam.ParDo(WriteCSVFIle(bucket)))
p.run()
The code produces URL_test_codes.csv in my bucket, but the file only contains one row (not including the 'URL' header) which tells me that my code is writing/overwriting the file as it processes each row. Is there a way to bulk write the contents of the entire file instead of making a series of requests to update the file? I'm new to Python/Dataflow so any help is greatly appreciated.

Let's point out the issues: the evident one is a quota issue from GCS side, reflected by the '429' error codes. But as you noted, this is derived from the inherent issue, which is more related to how you try to write your data to your blob.
Since a Beam Pipeline generates a Parallel Collection of elements, when you add elements to your PCollection, each pipeline step will be executed for each element, in other words, your ParDo function will try to write something to your output file once per element in your PCollection.
So, there are some issues with your WriteCSVFIle function. For example, in order to write your PCollection to GCS, it would be better to use a separate pipeline task focused on writing the whole PCollection, such as follows:
First, you can import this Function already included in Apache Beam:
from apache_beam.io import WriteToText
Then, you use it at the end of your pipeline:
| 'Write PCollection to Bucket' >> WriteToText('gs://{0}/{1}'.format(bucket_name, outputFile))
With this option, you don't need to create a storage client or reference a blob, the function just needs to receive the GCS URI where it would write the final result and you can adjust it according to the parameters you can find in the documentation.
With this, you only need to address the Dataframe created in your WriteCSVFIle function. Each pipeline step creates a new PCollection, so if a Dataframe-creator function should receive an element from a PCollection of URLs, then the new PCollection elements resulting from the Dataframe function will have 1 dataframe per url following your current logic, but since it seems you just want to write the results from genURLs considering that 'URL' is the only column in your dataframe, maybe going directly from genURLs to WriteToText can output what you're looking for.
Either way, you can adjust your pipeline accordingly, but at least with the WriteToText transform it would take care of writing your whole final PCollection to your Cloud Storage bucket.

How do i create a decorator for a function with added functionality?

I want to refactor my code. What I am currently doing is extracting data from an ad platform API endpoint and transforming and uploading it to big query. I have the following code which works but I want to refactor it after having learnt about decorators.
Decorators are very powerful and useful tool in Python since it allows programmers to modify the behavior of function or class. Decorators allow us to wrap another function in order to extend the behavior of wrapped function, without permanently modifying it.
import datauploader
import ndjson
import os
def upload_ads_details(extractor, access_token, acccount_id, req_output,
bq_client_name, bq_dataset_id, bq_gs_bucket,
ndjson_local_file_path, ndjson_file_name):
# Function to Extract data from the API/Ad Platform
ads_dictionary = extractor.get_ad_dictionary(access_token, acccount_id)
# Converting data to ndjson for upload to big query
output_ndjson = ndjson.dumps(ads_dictionary)
with open(ndjson_local_file_path, 'w') as f:
f.writelines(output_ndjson)
print(os.path.abspath(ndjson_local_file_path))
# This code below remains the same for all the other function calls
if req_output:
# Inputs for the uploading functions
print("Processing Upload")
partition_by = "_insert_time"
str_gcs_file_name = ndjson_file_name
str_local_file_name = ndjson_local_file_path
gs_bucket = bq_gs_bucket
gs_file_format = "JSON"
table_id = 'ads_performance_stats_table'
table_schema = ads_dictionary_schema
# Uploading Function
datauploader.loadToBigQuery(
bq_client_name,
bq_dataset_id,
table_id,
table_schema,
partition_by,
str_gcs_file_name,
str_local_file_name,
gs_bucket,
gs_file_format,
autodetect=False,
req_partition=True,
skip_leading_n_row=0
)

Google Analytics and Python

I'm brand new at Python and I'm trying to write an extension to an app that imports GA information and parses it into MySQL. There is a shamfully sparse amount of infomation on the topic. The Google Docs only seem to have examples in JS and Java...
...I have gotten to the point where my user can authenticate into GA using SubAuth. That code is here:
import gdata.service
import gdata.analytics
from django import http
from django import shortcuts
from django.shortcuts import render_to_response
def authorize(request):
next = 'http://localhost:8000/authconfirm'
scope = 'https://www.google.com/analytics/feeds'
secure = False # set secure=True to request secure AuthSub tokens
session = False
auth_sub_url = gdata.service.GenerateAuthSubRequestUrl(next, scope, secure=secure, session=session)
return http.HttpResponseRedirect(auth_sub_url)
So, step next is getting at the data. I have found this library: (beware, UI is offensive) http://gdata-python-client.googlecode.com/svn/trunk/pydocs/gdata.analytics.html
However, I have found it difficult to navigate. It seems like I should be gdata.analytics.AnalyticsDataEntry.getDataEntry(), but I'm not sure what it is asking me to pass it.
I would love a push in the right direction. I feel I've exhausted google looking for a working example.
Thank you!!
EDIT: I have gotten farther, but my problem still isn't solved. The below method returns data (I believe).... the error I get is: "'str' object has no attribute '_BecomeChildElement'" I believe I am returning a feed? However, I don't know how to drill into it. Is there a way for me to inspect this object?
def auth_confirm(request):
gdata_service = gdata.service.GDataService('iSample_acctSample_v1.0')
feedUri='https://www.google.com/analytics/feeds/accounts/default?max-results=50'
# request feed
feed = gdata.analytics.AnalyticsDataFeed(feedUri)
print str(feed)

Maybe this post can help out. Seems like there are not Analytics specific bindings yet, so you are working with the generic gdata.

I've been using GA for a little over a year now and since about April 2009, i have used python bindings supplied in a package called python-googleanalytics by Clint Ecker et al. So far, it works quite well.
Here's where to get it: http://github.com/clintecker/python-googleanalytics.
Install it the usual way.
To use it: First, so that you don't have to manually pass in your login credentials each time you access the API, put them in a config file like so:
[Credentials]
google_account_email = youraccount#gmail.com
google_account_password = yourpassword
Name this file '.pythongoogleanalytics' and put it in your home directory.
And from an interactive prompt type:
from googleanalytics import Connection
import datetime
connection = Connection() # pass in id & pw as strings **if** not in config file
account = connection.get_account(<*your GA profile ID goes here*>)
start_date = datetime.date(2009, 12, 01)
end_data = datetime.date(2009, 12, 13)
# account object does the work, specify what data you want w/
# 'metrics' & 'dimensions'; see 'USAGE.md' file for examples
account.get_data(start_date=start_date, end_date=end_date, metrics=['visits'])
The 'get_account' method will return a python list (in above instance, bound to the variable 'account'), which contains your data.

You need 3 files within the app. client_secrets.json, analytics.dat and google_auth.py.
Create a module Query.py within the app:
class Query(object):
def __init__(self, startdate, enddate, filter, metrics):
self.startdate = startdate.strftime('%Y-%m-%d')
self.enddate = enddate.strftime('%Y-%m-%d')
self.filter = "ga:medium=" + filter
self.metrics = metrics
Example models.py: #has the following function
import google_auth
service = googleauth.initialize_service()
def total_visit(self):
object = AnalyticsData.objects.get(utm_source=self.utm_source)
trial = Query(object.date.startdate, object.date.enddate, object.utm_source, ga:sessions")
result = service.data().ga().get(ids = 'ga:<your-profile-id>', start_date = trial.startdate, end_date = trial.enddate, filters= trial.filter, metrics = trial.metrics).execute()
total_visit = result.get('rows')
<yr save command, ColumnName.object.create(data=total_visit) goes here>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Data in Vertex AI Pipelines - python

Related

How to test kfp components with pytest

list all compute instances in a specific region gcp with python

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

How do i create a decorator for a function with added functionality?

Google Analytics and Python

Categories

Resources