I have written a cloud function in GCP, which inserts data into BigQuery when a file is uploaded to my GCS Bucket, but I am wondering how to integrate this into a Jenkins file/pipeline.
I have checked the Jenkins documentation, but I am unclear on how to actually integrate my cloud function into Jenkins.
This is my cloud function code:
def csv_in_gcs_to_table(event, context):
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = "bucket_name"
object_name = event['name']
table_id = "project_id.dataset_name.table_name"
schema = [
bigquery.SchemaField('col1', 'string'),
bigquery.SchemaField('col2', 'string'),
]
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job_config.source_format = bigquery.SourceFormat.CSV
job_config.write_disposition = bigquery.WriteDisposition().WRITE_APPEND
job_config.skip_leading_rows = 1
uri = "gs://{}/{}".format(bucket_name, object_name)
load_job = client.load_table_from_uri(uri,
table_id,
job_config=job_config)
load_job.result()
Also, does the trigger code need to be added to the Jenkins file too?
Related
This code retrieves the buckets of a Amazon S3-compatible storage (not Amazon AWS but the Zadara compatible cloud storage) and IT WORKS:
import boto3
from botocore.client import Config
session = boto3.session.Session( )
s3_client = session.client(
service_name = 's3',
region_name = 'IT',
aws_access_key_id = 'xyz',
aws_secret_access_key = 'abcedf',
endpoint_url = 'https://nothing.com:443',
config = Config(signature_version='s3v4'),
)
print('Buckets')
boto3.set_stream_logger(name='botocore')
print(s3_client.list_buckets())
I am trying to use the same method to access S3 via C# and AWS SDK, anyway I always obtain the error "The request signature we calculated does not match the signature you provided. Check your key and signing method.".
AmazonS3Config config = new AmazonS3Config();
config.AuthenticationServiceName = "s3";
config.ServiceURL = "https://nothing.com:443";
config.SignatureVersion = "s3v4";
config.AuthenticationRegion = "it";
AmazonS3Client client = new AmazonS3Client(
"xyz",
"abcdef",
config);
ListBucketsResponse r = await client.ListBucketsAsync();
What can I do? Why it is not working? I can't get a solution.
I tried also to trace debug infos:
Python
boto3.set_stream_logger(name='botocore')
C#
AWSConfigs.LoggingConfig.LogResponses = ResponseLoggingOption.Always;
AWSConfigs.LoggingConfig.LogMetrics = true;
AWSConfigs.LoggingConfig.LogTo = Amazon.LoggingOptions.SystemDiagnostics;
AWSConfigs.AddTraceListener("Amazon", new System.Diagnostics.ConsoleTraceListener());
but for C# it is not logging the whole request.
Any suggestion?
the folowing function runs but when i check bigquery there is nothing there. I am using the $ decorator to add a specific date and i have waited to be sure the data has uploaded:
def write_truncate_table(client, table_id, df, schema=None):
job_config = bigquery.LoadJobConfig(
schema=schema,
# to append use "WRITE_APPEND" or don't pass job_config at all (appending is default)
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
return job.result()
It is the first time I am using a Cloud Function and this cloud function is just doing one single job: everytime a file is uploaded to a GCS bucket, the cloud function is running and copying that file (.csv) to a BigQuery table without any transformations. What would be the most efficient wat to test (unit not integration) the gcs_to_bq method?
def get_bq_table_name(file_name):
if re.search('car', file_name):
return 'car'
return 'bike'
def gcs_to_bq(event, context):
# Construct a BigQuery client object.
client = bigquery.Client()
bq_table = get_bq_table_name(event['name'] )
table_id = f'xxx.yyy.{bq_table}'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("datetime", "STRING"),
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("id", "STRING"),
],
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://" + event['bucket'] + '/' + event['name']
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
I think you would need three things for unit testing
create a fake .cvs file and upload it to a stage/dev GCS bucket.
create a staging dataset in BQ.
create a fake event object that presents (1).
Then your unit testing is to call gcs_to_bq() with (3) and check if the table is created correctly in (2).
As you could see, though it's unit testing, it requires setting up cloud resources.
There is a GCS emulator that could help if you want to create GCS stub/mock completely local but I never tried it.
https://github.com/fsouza/fake-gcs-server
I have a new CSV file each week in the same format, which I need to append to a BigQuery table using the Python client. I successfully created the table using the first CSV, but I am unsure how to append subsequent CSVs going forward. The only way I have found is the google.cloud.bigquery.client.Client().insert_rows() method. See api link here. This would require me to first read the CSV in as a list of dictionaries. Is there a better way to append data from a CSV to a BigQuery table?
See simple example below
# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://your_bucket/path/your_file.csv"
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))
see more details in BigQuery Documentation
my question is about a code to extract a table extract a table from Bigquery and save it as a json file
.
I made my code mostly by following the gcloud tutorials on their documentation.
I couldn't implicit set my credentials, so I did it in a explicit way, to my json file. But it seems that it doesn't quite get the "Client" object by the path I took.
If anyone could clarify me how this whole implicit and explicit credential works, would help me a lot too!
I am using python 2.7 and pycharm. The code is as follows:
from gcloud import bigquery
from google.cloud import storage
def bigquery_get_rows ():
json_key = "path/to/my/json_file.json"
storage_client = storage.Client.from_service_account_json(json_key)
print("\nPeguei o Cliente\n")
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
print(storage_client)
#Setando ambiente
bucket_name = 'my_bucket/name'
print(bucket_name)
destination_uri = 'gs://{}/{}'.format(bucket_name, 'my_table_json_name.json')
print(destination_uri)
#dataset_ref = client.dataset('samples', project='my_project_name')
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
print(dataset_ref)
table_ref = dataset_ref.table('my_table_to_be_extracted_name')
print(table_ref)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = (
bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
extract_job = client.extract_table(
table_ref, destination_uri, job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
bigquery_get_rows()
You are using wrong client object. You try to use gcs client to work with bigquery.
Instead of
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
it should be:
bq_client = bigquery.Client.from_service_account_json(
'path/to/service_account.json')
dataset_ref = bq_client.dataset('my_dataset_name', project='my_project_id')