ETL load from google cloud storage to biquery

ETL load from google cloud storage to biquery - python

I want to load data from hundreds of CSV files on Google cloud Storage and append them to a single table on Bigquery on a daily basis using cloud dataflow (preferable using python SDK). Can you please let me know how I Can accomplish that?
Thanks

We can do it through Python as well.
Please find the below code snippet.
def format_output_json(element):
"""
:param element: is the row data in the csv
:return: a dictionary with key as column name and value as real data in a row of the csv.
:row_indices: I have hard-coded here, but can get it at the run time.
"""
row_indices = ['time_stamp', 'product_name', 'units_sold', 'retail_price']
row_data = element.split(',')
dict1 = dict()
for i in range(len(row_data)):
dict1[row_indices[i]] = row_data[i]
return [dict1]

Related

Google Cloud BigQuery WRITE_APPEND duplicate issue

I keep multiple data with similar name in Google Cloud Storage. My data comes here daily via API and I want to add them to my table in BigQuery, which is refreshed daily, and I do this via Python. When I do WRITE_TRUNCATE, it deletes the table and creates a new table with all the files in the storage, but I need to protect my table because I have historical data in it and they are not in the storage. When I WRITE_APPEND, I have a duplicate problem because it adds all the files in the storage. Anyone here have a suggested solution?
here is a piece of my code:
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True
job_config.max_bad_records = 5
uri = "gs://" + gcsbucket + "/" + tableprefix ```

#AhmetBuğraBUĞA, As you have mentioned in the comment, Date names were written at the end of the data and the problem can be solved by adding dates in the BigQuery as below which takes a single day from the dates.
(dt.datetime.today() - dt.timedelta(day=1)).strftime("%Y-%m-%d")

Trying to move data from one Azure Blob Storage to another using a Python script

I have data that exists in a zipped format in container A that I need to transform using a Python script and am trying to schedule this to occur within Azure, but when writing the output to a new storage container (container B), it simply outputs a csv with the name of the file inside rather than the data.
I've followed the tutorial given on the microsoft site exactly, but I can't get it to work - what am I missing?
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
file_n='iris.csv'
# Load iris dataset from the task node
df = pd.read_csv(file_n)
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False, encoding="utf-8")
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Specifically, the final line seems to be just giving me the output of a csv called "iris_setosa.csv" with a contents of "iris_setosa.csv" in cell A1 rather than the actual data that it reads in.

Update:
replace create_blob_from_text with create_blob_from_path.
create_blob_from_text creates a new blob from str/unicode, or updates the content of an existing blob. So you will find text iris_setosa.csv in the content of the new blob.
create_blob_from_path creates a new blob from a file path, or updates the content of an existing blob. It is what you want.
This workaround uses copy_blob and delete_blob to move Azure Blob from one container to another.
from azure.storage.blob import BlobService
def copy_azure_files(self):
blob_service = BlobService(account_name='account_name', account_key='account_key')
blob_name = 'iris_setosa.csv'
copy_from_container = 'test-container'
copy_to_container = 'demo-container'
blob_url = blob_service.make_blob_url(copy_from_container, blob_name)
# blob_url:https://demostorage.blob.core.windows.net/test-container/iris_setosa.csv
blob_service.copy_blob(copy_to_container, blob_name, blob_url)
#for move the file use this line
blob_service.delete_blob(copy_from_container, blob_name)

Extracting "data" from Amazon Ion file

Has anyone worked with the Amazon Quantum Ledger Database (QLDB) Amazon ion files? If so, do you know how to extract the "data" part to formulate tables? Maybe use python to scrape the data?
I am trying to get the "data" information from these files which are stored in s3 (I don't have access to QLDB so I cannot query directly) and then upload the results to Glue.
I am trying to perform an ETL job using GLue, but Glue doesn't like Amazon Ion files so I need to either query data from these files or scrape the files for relevant information.
Thanks.
 
PS : by "data" information I mean this:
{
PersonId:"4tPW8xtKSGF5b6JyTihI1U",
LicenseNumber:"LEWISR261LL",
LicenseType:"Learner",
ValidFromDate:2016–12–20,
ValidToDate:2020–11–15
}
ref : https://docs.aws.amazon.com/qldb/latest/developerguide/working.userdata.html

Have you tried working with the Amazon Ion library ?
Assuming the data mentioned in the question is present in a file called "myIonFile.ion" and if the file has only ion objects in it, we can read the data from the file as follows:
from amazon.ion import simpleion
file = open("myIonFile.ion", "rb") # opening the file
data = file.read() # getting the bytes for the file
iondata = simpleion.loads(data, single_value=False) # Loading as ion data
print(iondata['PersonId']) # should print "4tPW8xtKSGF5b6JyTihI1U"
Further guidance on using the ion library is provided in the Ion Cookbook
Besides, I'm unsure about your use-case but interacting with QLDB can also be done via the QLDB Driver which has a direct dependency on the Ion library.

Nosiphiwe,
AWS Glue is able to read Amazon Ion input. Many other services and applications can't, though, so it's a good idea to use Glue to convert the Ion data to JSON. Note that Ion is a super-set of JSON, adding some data types to JSON, so converting Ion to JSON may cause some down-conversion.
One good way to get access to your QLDB documents from the QLDB S3 export is to use Glue to extract the document data, store it in S3 as JSON, and query it with Amazon Athena. The process would go as follows:
Export your ledger data to S3
Create a Glue crawler to crawl and catalog the exported data.
Run a Glue ETL job to extract the revision data from the export files, convert it to JSON, and write it out to S3.
Create a Glue crawler to crawl and catalog the extracted data.
Query the extracted document revision data using Amazon Athena.
Take a look at the PySpark script below. It extracts just the revision metadata and data payload from the QLDB export files.
The QLDB export maps the table for each document, but separately from the revisions data. You'll have to do some extra coding to include the table name in your revision data in the output. The code below doesn't do this, so you'll end up with all of your revisions in one table in the output.
Also note that you'll get whatever revisions happen to be in the exported data. That is, you might get multiple document revisions for a given document ID. Depending on your intended use of the data, you may need to figure out how to grab just the latest revision of each document ID.
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import col
from awsglue.dynamicframe import DynamicFrame
# Initializations
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
# Load data. 'vehicle-registration-ion' is the name of your database in the Glue catalog for the export data. '2020' is the name of your table in the Glue catalog.
dyn0 = glueContext.create_dynamic_frame.from_catalog(database = "vehicle-registration-ion", table_name = "2020", transformation_ctx = "datasource0")
# Only give me exported records with revisions
dyn1 = dyn0.filter(lambda line: "revisions" in line)
# Now give me just the revisions element and convert to a Spark DataFrame.
df0 = dyn1.select_fields("revisions").toDF()
# Revisions is an array, so give me all of the array items as top-level "rows" instead of being a nested array field.
df1 = df0.select(explode(df0.revisions))
# Now I have a list of elements with "col" as their root node and the revision
# fields ("data", "metadata", etc.) as sub-elements. Explode() gave me the "col"
# root node and some rows with null "data" fields, so filter out the nulls.
df2 = df1.where(col("col.data").isNotNull())
# Now convert back to a DynamicFrame
dyn2 = DynamicFrame.fromDF(df2, glueContext, "dyn2")
# Prep and send the output to S3
applymapping1 = ApplyMapping.apply(frame = dyn2, mappings = [("col.data", "struct", "data", "struct"), ("col.metadata", "struct", "metadata", "struct")], transformation_ctx = "applymapping1")
datasink0 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://YOUR_BUCKET_NAME_HERE/YOUR_DESIRED_OUTPUT_PATH_HERE/"}, format = "json", transformation_ctx = "datasink0")
I hope this helps!

Tableau embedded data source refresh using python

Is there a way to refresh Tableau embedded datasource using python. I am currently using Tableau server client library to refresh published datasources which is actually working fine. Can someone help me to figure out a way?

The way you can reach them is kinda annoying from my perpective.
You need to use populate_connections() function to load embedded datasources. It would be easier if you know the name of the workbook.
import tableauserverclient as TSC
#sign in using personal access token
server = TSC.Server(server_address='server_name', use_server_version=True)
server.auth.sign_in_with_personal_access_token(auth_req=TSC.PersonalAccessTokenAuth(token_name='tokenName', personal_access_token='tokenValue', site_id='site_name'))
#use RequestOptions() with a filter to pull an specific workbook
def get_workbook(name):
req_opt = TSC.RequestOptions()
req_opt.filter.add(TSC.Filter(req_opt.Field.Name, req_opt.Operator.Equals, name))
return server.workbooks.get(req_opt)[0][0] #workbooks.get () function is intended to return a list items that you can iterate, but here we are assuming it will be find only one result
workbook = get_workbook(name='workbook_name') #gets the workbook
server.workbooks.populate_connections(workbook) #this function will load all the embedded datasources in the workbook
for datasource in workbook.connections: #iterate in datasource list
#Note: each element of this list is not an TSC.DatabaseItem, so, you will need to load a valid one using the "datasource_id" attribute from the element.
#If you try server.datasources.refresh(datasource) it will fail
ds = server.datasources.get_by_id(datasource.datasource_id) #loads a valid TSC.DatabaseItem
server.datasources.refresh(ds) #finally, you will be able to refresh it
...
The best practice is do not embeddeding datasources but publish them independently.
Update:
There is an easy way to achieve this. There are two types of extract tasks, Workbook and Data source. So, for embedded data sources, you need to perform a workbook refresh.
workbook = get_workbook(name='workbook_name')
server.workbooks.refresh(workbook.id)

You can use "tableauserverclient" Python package. You can pip install it from PyPy.
After installing it, you can consult the docs.
I will attach an example I used some time ago:
import tableauserverclient as TSC
tableau_auth = TSC.TableauAuth('user', 'pass', 'homepage')
server = TSC.Server('server')
with server.auth.sign_in(tableau_auth):
all_datasources, pagination_item = server.datasources.get()
print("\nThere are {} datasources on
site:".format(pagination_item.total_available))
print([datasource.name for datasource in all_datasources])

Reading Data From Cloud Storage Via Cloud Functions

I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.
I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.
In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.

The function does not actually receive the contents of the file, just some metadata about it.
You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.
Putting that together with the tutorial you're using, you get a function like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
bucket = storage_client.get_bucket(data['bucket'])
blob = bucket.blob(data['name'])
contents = blob.download_as_string()
# Process the file contents, etc...

This is an alternative solution using pandas:
Cloud Function Code:
import pandas as pd
def GCSDataRead(event, context):
bucketName = event['bucket']
blobName = event['name']
fileName = "gs://" + bucketName + "/" + blobName
dataFrame = pd.read_csv(fileName, sep=",")
print(dataFrame)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ETL load from google cloud storage to biquery - python

I want to load data from hundreds of CSV files on Google cloud Storage and append them to a single table on Bigquery on a daily basis using cloud dataflow (preferable using python SDK). Can you please let me know how I Can accomplish that? Thanks

Related

Google Cloud BigQuery WRITE_APPEND duplicate issue

Trying to move data from one Azure Blob Storage to another using a Python script

Extracting "data" from Amazon Ion file

Tableau embedded data source refresh using python

Reading Data From Cloud Storage Via Cloud Functions

Categories

Resources