Azure ML PipelineData with DataTransferStep results in 0 bytes file - python

I am building an Azure ML pipeline with the azureml Python SDK. The pipeline calls a PythonScriptStep which stores data on the workspaceblobstore of the AML workspace.
I would like to extend the pipeline to export the pipeline data to an Azure Data Lake (Gen 1). Connecting the output of the PythonScriptStep directly to Azure Data Lake (Gen 1) is not supported by Azure ML as far as I understand. Therefore, I added an extra DataTransferStep to the pipeline, which takes the output from the PythonScriptStep as input directly into the DataTransferStep. According to the Microsoft documentation this should be possible.
So far I have built this solution, only this results in a file of 0 bytes on the Gen 1 Data Lake. I think the output_export_blob PipelineData does not correctly references the test.csv, and therefore the DataTransferStep cannot find the input. How can I connect the DataTransferStep correctly with the PipelineData output from the PythonScriptStep?
Example I followed:
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb
pipeline.py
input_dataset = delimited_dataset(
datastore=prdadls_datastore,
folderpath=FOLDER_PATH_INPUT,
filepath=INPUT_PATH
)
output_export_blob = PipelineData(
'export_blob',
datastore=workspaceblobstore_datastore,
)
test_step = PythonScriptStep(
script_name="test_upload_stackoverflow.py",
arguments=[
"--output_extract", output_export_blob,
],
inputs=[
input_dataset.as_named_input('input'),
],
outputs=[output_export_blob],
compute_target=aml_compute,
source_directory="."
)
output_export_adls = DataReference(
datastore=prdadls_datastore,
path_on_datastore=os.path.join(FOLDER_PATH_OUTPUT, 'test.csv'),
data_reference_name='export_adls'
)
export_to_adls = DataTransferStep(
name='export_output_to_adls',
source_data_reference=output_export_blob,
source_reference_type='file',
destination_data_reference=output_export_adls,
compute_target=adf_compute
)
pipeline = Pipeline(
workspace=aml_workspace,
steps=[
test_step,
export_to_adls
]
)
test_upload_stackoverflow.py
import os
import pathlib
from azureml.core import Datastore, Run
parser = argparse.ArgumentParser("train")
parser.add_argument("--output_extract", type=str)
args = parser.parse_args()
run = Run.get_context()
df_data_all = (
run
.input_datasets["input"]
.to_pandas_dataframe()
)
os.makedirs(args.output_extract, exist_ok=True)
df_data_all.to_csv(
os.path.join(args.output_extract, "test.csv"),
index=False
)

The code example is immensely helpful. Thanks for that. You're right that it can be confusing to get PythonScriptStep -> PipelineData. Working initially even without the DataTransferStep.
I don't know 100% what's going on, but I thought I'd spitball some ideas:
Does your PipelineData, export_blob, contain the "test.csv" file? I would verify that before troubleshooting the DataTransferStep. You can verify this using the SDK, or more easily with the UI.
Go to the PipelineRun page, click on the PythonScriptStep in question.
On "Outputs + Logs" page, there's a "Data Outputs" Section (that is slow to load initially)
Open it and you'll see the output PipelineDatas then click on "View Output"
Navigate to given path either in the Azure Portal or Azure Storage Explorer.
In test_upload_stackoverflow.py you are treating the PipelineData as a directory when call .to_csv() as opposed to a file which would be you just calling df_data_all.to_csv(args.output_extract, index=False). Perhaps try defining the PipelineData with is_directory=True. Not sure if this is required though.

Related

apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python

I want to set up a pipeline every hour to parse 2000 raw protobuf format files in different folders of GCS buckets and load data into big query. so far I'm able to parse proto data successfully.
I know the wildcard method to read all the files in a folder, but I don't want that now as I have data from different folders and I want to run this faster like parallelism, not in a sequential way
like below
for x,filename enumerate(file_separted_comma):
--read data from prto
--load data to bigquery
Now I want to know whether the below approach is the best or recommended way of parsing multiple files from different folders in apache beam and load the data into a big query.
one more thing, Each record after parsing from proto, I'm making it into JSON record to load into the big query and don't know this is also a good way to load data to big query instead of directly loading deserialized(parsed) proto data.
I'm moving from a Hadoop job to dataflow to reduce the cost by setting up this pipeline.
I'm new to apache-beam,dont know what are cons&pros, hence can somebody take a look at the code and help me here to make a better approach to go for production
import time
import sys
import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import csv
import base64
import rtbtracker_log_pb2
from google.protobuf import timestamp_pb2
from google.protobuf.json_format import MessageToDict
from google.protobuf.json_format import MessageToJson
import io
from apache_beam.io.filesystems import FileSystems
def get_deserialized_log(serialized_log):
log = rtbtracker_log_pb2.RtbTrackerLogProto()
log.ParseFromString(serialized_log)
return log
def print_row(message):
message=message[3]
message = message.replace('_', '/');
message = message.replace('*', '=');
message = message.replace('-', '+');
#finalbunary=base64.b64decode(message.decode('UTF-8'))
finalbunary=base64.b64decode(message)
msg=get_deserialized_log(finalbunary)
jsonObj = MessageToDict(msg)
#jsonObj = MessageToJson(msg)
return jsonObj
def parse_file(element):
for line in csv.reader([element], quotechar='"', delimiter='\t', quoting=csv.QUOTE_ALL, skipinitialspace=True):
return line
def run():
parser = argparse.ArgumentParser()
parser.add_argument("--input", dest="input", required=False)
parser.add_argument("--output", dest="output", required=False)
app_args, pipeline_args = parser. parse_known_args()
with beam.Pipeline(options=PipelineOptions()) as p:
input_list=app_args.input
file_list = input_list.split(",")
res_list = ["/home/file_{}-00000-of-00001.json".format(i) for i in range(len(file_list))]
for i,file in enumerate(file_list):
onesec=p | "Read Text {}".format(i) >> beam.io.textio.ReadFromText(file)
parsingProtoFile=onesec | 'Parse file{}'.format(i) >> beam.Map(parse_file)
printFileConetent=parsingProtoFile | 'Print output {}'.format(i) >>beam.Map(print_row)
#i want to load to bigquery here
##LOAD DATA TO BIGQUERY
#secondsec=printFileConetent | "Write TExt {}".format(i) >> ##beam.io.WriteToText("/home/file_{}".format(i),file_name_suffix=".json",
###num_shards=1 ,
##append_trailing_newlines = True)
if __name__ == '__main__':
run()
running code below in local
python3 another_main.py --input=tracker_one.gz,tracker_two.gz
output path i haven't mentioed as i dnt want to save the data to gcs as i will be loading it into bigquery
and like below running in dataflowrunner
python3 final_beam_v1.py --input gs://bucket/folder/2020/12/23/00/00/fileread.gz --output gs://bucket/beamoutput_four/ --runner DataflowRunner --project PROJECT --staging_location gs://bucket/staging_four --temp_location gs://bucket/temp_four --region us-east1 --setup_file ./setup.py --job_name testing
noticed that two jobs will be running for single input file in the same job name and dnt know why it is happening and PFA screenshot for the same
That method of reading files is fine (as long as the number of input files isn't too large). However, if you can express the set of files you want to read as a wildcard expression (which can match against multiple folders), that will likely perform better, and Dataflow will read all the files that match the pattern in parallel.
For writing to BigQuery, it's best to use the built-in BigQuery sink. The default behavior is to create temp files in JSON format and then load those into BigQuery, but you can also use Avro instead, which can be more efficient. You can also combine all of your inputs into one PCollection using Flatten, so that you only need one BigQuery sink in your pipeline.

Can't read .xlsx file on Azure Databricks

I'm on Azure databricks notebooks using Python, and I'm having trouble reading an excel file and putting it in a spark dataframe.
I saw that there were topics of the same problems, but they don't seem to work for me.
I tried the following solution:
https://sauget-ch.fr/2019/06/databricks-charger-des-fichiers-excel-at-scale/
I did add the credentials to access my files on Azure Data Lake.
After installing all the libraries I needed, I'm doing this code :
import xlrd
import azure.datalake.store
filePathBsp = projectFullPath + "BalanceShipmentPlan_20190724_19h31m37s.xlsx";
bspDf = pd.read_excel(AzureDLFileSystem.open(filePathBsp))
There, I use:
"AzureDLFileSystem.open"
to get the file in Azure Data Lake because:
"pd.read_excel"
doesn't let me get my file to the Lake.
The problem is, it gives me this error :
TypeError: open() missing 1 required positional argument: 'path'
I'm sure I can access this file because when I try:
spark.read.csv(filePathBsp)
he can find my file.
Any ideas?
Ok, after long days of researchs, i've finally found the solution.
Here it is !
First, you have to import the library "spark-Excel" in your cluster.
Here's the page for this library : https://github.com/crealytics/spark-excel
You also need the library "spark_hadoopOffice", or you'll get the following exception later :
java.io.IOException: org/apache/commons/collections4/IteratorUtils
Take care about the version of Scala in your cluster when you download the libraries, it's important.
Then, you have to mount the credentials for Azure Data Lake Storage (ADLS) This way :
# Mount point
udbRoot = "****"
configs = {
"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "****",
"dfs.adls.oauth2.credential": "****",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/****/oauth2/token"
}
# unmount
#dbutils.fs.unmount(udbRoot)
# Mounting
dbutils.fs.mount(
source = "adl://****",
mount_point = udbRoot,
extra_configs = configs
)
You need to do the mount command only once.
Then, you can do this code line :
testDf = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(fileTest)
display(testDf)
Here you go ! You have a Spark Dataframe from an Excel File in Azure Data Lake Storage !
It worked for me, hopefully it will help someone else.

How to run Python script over result generated with U-SQL script in Azure Machine Learning Pipelines?

I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.
Conceptually this is pretty simple:
Run a .usql script to generate intermediate data (two tables, intermediate_1 and intermediate_2) from a large initial_table
Run a Python script over the intermediate data to generate the final result final
What should be the Azure Machine Learning Pipeline steps to do this?
I thought the following plan would work:
Run the .usql query on a adla_compute using an AdlaStep like
int_1 = PipelineData("intermediate_1", datastore=adls_datastore)
int_2 = PipelineData("intermediate_2", datastore=adls_datastore)
adla_step = AdlaStep(script_name='script.usql',
source_directory=sample_folder,
inputs=[initial_table],
outputs=[intermediate_1, intermediate_2],
compute_target=adla_compute)
Run a Python step on a compute target aml_compute like
python_step = PythonScriptStep(script_name="process.py",
arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final],
inputs=[intermediate_1, intermediate_2],
outputs=[final],
compute_target=aml_compute,
source_directory=source_directory)
This however fails at the Python step with an error of the kind
StepRun(process.py) Execution Summary
======================================
StepRun(process.py) Status: Failed
Unable to mount data store mydatastore because it does not specify a
storage account key.
I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore Azure Data Lake data store reference on which I am running the U-SQL queries against.
Can someone smell if I am doing something really wrong here?
Should I move the intermediate data (intermediate_1 and intermediate_2) to a storage account, e.g. with a DataTransferStep, before the PythonScriptStep?
ADLS does not support mount. So, you are right, you will have to use DataTransferStep to move data to blob first.
Data Lake store is not supported for AML compute. This table lists different computes and their level of support for different datastores: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#compute-and-datastore-matrix
You can use DataTransferStep to copy data from ADLS to blob and then use that blob as input for PythonScriptStep. Sample notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb
# register blob datastore, example in linked notebook
# blob_datastore = Datastore.register_azure_blob_container(...
int_1_blob = DataReference(
datastore=blob_datastore,
data_reference_name="int_1_blob",
path_on_datastore="int_1")
copy_int_1_to_blob = DataTransferStep(
name='copy int_1 to blob',
source_data_reference=int_1,
destination_data_reference=int_1_blob,
compute_target=data_factory_compute)
int_2_blob = DataReference(
datastore=blob_datastore,
data_reference_name="int_2_blob",
path_on_datastore="int_2")
copy_int_2_to_blob = DataTransferStep(
name='copy int_2 to blob',
source_data_reference=int_2,
destination_data_reference=int_2_blob,
compute_target=data_factory_compute)
# update PythonScriptStep to use blob data references
python_step = PythonScriptStep(...
arguments=["--input1", int_1_blob, "--input2", int_2_blob, "--output", final],
inputs=[int_1_blob, int_2_blob],
...)

How can I combine two presentations (pptx) into one master presentation?

I'm part of a project team that created PPTX presentations to present to clients. After creating all of the files, we need to add additional slides to each presentation. All of the new slides will be the same across the each presentation.
What is the best way to accomplish this programmatically?
I don't want to use VBA because (as far as I understand) I would have to open each presentation to run the script.
I've tried using the python-pptx library. But the documentation states:
"Copying a slide from one presentation to another turns out to be pretty hard to get right in the general case, so that probably won’t come until more of the backlog is burned down."
I was hoping something like the following would work -
from pptx import Presentation
main = Presentation('Universal.pptx')
abc = Presentation('Test1.pptx')
main_slides = main.slides.get(1)
abc_slides = abc.slides.get(1)
full = main.slides.add_slide(abc_slides[1])
full.save('Full.pptx')
Has anyone had success do anything like that?
I was able to achieve this by using python and win32com.client. However, this doesn't work quietly. What I mean is that it launches Microsoft PowerPoint and opens input files one by one, then copies all slides from an input file and pastes them to an output file in a loop.
import win32com.client
from os import walk
def mergePresentations(inputFileNames, outputFileName):
Application = win32com.client.Dispatch("PowerPoint.Application")
outputPresentation = Application.Presentations.Add()
outputPresentation.SaveAs(outputFileName)
for file in inputFileNames:
currentPresentation = Application.Presentations.Open(file)
currentPresentation.Slides.Range(range(1, currentPresentation.Slides.Count+1)).copy()
Application.Presentations(outputFileName).Windows(1).Activate()
outputPresentation.Application.CommandBars.ExecuteMso("PasteSourceFormatting")
currentPresentation.Close()
outputPresentation.save()
outputPresentation.close()
Application.Quit()
# Example; let's say you have a folder of presentations that need to be merged
# to new file named "allSildesMerged.pptx" in the same folder
path,_,files = next(walk('C:\\Users\\..\\..\\myFolder'))
outputFileName = path + '\\' + 'allSildesMerged.pptx'
inputFiles = []
for file in files:
inputFiles.append(path + '\\' + file)
mergePresentations(inputFiles, outputFileName)
The GroupDocs.Merger REST API is also another option to merge multiple PowerPoint presentations into a single document. It is paid API but provides 150 monthly free API calls.
Currently, it supports working with cloud providers: Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage, FTP Storage along with GroupDocs internal Cloud Storage. However, in near future, it has a plan to support merge files from the request body(stream).
P.S: I'm developer evangelist at GroupDocs.
# For complete examples and data files, please go to https://github.com/groupdocs-merger-cloud/groupdocs-merger-cloud-python-samples
# Get Client ID and Client Secret from https://dashboard.groupdocs.cloud
client_id = "XXXX-XXXX-XXXX-XXXX"
client_secret = "XXXXXXXXXXXXXXXX"
documentApi = groupdocs_merger_cloud.DocumentApi.from_keys(client_id, client_secret)
item1 = groupdocs_merger_cloud.JoinItem()
item1.file_info = groupdocs_merger_cloud.FileInfo("four-slides.pptx")
item2 = groupdocs_merger_cloud.JoinItem()
item2.file_info = groupdocs_merger_cloud.FileInfo("one-slide.docx")
options = groupdocs_merger_cloud.JoinOptions()
options.join_items = [item1, item2]
options.output_path = "Output/joined.pptx"
result = documentApi.join(groupdocs_merger_cloud.JoinRequest(options))
A free tool called "powerpoint join" can help you.

BlobInfo object from a BlobKey created using blobstore.create_gs_key

I am converting code away from the deprecated files api.
I have the following code that works fine in the SDK server but fails in production. Is what I am doing even correct? If yes what could be wrong, any ideas how to troubleshoot it?
# Code earlier writes the file bs_file_name. This works fine because I can see the file
# in the Cloud Console.
bk = blobstore.create_gs_key( "/gs" + bs_file_name)
assert(bk)
if not isinstance(bk,blobstore.BlobKey):
bk = blobstore.BlobKey(bk)
assert isinstance(bk,blobstore.BlobKey)
# next line fails here in production only
assert(blobstore.get(bk)) # <----------- blobstore.get(bk) returns None
Unfortunately, as per the documentation, you can't get a BlobInfo object for GCS files.
https://developers.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
Note: Once you obtain a blobKey for the GCS object, you can pass it around, serialize it, and otherwise use it interchangeably anywhere you can use a blobKey for objects stored in Blobstore. This allows for usage where an app stores some data in blobstore and some in GCS, but treats the data otherwise identically by the rest of the app. (However, BlobInfo objects are currently not available for GCS objects.)
I encountered this exact same issue today and it feels very much like a bug within the blobstore api when using google cloud storage.
Rather than leveraging the blobstore api I made use of the google cloud storage client library. The library can be downloaded here: https://developers.google.com/appengine/docs/python/googlecloudstorageclient/download
To access a file on GCS:
import cloudstorage as gcs
with gcs.open(GCSFileName) as f:
blob_content = f.read()
print blob_content
It sucks that GAE has different behaviours when using blobInfo in local mode or the production environment, it took me a while to find out that, but a easy solution is that:
You can use a blobReader to access the data when you have the blob_key.
def getBlob(blob_key):
logging.info('getting blob('+blob_key+')')
with blobstore.BlobReader(blob_key) as f:
data_list = []
chunk = f.read(1000)
while chunk != "":
data_list.append(chunk)
chunk = f.read(1000)
data = "".join(data_list)
return data`
https://developers.google.com/appengine/docs/python/blobstore/blobreaderclass

Categories

Resources