Write data from broadcast variable (databricks) to azure blob - python

I have a url from where I download the data (which is in JSON format) using Databricks:
url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/amzf277698d77514b44"
testfile = urllib.request.URLopener()
testfile.retrieve(url, "file.gz")
with gzip.GzipFile("file.gz", 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8')
data = json.loads(json_str)
Now I want to save this data in Azure container as a blob .json file.
I have tried saving data in a dataframe and write df to mounted location but data is huge in GBs and I get spark.rpc.message.maxSize (268435456 bytes) error.
I have tried saving data in a broadcast variable (it saves successfully) but I am not sure how to write data from broadcast variable to mounted location.
Here is how I save data in broadcast variable
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
broadcastStates = spark.sparkContext.broadcast(data)
print(broadcastStates.value)
M question is
Is there any way I can write data from broadcast variable to azure mounted location
if not then please guide me what is right/best way to get this job done.

It is not possible to write the broadcast variable into mounted azure blob storage. However, there is a way you can write the value of broadcast variable into a file.
pyspark.broadcast.Broadcast provides 2 methods, dump() and load_from_path(), using which you can write and read the value of a broadcast variable. Since you have created a broadcast variable using:
broadcastStates = spark.sparkContext.broadcast(data)
Use the following syntax to write the value of broadcast variable to a file.
<broadcast_variable>.dump(<broadcast_variable>.value, filename)
Note: the file must have write attribute and the write() argument must be string.
To read this data from the file, you can use load_from_path() as shown below:
<broadcast_variable>.load_from_path(filename)
Note: the file must have read and readline attributes.
There might also be a way to avoid “spark.rpc.message.maxSize (268435456 bytes) error”. The default value of spark.rpc.message.maxSize is 128. Refer to the following document to know more about this error.
https://spark.apache.org/docs/latest/configuration.html#networking
While creating a cluster in Databricks, we can configure and increase the value to avoid this error. The steps to configure a cluster are:
⦁ While creating cluster, choose advanced options (present at the bottom).
⦁ Under spark tab, choose the configuration and its value as shown below.
Click create cluster.
This might help in writing the dataframe directly to mounted blob storage without using broadcast variables.
You can also try increasing the number of partitions to save the dataframe into multiple smaller files to avoid maxSize error. Refer to the following document about configuring spark and partitioning.
https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html

Related

How to read HDF5 files in R without the memory error?

Goal
Read the data component of a hdf5 file in R.
Problem
I am using rhdf5 to read hdf5 files in R. Out of 75 files, it successfully read 61 files. But it throws an error about memory for the rest of the files. Although, some of these files are shorter than already read files.
I have tried running individual files in a fresh R session, but get the same error.
Following is an example:
# Exploring the contents of the file:
library(rhdf5)
h5ls("music_0_math_0_simple_12_2022_08_08.hdf5")
group name otype dclass dim
0 / data H5I_GROUP
1 /data ACC_State H5I_DATASET INTEGER 1 x 1
2 /data ACC_State_Frames H5I_DATASET INTEGER 1
3 /data ACC_Voltage H5I_DATASET FLOAT 24792 x 1
4 /data AUX_CACC_Adjust_Gap H5I_DATASET INTEGER 24792 x 1
... CONTINUES ----
# Reading the file:
rhdf5::h5read("music_0_math_0_simple_12_2022_08_08.hdf5", name = "data")
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
In addition: Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
Error: Error in h5checktype(). H5Identifier not valid.
I can read the file via python:
import h5py
filename = "music_0_math_0_simple_12_2022_08_08.hdf5"
hf = h5py.File(filename, "r")
hf.keys()
data = hf.get('data')
data['SCC_Follow_Info']
#<HDF5 dataset "SCC_Follow_Info": shape (9, 24792), type "<f4">
How can I successfully read the file in R?
When you ask to read the data group, rhdf5 will read all the underlying datasets into R's memory. It's not clear from your example exactly how much data this is, but maybe for some of your files it really is more than the available memory on your computer. I don't know how Python works under the hood, but perhaps it doesn't do any reading of datasets until you run data['SCC_Follow_Info']?
One option to try, is that rather than reading the entire data group, you could be more selective and try reading only the specific dataset you're interested in at that moment. In the Python example that seems to be /data/SCC_Follow_Info.
You can do that with something like:
follow_info <- h5read(file = "music_0_math_0_simple_12_2022_08_08.hdf5",
name = "/data/SCC_Follow_Info")
Once you've finished working with that dataset remove it from your R session e.g. rm(follow_info) and read the next dataset or file you need.

How to run Python script over result generated with U-SQL script in Azure Machine Learning Pipelines?

I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.
Conceptually this is pretty simple:
Run a .usql script to generate intermediate data (two tables, intermediate_1 and intermediate_2) from a large initial_table
Run a Python script over the intermediate data to generate the final result final
What should be the Azure Machine Learning Pipeline steps to do this?
I thought the following plan would work:
Run the .usql query on a adla_compute using an AdlaStep like
int_1 = PipelineData("intermediate_1", datastore=adls_datastore)
int_2 = PipelineData("intermediate_2", datastore=adls_datastore)
adla_step = AdlaStep(script_name='script.usql',
source_directory=sample_folder,
inputs=[initial_table],
outputs=[intermediate_1, intermediate_2],
compute_target=adla_compute)
Run a Python step on a compute target aml_compute like
python_step = PythonScriptStep(script_name="process.py",
arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final],
inputs=[intermediate_1, intermediate_2],
outputs=[final],
compute_target=aml_compute,
source_directory=source_directory)
This however fails at the Python step with an error of the kind
StepRun(process.py) Execution Summary
======================================
StepRun(process.py) Status: Failed
Unable to mount data store mydatastore because it does not specify a
storage account key.
I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore Azure Data Lake data store reference on which I am running the U-SQL queries against.
Can someone smell if I am doing something really wrong here?
Should I move the intermediate data (intermediate_1 and intermediate_2) to a storage account, e.g. with a DataTransferStep, before the PythonScriptStep?
ADLS does not support mount. So, you are right, you will have to use DataTransferStep to move data to blob first.
Data Lake store is not supported for AML compute. This table lists different computes and their level of support for different datastores: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#compute-and-datastore-matrix
You can use DataTransferStep to copy data from ADLS to blob and then use that blob as input for PythonScriptStep. Sample notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb
# register blob datastore, example in linked notebook
# blob_datastore = Datastore.register_azure_blob_container(...
int_1_blob = DataReference(
datastore=blob_datastore,
data_reference_name="int_1_blob",
path_on_datastore="int_1")
copy_int_1_to_blob = DataTransferStep(
name='copy int_1 to blob',
source_data_reference=int_1,
destination_data_reference=int_1_blob,
compute_target=data_factory_compute)
int_2_blob = DataReference(
datastore=blob_datastore,
data_reference_name="int_2_blob",
path_on_datastore="int_2")
copy_int_2_to_blob = DataTransferStep(
name='copy int_2 to blob',
source_data_reference=int_2,
destination_data_reference=int_2_blob,
compute_target=data_factory_compute)
# update PythonScriptStep to use blob data references
python_step = PythonScriptStep(...
arguments=["--input1", int_1_blob, "--input2", int_2_blob, "--output", final],
inputs=[int_1_blob, int_2_blob],
...)

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Azure blob storage to JSON in azure function using SDK

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.
figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.

reading numy array from GCS into spark

I have 100 npz files containing numpy arrays in google storage.
I have setup dataproc with jupyter and I am trying to read all the numpy arrays into spark RDD. What is the best way to load the numpy arrays from a google storage into pyspark?
Is there an easy way like np.load("gs://path/to/array.npz") to load the numpy array and then do sc.parallelize on it?
If you plan to scale eventually you'll want to use the distributed input methods in SparkContext rather than doing any local file loading from the driver program relying on sc.parallelize. It sounds like you need to read each of the files intact though, so in your case you want:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/")
Or you can also specify single files if you want, but then you just have an RDD with a single element:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/arr1.npz")
Then each record is a pair of <filename>,<str of bytes>. On Dataproc, sc.binaryFiles will just automatically work directly with GCS paths, unlike np.load which requires local filesystem paths.
Then in your worker code, you just need to use StringIO to use those byte strings as the file object you put into np.load:
from StringIO import StringIO
# For example, to create an RDD of the 'arr_0' element of each of the picked objects:
npz_rdd.map(lambda l: numpy.load(StringIO(l[1]))['arr_0'])
During development if you really want to just read the files into your main driver program, you can always collapse your RDD down using collect() to retrieve it locally:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/arr1.npz")
local_bytes = npz_rdd.collect()[0][1]
local_np_obj = np.load(StringIO(local_bytes))

Categories

Resources