I'm on Azure databricks notebooks using Python, and I'm having trouble reading an excel file and putting it in a spark dataframe.
I saw that there were topics of the same problems, but they don't seem to work for me.
I tried the following solution:
https://sauget-ch.fr/2019/06/databricks-charger-des-fichiers-excel-at-scale/
I did add the credentials to access my files on Azure Data Lake.
After installing all the libraries I needed, I'm doing this code :
import xlrd
import azure.datalake.store
filePathBsp = projectFullPath + "BalanceShipmentPlan_20190724_19h31m37s.xlsx";
bspDf = pd.read_excel(AzureDLFileSystem.open(filePathBsp))
There, I use:
"AzureDLFileSystem.open"
to get the file in Azure Data Lake because:
"pd.read_excel"
doesn't let me get my file to the Lake.
The problem is, it gives me this error :
TypeError: open() missing 1 required positional argument: 'path'
I'm sure I can access this file because when I try:
spark.read.csv(filePathBsp)
he can find my file.
Any ideas?
Ok, after long days of researchs, i've finally found the solution.
Here it is !
First, you have to import the library "spark-Excel" in your cluster.
Here's the page for this library : https://github.com/crealytics/spark-excel
You also need the library "spark_hadoopOffice", or you'll get the following exception later :
java.io.IOException: org/apache/commons/collections4/IteratorUtils
Take care about the version of Scala in your cluster when you download the libraries, it's important.
Then, you have to mount the credentials for Azure Data Lake Storage (ADLS) This way :
# Mount point
udbRoot = "****"
configs = {
"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "****",
"dfs.adls.oauth2.credential": "****",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/****/oauth2/token"
}
# unmount
#dbutils.fs.unmount(udbRoot)
# Mounting
dbutils.fs.mount(
source = "adl://****",
mount_point = udbRoot,
extra_configs = configs
)
You need to do the mount command only once.
Then, you can do this code line :
testDf = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(fileTest)
display(testDf)
Here you go ! You have a Spark Dataframe from an Excel File in Azure Data Lake Storage !
It worked for me, hopefully it will help someone else.
Related
I am building an Azure ML pipeline with the azureml Python SDK. The pipeline calls a PythonScriptStep which stores data on the workspaceblobstore of the AML workspace.
I would like to extend the pipeline to export the pipeline data to an Azure Data Lake (Gen 1). Connecting the output of the PythonScriptStep directly to Azure Data Lake (Gen 1) is not supported by Azure ML as far as I understand. Therefore, I added an extra DataTransferStep to the pipeline, which takes the output from the PythonScriptStep as input directly into the DataTransferStep. According to the Microsoft documentation this should be possible.
So far I have built this solution, only this results in a file of 0 bytes on the Gen 1 Data Lake. I think the output_export_blob PipelineData does not correctly references the test.csv, and therefore the DataTransferStep cannot find the input. How can I connect the DataTransferStep correctly with the PipelineData output from the PythonScriptStep?
Example I followed:
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb
pipeline.py
input_dataset = delimited_dataset(
datastore=prdadls_datastore,
folderpath=FOLDER_PATH_INPUT,
filepath=INPUT_PATH
)
output_export_blob = PipelineData(
'export_blob',
datastore=workspaceblobstore_datastore,
)
test_step = PythonScriptStep(
script_name="test_upload_stackoverflow.py",
arguments=[
"--output_extract", output_export_blob,
],
inputs=[
input_dataset.as_named_input('input'),
],
outputs=[output_export_blob],
compute_target=aml_compute,
source_directory="."
)
output_export_adls = DataReference(
datastore=prdadls_datastore,
path_on_datastore=os.path.join(FOLDER_PATH_OUTPUT, 'test.csv'),
data_reference_name='export_adls'
)
export_to_adls = DataTransferStep(
name='export_output_to_adls',
source_data_reference=output_export_blob,
source_reference_type='file',
destination_data_reference=output_export_adls,
compute_target=adf_compute
)
pipeline = Pipeline(
workspace=aml_workspace,
steps=[
test_step,
export_to_adls
]
)
test_upload_stackoverflow.py
import os
import pathlib
from azureml.core import Datastore, Run
parser = argparse.ArgumentParser("train")
parser.add_argument("--output_extract", type=str)
args = parser.parse_args()
run = Run.get_context()
df_data_all = (
run
.input_datasets["input"]
.to_pandas_dataframe()
)
os.makedirs(args.output_extract, exist_ok=True)
df_data_all.to_csv(
os.path.join(args.output_extract, "test.csv"),
index=False
)
The code example is immensely helpful. Thanks for that. You're right that it can be confusing to get PythonScriptStep -> PipelineData. Working initially even without the DataTransferStep.
I don't know 100% what's going on, but I thought I'd spitball some ideas:
Does your PipelineData, export_blob, contain the "test.csv" file? I would verify that before troubleshooting the DataTransferStep. You can verify this using the SDK, or more easily with the UI.
Go to the PipelineRun page, click on the PythonScriptStep in question.
On "Outputs + Logs" page, there's a "Data Outputs" Section (that is slow to load initially)
Open it and you'll see the output PipelineDatas then click on "View Output"
Navigate to given path either in the Azure Portal or Azure Storage Explorer.
In test_upload_stackoverflow.py you are treating the PipelineData as a directory when call .to_csv() as opposed to a file which would be you just calling df_data_all.to_csv(args.output_extract, index=False). Perhaps try defining the PipelineData with is_directory=True. Not sure if this is required though.
There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3 to spark.sparkContext without any difference
Also swapping // and /// in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!
I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0 and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...
I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.
I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.
In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.
The function does not actually receive the contents of the file, just some metadata about it.
You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.
Putting that together with the tutorial you're using, you get a function like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
bucket = storage_client.get_bucket(data['bucket'])
blob = bucket.blob(data['name'])
contents = blob.download_as_string()
# Process the file contents, etc...
This is an alternative solution using pandas:
Cloud Function Code:
import pandas as pd
def GCSDataRead(event, context):
bucketName = event['bucket']
blobName = event['name']
fileName = "gs://" + bucketName + "/" + blobName
dataFrame = pd.read_csv(fileName, sep=",")
print(dataFrame)
I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.
figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.
I'm trying to spin up a Spark cluster with Datbricks' CSV package so that I can create parquet files and also do some stuff with Spark obviously.
This is being done within AWS EMR, so I don't think I'm putting these options in the correct place.
This is the command I want to send to the cluster as it spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. I've tried putting this on a Spark step - is this correct?
If the cluster spun up without that being properly installed, how do I start up PySpark with that package? Is this correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? I can't tell if it was installed properly or not. Not sure what functions to test
And in regards to actually using the package, is this correct for creating a parquet file:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
#is it this option1
df.write.parquet("s3n://bucketname/nation_parquet.parquet")
#or this option2
df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
I'm not able to find any recent (mid 2015 and later) documentation regarding writing a Parquet file.
Edit:
Okay, now I'm not sure if I'm creating my dataframe correctly. If I try to run some select queries on it and show the resultset, I don't get anything and instead some error. Here's what I tried running:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
df.registerTempTable("region2")
tcp_interactions = sqlContext.sql(""" SELECT nation_id, name, comment FROM region2 WHERE nation_id > 1 """)
tcp_interactions.show()
#get some weird Java error:
#Caused by: java.lang.NumberFormatException: For input string: "0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|"