Create Spark context from Python in order to run databricks sql - python

I've been following this tutorial which lets me connect to Databricks from Python and then run delta table queries. However, I've stumbled upon a problem. When I run it for the FIRST time, I get the following error:
Container container-name in account
storage-account.blob.core.windows.net not found, and we can't create
it using anoynomous credentials, and no credentials found for them in
the configuration.
When I go back to my Databricks cluster and run this code snippet
from pyspark import SparkContext
spark_context =SparkContext.getOrCreate()
if StorageAccountName is not None and StorageAccountAccessKey is not None:
print('Configuring the spark context...')
spark_context._jsc.hadoopConfiguration().set(
f"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net",
StorageAccountAccessKey)
(where StorageAccountName and AccessKey are known) then run my Python app once again, it runs successfully without throwing the previous error. I'd like to ask, is there a way to run this code snippet from my Python app and at the same time reflect it on my Databricks cluster?

You just need to add these configuration options to the cluster itself as it's described in the docs. You need to set following Spark property, the same as you do in your code:
fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
For security, it's better to put access key into secret scope, and refer it from Spark configuration (see docs)

Related

How can I be sure that a Library like Pandas is not sending my API Key Secrets to places outside from my Local?

Let's say:
I have my python code in main.py and I am using Pandas
I am storing my API Key(to some azure service) in a Windows Environment Variable ( variable name = "AZURE_KEY" and variable_value = "abc123abc")
I will import this API Key in main.py using azure_key = os.environ.get("AZURE_KEY")
Question:
How can I be sure that Pandas Library hasn't sent azure_key's value to somewhere outside my local system?
Possible Approach:
I know one way is to go through the entire Pandas module files and understand the source code to see if any fishy stuff is happening , but such an approach is not feasible.
Note:
Pandas is just an example for the question.I want to use an API Key within a Streamlit code.
Hence,Please take this question agnostic to the library..
For a production system (on a server), you could use a firewall to filter outgoing connections
For a development system (your machine), you could add restrictions to the "API Key" account (e.g. only access test data, only access systems you really need, etc.)

How to remove a cloud armor secuiry policy from backend service using Python

I'm creating a few GCP cloud armor policies across multiple projects using the Python client library and attaching them to several backend services using the .set_security_policy() method
I know you can do it using the console / gcloud but I need to automate this in Python
I've tried the .update() method in google-cloud-compute but that did not work out
from google.cloud import compute, compute_v1
client = compute.BackendServicesClient()
backend_service_resource = compute_v1.types.BackendService(security_policy="")
client.update(project='project_id',
backend_service='backend_service',
backend_service_resource=backend_service_resource)
The error I got when running the above code is
google.api_core.exceptions.BadRequest: 400 PUT https://compute.googleapis.com/compute/v1/projects/<project-id>/global/backendServices/<backend-name>: Invalid value for field 'resource.loadBalancingScheme': 'INVALID_LOAD_BALANCING_SCHEME'. Cannot change load balancing scheme.
When I specify loadBalancingScheme then the same error occurs with another resource value. At run-time I would not have information of all the meta data of the backend-service and some meta-data might not be initialized in the first place.
This is for anyone who had similar issues in the future. I was originally going to call the gcloud commands through python using os.system() as #giles-roberts recommended, but then I stumbled across a proper way to to do this using the Client Libraries
You simply use the same .set_security_policy() to set the security policy in the first place but this time make the policy as None. This is not quite obvious since the name of the security policy has to be a string in the documentation and it does not accept an empty string either.
from google.cloud import compute, compute_v1
client = compute.BackendServicesClient()
resource = compute_v1.types.SecurityPolicyReference(security_policy=None)
error = client.set_security_policy(project='<project_id>',
backend_service='<backend_service>',
security_policy_reference_resource=resource)

Azure function write_dataframe_to_datalake in python working fine on VScode but fails when deployed to the cloud

I am new to Azure functions and am trying to write a function (Blobtrigger), the function reads the file uploaded on the blob (the file is binary .dat file), does some conversions to convert the data as a pandas data frame and then converts it to .parquet file format and saves it on the azure datalake.
Everything works well when I am locally running the function in VSCode, but when I upload it on the azure cloud functions, I get an error and the function fails.
After looking into the code and checking each step by individually uploading on cloud, I have found that the code works fine till converting the .dat file to dataframe, but fails when I add the function to save it to datalake. I am using the following function found in microsoft tutorials.
def write_dataframe_to_datalake(df, datalake_service_client, filesystem_name, dir_name, filename):
file_path = f'{dir_name}/{filename}'
file_client = datalake_service_client.get_file_client(filesystem_name, file_path)
processed_df = df.to_parquet(index=False)
file_client.upload_data(data=processed_df,overwrite=True, length=len(processed_df))
file_client.flush_data(len(processed_df))
return True
reference: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-deploy-serverless-cloud-etl-05
This function works fine on azure cloud, when I run it indiviually but not with my converted dataframe.
Can anyone identify what the problem could be. Thanks a lot!
Based on the Error message we can identify the issue but when Azure function is working fine locally it should work after the deployment as well. In few cases after deploying our function we will get internal server errors when we run our code.
Below are two ways we can get rid of it.
Add Application Settings ( Add all the variables from local.settings.json under Configurations )
If our Azure function is interacting with any of Azure Storage Service or Blob storage we need to add the API link in CORS, or you can try adding “*” which means allowing all).
Refer to this MS Docs for adding Application Settings

PySpark using IAM roles to access S3

I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a definitive answer as to if PySpark supports this out of the box.
Ideally, I'd like to be able to assume a role when running in standalone mode locally and point my SparkContext to that s3 path. I've seen that non-IAM calls usually follow :
spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp')
sc = SparkContext(conf=spark_conf)
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>#some-bucket/some-key')
Does something like this exist for providing IAM info? :
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>#some-bucket/some-key')
or
rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>#some-bucket/some-key')
If not, what are the best practices for working with IAM creds? Is it even possible?
I'm using Python 1.7 and PySpark 1.6.0
Thanks!
IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.
You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.
hadoop-aws link.
aws-java-sdk link.
Then set this in core-site.xml:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
Hadoop 2.8+'s s3a connector supports IAM roles via a new credential provider; It's not in the Hadoop 2.7 release.
To use it you need to change the credential provider.
fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
fs.s3a.access.key = <your access key>
fs.s3a.secret.key = <session secret>
fs.s3a.session.token = <session token>
What is in Hadoop 2.7 (and enabled by default) is the picking up of the AWS_ environment variables.
If you set the AWS env vars for session login on your local system and the remote ones then they should get picked up.
I know its a pain, but as far as the Hadoop team are concerned Hadoop 2.7 shipped mid-2016 and we've done a lot since then, stuff which we aren't going to backport
IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:
Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.
To find out which combinations work, go to hadoop-aws on mvnrepository here. Click through the version of hadoop-aws you have look for the version of the aws-java-sdk compile dependency.
To find out what version of hadoop-aws you are using, in PySpark you can execute:
sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()
where sc is the SparkContext
This is what worked for me:
import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
sc = SparkContext.getOrCreate()
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc)
df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()
It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that made it work. There is good guidance on troubleshooting s3a access here
Specially note that
Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
Here is a useful post containing further information.
Here's some more useful information about compatibility between the java libraries
I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.
You could try the approach in Locally reading S3 files through Spark (or better: pyspark).
However I've had better luck with setting environment variables (AWS_ACCESS_KEY_ID etc) in Bash ... pyspark will automatically pick these up for your session.
After more research, I'm convinced this is not yet supported as evidenced here.
Others have suggested taking a more manual approach (see this blog post) which suggests to list s3 keys using boto, then parallelize that list using Spark to read each object.
The problem here (and I don't yet see how they themselves get around it) is that the s3 objects given back from listing within a bucket are not serializable/pickle-able (remember : it's suggested that these objects are given to the workers to read in independent processes via map or flatMap). Furthering the problem is that the boto s3 client itself isn't serializable (which is reasonable in my opinion).
What we're left with is the only choice of recreating the assumed-role s3 client per file, which isn't optimal or feasible past a certain point.
If anyone sees any flaws in this reasoning or an alternative solution/approach, I'd love to hear it.

Why is hive attempting to write to /user in hdfs?

Working with a simple HiveQL query that looks like this:
SELECT event_type FROM {{table}} where dt=20140103 limit 10;
The {{table}} part is just interpolated via the runner code im using via Jinja2. I'm running my query using the -e flag on the hive command line using subprocess.Popen from python.
For some reason, this setup is attempting to write into the regular /user directory in HDFS? Sudoing the command has no effect. The error produced is as follows:
Job Submission failed with exception:
org.apache.hadoop.security.AccessControlException(Permission denied:user=username, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x\n\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234)
Why would hive attempt to write to /users? Additionally, why would a select statement like this need an output location at all?
Hive is a SQL frontend to MapReduce and so needs to compile and stage Java code for execution. It's not trying to put output there but rather the program it will execute. Depending on your version of Hadoop this is controlled by the variables:
mapreduce.jobtracker.staging.root.dir
And on YARN / Hadoop 2:
yarn.app.mapreduce.am.staging-dir
These are set in mapred-site.xml.
Your runner needs to be authenticated to the cluster and have a writable directory it can use.

Categories

Resources