Why is hive attempting to write to /user in hdfs? - python

Working with a simple HiveQL query that looks like this:
SELECT event_type FROM {{table}} where dt=20140103 limit 10;
The {{table}} part is just interpolated via the runner code im using via Jinja2. I'm running my query using the -e flag on the hive command line using subprocess.Popen from python.
For some reason, this setup is attempting to write into the regular /user directory in HDFS? Sudoing the command has no effect. The error produced is as follows:
Job Submission failed with exception:
org.apache.hadoop.security.AccessControlException(Permission denied:user=username, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x\n\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234)
Why would hive attempt to write to /users? Additionally, why would a select statement like this need an output location at all?

Hive is a SQL frontend to MapReduce and so needs to compile and stage Java code for execution. It's not trying to put output there but rather the program it will execute. Depending on your version of Hadoop this is controlled by the variables:
mapreduce.jobtracker.staging.root.dir
And on YARN / Hadoop 2:
yarn.app.mapreduce.am.staging-dir
These are set in mapred-site.xml.
Your runner needs to be authenticated to the cluster and have a writable directory it can use.

Related

Load local files to AWS stage in Snowflake using Python

I am planning to ingest data from local system to snowflake table using the Amazon s3 internal stage. How can I load the data to the s3 stage using python ? Previously for loading the data to snowflake I was using snowflake's internal staging and using the below command to perform the operation.
put file://<local_file_location> #<creating_stage_snowflake> auto_compress=true
copy into <table_name> from #<creating_stage_snowflake>/<file_name>.gz file_format = (TYPE=CSV FIELD_DELIMITER='~' error_on_column_count_mismatch=false, ENCODING = 'UTF-8')
What should be the approach to load the data from local system to amazon s3 and then copying these files from the staged table to the snowflake table using python.
Please share your inputs.
I use the Snowsql tool with a command line, that refers to a small sql script that I created in my code. Its not very elegant, but it does the job. You can hide the connection info for snowsql in the .snowsql\config file.
This works well for individual programs, but is probably not the best solution for a multi process web service.

Create Spark context from Python in order to run databricks sql

I've been following this tutorial which lets me connect to Databricks from Python and then run delta table queries. However, I've stumbled upon a problem. When I run it for the FIRST time, I get the following error:
Container container-name in account
storage-account.blob.core.windows.net not found, and we can't create
it using anoynomous credentials, and no credentials found for them in
the configuration.
When I go back to my Databricks cluster and run this code snippet
from pyspark import SparkContext
spark_context =SparkContext.getOrCreate()
if StorageAccountName is not None and StorageAccountAccessKey is not None:
print('Configuring the spark context...')
spark_context._jsc.hadoopConfiguration().set(
f"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net",
StorageAccountAccessKey)
(where StorageAccountName and AccessKey are known) then run my Python app once again, it runs successfully without throwing the previous error. I'd like to ask, is there a way to run this code snippet from my Python app and at the same time reflect it on my Databricks cluster?
You just need to add these configuration options to the cluster itself as it's described in the docs. You need to set following Spark property, the same as you do in your code:
fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
For security, it's better to put access key into secret scope, and refer it from Spark configuration (see docs)

Why do we need airflow hooks?

Doc says:
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref
But why do we need them?
I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?
You should use just PostresHook. Instead of using psycopg2 as so:
conn = f'{pass}:{server}#host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()
You can just type:
postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)
Which can also make use of encryption of connections.
So using hooks is cleaner, safer and easier.
While it is possible to just hardcode the connections in your script and run it, the power of hooks will allow to edit environment variables from within the UI.
Have a look at "Automate AWS Tasks Thanks to Airflow Hooks" to learn a bit more about how to use hooks.

AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.

How to execute a shell command to populate a Jenkins Dynamic Choice Parameter Plugin

I'd like to create a Jenkins job where I do a backup and deploy of certain databases to a remote MongoDB instance. I'd like this build to be parameterized so that at build time the user chooses from a list of valid MongoDB hostnames and then once the user selects the valid DB hostname, a second list parameter choice box will be dynamically populated with all valid database names on that hostnames. Then once The user has selected the DB name, that will be stored in a parameter "DB" that can be passed to a Build Step "Execute Shell" script to do the actual work.
My problem is that I need for a way to execute a script in the Jenkins Dynamic Parameter (Cascading) Plug-in that will run a shell (or ideally, python) script that will return a list of valid DB names on the selected host. I'm not able to get groovy script portion of the plugin to execute shell commands on the local OS (like the"Execute Shell" build step does).
Ideally I'd like to run something like this where "MONGOHOST" is the first parameter chosen by the user:
#!/usr/bin/env python
from pymongo import MongoClient
client = MongoClient('mongodb://${MONGOHOST}:27017/')
choicelist = client.database_names()
client.close()
I'd then like "choicelist" to be presented in such a way as they become populated as the available choices for a "DB" parameter.
How can I achieve this, especially since the Dynamic Choice parameter only accepts groovy script and not native python?
Usually the dynamic parameter plugin just loads the options from simple ini files. So, if you want to update the list of available options, you just have to update these files on the Jenkins instance.
BTW, If you are trying to implement a self-service portal, you may want to have a look at RunDeck, which I discovered recently and it seems considerably more user-friendly than Jenkins.

Categories

Resources