I'm looking into establishing a JDBC Spark connection to use from R/python. I know that pyspark and SparkR are both available - but these seem more appropriate for interactive analysis, particularly since they reserve cluster resources for the user. I'm thinking of something more analogous to the Tableau ODBC Spark connection - something more light-weight (as I understand it) for supporting simple random access. While this seems possible and there is some documentation it isn't clear (to me) what the JDBC driver requirements are.
Should I use the org.apache.hive.jdbc.HiveDriver like I do to establish a Hive connection since Hive and Spark SQL via thrift seem closely linked? Should I swap out the hadoop-common dependency needed for my Hive connection (using HiveServer2 Port) for some spark-specific dependency (when using hive.server2.thrift.http.port)?
Also, since most of the connection functionality seems to leverage Hive, what is the key thing that causes Spark SQL to be used as the query engine instead of Hive?
As it turned out the URL that I needed to use did not match the Hive database host URL listed in the ambari. I came across the correct URL in an example for how to connect (to my cluster specifically). Given the proper URL I was able to establish a connection using the HiveDriver without issue.
Related
Moving this question from DevOps Stack Exchange where it got only 5 views in 2 days:
I would like to query an Azure Database for MySQL Single Server.
I normally interact with this database using a universal database tool (dBeaver) installed onto an Azure VM. Now I would like to interact with this database using Python from outside Azure. Ultimately I would like to write an API (FastAPI) allowing multiple users to connect to the database.
I ran a simple test from a Jupyter notebook, using SQLAlchemy as my ORM and specifying the pem certificate as a connection argument:
import pandas as pd
from sqlalchemy import create_engine
cnx = create_engine('mysql://XXX', connect_args={"ssl": {"ssl_ca": "mycertificate.pem"}})
I then tried reading data from a specific table (e.g. mytable):
df = pd.read_sql('SELECT * FROM mytable', cnx)
Alas I ran into the following error:
'Client with IP address 'XX.XX.XXX.XXX' is not allowed to connect to
this MySQL server'.
According to my colleagues, a way to fix this issue would be to whitelist my IP address.
While this may be an option for a couple of users with static IP addresses I am not sure whether it is a valid solution in the long run.
Is there a better way to access an Azure Database for MySQL Single Server from outside Azure?
As mentioned in comments, you need to whitelist the IP address ranges(s) in the Azure portal for your MySQL database resource. This is a well accepted and secure approach.
How can I connect to a sql/mp or sql/mx database on Nonstop using python.?
i have followed this to get the connection established to HP nonstop database, I am getting error as driver DRIVER={NonStop ODBC/MX 3.6} is not found.
Help is appreciated.
To connect the NonStop sql or for any relevant databases, a ODBC or JDBC driver is required.
From my experience JDBC works with JAVA only thought there are pyodbc and other python libs are available( they didn't work for me and they internally depends on java.)
Better to do it in java. In my case I called the java class from python using os level commands and result stored in excel which is read from python again.
The approach I am trying is to write a dynamic script that would generate mirror tables as in Oracle with similar data types in SQL server. Then again, write a dynamic script to insert records to SQL server. The challenge I see is incompatible data types. Has anyone come across similar situation? I am a sql developer but I can learn python if someone can share their similar work.
Have you tried the "SQL Server Import and Export Wizard" in SSMS?
i.e. if you create an empty SQL server database and right click on it in SSMS then one of the "tasks" menu options is "Import Data..." which starts up the "SQL Server Import and Export Wizard". This builds a once-off SSIS package .. which can be saved if you want to re-use.
There is a data source option for "Microsoft OLE DB Provider for Oracle".
You might have a better Oracle OLE DB Provider available also to try.
The will require Oracle client software to be available.
I haven't actually tried this (Oracle to SQL*Server) so am not sure if reasonable or not.
How many tables, columns?
Oracle DB may also have Views, triggers, constraints, Indexes, Functions, Packages, sequence generators, synonyms.
I used linked server, got all the metadata of the tables from dba_tab_columns in Oracle. Wrote script to create tables based on the metadata. I needed to use SSIS script task to save the create table script for source control. Then I wrote sql script to insert data from oracle, handled type differences through script.
I want to connect to the Microsoft Analysis Server via Python. I have seen you can do this by the package XML or olapy, but both of them required the Analysis Server to be in HTTP, which is not applicable in my case. Is it possible to connect to Analysis Server using a connection string, that is similar to Microsoft's OLAP in R?
i.e. the connection will be something like:
connection_string = "Provider=MSOLAP.8;Integrated Security=SSPI;Persist Security Info=True;Initial Catalog="Database Name";Data Source="Server Name";MDX Compatibility=1;Safety Options=2;MDX Missing Member Mode=Error;Update Isolation Level=2"
After connecting to the Analysis Server via this connection string, I expect to query the Server by some MDX/DAX code.
Thanks in advance!
Well just Googling around it seems that the library IronPython will be useful
Execute query on SQL Server Analysis Services with IronPython
I'm using SQLAlchemy in WSGI python web app to query the database. If I do two concurrent requests, the second request invariably throws an exception, with SQL Server stating
[24000] [FreeTDS][SQL Server]Invalid cursor state (0) (SQLExecDirectW)
Unfortunately it looks like I can't use caching to prevent additional requests to the database. Is there another way to resolve this issue? Ideally using native python libraries (i.e. not relying on another python module)?
The only thing I can think of is using threads to put a lock on the function making the database queries, but I'm worried this will slow down the app.
Is there anything else that can be done? Is this a configuration issue?
I'm using FreeTDS v0.91 on a Centos 5.9 server, connecting to MS SQL Server 2008.
The webapp is based on Paste.
Are your two concurrent requests using different database connections? DBAPI connections are not generally threadsafe. At the ORM level, you'd make sure you're using session per request so that each request has its own Session and therefore dedicated DBAPI connection.