How does writing data from spark to vertica work? - python

Are there any good online resources to learn how writing data from Spark to Vertica works? I'm trying to understand why writing to a Vertica database is slow.
This is my basic workflow:
Create a SparkContext. I'm using the class pyspark.sql.SQLContext to create one.
From SQLContext, using the read method to get DataFrameReader interface under 'jdbc' format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
Read entries from a Vertica database using jdbc connection (call it dbA)
Write those entries into another Vertica database using the SparkContext in Step 1 (call it dbB)
Right now it's just a simple read from dbA and write to dbB. But writing 50 entries takes about 5 seconds.
Thanks!

Have you tried HPE's Big Data Marketplace, specifically the HPE Vertica Connector For Apache Spark? You'll need to create an account to download the file, but there's no cost associated with creating an account. The documentation includes a Scala example of writing a Spark data frame to a Vertica table.

Related

Getting data from Cassandra tables to MongoDB/RDBMS in realtime

I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.
You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.

Is there any way we can load BigTable data into BigQuery?

I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.

How to use PySpark to stream data into MySQL database?

I am currently working on a single page web app that allows users to upload large CSV files (currently testing a ~7GB file) to a flask server and then stream that dataset to a database. The upload takes about a minute and the file gets fully saved to a temporary file on the flask server. Now I need to be able to stream this file and store it into a database. I did some research and found that PySpark is great for streaming data and I am choosing MySQL as the database to stream the CSV data into (but I am open to other dbs and streaming methods). I am a junior dev and new to PySpark so I'm not sure how to go about this. The Spark streaming guide says that data must ingested through a source like Kafka, Flume, TCP socets, etc so I am wondering if I have to use any of those methods to get my CSV file into Spark. However, I came across this great example where they are streaming csv data into Azure SQL database and it looks like they are just reading the file directly using Spark without needing to ingest it through a streaming source like Kafka, etc. The only thing that confuses me with that example is that they are using a HDInsight Spark cluster to stream data into the db and I'm not sure how to incorporate that all with a flask server. I appologize for the lack of code but currently I just have a flask server file with one route doing the file upload. Any examples, tutorials, or advice would be greatly appreciated.
I am not sure about the streaming part but spark can handle large files efficiently - and storing to a db table will be done in parallel, so without much knowledge about your details, and provided that you have the uploaded file on your server, I'd say that:
If I wanted to save a big structured file like a csv in a table, I would start like this:
# start with some basic spark configuration, e.g. we want the timezone to be UTC
conf = SparkConf()
conf.set('spark.sql.session.timeZone', 'UTC')
# this is important: you need to have the mysql connector jar for the right mysql version:
conf.set('jars', 'path to mysql connector jar you can download from here: https://dev.mysql.com/downloads/connector/odbc/')
# instantiate a spark session: the first time it will take a few seconds
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Huge File uploader') \
.getOrCreate()
# read the file first as a dataframe
df = spark.read.csv('path to 7GB/ huge csv file')
# optionally, add a filename column
from pyspark.sql import functions as F
df = df.withColumn('filename', F.lit('thecurrentfilename'))
# write it to the table
df.write.format('jdbc').options(
url='e.g. localhost:port',
driver='com.mysql.cj.jdbc.Driver', # the driver for MySQL
dbtable='the table name to save to',
user='user',
password='secret',
).mode('append').save()
Note the mode 'append' here: the catch in this is that spark cannot perform updates on a table, it is either append the new rows or replace what is in the table.
So, if your csv is like this:
id, name, address....
You will end up with a table with the same fields.
This is the most basic example I could think of so that you start with spark, with no considerations about a spark cluster or anything else related. I would suggest you take this for a spin and figure out if this suits your needs :)
Also, keep in mind that this might take a few seconds or more depending on your data, where the database is located, your machine and your database load, so it might be a good idea to keep things asynchronous with your api, again I don't know about any of your other details.
Hope this helps. Good luck!

Python: How to update (overwrite) Google BigQuery table using pandas dataframe

I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.

Can I use mrjob python library on partitioned hive tables?

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally on text files stored on CDH5 and I am impressed by the ease of development.
After some research I discovered there is a library called HCatalog, but as far as I know it's not available for python (only Java). Unfortunately, I do not have much time to learn Java and I would like to stick to Python.
Do you know any way to run mrjob on hive stored data?
If this is impossible, is there a way to stream python-written mapreduce code to hive? (I would rather not upload mapreduce python files to hive)
As Alex stated currently Mr.Job does not work with avro formated files. However, there is a way to perform python code on hive tables directly (no Mr.Job needed, unfortunatelly with loss of flexibility). Eventually, I managed to add python file as a resource to hive by executing "ADD FILE mapper.py" and performing SELECT clause with TRANSFORM ... USING ...., storing the results of a mapper in a separate table. Example Hive query:
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
Full example is available here (at the bottom): link

Categories

Resources