I am currently working on a single page web app that allows users to upload large CSV files (currently testing a ~7GB file) to a flask server and then stream that dataset to a database. The upload takes about a minute and the file gets fully saved to a temporary file on the flask server. Now I need to be able to stream this file and store it into a database. I did some research and found that PySpark is great for streaming data and I am choosing MySQL as the database to stream the CSV data into (but I am open to other dbs and streaming methods). I am a junior dev and new to PySpark so I'm not sure how to go about this. The Spark streaming guide says that data must ingested through a source like Kafka, Flume, TCP socets, etc so I am wondering if I have to use any of those methods to get my CSV file into Spark. However, I came across this great example where they are streaming csv data into Azure SQL database and it looks like they are just reading the file directly using Spark without needing to ingest it through a streaming source like Kafka, etc. The only thing that confuses me with that example is that they are using a HDInsight Spark cluster to stream data into the db and I'm not sure how to incorporate that all with a flask server. I appologize for the lack of code but currently I just have a flask server file with one route doing the file upload. Any examples, tutorials, or advice would be greatly appreciated.
I am not sure about the streaming part but spark can handle large files efficiently - and storing to a db table will be done in parallel, so without much knowledge about your details, and provided that you have the uploaded file on your server, I'd say that:
If I wanted to save a big structured file like a csv in a table, I would start like this:
# start with some basic spark configuration, e.g. we want the timezone to be UTC
conf = SparkConf()
conf.set('spark.sql.session.timeZone', 'UTC')
# this is important: you need to have the mysql connector jar for the right mysql version:
conf.set('jars', 'path to mysql connector jar you can download from here: https://dev.mysql.com/downloads/connector/odbc/')
# instantiate a spark session: the first time it will take a few seconds
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Huge File uploader') \
.getOrCreate()
# read the file first as a dataframe
df = spark.read.csv('path to 7GB/ huge csv file')
# optionally, add a filename column
from pyspark.sql import functions as F
df = df.withColumn('filename', F.lit('thecurrentfilename'))
# write it to the table
df.write.format('jdbc').options(
url='e.g. localhost:port',
driver='com.mysql.cj.jdbc.Driver', # the driver for MySQL
dbtable='the table name to save to',
user='user',
password='secret',
).mode('append').save()
Note the mode 'append' here: the catch in this is that spark cannot perform updates on a table, it is either append the new rows or replace what is in the table.
So, if your csv is like this:
id, name, address....
You will end up with a table with the same fields.
This is the most basic example I could think of so that you start with spark, with no considerations about a spark cluster or anything else related. I would suggest you take this for a spin and figure out if this suits your needs :)
Also, keep in mind that this might take a few seconds or more depending on your data, where the database is located, your machine and your database load, so it might be a good idea to keep things asynchronous with your api, again I don't know about any of your other details.
Hope this helps. Good luck!
Related
I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.
You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.
I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.
I am currently using SQL Alchemy to create a database and I want to enable users to update the database by using an API end point by uploading a csv file If ever necessary.
This is the CSV file.
The column names are the same as my database, My API end points are working and I can receive the CSV through a jsonified version of the CSV. I was wondering how to I update my database if say there's a change in the construction programme for a given date.
Thank you in advance ( It's my first time posting, forgive me for the lack of details and explaination)
I have a table created by a crawler pointing to some parquet files stored in s3. From the the Glue data catalogue GUI I can see many fields (53).
When I open up an ETL dev endpoint and connect with a sagemaker notebook, load the same table and run printSchema, I see a lot less fields (36) using the code below.
from pyspark.context import SparkContext
from awsglue.context import GlueContext, DynamicFrame
# Get the right stuff
glueContext = GlueContext(SparkContext.getOrCreate())
data = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table")
print('Fields: ', len(data.schema().fields))
data.printSchema()
returns only 36 fields. Can anyone tell me how to access the missing fields? It seems to happen most frequently on fields that are sparsely populated.
Edit: This unanswered question on the AWS forums seems to be due to the same issue - apparently PySpark tries to infer its own schema rather than use the one found by the crawler.
For parquet files Glue uses Spark's reader and therefore rely on the schema inherited from a file instead of using the one from Data Catalog created by crawler.
If source folder has files with different schemas then Glue crawler merges it into a single schema which makes it different from the one you see in ETL.
Have you tried .create_dynamic_frame.from_options and read directly from s3 bucket? Sometimes that behaves differently from Crawler.
Have you tried 'Update all new and existing partitions with metadata from the table' in 'Output -> Configuration options (optional)' section of Crawler?
Are there any good online resources to learn how writing data from Spark to Vertica works? I'm trying to understand why writing to a Vertica database is slow.
This is my basic workflow:
Create a SparkContext. I'm using the class pyspark.sql.SQLContext to create one.
From SQLContext, using the read method to get DataFrameReader interface under 'jdbc' format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
Read entries from a Vertica database using jdbc connection (call it dbA)
Write those entries into another Vertica database using the SparkContext in Step 1 (call it dbB)
Right now it's just a simple read from dbA and write to dbB. But writing 50 entries takes about 5 seconds.
Thanks!
Have you tried HPE's Big Data Marketplace, specifically the HPE Vertica Connector For Apache Spark? You'll need to create an account to download the file, but there's no cost associated with creating an account. The documentation includes a Scala example of writing a Spark data frame to a Vertica table.