I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.
Related
I've been learning Spark recently (PySpark to be more precise) and at first it seemed really useful and powerful to me. Like you can process Gb of data in parallel so it can me much faster than processing it with classical tool... right ? So I wanted to try by myself to be convinced.
So I downloaded a csv file of almost 1GB, ~ten millions of rows (link :https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz) and wanted to try to process it with Spark and with Pandas to see the difference.
So the goal was just to read the file and count of many rows were there for a certain date. I tried with PySpark :
Preprocess with PySpark
and with pandas :
Preprocess with Pandas
Which obviously gives the same result, but it take about 1mn30 for PySpark and only (!) about 30s for Pandas.
I feel like I missed something but I don't know what. Why does it take much more time with PySpark ? Shouldn't be the contrary ?
EDIT : I did not show my Spark configuration, but I am just using it locally so maybe this can be the explanation ?
Spark is a distributed processing framework. That means that, in order to use it at it's full potential, you must deploy it on a cluster of machines (called nodes): the processing is then parallelized and distributed across them. This usually happens on cloud platforms like Google Cloud or AWS. Another interesting option to check out is Databricks.
If you use it on your local machine it would run on a single node, therefore it will be just a worse version of Pandas. That's fine for learning purposes but it's not the way it is meant to be used.
For more informations about how a Spark cluster works check the documentation: https://spark.apache.org/docs/latest/cluster-overview.html
Keep in mind that is a very deep topic, and it would take a while to decently understand everything...
Currently, the approach I take is,
clearing the rows in the table using python,
fetching the output of the view , in python and storing the result in a df
appending the data to the table using df.to_Sql in python.
Scheduling this script to be run every day at a specified time( prefect ).
I find this method, unappealing is because of the following reasons:
This method external , hence it involves latency.
This method is subject to various dependencies, like the sql connector that I am using for python, the scheduler like prefect, where debugging can get tricky If I have more than 10 tables..
Is there a better way/ package / tool to automate the process with least dependencies and latency ?
Have you tried Prefect 2 already? Regarding the load process, you may consider loading data to a temp table and merging from there -- by doing that in SQL, it might be faster and easier to troubleshoot. dbt is also a tool you can consider, and you can orchestrate dbt with prefect using the prefect-dbt package: https://github.com/PrefectHQ/prefect-dbt
I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.
I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?
I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.
For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.
Im using python 2.7.8 and pymongo 2.7
and the mongoDB server is a ReplicaSet group one primary two secondary .
the mognodb server is built on AWS server EBS:500GB, IOPS3000 .
I want to know is there any way to speed up the insert.when W=2, j=True
Using pymongo to insert a million of files takes a lot of time
and i know that if i use the W=0 it will speed up ,but it isn't safe
So any suggestion ? Please help me thanks.
Setting W=0 is deprecated. This is the older model of MongoDB (pre-3.0), which they don't recommend using any more.
Using MongoDB as a file storage system also isn't a great idea; but, you can consider using GridFS if that's the case.
I assume you're trying some sort of mass-import, and you don't have many (or any) readers right now; in which case, you will be okay if any reader sees some, but not all, of the documents.
You have a couple of options:
set j=False. MongoDB will return more quickly (before the documents are committed to the journal), at the potential risk of documents being lost if the DB crashes.
set W=1. If replication is slow, this will only wait until one of the nodes (the primary) has the data before returning.
If you do need strong consistency requirements (readers seeing everything inserted so far), neither of these options will help.
You can use unordered or ordered bulk inserts
This speeds up things alot. May be also you take a look at my muBulkOps a wraper of pymongo bulk operations.