I've been learning Spark recently (PySpark to be more precise) and at first it seemed really useful and powerful to me. Like you can process Gb of data in parallel so it can me much faster than processing it with classical tool... right ? So I wanted to try by myself to be convinced.
So I downloaded a csv file of almost 1GB, ~ten millions of rows (link :https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz) and wanted to try to process it with Spark and with Pandas to see the difference.
So the goal was just to read the file and count of many rows were there for a certain date. I tried with PySpark :
Preprocess with PySpark
and with pandas :
Preprocess with Pandas
Which obviously gives the same result, but it take about 1mn30 for PySpark and only (!) about 30s for Pandas.
I feel like I missed something but I don't know what. Why does it take much more time with PySpark ? Shouldn't be the contrary ?
EDIT : I did not show my Spark configuration, but I am just using it locally so maybe this can be the explanation ?
Spark is a distributed processing framework. That means that, in order to use it at it's full potential, you must deploy it on a cluster of machines (called nodes): the processing is then parallelized and distributed across them. This usually happens on cloud platforms like Google Cloud or AWS. Another interesting option to check out is Databricks.
If you use it on your local machine it would run on a single node, therefore it will be just a worse version of Pandas. That's fine for learning purposes but it's not the way it is meant to be used.
For more informations about how a Spark cluster works check the documentation: https://spark.apache.org/docs/latest/cluster-overview.html
Keep in mind that is a very deep topic, and it would take a while to decently understand everything...
Related
I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.
When we are writing a pyspark dataframe to s3 from EC2 instance using pyspark code the time taken to complete write operation is longer than usual time. Earlier it used to take 30 min to complete the write operation for 1000 records, but now it is taking more than an hour. Also after completion of the write operation the context switch to next lines of code is taking longer time(20-30min). We are not sure whether this is AWS-s3 issue or else because of lazy computation of Pyspark. Could anybody throw some light on this quesion.
Thanking in advance
It seems an issue with the cloud environment. Four things coming to my mind, which you may check:
Spark version: For some older version of spark, one gets S3 issues.
Data size being written in S3, and also the format of data while storing
Memory/Computation issue: The memory or CPU might be getting utilized to maximum levels.
Temporary memory storage issue- Spark stores some intermediate data in temporary storage, and that might be getting full.
So, with more details, it may become clear on the solution.
I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.
I have a rather complex database which I deliver in CSV format to my client. The logic to arrive at that database is an intricate mix of Python processing and SQL joins done in sqlite3.
There are ~15 source datasets ranging from a few hundreds records to as many as several million (but fairly short) records.
Instead of having a mix of Python / sqlite3 logic, for clarity, maintainability and several other reasons I would love to move ALL logic to an efficient set of Python scripts and circumvent sqlite3 altogether.
I understand that the answer and the path to go would be Pandas, but could you please advise if this is the right track for a rather large database like the one described above?
I have been using Pandas with datasets > 20 GB in size (on a Mac with 8 GB RAM).
My main problem has been that there is a know bug in Python that makes it impossible to write files larger than 2 GB on OSX. However, using HDF5 circumvents that.
I found the tips in this and this article enough to make everything run without problem. The main lesson is to check the memory usage of your data frame and cast the types of the columns to the smallest possible data type.
I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fact that I want to use HDFS as the destination.
I have not been able to find a tool that supports this in batch. Before re-creating the wheel I wanted to ask the StackOverflow community for their input.
If a solution exists already in python that would be even better.
we use http://mergelog.sourceforge.net/ to merge all our apache logs..
take a look at Zomhg, its an aggregation/reporting system for log files using Hbase and Hdfs: http://github.com/zohmg/zohmg
Scribe can meet your requirements, there's a version (link) of scribe that can aggregate logs from multiple sources, and after reaching given threshold it stores everything in HDFS. I've used it and it works very well. Compilation is quite complicated, so if you have any problems ask a question.
PiCloud may help.
The PiCloud Platform gives you the freedom to develop your algorithms
and software without sinking time into all of the plumbing that comes
with provisioning, managing, and maintaining servers.