show() subset of big dataframe pyspark - python

I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.
I have tried numerous things e.g.
df.show(5)
and
df.limit(5).show()
but everything I try requires a large portion of jobs resulting in slow performance.
I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?

Try the rdd equivalent of the dataframe
rdd_df = df.rdd
rdd_df.take(5)
Or, Try to print the dataframe schema
df.printSchema()

First, to show a certain number of rows you can use the limit() method after calling a select() method, like this:
df.select('*').limit(5).show()
also, the df.show() action will only print the first 20 rows, it will not print the whole dataframe.
second, and more importantly,
Spark Actions:
spark dataframe does not contain data, it contains instructions and operation graph, since spark works with big data it does not allow to perform any operation as its called, to prevent slow performance, instead, methods are separated into two kinds Actions and Transformations, transformations are collected and contained as an operation graph.
Action is a method that causes the dataframe to execute all accumulated operation in the graph, causing the slow performances, since it execute everything (note, UDFs are extremely slow).
show() is an action, when you call show() it has to calculate every other transformation to show you the true data.
keep that in mind.

to faster iterate you have to understand the difference between actions and transformation.
The Transformation is defined by any operation that result into another RDD/Spark Dataframe for example df.filter.join.groupBy. Action is defined by any operation that result into non-RDD for example df.write. or df.count() or df.show().
The transformation is lazy, saying the not like python df1=df.filter, df2=df1.groupby df and df1 and df3 was in memory. Instead the data will flow into memory until you call an action. like in your case .show()
And calling df.limit(5).show() will not fastern your job iteration, because this limit is limiting the final dataframe gets print out instead the original data that is flowing through your memory.
Like others suggestion, you should be able to limit your input datasize in order to faster test whether your transformation works. And further improve your iteration, you could cache dataframe from matured transformation, instead of running them over and over again.

Related

Memory Error when parsing two large data frames

I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.

How to use dask to populate DataFrame in parallelized task?

I would like to use dask to parallelize a numbercrunching task.
This task utilizes only one of the cores in my computer.
As a result of that task I would like to add an entry to a DataFrame via shared_df.loc[len(shared_df)] = [x, 'y']. This DataFrame should be populized by all the (four) paralllel workers / threads in my computer.
How do I have to setup dask to perform this?
The right way to do something like this, in rough outline:
make a function that, for a given argument, returns a data-frame of some part of the total data
wrap this function in dask.delayed, make a list of calls for each input argument, and make a dask-dataframe with dd.from_delayed
if you really need the index to be sorted and the index to partition along different lines than the chunking you applied in the previous step, you may want to do set_index
Please read the docstrings and examples for each of these steps!

How to perform time derivatives in Dask without sorting

I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.
You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records.
It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).
Now, I can use a simple groupBy and agg(sum) it, but to my understanding this is not really efficient. The groupBy will move around a lot of data between partitions.
Alternatively, I can also use a Window function with a partitionBy clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.
But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?
As far as I know, when working with spark DataFrames, the groupBy operation is optimized via Catalyst. The groupBy on DataFrames is unlike the groupBy on RDDs.
For instance, the groupBy on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey or aggregateByKey on RDDs. See this related SO-article with a nice example.
In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.
Concluding, I think you're fine employing groupBy when using spark DataFrames. Using Window does not seem appropriate to me for your requirement.

How can I save partial results of dataframe transformation processes in pyspark?

I am working in apache-spark to make multiple transformations on a single Dataframe with python.
I have coded some functions in order to make easier the different transformations. Imagine we have functions like:
clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or
#udf
return df
I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.
I noticed that everytime I add a line like below, the running time is longer:
# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)
As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1 Spark runs only this function, but when running function2 Spark runs function1 and then, function2. What if I really need to run each function only one?
I tried with a df.cache() and df.persist() but I din't get the desired results.
I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.
You probably aren't getting the desired results from cache() or persist() because they won't be evaluated until you invoke an action. You can try something like this:
# Step transformation 1:
df = function1(df,column).cache()
# Now invoke an action
df.count()
# Step transformation 2.
df = function2(df, column)
To see that the execution graph changes, the SQL tab in the Spark Job UI is a particularly helpful debugging tool.
I'd also recommend checking out the ML Pipeline API and seeing if it's worth implementing a custom Transformer. See Create a custom Transformer in PySpark ML.

Categories

Resources