PySpark - SciPy Functions - python

I am dealing with spark dataframes with very long columns that represents time domain signals. Lots of millions of rows.
I need to perform signal processing on these using some functions from SciPy, which require me to input the columns as numpy arrays. Specifically I am trying to use these 2:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.stft.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html
My current approach is to turn the column into a dataframe using collect as follows:
x = np.array(df.select("column_name").collect()).reshape(-1)
Then I feed the result to SciPy
However:
This is very slow
Collect loads all the data in the driver and therefore is not scalable
Please, could somebody help me find the most performing way to do this?
I found this somewhat old post, but there is no conclusion to it and I can find the following objections to it:
How to convert a pyspark dataframe column to numpy array
Objections to that post:
It is not clear to me that Dask Arrays are compatible with SciPy, because Dask apparently only implements a subset of Numpy algorithms.
The problem of how to convert from a PySpark dataframe to an array in the first place (using collect() currently) would not be solved by dask.
Please, help on this is much needed and would be greatly appreciated.
Many thanks in advance.

Related

Dealing with very small static tables in pySpark

I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes.
Currently, the only way in which I can do this is by converting those smaller static tables into pySpark dataframes as well. I'm just curious as to whether using such a small table as a pySpark dataframe is a bad practice? I know pySpark is meant for large datasets which need to be distributed but given that my large dataset is in a pySpark dataframe, I assumed I would have to convert the smaller static table into a pySpark dataframe as well in order to make the appropriate joins.
Any tips on best practices would be appreciated, as it relates to joining with very small datasets. Maybe I am overcomplicating something which isn't even a big deal but I was curious. Thanks in advance!
Take a look at Broadcast joins. Wonderfully explained here https://mungingdata.com/apache-spark/broadcast-joins/
The best practice in your case is to broadcast your small df and joins the broadcasted df to your large df like this code below:
val broadcastedDF = sc.broadcast(smallDF)
largeDF.join(broadcastedDF)

Data analysis : compare two datasets for devising useful features for population segmentation

Say I have two pandas dataframes, one containing data for general population and one containing the same data for a target group.
I assume this is a very common use case of population segmentation. My first idea to explore the data would be to perform some vizualization using e.g. seaborn Facetgrid or barplot & scatterplot or something like that to get a general idea of the trends and differences.
However, I found out that this operation is not as straightforward as I thought as seaborn is made to analyze one dataset and not compare two datasets.
I found this SO answer which provides a solution. But I am wondering how would people go about if if the dataframe was huge and a concat operation would not be possible ?
Datashader does not seem to provide such features as far as I have seen ?
Thanks for any ideas on how to go about such task
I would use the library Dask when data is too big for pandas. Dask is made by the same people who created pandas and it is a little bit more advanced, because it is a big data tool, but it has some of the same features including concat. I found dask easy enough to use and am using it for a couple of projects where I have dozens of columns and tens of millions of rows.

Parallellise a custom function with PySpark

I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.
Here's a simplified example:
import numpy as np
import pandas as pd
dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
'val':np.random.normal(size=100)})
My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.
The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.
How can I parallellise this?
This answer is so short that it should rather be a comment but not enough reputation to comment.
Spark 2.3 introduced pandas vectorized UDFs that are exactly what you're looking for: executing a custom pandas transformation over a grouped Spark DataFrame, in a distributed fashion, and with great performance thanks to PyArrow serialization.
See
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?pyspark.sql.functions.pandas_udf#pyspark.sql.functions.pandas_udf
Using Collect_set after exploding in a groupedBy object in Pyspark
for more information and examples.

How to perform time derivatives in Dask without sorting

I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.
You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)

Efficiently creating lots of Histograms from grouped data held in pandas dataframe

I want to create a bunch of histograms from grouped data in pandas dataframe. Here's a link to a similar question. To generate some toy data that is very similar to what I am working with you can use the following code:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
I want to put those histograms (read the binned data) in a new dataframe and save that for later processing. Here's the real kicker, my file is 6 GB, with 400k+ groups, just 2 columns.
I've thought about using a simple for loop to do the work:
data=[]
for group in df['Letter'].unique():
data.append(np.histogram(df[df['Letter']==group]['N'],range=(-2000,2000),bins=50,density=True)[0])
df2=DataFrame(data)
Note that the bins, range, and density keywords are all necessary for my purposes so that the histograms are consistent and normalized across the rows in my new dataframe df2 (parameter values are from my real dataset so its overkill on the toy dataset). And the for loop works great, on the toy dataset generates pandas dataframe of 3 rows and 50 columns as expected. On my real dataset I've estimated that time to completion of the code would be around 9 days. Is there any better/faster way to do what I'm looking for?
P.S. I've thought about multiprocessing, but I think the overhead of creating processes and slicing data would be slower than just running this serially (I may be wrong and wouldn't mind to be corrected on this one).
For the type of problem you describe here, I personally usually do the following, which is basically delegate the whole thing to multithreaded Cython/C++. It's a bit of work, but not impossible, and I'm not sure there's really a viable alternative at the moment.
Here are the building blocks:
First, your df.x.values, df.y.values are just numpy arrays. This link shows how to get C-pointers from such arrays.
Now that you have pointers, you can write a true multithreaded program using Cython's prange and foregoing any Python from this point (you're now in C++ territory). So say you have k threads scanning your 6GB arrays, and thread i handles groups whose keys have a hash that is i modulo k.
For a C program (which is what your code really is now) the GNU Scientific Library has a nice histogram module.
When the prange is done, you need to convert the C++ structures back to numpy arrays, and from there back to a DataFrame. Wrap the whole thing up in Cython, and use it like a normal Python function.

Categories

Resources