Getting around for loops in PySpark?

Getting around for loops in PySpark? - python

I have a clustering algorithm in Python that I am trying to convert to PySpark (for parallel processing).
I have a dataset that contains regions, and stores within those regions. I want to perform my clustering algorithm for all stores in a single region.
I have a few for loops before getting to the ML. How can I modify the code to remove the for loops in PySpark? I have read for loops in PySpark are generally not a good practice - but I need to be able to perform the model on many sub-datasets. Any advice?
For reference, I"m currently looping (through Pandas DataFrames) like this pseudocode below:
for region in df_region:
for distinct stores in region:
[apply ML clustering algorithm]

Search Built-in Algorithms
You could consider looking up RDD-based built-in clustering algorithms first since they are usually common and were released via rigorous validation process.
Clustering - RDD-based API
If you're more familiar with DataFrame-based API, then you could go here for a glance. And you might want to keep in mind as of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode (no new features, only bug fixes). The primary ML API is now the DataFrame-based API in the spark.ml package.
Implement Yourself
Pandas UDFs
If you do have a model object already, consider Pandas UDFs since they have iterator support now (Since 3.0.0). Simply saying, it means a model won't be loaded for each row.
from pyspark.sql.functions import pandas_udf
#pandas_udf(...)
def categorize(iterator):
model = ... # load model
for features in iterator:
yield model.predict(features)
"""
GROUP BY in Spark SQL or window functions can be considered.
It depends on your scenarios, just remember DataFrames are still based on RDDs.
They are immutable and are high-level abstraction.
"""
spark_df.withColumn("clustered_result", categorize("some_column")).show()
RDD Exploring
If, unfortunately, your intentional execution of the clustering algorithm is not included in the set of Spark built-in clustering algorithms and won't have a progress of training which means the generation of a model. You could consider converting the Pandas DataFrame into RDD data structures, then implementing your clustering algorithm. A rough process will be like the following:
pandas_df = ....
spark_df = spark.createDataFrame(pandas_df)
.
.
clustering_result = spark_df.rdd.map{p => cluster_algorithm(p)}
note1: It's only a rough progress, you might want to partition the whole dataset into few RDDs based on region then execute the clustering algorithm in each partitioned RDDs. Because the information of the clustering algorithm kinda not too clear, I could only give the advice based on some assumptions.
note2: RDD implementation should be your last option
RDD Programming Guide
2017, Chen Jin, A Scalable Hierarchical Clustering Algorithm Using Spark

Related

Store and reuse tsfresh feature engineering performed

I am currently using the tsfresh package for a project (predictive maintenance).
It is working really well and now I want to implement it live.
However, the issue is that I don't know how to store the feature engineering that has been applied to my original dataset in order to do the same feature engineering to the data that I am streaming (receiving live).
Do you have any idea if there is a parameter or a function that allows to store the feature engineering performed by tsfresh?
(I am using the extract_relevant_features function).

After searching through various post it turns out that the answer is that you can save your parameter into a dictionnary (see here).
This dictionnary can be can later be called with the function extract_features to extract only those parameters.

Best way to manage train/test/val splits on AzureML

I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks

I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.

Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.

How to perform time derivatives in Dask without sorting

I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.

You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)

Convert CSV table to Redis data structures

I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.

You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).

How to store the results of tokenization for further indexing?

I am totally a beginner and now trying to implement a simple search engine in python.
I doing the tokenizer well by used functions in NLTK. But I am now confused on storing the results of the tokenizer. I need to keep them for further indexing.
What's the common way to do this? What kind of database should I use?

Introduction to Information Retrieval by Manning, Raghavan and Schütze devotes several chapters to index construction and storage; so does Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto.
For a simple hobby/study project, though, SQLite will suffice for index storage. You need a table that holds (term, document-id, frequency) triples to compute tf and one that stores (term, df) pairs, both with an index on the terms; that's enough to compute tf-idf.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.