TensorFlow Federated: How to tune non-IIDness in federated dataset?

TensorFlow Federated: How to tune non-IIDness in federated dataset? - python

I am testing some algorithms in TensorFlow Federated (TFF). In this regard, I would like to test and compare them on the same federated dataset with different "levels" of data heterogeneity, i.e. non-IIDness.
Hence, I would like to know whether there is any way to control and tune the "level" of non-IIDness in a specific federated dataset, in an automatic or semi-automatic fashion, e.g. by means of TFF APIs or just traditional TF API (maybe inside the Dataset utils).
To be more practical: for instance, the EMNIST federated dataset provided by TFF has 3383 clients with each one of them having their handwritten characters. However, these local dataset seems to be quite balanced in terms of number of local examples and in terms of represented classes (all classes are, more or less, represented locally).
If I would like to have a federated dataset (e.g., starting by the TFF's EMNIST one) that is:
Patologically non-IID, for example having clients that hold only one class out of N classes (always referring to a classification task). Is this the purpose of tff.simulation.datasets.build_single_label_dataset documentation here. If so, how should I use it from a federated dataset such as the ones already provided by TFF?;
Unbalanced in terms of the amount of local examples (e.g., one client has 10 examples, another one has 100 examples);
Both the possibilities;
how should I proceed inside the TFF framework to prepare a federated dataset with those characteristics?
Should I do all the stuff by hand? Or do some of you have some advices to automate this process?
An additional question: in this paper "Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification", by Hsu et al., they exploit the Dirichlet distribution to synthesize a population of non-identical clients, and they use a concentration parameter to control the identicalness among clients. This seems an wasy-to-tune way to produce datasets with different levels of heterogeneity. Any advice about how to implement this strategy (or a similar one) inside the TFF framework, or just in TensorFlow (Python) considering a simple dataset such as the EMNIST, would be very useful too.
Thank you a lot.

For Federated Learning simulations, its quite reasonable to setup the client datasets in Python, in the experiment driver, to achieve the desired distributions. At some high-level, TFF handles modeling data location ("placements" in the type system) and computation logic. Re-mixing/generating a simulation dataset is not quite core to the library, though there are helpful libraries as you've found. Doing this directly in python by manipulating the tf.data.Dataset and then "pushing" the client datasets into a TFF computation seems straightforward.
Label non-IID
Yes, tff.simulation.datasets.build_single_label_dataset is intended for this purpose.
It takes a tf.data.Dataset and essentially filters out all examples that don't match desired_label values for the label_key (assuming the dataset yields dict like structures).
For EMNIST, to create a dataset of all the ones (regardless of user), this could be achieved by:
train_data, _ = tff.simulation.datasets.emnist.load_data()
ones = tff.simulation.datasets.build_single_label_dataset(
train_data.create_tf_dataset_from_all_clients(),
label_key='label', desired_label=1)
print(ones.element_spec)
>>> OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None))])
print(next(iter(ones))['label'])
>>> tf.Tensor(1, shape=(), dtype=int32)
Data imbalance
Using a combination of tf.data.Dataset.repeat and tf.data.Dataset.take can be used to create data imbalances.
train_data, _ = tff.simulation.datasets.emnist.load_data()
datasets = [train_data.create_tf_dataset_for_client(id) for id in train_data.client_ids[:2]]
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [93, 109]
datasets[0] = datasets[0].repeat(5)
datasets[1] = datasets[1].take(5)
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [465, 5]

Related

Applying CNN to Fast time Fourier Transform?

I have data that fast time fourier transform is applied.
(amplitudes at specific Hzs)
There are solutions on internet that CNN is applied to mel spectrogram, however, I see no solution that CNN is applied to Fast Fourier Transformed signal.
Is it possible that CNN is applied to Fast Fourier Transformed signals?
Or is it not possible because CNN is considering temporal attribute?
Thanks!

I'm assuming each row of your spreadsheet is IID, e.g. it wouldn't change the problem to re-order the rows in that spreadsheet.
In this case you have a pretty typical ML problem. The fact that the FFT has already been applied and specific frequency responses (columns) have been extracted is a process called "feature engineering". Prior to the common use of neural networks, this was a standard step in all machine learning problems and remains common to a great many domains.
With data that has been feature engineered, you should look to traditional ML algorithms. Random Forests, XGBoost, and Linear Regression come to mind. A fully connected neural network is also appropriate, but I would typically expect it to under-perform other ML methods.
The hallmark of a CNN is that it operates on an ordered sequence of data. In your case the raw data, from which your dataset was derived, would be appropriate for a CNN. In a sound file you have a 1D sequence of information. You could not re-order the data in the time dimension without fundamentally changing its meaning.
A 2D CNN operates over an image where the pixel order in X and Y cannot be changed. Again the sequential order of the data matters. The same applies for 3D CNNs.
Be aware that the application of a FFT has fundamentally biased your solution by representing it only in a limited set of frequency responses. All feature engineering is fundamentally biasing the problem, presumably in a well thoughout-out way. However, it's entirely possible that other useful signals in the data exist, which aren't expressed by the FFT # 10, 20, 30 Hz, etc. The CNN has the capacity to learn its own version of an FFT as well as other non cyclic patterns. Typically, the lack of a feature engineering step is the key differentiator between the CNN and traditional ML algorithms.

How to model RNN with Attention Mechanism for Non-Text Classification?

Recurrent Neural Networks (RNN) With Attention Mechanism is generally used for Machine Translation and Natural Language Processing. In Python, implementation of RNN With Attention Mechanism is abundant in Machine Translation (For Eg. https://talbaumel.github.io/blog/attention/, however what I would like to do is to use RNN With Attention Mechanism on a temporal data file (not any textual/sentence based data).
I have a CSV file with of dimensions 21392 x 1972, which I have converted to a Dataframe using Pandas. The first column is of Datetime Format and last column consists of target classes like "Class1", "Class2", "Class3" etc. which I would like to identify. So in total, there are 21392 rows (instances of data in 10 minutes time-steps) and 1971 features. The last (1972th column) is the label column, with 14 different classes in total.
I have looked into available implementation documentation on Keras (https://medium.com/datalogue/attention-in-keras-1892773a4f22) as well as on Tensorflow (Visualizing attention activation in Tensorflow), but none of them seem to be doing what I want to accomplish. I understand that this is an unusual approach, but would want to try this and use the attention mechanism because many of my features are presumably redundant in the data.
import pandas as pd
mydataset = pd.read_csv('final_merged_data.csv')
It is predominant from existing literature that an Attention Mechanism works quite well when coupled into the RNN. I am unable to locate any such implementation of RNN with Attention Mechanism, which can also provide a visualisation as well. I am also unable to understand how I can convert my data into a sequence (or a list of lists) so that I can use it with One Hot Encoding afterwards for using RNN with Attention. I am new to using Python as well as Keras/Tensorflow, and am quite confused on the procedure to convert my data/typecast it to a form which will be able to mimic the sequence classification problem. My problem is basically of multi-class classification, like one would normally do using Machine Learning Classifiers to predict the labels, but using RNN with Attention. Any help in this regard would be highly appreciated. Cheers!

Kindly refer to this paper for using Sequence to Sequence Model with attention for time series classification.
https://www.computer.org/csdl/proceedings/icdmw/2016/5910/00/07836709.pdf

How to study the effect of each data on a deep neural network model?

I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,

From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.

This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.

Improving SVC prediction performance on single samples

I have large-ish SVC models (~50Mb cPickles) for text classification and I am trying out various ways to use them in a production environment. Classifying batches of documents works very well (about 1k documents per minute using both predict and predict_proba).
However, prediction on a single document is another story, as explained in a comment to this question:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – larsmans
I am already serving the SVC models as Flask web-applications, so a part of the overhead is gone (unpickling) but the prediction times for single docs are still on the high side (0.25s).
I have looked at the code in the predict methods but cannot figure out if there is a way to "pre-warm" them, reconstructing the LibSVM data structure in advance at server startup... any ideas?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))

I can see three possible solutions.
Custom server
It is not the matter of "warming" anything up. Simply - libSVM is the C library, and you need to pack/unpack data into correct format. This process is more efficient on the whole matrices than on each row separately. The only way to overcome this would be to write more efficient wrapper between your production env and the libSVM (you could write a libsvm based server, which would use some kind of shared memory with your service). Unfortunately, this is to custom problem to be solvable by existing implementations.
Batches
Naive approach like buffering the queries is an option (if it is "high performance" system with thousands of queries, you can simply store them in N-element batches, and send them to libSVM in such packs).
Own classification
Lastly - classification using SVM is really simple task. You don't need libSVM to perform classification. Only training is a complex problem. Once you get all the support vectors (SV_i), kernel (K), lagragian multipliers (alpha_i) and intercept term (b), you classify using:
cl(x) = sgn( SUM_i y_i alpha_i K(SV_i, x) + b)
You can code this operation directly in your app, without the need to actualy pack/unpack/send anything to libsvm. This can speed things up by the order of magnitude. Obviously - probability is more complex to retrieve, as it requires the Platt's scaliing, but it is still possible.

You can't construct the LibSVM data structure in advance. When a request to classify a document arrives, you get the text of the document, make a vector out of if and only then convert to LibSVM format so you can get a decision.
LinearSVC should be considerably faster than a SVC with a linear kernel as it uses liblinear. You could try using a different classifier if that does not decrease performance too much.

Is scikit-learn suitable for big data tasks?

I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.