Machine learning - generate new data from current dataset

Machine learning - generate new data from current dataset - python

I have created a dataset from some sensor measurements and some labels and did some classification on it with good results. However, since my the amount of data in my dataset is relatively small (1400 examples) I want to generate more data based on this data. Each row from my dataset consists of 32 numeric values and a label.
Which would be the best approach to generate more data based on the existing dataset I have? So far I have looked at Generative Adversarial Networks and Autoencoders, but I don't think this methods are suitable in my case.
Until now I have worked in Scikit-learn but I could use other libraries as well.

The keyword is here Data Augmentation. You use your available data and modify them slightly to generate additional data which are a little bit different from your source data.
Please take a look at this link. The author uses Data Augmentation to rotate and flip the cat image. So he generate 6 additional images with different perspectives from a single source image.
If you transfer this idea to your sensor data you can add some kind of random noise to your data to increase the dataset. You can find a simple example for Data Aufmentation for time series data here.
Another approach is to window the data and move the window a small step, so the data in the window are a little bit different.
The guys from the statistics stackexchange write something about it. Please check this for additional information.

Related

Method of averaging periodic timeseries data

I'm in the process of collecting O2 data for work. This data shows periodic behavior. I would like to parse out each repetition to thereby get statistical information like average and theoretical error. Data Figure
Is there a convenient way programmatically:
Identify cyclical data?
Pick out starting & ending indices such that repeating cycle can be concatenated, post-processed, etc.
I had a few ideas, but am more lacking the Python programing experience.
Brute force, condition data in Excel prior. (Will likely collect similar data in future, would like more robust method).
Train NN to identify cycle then output indices. (Limited training set, would have to label).
Decompose to trend/seasonal data apply Fourier series on seasonal data. Pick out N cycles.
Heuristically, i.e. identify thresholds of rate of change & event detection (difficult due to secondary hump, please see data).
Is there a Python program that systematically does this for me? Any help would be greatly appreciated.
Sample Data

Multidimensional LSTM

I'm having a problem to make at least one functional machine learning model, the examples I found all over the network are either off topic or good but incomplete (missing dataset, explanations...).
The closest example related to my problem is this.
I'm trying to create a model based on accelerometer & gyroscope sensor, each one has its own 3 axis, for example if I lift the sensor parallel to the gravity then return it back to his initial position, then I should have a table like this.
Example
Now this whole table correspond to one movement which I call it "Fade_away", and the duration for this same movement is variable.
I have only two main questions:
In which format I need to save my dataset, because I don't think an array could arrange this kind of data?
How can I implement a simple model at least with one hidden layer?
To make it easier, let's say that I have 3 outputs, "Fade_away", "Punch" and "Rainbow".

Do we Visualize Big Data

I began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use cases, Does it make sense to visualize Big Data or should we just take a random sample?

Short answer: no, if you're trying to visualize data with tens of thousands of rows or more, Altair is probably not the right tool. But there are efforts in progress to add support for larger datasets in the vega ecosystem; see https://github.com/vega/scalable-vega.

Subnetwork analysis on proteomics data

Background: I have proteomics data from seven samples (pvalue/ log-score of fold change), I want to analyze the data by network (interactome) analyses.
Question: I like to create an interactome of all the proteins from the data, and map the proteins to this network that have significant pvalue (compare to control),
after that I like to create subnetwork(s); also like to add the pathways enrichments to the subnetwork(s).
Request: please suggest online or standalone tools (or algorithm) that fits my requirements.
Thanks !

For creating network graphs to represent your protein-protein interactions, I would recommend taking a look at the networkx library. You can use it to pass in some nodes (proteins of interest) and edges (interactions) and generate a graph. I believe that it can also generate subnetworks of these graphs as well.

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.