Automatically remove datapoints to change dataset distribution

Automatically remove datapoints to change dataset distribution - python

I have a dataset generator for machine learning purposes which unfortunately is producing a distribution as shown in the attached image. Note that I can't really change the data generation process very much.
This dataset is heavily skewed towards the centre. I would like to be able to fit this dataset to a more uniform distribution by removing some of those central datapoints. I am a beginner to programming and data science so I am unaware of any methods in which I can do this.
If anyone can point to a library or function I can utilise to achieve this it would be much appreciated. I am using python3 and my data is stored in a csv.
Thanks!

Related

How should I fit radial basis function to 3D data in Scipy?

I'm reasonably new here, so sorry if this is a basic question.
I have a dataset comprising several thousand 'hotspots'- small sections of 3D data, at most a few tens or hundreds datapoints each, and each arranged around a centre. I am (in Python) trying to fit each to a particular radial basis function, roughly of the form exp(-r)/r, and obtain the parameters of that fit for each hotspot. All the libraries I've seen that do this are ML-based and treat the hotspots as training data from which to obtain global parameters- but that's not what I need, because each hotspot should have different parameters and those variations contain meaningful information. scipy.optimize.RBF doesn't seem to allow you to retrieve the parameters it uses, unless I'm misreading the documentation. Is there any library that does this easily? Any advice would be very appreciated!

How to set good parameters clustering high density data with DBSCAN?

I want to cluster some stars based on given position (X,Y,Z) using DBSCAN, I do not know how to adjust the data to get the right numbers of clusters to plot it afterward?
this is how the data looks like
what is the right parameters for these data?
the number of rows are 1.202672e+06
import pandas as pd
data = pd.read_csv('datasets/full_dataset.csv')
from sklearn.cluster import DBSCAN
clusters=DBSCAN(eps=0.5,min_samples=40,metric="euclidean",algorithm="auto")

min_samples is arguably one of the tougher ones to choose, but you can decide that by just looking at the results and deciding how much noise you are okay with.
Choosing eps can be aided by running k-NN to understand the density distribution of your data. I believe that the DBACAN paper recommends in more detail. There might even be a way to plot this in python (in R it is kNNdistplot).
I would prefer to use OPTICS, which is essentially doing all eps values simultaneously. However, I haven't found a decent implementation of this in either in python or R. In fact, there is an incorrect implementation in python which doesn't follow the original OPTICS paper at all.
If you really want to use optics, I recommend using a java implementation available using ELKI.
If anyone else has heard of a proper python implementation, I'd love to hear it.
If you want to go the trial and error route, start eps much smaller and go from there.

How to label data in SOM using SOMPY library?

I'm currently working on a project using machine learning to determine whether a network flow is a botnet or benign flow. Of course in the process, I've been using different methods of data analysis, including visualization through self-organizing maps. I'm very new to the concept of SOMs, so please let me know if I'm making incorrect assumptions.
I've so far created self-organizing maps for a dataset with 6 dimensions using the SOMPY library: https://github.com/sevamoo/SOMPY
Essentially where I am stuck is labeling concentrations of botnet/benign flows within the map using this library. Finding trends with each dimension isn't very useful unless I can find the relationship between the clusters and types of flows.
So, is there any way of labeling SOMs using SOMPY where I can compare concentrations of flows to clusters in the other maps?
If SOMPY isn't sufficient, what other libraries would you suggest? Preferably Python, since I have more experience in that language.

Do you have labels for your data?
With labels: Use the classification ability of the SUSI package which works like a better majority vote.
Without labels: Look at the u-matrix of your data in the SUSI package, use its borders as cluster borders and look at the statistics of the different clusters.

Machine Learning for optimizing parameters

For my master's thesis I am using a 3rd party program (SExtractor) in addition to a python pipeline to work with astronomical image data. SExtractor takes a configuration file with numerous parameters as input, which influences (after some intermediate steps) the statistics of my data. I've already spent way too much time playing around with the parameters, so I've looked a little bit into machine learning and have gained a very basic understanding.
What I am wondering now is: Is it reasonable to use a machine learning algorithm to optimize the parameters of the SExtractor, when the only method to judge the performance or quality of the parameters is with the final statistics of the analysis run (which takes at least an hour on my machine) and there are more than 6 parameters which influence the statistics.
As an example, I have included 2 different versions of the statistics I am referring to, made from slightly different versions of Sextractor parameters. Red line in the left image is the median value of the standard deviation (as it should be). Blue line is the median of the standard deviation as I get them. The right images display the differences of the objects in the 2 data sets.
I know this is a very specific question, but as I am new to machine learning, I can't really judge if this is possible. So it would be great if someone could suggest me if this is a pointless endeavor and point me in the right.

You can try an educated guess based on the data that you already have. You are trying to optimize the parameters such that the median of the standard deviation has the desired value. You could assume various models and try to estimate the parameters based on the model and the estimated data. But I think you should have a good understanding of machine learning to do so. With good I mean beyound an undergraduate course.

method for implementing regression tree on raster data - python

I'm trying to build and implement a regression tree algorithm on some raster data in python, and can't seem to find the best way to do so. I will attempt to explain what I'm trying to do:
My desired output is a raster image, whose values represent lake depth, call it depth.tif. I have a series of raster images, each representing the reflectance values in different Landsat bands, say [B1.tif, B2.tif, ..., B7.tif] that I want to use as my independent variables to predict lake depth.
For my training data, I have a shapefile of ~6000 points of known lake depth. To create a tree, I extracted the corresponding reflectance values for each of those points, then exported that to a table. I then used that table in weka, a machine-learning software, to create a 600-branch regression tree that would predict depth values based on the set of reflectance values. But because the tree is so large, I can't write it in python manually. I ran into the python-weka-wrapper module so I can use weka in python, but have gotten stuck with the whole raster part. Since my data has an extra dimension (if converted to array, each independent variable is actually a set of ncolumns x nrows values instead of just a row of values, like in all of the examples), I don't know if it can do what I want. In all the examples for the weka-python-wrapper, I can't find one that deals with spatial data, and I think this is what's throwing me off.
To clarify, I want to use the training data (which is a point shapefile/table right now but can- if necessary- be converted into a raster of the same size as the reflectance rasters, with no data in all cells except for the few points I have known depth data at), to build a regression tree that will use the reflectance rasters to predict depth. Then I want to apply that tree to the same set of reflectance rasters, in order to obtain a raster of predicted depth values everywhere.
I realize this is confusing and I may not be doing the best job at explaining. I am open to other options besides just trying to implement weka in python, such as sklearn, as long as they are open source. My question is, can what I described be done? I'm pretty sure it can, as it's very similar to image classification, with the exception that the target values (depth) are continuous and not discrete classes but so far I have failed. If so, what is the best/most straight-forward method and/or are there any examples that might help?
Thanks

I have had some experience using LandSat Data for the prediction of environmental properties of soil, which seems to be somewhat related to the problem that you have described above. Although I developed my own models at the time, I could describe the general process that I went through in order to map the predicted data.
For the training data, I was able to extract the LandSat values (in addition to other properties) for the spatial points where known soil samples were taken. This way, I could use the LandSat data as inputs for predicting the environmental data. A part of this data would also be reserved for testing to confirm that the trained models were not overfitting to training data and that it predicted the outputs well.
Once this process was completed, it would be possible to map the desired area by getting the spatial information at each point of the desired area (matching the resolution of the desired image). From there, you should be able to input these LandSat factors into the model for prediction and the output used to map the predicted depth. You could likely just use Weka in this case to predict all of the cases, then use another tool to build the map from your estimates.
I believe I whipped up some code long ago to extract each of my required factors in ArcGIS, but it's been a while since I did this. There should be some good tutorials out there that could help you in that direction.
I hope this helps in your particular situation.

It sounds like you are not using any spatial information to build your tree
(such as information on neighboring pixels), just reflectance. So, you can
apply your decision tree to the pixels as if the pixels were all in a
one-dimensional list or array.
A 600-branch tree for a 6000 point training data file seems like it may be
overfit. Consider putting in an option that requires the tree to stop splitting
when there are fewer than N points at a node or something similar. There may
be a pruning factor that can be set as well. You can test different settings
till you find the one that gives you the best statistics from cross-validation or
a held-out test set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.