How to label data in SOM using SOMPY library? - python

I'm currently working on a project using machine learning to determine whether a network flow is a botnet or benign flow. Of course in the process, I've been using different methods of data analysis, including visualization through self-organizing maps. I'm very new to the concept of SOMs, so please let me know if I'm making incorrect assumptions.
I've so far created self-organizing maps for a dataset with 6 dimensions using the SOMPY library: https://github.com/sevamoo/SOMPY
Essentially where I am stuck is labeling concentrations of botnet/benign flows within the map using this library. Finding trends with each dimension isn't very useful unless I can find the relationship between the clusters and types of flows.
So, is there any way of labeling SOMs using SOMPY where I can compare concentrations of flows to clusters in the other maps?
If SOMPY isn't sufficient, what other libraries would you suggest? Preferably Python, since I have more experience in that language.

Do you have labels for your data?
With labels: Use the classification ability of the SUSI package which works like a better majority vote.
Without labels: Look at the u-matrix of your data in the SUSI package, use its borders as cluster borders and look at the statistics of the different clusters.

Related

Automatically remove datapoints to change dataset distribution

I have a dataset generator for machine learning purposes which unfortunately is producing a distribution as shown in the attached image. Note that I can't really change the data generation process very much.
This dataset is heavily skewed towards the centre. I would like to be able to fit this dataset to a more uniform distribution by removing some of those central datapoints. I am a beginner to programming and data science so I am unaware of any methods in which I can do this.
If anyone can point to a library or function I can utilise to achieve this it would be much appreciated. I am using python3 and my data is stored in a csv.
Thanks!

Cluster analysis algorithm for identifying line clusters on a map

I have a reasonably large set of (r,g,b)-colored data points with (x,y)-coordinates that looks like this:
Before commiting them to my database, I'd like to automatically identify all point clusters ( most of which look like lines ) and attribute a category to each colored point according to which cluster they belong to.
According to the scikit-learn roadmap I should be using either Meanshift or Gaussian mixture models, but I'd like to know if there is any solution available that will also take into account that nearby points that share similar colors are more likely to belong to the same cluster.
I have access to a GPU so any kind of solution is welcome, even if it's based on deep learning.
I tried #mcdowella 's answer and it worked surprisingly well. I ran it over the higher-dimensional version of these points ( which were generated through T-SNE ) by using the HDBSCAN Robust Single Linkage implementation and it approximated many lines without any parameter tuning.
I would try https://en.wikipedia.org/wiki/Single-linkage_clustering - it has a tendency to follow lines that is sometimes even a disadvantage for people who want nice compact rounded clusters and get straggling spaghetti (nice picture on P7 of https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf).

Creating text-clusters that contain similar text

Recently I had worked on image clustering which found similar images and grouped them together. I had used python's skimage module to calculate SSIM and then cluster all images based on some threshold that was decided.
I want to do similar for the text. I want to create automatic clusters containing similar text. For example, cluster-1 could have all text that represents working mothers, cluster-2 could have all text representing people talking about food and so on. I understand this has to be unsupervised learning. Do we have similar python module's that could help achieve this task? I also checked out google's tensorflow to see if I could get something from it but did not find anything relating to text clustering in its documentation.
There are numerous ways you can approach the task. In most cases the clustering algorithms are very similar to image clustering but what you need to define is the distance metric - in this case semantic similarity metric of some kind.
For this purpose you can use the approaches I list in another question around the topic of semantic similarity (even if a bit more detailed).
The one additional approach worth mentioning is 'automatic clustering' provided by topical modelling tools like LSA which you can run fairly easy using gensim package.

CFD work with Python

I am a meteorologist, and lately I am trying to investigate the possibility of building my one sondes.
In order to do that, I have the following work plan :
I would like to generate 3D models pyformex. An alternative is openSCAD. But I start with pyformex - to generate simple cylindrical sonde shapes with associated extra features, e.g. intake tube or such.
Next, I will like to split it in Meshes, using PyDistMesh; as well as prepare a raytraced point cloud model with Xrt.
In the third step, I would like to perform the CFD works.
Now, my questions :
Are there some other simple Python Libraries to generate 3D models? I would like a very simple system, where i can issue commands like p = Parallelogram (length, height, width), or p.position(x,y,z) etc. It would be nice to have built in mouse interaction - that is, a built in drawing component, which I can use to display the model, and rotate/ zoom/pan with mouse.
Any other mesh generation tools?
For this step, I would need a multiphysics system. I tried to use OpenFOAM, it is too huge (to hack through). I have taken a look at SU2, but it seems to focus more on aerospace engineering, than Fluid Dynamics (I would like to simulate the flight of the sonde - which is closer to aerospace engineering, as well as the state of the atmosphere). Fluidity seems to suit my needs better, but I dont find a python fork thereof. So are there some general purpose, not too bloated up, multiphysics python library for geophysical and general hydrodynamic simulations? I have taken a look a MOOSE, also dont find a python binding for it.
Scientific visualization : Are there some 3 or 4 (or may be higher dimensional) visualization libraries? I would prefer to issue simple commands as Plot instead of first generating a window / form, and then putting the graphs on it, if possible.
FINALLY, and most importantly, if the same can be done by C++ or Fortan, or some other language besides java, I would also consider using those.
Have a look at http://freecadweb.org/. This seems to be under active development. It is a fairly complete open source CAD package written in python. I believe it also has tools for meshing.
For cfd, you might want to consider openfoam - http://www.openfoam.com/. This is an open source cfd package with the obligatory steep learning curve. There seem to be some python libraries to be available that link to it, however I'm not sure how active these are:
http://openfoamwiki.net/index.php/Contrib/PyFoam
http://pythonflu.wikidot.com/

Useful packages to create online prediction tool with Python and R (example provided)

I am building a Cox PH statistical model to predict the probability of relapse for breast cancer patients. I would like to create an online interface for patients or doctors to use that would allow them to input the relevant patient characteristics, and compute the probability of relapse. Here is a perfect example, albeit for prostate cancer:
http://nomograms.mskcc.org/Prostate/PostRadicalProstatectomy.aspx
My basic plan is to create the tool with python, and compute the probability with R based on the user's inputs and from my previously fitted Cox PH model. The tool would require both drop-down boxes and user-inputted numerical values. The problem is I've never done any web programming or GUI programming with Python. My experience is with the scientific programming side of Python (e.g. Pylab, etc). My questions are:
What relevant packages for Python and R will I need? From some Googling I've done it seems that RPy and Tkinter are clear choices.
How can I store the statistical model such that the tool doesn't have to compute the model from my data set every time someone uses it? In the case of a Cox PH model, this would require storing the baseline hazard and the model formula.
Any useful tips from your experience?
I really appreciate your help!
Basically you need to learn WebDev, which is a pretty massive topic. If you are serious about making this a webapp, Django is one of the easiest places to start, and it's also fantastically documented. So essentially my answer would be:
http://djangobook.com/en/2.0/
start reading.
Apart from using R through RPy or equivalent there are a number of survival analysis routines in the statsmodels (formerly sicpy.statsmodel) python library. They are in the "sandbox" package though, meaning they aren't supposed to be ready for production right now.
E.g. you have the Cox model of proportional hazard coded here.
See also this question on CrossValidated.
I suggest fitting your model in R and easily use the 'DynNom' package ('DNbuilder' function). It will create such an interactive tool for your prediction model that you can easily share it as a webpage without any web programming or GUI programming skills. This will be a line of code after fitting your model, for example:
fit1 <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
library(DynNom)
DNbuilder(fit1)
You can easily share it on your account in http://shinyapps.io/ or host it on your website (needs more effort).

Categories

Resources