I have point data from deer surveys and would like to predict values for areas not surveyed based on vegetation type (Grass, Disturbed, Oak, Pine, Mixed etc.)
So far I have dissolved my vegetation layer to combine adjacent polygons and used Spatial intersect Join to combine this layer with my point data. I'm now trying to predict values for polygons with Null values for the pop field (deer seen) and pop_avg field (pop/3 survey nights) based on vegetation type (text field). I'm not really sure what my next step should be if anyone has any suggestions.
EDIT: Would I need to do the prediction analysis in a program such as R or python to later bring back to map in ArcGIS?
I found out about the randomForest package and how it might be helpful to answer my question here: https://gis.stackexchange.com/questions/176195/predict-estimate-point-values-for-unsurveyed-areas-based-on-vegetation-type
From there I was finally able to come to a solution to my problem which I answered here: Predict/estimate values using randomForest in R
Related
I'm having a problem to make at least one functional machine learning model, the examples I found all over the network are either off topic or good but incomplete (missing dataset, explanations...).
The closest example related to my problem is this.
I'm trying to create a model based on accelerometer & gyroscope sensor, each one has its own 3 axis, for example if I lift the sensor parallel to the gravity then return it back to his initial position, then I should have a table like this.
Example
Now this whole table correspond to one movement which I call it "Fade_away", and the duration for this same movement is variable.
I have only two main questions:
In which format I need to save my dataset, because I don't think an array could arrange this kind of data?
How can I implement a simple model at least with one hidden layer?
To make it easier, let's say that I have 3 outputs, "Fade_away", "Punch" and "Rainbow".
I am working on the classification of a 3D point cloud using several python libraries (whitebox, PCL, PDAL). My goal is to classify the soil. The data set has been classified by a company so I am based on their classification as ground truth.
For the moment I am able to classify the soil, to do that I declassified the data set and redo a classification with PDAL. Now I'm at the stage of comparing the two datasets to see the quality of my classification.
I made a script which takes the XYZ coordinates of the 2 sets and puts it in a list and I compare them one by one, however the dataset contains around 5 millions points and it takes 1 minute by 5 points at the begining. After few minutes everything crash. Can anyone give me tips? Here a picture of my clouds The set at the lets is the ground truth and at the right is the one classified by me
Your problem is that you are not using any spatial data structure to ease your point proximity queries. There are several ways you can mitigate this issue, such as KD tree and Octree.
By using such spatial structures you will be able to discard a large portion of unnecessary distance computations, thus improving the performance.
I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution
I have a dataset with m observations and p categorical variables (nominal), each variable X1,X2...Xp has several different classes (possible values). Ultimately I am looking for a way to find anomalies i.e to identify rows for which the combination of values seems incorrect with respect to the data I saw so far. So far I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row. I would greatly appreciate any help!
Take a look on nearest neighborhoods method and cluster analysis. Metric can be simple (like squared error) or even custom (with predefined weights for each category).
Nearest neighborhoods will answer the question 'how different is the current row from the other row' and cluster analysis will answer the question 'is it outlier or not'. Also some visualization may help (T-SNE).