I have a CSV file with points tagged by lat/long (~10K points). I'd like to search for all points within a given distance of a user/specified lat/long coordinate — say, for example, the centroid of Manhattan.
I'm pretty new to programming and databases, so this may be a basic question. If so, I apologize. Is it performant to do this search in pure Python without using a database? As in, could I simply read the CSV into memory and do the search with a Python script? If it is performant, would it scale well as the number of points increases?
Or is this simply infeasible in Python, and I need to investigate using a database that supports geospatial queries?
Additionally, how do I go about understanding the performance of these types of calculations so that I can develop a good intuition for this?
This is definitely possible in python without any databases. I would definitely recommend using numpy. I would do the following:
read all points from csv into a numpy array
Calculate the distance of each point to your given point
Sort the distance or simply find the one with minimum distance using argmin
Because all calculations are vectorized, they happen at close to C speed.
With an okay computer, the I/O will take like 2-3 seconds and the calculation will take less than 100-200 milliseconds.
In terms of math, you can try http://en.wikipedia.org/wiki/Haversine_formula
Related
first question, I will do my best to be as clear as possible.
If I can provide UMAP with a distance function that also outputs a gradient or some other relevant information, can I apply UMAP to non-traditional looking data? (I.e., a data set with points of inconsistent dimension, data points that are non-uniformly sized matrices, etc.) The closest I have gotten to finding something that looks vaguely close to my question is in the documentation here (https://umap-learn.readthedocs.io/en/latest/embedding_space.html), but this seems to be sort of the opposite process, and as far as I can tell still supposes you are starting with tuple-based data of uniform dimension.
I'm aware that one way around this is just to calculate a full pairwise distance matrix ahead of time and give that to UMAP, but from what I understand of the way UMAP is coded, it only performs a subset of all possible distance calculations, and is thus much faster for the same amount of data than if I were to take the full pre-calculation route.
I am working in python3, but if there is an implementation of UMAP dimension reduction in some other environment that permits this, I would be willing to make a detour in my workflow to obtain this greater flexibility with incoming data types.
Thank you.
Algorithmically this is quite possible, but in practice most implementations do not support anything other than fixed dimension vectors. If computing the all pairs distances is not tractable another option is to try to find a way to featurize or vectorize the data in a way that will allow for easy distance computations. This is, of course, not always possible. The final option is to implement things yourself, but this requires handling the nearest neighbour search, which is likely a non-trivial coding project in and of itself.
I want to find out what would be most efficient algorithm to find tree with most number of neighboring trees (say within 500 feet radius of it). We can use dataset from https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/pi5s-9p35 for this. I understand Python better but any other language or just pseudo code will be fine too!
In above dataset (export in CSV from above link), columns of interest are tree_id, x_sp and y_sp (they are x coordinate and y coordinate of tree).
No geospatial or special libraries please as this is algorithm how to question.
I can't say which would be the most efficient, but one data structure you might use would be the KD-Tree: https://en.wikipedia.org/wiki/K-d_tree (I think there is a python example in this article).
I would like to represent a bunch of particles (~100k grains), for which I have the position (including rotation) and a discretized level set function (i.e. the signed distance of every voxel from surface). Due to the large sample, I'm searching for eficient solutions to visualize it.
I first went for vtk, using its python interface, but I'm not really sure if it's the best (and simplest) way to do it since, as far as I know, there is no direct implementation for getting an isosurface from a 3D data set. In the beginning I was thinking usind marching cubes, but then I still would have to use a threshold or to interpolate in order to get the voxels that are on the surface and label them in order to be used by the marching cubes.
Now I found mayavi which has a python function
mlab.pipeline.iso_surface()
I however did not find much documentation on it and was wondering how it behaves in terms of performance.
Does someone have experience with this kind of tools? Which would be the best solution (in terms of efficiency and, secondly, in terms of simplicity - I do not know the vtk library, but if there is a huge difference in performance I can dig into it, also without python interface).
Lowest/Highest Combined Surface(s)
I'm looking for a methodology (and/or preferably a software approach) to create what I'm calling the Lowest (or highest) combined surface for a set of polygons.
So if our input was these two polygons that partially overlap and definitely intersect
My Lowest Combined output would be these three polygons
Given a number of "surfaces" (3d polygons)
We've gone through a variety of approaches and the best solution we could come up with involved applying a point grid to each polygon and performing calculations to return the lowest sets of points at each grid location. The problem is that the original geometry is lost in this approach which doesn't give us a working solution.
Background
I'm looking at a variety of "surfaces" that can be represented by 3d faces (cad Speak) or polygons and usually are distributed in a shapefile (.shp). When there are two surfaces that interact I'm interested in taking either the lowest combined or highest combined surface. I'm able to do this in CAD by manually tracing out new polygons for the interaction zones - but once I get into more than a handful of surfaces this becomes too labor intensive.
The current Approach
My current approach which falls somewhere in the terrible category is to generate a point cloud from each surface on a 1m grid and then do a grid cell based comparison of the points.
I do this by using AutoCAD Civl 3D's surface Generation Tools to create a TIN from each polygon surface and then using its Surface. This is then exported to a 1m DEM file which I believe is a gridded output format.
Next each DEM file is brought into Global Mapper where I generate a single point at the center of each "elevation grid cell". Next this data is exported to a .csv file where each point contains a variety of attributes such as what the name of the surface this point came from and what its altitude is
Next once I have a set of CSV files I run them through a python script that will export the lowest point (and associated attributes) at each grid. I do everything in UTM because the UTM grid is based on meters and it makes everything easier.
Lastly we bring the point file back into global mapper - coloring each point by what surface it started from.
There a variety of issues with this approach - sometimes things don't line up perfectly and there is a variety of cleanup I have to do
Also the edges end up being jagged - as is the case because I've converted nice straight lines into a point cloud
Alternatively we came up with a similar approach in Arc GIS using the Surface Comparison tool, however it had similar limitations to what we ran into with my approach.
What I'm looking for is a way to do this automatically with a variable number of inputs. I'm willing to use just about any tool to have this done, as it seems like it shouldn't be too difficult a process
Software?
When I look at this problem from a programmers point of view it looks rather straight forward - but I'm at a total loss how to proceed. I'm assuming Stack Overflow is the correct stack exchange for this question - but if it should be somewhere else - I'm happy to move it to a different exchange.
I wasn't sure if something like Mathematica (which i have zero experience) with could handle this situation or whether there was some fancy 3d math library in python that could chop polygons up by how they interact and then give me the lowest for co-located polys.
In any case I'm willing to try anything out so please if you have an idea of what tools and/or libraries I can use to do this please share! I have to assume that there is SOMETHING out there that can handle this type of 3d geometric processing
Thanks
EDIT
Because the commenters seem confused I am not asking for code - I am asking for methodologies, libraries, support tools, or even software packages that can perform these operations. I plan to write software to do this, however, I am hoping I don't need to pull out my trig books and write all these operations by hand. I have to assume there is somebody out there that has dealt with something similar before.
Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I'll describe the implementation then ask for help.
I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.
START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a
This part is fast, the part that isn't fast is the follow up query :
START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x
I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.
I think the problem is that my graph has very different levels of sparsity - starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven't been able to find any resources on this problem, and surprisingly there isn't already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only :(
What version of neo4j did you try this with?
Up until 1.8 performance has been no design goal of cypher (rather the language)
Have a look at a recent snapshot (1.9-SNAP).
Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.
You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.
What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations
You're already using parameters which is great. What is the variability of your rel-types ?
Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?
There are DBSCAN implementations around that use indexing. I don't know about neo4j so I can't really tell if your approach is efficient. The thing you might need to precompute is actually a sparse version of your graph, with only the edges that are within the epsilon threshold.
What I'd like to point out that apparently you have different densities in your data set, so you might want to instead use OPTICS, which is a variant of DBSCAN that does away with the epsilon parameter (and also doesn't need to distinguish "core" nodes, as every node is a core node for a certain epsilon). Do not use the Weka version (or the weka-inspired python version that is floating around). They are half OPTICS and half DBSCAN.
When you have efficient sorted updatable heaps available, OPTICS can be pretty fast.