Clustering latitude longitude points in Python with fixed number of clusters - python

kmeans does not work properly for geospatial coordinates - even when changing the distance function to haversine as stated here.
I had a look at DBSCAN which doesn
t let me set a fixed number of clusters.
Is there any algorithm (in python if possible) that has the same input values as kmeans? or
Can I easily convert latitude, longitude to euclidean coordinates (x,y,z) as done here and do the calculation on my data?
It does not have to perfectly accurate, but it would nice if it would.

Using just lat and longitude leads to problems when your geo data spans a large area. Especially since the distance between longitudes is less near the poles. To account for this it is good practice to first convert lon and lat to cartesian coordinates.
If your geo data spans the united states for example you could define an origin from which to calculate distance from as the center of the contiguous united states. I believe this is located at Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.
TO CONVERT lat lon to CARTESIAN coordinates- calculate the distance using haversine, from every location in your dataset to the defined origin. Again, I suggest Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.
You can use haversine in python to calculate these distances:
from haversine import haversine
origin = (39.50, 98.35)
paris = (48.8567, 2.3508)
haversine(origin, paris, miles=True)
Now you can use k-means on this data to cluster, assuming the haversin model of the earth is adequate for your needs. If you are doing data analysis and not planning on launching a satellite I think this should be okay.

Have you tried kmeans? The issue raised in the linked question seems to be with points that are close to 180 degrees. If your points are all close enough together (like in the same city or country for example) then kmeans might work OK for you.

Related

How do I find the density of a list of points given latitude and longitude in a 5 mile radius in Python Pandas?

I am trying to come up with a calculation that creates a column that comes up with a number that shows density for that specific location in a 5 mile radius, i.e if there are many other locations near it or not. I would like to compare these locations with themselves to achieve this.
I'm not familiar with the math needed to achieve this and have tried to find a solution for some time now.
Ok, i'm not super clear with what your problem may be but i will try to give you my approach.
Let's first assume that the area you are querying for points is small enough to be considered flat hence the geo coordinates of your area will basically be cartesian coordinates.
You choose your circle's center as (x,y) and then you have to find which of your points are within radius of your cirle: in cartesian coordinates being inside of a circle means that the distance of the points from your center are smaller than a given radius. You save those points in your choice of data structure and the density will probably be the number of your points divided by the area of the circle.
I hope i understood the problem correctyl!

How to interpolate 5D data using scipy Nearest ND interpolator

I have a set of data with each point having 5 parameters ([latitude, longitude, time, wind speed, bearing]). And i want to interpolate this data.
I have implemented scipy nearest ND interpolator based on what I read from the documentation, but the data at points outside the provided data points do not seem to be correct.
Implementation
interp = scipy.interpolate.NearestNDInterpolator(Windspeed_Data_Array[:, 0:3], Windspeed_Data_Array[:, 3:5])
Where "Windspeed_Data_Array[:,0:3]" is [latitude, longitude, time] and "Windpseed_Data_Array[:,3:5]" is [windspeed, bearing].
For example when I set the test coordinates to [-37.7276, 144.9066, 1483180200]
The raw data is shown below
|latitude|longitude|time |windspeed|bearing|
|-37.7276|144.9066 |1483174800|16.6 |193 |
|-37.7276|144.9066 |1483185600|14.8 |184 |
I thought the output at the test coordinates should be between the two data points shown, however when I run the code:
test = interp(test_coords)
The output is Windspeed = 16.6 and bearing = 193 which seems to be wrong
That's the nature of the choosen interpolation.
Nearest interpolation interpolation will assign to the dependent variables the value found in the nearest sample
This is an example from the NearestNEInterpolator documentation
If you want to have a weighted average of multiple close neighbors I would suggest you to take a look at LinearNDInterpolator.
Note: Don't be seduced by the "Nearest" word hehe

What is wrong with my geopy great circle distance computation?

I want to compute the distance (in km) using geopy library between two points defined by their respective (lat, lon) coordinates.
My code
from geopy.distance import great_circle
# lat, lon
p1 = (45.8864, -7.2305)
p2 = (46.2045, -7.2305)
# distance in km
great_circle(p1, p2).km
>>> 35.371156132664765
To check above results, I used the tool available here: https://www.movable-type.co.uk/scripts/latlong.html but the two outputs do not match.
The output of my code is 35.371156132664765 though the above tool returns a distance of 15.41 km.
How come the results are different ?
Your calculation for the points p1 and p2 is wrong, you need to convert the minutes and seconds into degree correctly. Otherwise the code works prefectly.

Solar Zenith Angle for many coordinates using PVLIB

I need calculate the solar zenith angle for approximately 106.000.000 of different coordinates. This coordinates are referrals to the pixels from an image projected at Earth Surface after the image had been taken by camera into the airplane.
I am using the pvlib.solarposition.get_position() to calculate the solar zenith angle. The values returned are being calculated correctly (I compared some results with NOOA website) but, how I need calculate the solar zenith angle for many couple of coordinates, the python is spending many days (about 5 days) to finish the execution of the function.
How I am a beginner in programming, I wonder is there is any way to accelerate the solar zenith angle calculation.
Below found the part of the code implemented which calculate the solar zenith angle:
sol_apar_zen = []
for i in range(size3):
solar_position = np.array(pvl.solarposition.get_solarposition(Data_time_index, lat_long[i][0], lat_long[i][1]))
sol_apar_zen.append(solar_position[0][0])
print(len(sol_apar_zen))
Technically, if you need to compute Solar Zenith Angle quickly for a large list (array), there are more efficient algorithms than the PVLIB's one. For example, the one described by Roberto Grena in 2012 (https://doi.org/10.1016/j.solener.2012.01.024).
I found a suitable implementation here: https://github.com/david-salac/Fast-SZA-and-SAA-computation (you mind need some tweaks, but it's simple to use it, plus it's also implemented for other languages than Python like C/C++ & Go).
Example of how to use it:
from sza_saa_grena import solar_zenith_and_azimuth_angle
# ...
# A random time series:
time_array = pd.date_range("2020/1/1", periods=87_600, freq="10T", tz="UTC")
sza, saa = solar_zenith_and_azimuth_angle(longitude=-0.12435, # London longitude
latitude=51.48728, # London latitude
time_utc=time_array)
That unit-test (in the project's folder) shows that in the normal latitude range, an error is minimal.
Since your coordinates represent a grid, another option would be to calculate the zenith angle for a subset of your coordinates, and the do a 2-d interpolation to obtain the remainder. 1 in 100 in both directions would reduce your calculation time by a factor of 10000.
If you want to fasten up this calculation you can use the numba core (if installed)
location.get_solarposition(
datetimes,
method='nrel_numba'
)
Otherwise you have to implement your own calculation based on vectorized numpy arrays. I know it is possible but I am not allowed to share. You can find the formulation if you search for spencer 1971 solar position

DBSCAN not giving correct results on spatial transportation data

I am trying to form clusters in the transportation data involving lat and long but
I am getting incorrect results as it is classifying points even having moderate
distances between them in the same cluster.
slat - source latitudes
slong - source longitudes
coords = source.as_matrix(columns=['slat', 'slong'])
kms_per_radian = 6371.0088
epsilon = 2 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=3, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
source['cluster'] = db.labels_ # source is the dataset
I tried plotting all the points in cartoDB and clusters were not proper,
As locations having more distances than 2 km were in the same clusters.
please anyone could tell how to do it better
I followed steps from clustering spatial data
I have not followed the steps for cluster centre point as I could not import the necessary library in python .
Is that the reason why I am not correct results.
Please tell about it.
In short My aim to replicate this for latitudes and longitudes for source and destination as Grab did it in the article with the image shown as
Article link -
Grab clustering rides
please offer any insight on how to replicate it
DBSCAN explicitly allows points to have distances larger than epsilon as long as they are density connected. That is why it is called a density based approach, and why you can get clusters of arbitrary shape.
It is a frequent misconception that any two points of a cluster would be at most epsilon apart. That is called complete-linkage and will give approximately spherical clusters.

Categories

Resources