I'm looking for some help with OSRM.
First of all I'm using PyCharm and installed the packge osrm, which is a wrapper around the OSRM-API.
I have a list of GPS-points, but some of them are with noisy, so i want to match all of them to the road they belong to (see picture)
I tried the match-function as mentioned in the documentation with the following code:
chunk_size = 99
coordinates = [coordinates[i:i+chunk_size] for i in range(0, len(coordinates), chunk_size)]
osrm.RequestConfig.host = "https://router.project-osrm.org"
coord=[]
for i in coordinates:
result = osrm.match(i, steps=False, overview="simplified")
#print(result)
routes = polyline.decode(result['matchings'][0]['geometry'], geojson=True)
#print(routes)
for j in routes:
coord.append((j[0],j[1]))
First Question: Is this the right way in doing this and and is it possible to plot that right away?
Because after that i but these coord in a dataframe to plot them:
df = pd.DataFrame(coord, columns=['Lon','Lat']) # Lat=51.xxx Lon=12.xxx
print(df)
df = df.sort_values(by=['Lat'])
fig_route = px.line_mapbox(df, lon=df['Lon'], lat=df['Lat'],
width=1200, height=900, zoom=13)
fig_route.update_layout(mapbox_style='open-street-map')
fig_route.update_layout(margin={'r': 0, 't': 0, 'l': 0, 'b': 0})
fig_route.show()
And if I do that the following happens:
These points might have been matched, but look at the whole plot:
This whole plot is a mess :D
And second question:
I've got the feeling, that it takes to long to run the whole code for such a "little task". It takes roughly 10 seconds for the whole code to read the points from excel into a parquet-file (7800 GPS points) - put them into a list and delete duplicates (list(set())) and do the request. Are these 10 seconds ok or is there a mistake in my code?
Thank you in advance for the help!
Related
As i am starting to build some basic plotting methods for 3D visualization with VTK for some Data-Visualization, i ran over the following issue:
My Dataset is usually about the Size ob 200e6-1000e6 Datapoints (Sensor Values) with its corresponding coordinates, points, (X,Y,Z).
My Visualization method works fine, but there is at least one Bottleneck. Beside the rest of the code, the schown example with the 2 for loops, is the most time consuming part of the whole method.
I am not happy with adding the Coordinates (points, numpy(n,3) ) and Sensor Values (intensity, numpy(n,1) ) via foor loops to the VTK Objects
The spicific code example:
vtkpoints = vtk.vtkPoints() # https://vtk.org/doc/nightly/html/classvtkPoints.html
vtkpoints.SetNumberOfPoints(self.points.shape[0])
# Bottleneck - Faster way?
self.start = time.time()
for i in range(self.points.shape[0]):
vtkpoints.SetPoint(i, self.points[i])
self.vtkpoly = vtk.vtkPolyData() # https://vtk.org/doc/nightly/html/classvtkPolyData.html
self.vtkpoly.SetPoints(vtkpoints)
self.elapsed_time_normal = (time.time() - self.start)
print(f" AddPoints took : {self.elapsed_time_normal}")
# Bottleneck - Faster way?
vtkcells = vtk.vtkCellArray() # https://vtk.org/doc/nightly/html/classvtkCellArray.html
self.start = time.time()
for i in range(self.points.shape[0]):
vtkcells.InsertNextCell(1)
vtkcells.InsertCellPoint(i)
map(vtkcells.InsertNextCell(1),self.points)
self.elapsed_time_normal = (time.time() - self.start)
print(f" AddCells took : {self.elapsed_time_normal}")
# Inserts Cells to vtkpoly
self.vtkpoly.SetVerts(vtkcells)
Times:
Convert DataFrame took: 6.499739646911621
AddPoints took : 58.41245102882385b
AddCells took : 48.29743027687073
LookUpTable took : 0.7522616386413574
All Input Data is of type int, its basicly a Dataframe converted to vtknumpy objects by numpy_to_vtk method.
I am very happy, if someone has an idea of speeding this up.
BR
Bastian
For the first Loop #Mathieu Westphal gave me a nice solution:
vtkpoints = vtk.vtkPoints()
vtkpoints.SetData(numpy_to_vtk(self.points))
self.vtkpoly = vtk.vtkPolyData()
self.vtkpoly.SetPoints(vtkpoints)
New Time to Add Points:
AddPoints took : 0.03202845573425293
For the second Loop it took me a bit longer to find that solution, bit with some hints of the vtk community, i got it running too. Heres the Code i used to set the Verts of the vtkpoly:
vtkcells = vtk.vtkCellArray()
cells_array_init = np.arange(len(self.points)).reshape(len(self.points),1)
cells_array_set = np.repeat(cells_array_init, repeats=3, axis=1)
cells_npy = np.column_stack([np.full(len(self.points), 3, dtype=np.int64), cells_array_set.astype(np.int64)]).ravel()
vtkcells.SetCells(len(self.points), numpy_to_vtkIdTypeArray(cells_npy)
# Inserts Cells to vtkpoly
self.vtkpoly.SetVerts(vtkcells)
New Time to Add Cells/Verts:
AddCells took : 2.73202845573425293
(17-18 x faster)
Needed to do some numpy stuff to get a "Cellarray" and get it in the right shape: (3,0,0,0, ..., 3, n-1, n-1, n-1) with n as len(points), and convert this to numpy_to_vtkIDTypeArray.
Pretty sure this can be improved to, but i am pretty happy for now. If someone has a quick idea to even speed this up, i'd love to hear!
BR
So I'm comparing NBA betting lines between different sportsbooks over time
Procedure:
Open pickle file of scraped data
Plot the scraped data
The pickle file is a dictionary of NBA betting lines over time. Each of the two teams are their own nested dictionary. Each key in these team-specific dictionaries represents a different sportsbook. The values for these sportsbook keys are lists of tuples, representing timeseries data. It looks roughly like this:
dicto = {
'Time': <time that the game starts>,
'Team1': {
Market1: [ (time1, value1), (time2, value2), etc...],
Market2: [ (time1, value1), (time2, value2), etc...],
etc...
}
'Team2': {
<SAME FORM AS TEAM1>
}
}
There are no issues with scraping or manipulating this data. The issue comes when I plot it. Here is the code for the script that unpickles and plots these dictionaries:
import matplotlib.pyplot as plt
import pickle, datetime, os, time, re
IMAGEPATH = 'Images'
reg = re.compile(r'[A-Z]+#[A-Z]+[0-9|-]+')
noDate = re.compile(r'[A-Z]+#[A-Z]+')
# Turn 1 into '01'
def zeroPad(num):
if num < 10:
return '0' + str(num)
else:
return num
# Turn list of time-series tuples into an x list and y list
def unzip(lst):
x = []
y = []
for i in lst:
x.append(f'{i[0].hour}:{zeroPad(i[0].minute)}')
y.append(i[1])
return x, y
# Make exactly 5, evenly spaced xticks
def prune(xticks):
last = len(xticks)
first = 0
mid = int(len(xticks) / 2) - 1
upMid = int( mid + (last - mid) / 2)
downMid = int( (mid - first) / 2)
out = []
count = 0
for i in xticks:
if count in [last, first, mid, upMid, downMid]:
out.append(i)
else:
out.append('')
count += 1
return out
def plot(filename, choice):
IMAGEPATH = 'Images'
IMAGEPATH = os.path.join(IMAGEPATH, choice)
with open(filename, 'rb') as pik:
dicto = pickle.load(pik)
fig, axs = plt.subplots(2)
gameID = noDate.search(filename).group(0)
tm = dicto['Time']
fig.suptitle(gameID + '\n' + str(tm))
i = 0
for team in dicto.keys():
axs[i].set_title(team)
if team == 'Time':
continue
for market in dicto[team].keys():
lst = dicto[team][market]
x, y = unzip(lst)
axs[i].plot(x, y, label= market)
axs[i].set_xticks(prune(x))
axs[i].set_xticklabels(rotation=45, labels = x)
i += 1
plt.tight_layout()
#Finish
outputFile = reg.search(filename).group(0)
date = (datetime.datetime.today() - datetime.timedelta(hours = 6)).date()
fig.savefig(os.path.join(IMAGEPATH, str(date), f'{outputFile}.png'))
plt.close()
Here is the image that results from calling the plot function on one of the dictionaries that I described above. It is pretty much exactly as I intended it, except for one very strange and bothersome problem.
You will notice that the bottom right tick looks haunted, demonic, jpeggy, whatever you want to call it. I am highly suspicious that this problem occurs in the prune function, which I use to set the xtick values of the plot.
The reason that I prune the values with a function like this is because these dictionaries are continuously updated, so setting a static number of xticks would not work. And if I don't prune the xticks, they end up becoming unreadable due to overlapping one another.
I am quite confused as to what could cause an xtick to look like this. It happens consistently, for every dictionary, every time. Before I added the prune function (when the xticks unbound, overlapping one another), this issue did not occur. So when I say I'm suspicious that the prune function is the cause, I am really quite certain.
I will be happy to share an instance of one of these dictionaries, but they are saved as .pickle files, and I'm pretty sure it's bad practice to share pickle files over the internet. I have been warned about potential malware, so I'll just stay away from that. But if you need to see the dictionary, I can take the time to prettily print one and share a screenshot. Any help is greatly appreciated!
Matplotlib does this when there are many xticks or yticks which are plotted on the same value. It is normal. If you can limit the number of times the specific value is plotted - you can make it appear indistinguishable from the rest of the xticks.
Plot a simple example to test this out and you will see for yourself.
I've encountered something very strange when having a function which generates an NdOverlay of Points to a DynamicMap, where the function is tied to panel widgets (I don't think the panel widgets are important).
The below code is a working example which produces the expected behavior. Whenever you change the widget values a new plot is generated with two sets of Points overlaid, with different colors and respective legend entries. Image shown below code.
a_widget = pn.widgets.Select(name='A', options=[1,2,3,4])
b_widget = pn.widgets.IntSlider(name='B', start=10, end=20, value=10)
widget_box = pn.WidgetBox(a_widget, b_widget, align='center')
#pn.depends(a=a_widget.param.value, b=b_widget.param.value)
def get_points(a, b):
return hv.NdOverlay({x: hv.Points(np.random.rand(10,10)) for x in range(1,3)})
points = hv.DynamicMap(get_points)
pn.Row(widget_box, points)
The second example shown below, is meant to demonstrate that in certain situations you might want to just simply return an empty plot and the way that I've done it in this example is done in the same way as in this example: http://holoviews.org/gallery/demos/bokeh/box_draw_roi_editor.html#bokeh-gallery-box-draw-roi-editor
The result of this code is an empty plot as expected when a == 1, but when a has values other than 1, the result is quite strange as illustrated in the image below the code.
The points all have the same color
When changing the slider for instance, some points are frozen and never changes, which is not the case in the above working example.
a_widget = pn.widgets.Select(name='A', options=[1,2,3,4])
b_widget = pn.widgets.IntSlider(name='B', start=10, end=20, value=10)
widget_box = pn.WidgetBox(a_widget, b_widget, align='center')
#pn.depends(a=a_widget.param.value, b=b_widget.param.value)
def get_points(a, b):
if a == 1:
return hv.NdOverlay({None: hv.Points([])})
else:
return hv.NdOverlay({x: hv.Points(np.random.rand(10,10)) for x in range(1,3)})
points = hv.DynamicMap(get_points)
pn.Row(widget_box, points)
While I can not help the observed issue with NdOverlay, creating plots with or without content can be done with the help of Overlay.
As b_widget is never used in your code, I removed it for simplicity.
a_widget = pn.widgets.Select(name='A', options=[1,2,3,4])
widget_box = pn.WidgetBox(a_widget, align='center')
#pn.depends(a=a_widget.param.value)
def get_points(a):
images = []
if a == 3:
images.append(hv.Points(np.random.rand(10,10), label='None'))
else:
for x in range(1,3):
images.append(hv.Points(np.random.rand(10,10), label=str(x)))
return hv.Overlay(images)
points = hv.DynamicMap(get_points)
pn.Row(widget_box, points)
The way how to use NdOverlay that is described in the documentation for NdOverlay is different to your approach, this might be a reason for the observed problems.
Anyway, to narrow down which part of the code is responsible for the observed issue, I removed all code that is not necessary to reproduce it.
For clarity, I renamed the values of a, and I also made sure, that a start value for a is provided.
It turned out while testing the code, that the if-else-statement is neither important, so I removed that too.
And just to make sure, that variables behave like expected, I added some print-statements.
This gives the following minimal reproducable example:
a_widget = pn.widgets.Select(name='A', value='Test', options=['Test','Test1', 'Test2'])
#pn.depends(a=a_widget.param.value)
def get_points(a):
dict_ = {}
dict_[str(a)] = hv.Points(np.random.rand(10,10))
print(dict_)
overlay = hv.NdOverlay(dict_)
print(overlay)
return overlay
points = hv.DynamicMap(get_points)
# using the server approach here to see the outpout of the
# print-statements
app = pn.Row(a_widget, points)
app.app()
When running this code, and choosing the different options in the select widget, it turns out that option Test is not updated, once one of the options Test1 and Test3 have been choosen.
When we change the default value in the first line like this
a_widget = pn.widgets.Select(name='A', value='Test2', options=['Test','Test1', 'Test2'])
now Test2 is not updated correctly.
So it looks like this is an issue of DynamicMap using NdOverlay.
So I suggest you report this issue to the developers (if not already done), either wait for new release or use a different approach (e.g. as shown above).
I have this data frame df1,
id lat_long
400743 2504043 (175.0976323, -41.1141412)
43203 1533418 (173.976683, -35.2235338)
463952 3805508 (174.6947496, -36.7437555)
1054906 3144009 (168.0105269, -46.36193)
214474 3030933 (174.6311167, -36.867717)
1008802 2814248 (169.3183615, -45.1859095)
988706 3245376 (171.2338968, -44.3884099)
492345 3085310 (174.740957, -36.8893026)
416106 3794301 (174.0106383, -35.3876921)
937313 3114127 (174.8436185, -37.80499)
I have constructed the tree for search here,
def construct_geopoints(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
tree = construct_geopoints(actualdata)
Now, I am trying to search all the geopoints which are within 1KM of every geopoint in my data frame df1. Here is how I am doing,
dfs = []
for name,group in df1.groupby(np.arange(len(df1))//10000):
s = group.reset_index(drop=True).copy()
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
dfs.append(s)
output = pd.concat(dfs,axis = 0)
Everything here works fine, however I am trying to parallelise this task, since my df1 size is 2M records, this process is running for more than 8 hours. Can anyone help me on this? And another thing is, the result returned by query_ball_point is a list and so its throwing memory error when I am processing it for the huge amount of records. Any way to handle this.
EDIT :- Memory issue, look at the VIRT size.
It should be possible to parallelize your last segment of code with something like this:
from multiprocessing import Pool
...
def process_group(group):
s = group[1].reset_index(drop=True) # .copy() is implicit
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
return s
groups = df1.groupby(np.arange(len(df1))//10000)
p = Pool(5)
dfs = p.map(process_group, groups)
output = pd.concat(dfs, axis=0)
But watch out, because the multiprocessing library pickles all the data on its way to and from the workers, and that can add a lot of overhead for data-intensive tasks, possibly cancelling the savings due to parallel processing.
I can't see where you'd be getting out-of-memory errors from. 8 million records is not that much for pandas. Maybe if your searches are producing hundreds of matches per row that could be a problem. If you say more about that I might be able to give some more advice.
It also sounds like pysal may be taking longer than necessary to do this. You might be able to get better performance by using GeoPandas or "rolling your own" solution like this:
assign each point to a surrounding 1-km grid cell (e.g., calculate UTM coordinates x and y, then create columns cx=x//1000 and cy=y//1000);
create an index on the grid cell coordinates cx and cy (e.g., df=df.set_index(['cx', 'cy']));
for each point, find the points in the 9 surrounding cells; you can select these directly from the index via df.loc[[(cx-1,cy-1),(cx-1,cy),(cx-1,cy+1),(cx,cy-1),...(cx+1,cy+1)], :];
filter the points you just selected to find the ones within 1 km.
I have a polygon shapefile of the U.S. made up of individual states as their attribute values. In addition, I have arrays storing latitude and longitude values of point events that I am also interested in. Essentially, I would like to 'spatial join' the points and polygons (or perform a check to see which polygon [i.e., state] each point is in), then sum the number of points in each state to find out which state has the most number of 'events'.
I believe the pseudocode would be something like:
Read in US.shp
Read in lat/lon points of events
Loop through each state in the shapefile and find number of points in each state
print 'Here is a list of the number of points in each state: '
Any libraries or syntax would be greatly appreciated.
Based on what I can tell, the OGR library is what I need, but I am having trouble with the syntax:
dsPolygons = ogr.Open('US.shp')
polygonsLayer = dsPolygons.GetLayer()
#Iterating all the polygons
polygonFeature = polygonsLayer.GetNextFeature()
k=0
while polygonFeature:
k = k + 1
print "processing " + polygonFeature.GetField("STATE") + "-" + str(k) + " of " + str(polygonsLayer.GetFeatureCount())
geometry = polygonFeature.GetGeometryRef()
#Read in some points?
geomcol = ogr.Geometry(ogr.wkbGeometryCollection)
point = ogr.Geometry(ogr.wkbPoint)
point.AddPoint(-122.33,47.09)
point.AddPoint(-110.11,33.33)
#geomcol.AddGeometry(point)
print point.ExportToWkt()
print point
numCounts=0.0
while pointFeature:
if pointFeature.GetGeometryRef().Within(geometry):
numCounts = numCounts + 1
pointFeature = pointsLayer.GetNextFeature()
polygonFeature = polygonsLayer.GetNextFeature()
#Loop through to see how many events in each state
I like the question. I doubt I can give you the best answer, and definitely can't help with OGR, but FWIW I'll tell you what I'm doing right now.
I use GeoPandas, a geospatial extension of pandas. I recommend it — it's high-level and does a lot, giving you everything in Shapely and fiona for free. It is in active development by twitter/#kajord and others.
Here's a version of my working code. It assumes you have everything in shapefiles, but it's easy to generate a geopandas.GeoDataFrame from a list.
import geopandas as gpd
# Read the data.
polygons = gpd.GeoDataFrame.from_file('polygons.shp')
points = gpd.GeoDataFrame.from_file('points.shp')
# Make a copy because I'm going to drop points as I
# assign them to polys, to speed up subsequent search.
pts = points.copy()
# We're going to keep a list of how many points we find.
pts_in_polys = []
# Loop over polygons with index i.
for i, poly in polygons.iterrows():
# Keep a list of points in this poly
pts_in_this_poly = []
# Now loop over all points with index j.
for j, pt in pts.iterrows():
if poly.geometry.contains(pt.geometry):
# Then it's a hit! Add it to the list,
# and drop it so we have less hunting.
pts_in_this_poly.append(pt.geometry)
pts = pts.drop([j])
# We could do all sorts, like grab a property of the
# points, but let's just append the number of them.
pts_in_polys.append(len(pts_in_this_poly))
# Add the number of points for each poly to the dataframe.
polygons['number of points'] = gpd.GeoSeries(pts_in_polys)
The developer tells me that spatial joins are 'new in the dev version', so if you feel like poking around in there, I'd love to hear how that goes! The main problem with my code is that it's slow.
import geopandas as gpd
# Read the data.
polygons = gpd.GeoDataFrame.from_file('polygons.shp')
points = gpd.GeoDataFrame.from_file('points.shp')
# Spatial Joins
pointsInPolygon = gpd.sjoin(points, polygons, how="inner", op='intersects')
# Add a field with 1 as a constant value
pointsInPolygon['const']=1
# Group according to the column by which you want to aggregate data
pointsInPolygon.groupby(['statename']).sum()
**The column ['const'] will give you the count number of points in your multipolygons.**
#If you want to see others columns as well, just type something like this :
pointsInPolygon = pointsInPolygon.groupby('statename').agg({'columnA':'first', 'columnB':'first', 'const':'sum'}).reset_index()
[1]: https://geopandas.org/docs/user_guide/mergingdata.html#spatial-joins
[2]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html