I’m working with geo-located social media posts and clustering their locations (latitude/longitude) using DBSCAN. In my data set, I have many users who have posted multiple times, which allows me to derive their trajectory (a time ordered sequence of positions from place to place). Ex:
3945641 [[38.9875, -76.94], [38.91711157, -77.02435118], [38.8991, -77.029], [38.8991, -77.029], [38.88927534, -77.04858468])
I have derived trajectories for my entire data set, and my next step is to cluster or aggregate the trajectories in order to identify areas with dense movements between locations. Any ideas on how to tackle trajectory clustering/aggregation in Python?
Here is some code I've been working with to create trajectories as line strings/JSON dicts:
import pandas as pd
import numpy as np
import ujson as json
import time
# Import Data
data = pd.read_csv('filepath.csv', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')[['latitude','longitude']].values.tolist()] for id in uniqueIds]
# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('filepath_out.csv', index=False, header=False)
# Save outputs as JSON
#outputDict = {}
#for i in output:
# outputDict[i[0]]=i[1]
#with open('filepath.json','w') as f:
#json.dump(outputDict, f, sort_keys=True, indent=4, ensure_ascii=False,)
EDIT
I've come across a python package, NetworkX, and was debating the idea of creating a network graph from my clusters as opposed to clustering the trajectory lines/segments. Any opinions on clustering trajectories v.s. turning clusters into a graph to identify densely clustered movements between locations.
Below is an example of some clusters look like:
In an effort to answer my own 1+ year old question, I've come up with a couple solutions for which have solved this (and similar questions), albeit, without Python (which was my hope). First, using a method I provided a user in the GIS StackExchange using ArcGIS and a couple of built-in tools to carry out a line density analysis (https://gis.stackexchange.com/questions/42224/creating-polyline-based-heatmap-from-gps-tracks/270524#270524). This takes GPS points, creates lines, segments the lines, and then clusters them. The second method uses SQL (ST_MakeLine primarily) and a Postgres/GIS/CARTO data base to create lines ordered by ascending timestamp, and then grouped by users (e.g. https://carto.com/blog/jets-and-datelines/). One can then count the number of line occurrences (assuming points are clustered with clearly defined centroids similar to the initial question of mine above), and treat this as a cluster (e.g. Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance, https://carto.com/blog/alteryx-and-carto-to-explore-london-bike-data/).
I have the data in a csv file which contains zipcodes in 1 column and probably geojson data in the other column. I loaded the data in pandas dataframe. How do I extract just the coordinates from the geojson column.
zips.head(2)
Out[14]:
postal_code geojson
0 85309 {"type":"MultiPolygon","coordinates":[[[[-112....
1 85310 {"type":"MultiPolygon","coordinates":[[[[-112....
zips.geojson[1]
zips.geojson.values[0]
'{"type":"MultiPolygon","coordinates":[[[[-112.363501,33.551312],[-112.363457,33.551312],[-112.36253,33.551309],[-112.361378,33.551311],[-112.360977,33.55131],[-112.358913,33.551305],[-112.358916,33.551104],[-112.358898,33.550758],[-112.358825,33.549401],[-112.358763,33.548056],[-112.358652,33.546016],[-112.358635,33.54554],[-112.358629,33.545429],[-112.358613,33.545143],[-112.358607,33.545039],[-112.358599,33.544897],[-112.358596,33.544838],[-112.358592,33.54478],[-112.358545,33.543923],[-112.358475,33.542427],[-112.358444,33.541913],[-112.35842,33.541399],[-112.358363,33.540373],[-112.358345,33.540104],[-112.35833,33.539878],[-112.35828,33.538863],[-112.358263,33.538352],[-112.358204,33.537335],[-112.358196,33.536892],[-112.358193,33.536444],[-112.358192,33.53631],[-112.358182,33.536031],[-112.358175,33.535797],[-112.358186,33.534197],[-112.358187,33.53324],[-112.358185,33.53278],[-112.358182,33.532218],[-112.358168,33.530732],[-112.358163,33.530174],[-112.35815,33.529797],[-112.359343,33.529819],[-112.359387,33.529812],[-112.359354,33.529716],[-112.360874,33.529732],[-112.370575,33.529805],[-112.375373,33.529907],[-112.37537,33.528961],[-112.375382,33.527693],[-112.375384,33.527033],[-112.375393,33.526355],[-112.374883,33.526353],[-112.371535,33.52634],[-112.366678,33.526323],[-112.366665,33.523201],[-112.366664,33.52285],[-112.366661,33.522734],[-112.366658,33.522596],[-112.366657,33.522553],[-112.366655,33.522502],[-112.366658,33.522388],[-112.368754,33.522441],[-112.370106,33.522618],[-112.370917,33.522624],[-112.371875,33.522633],[-112.371865,33.522389],[-112.371875,33.522162],[-112.37175,33.51916],[-112.375186,33.519096],[-112.375306,33.519094],[-112.375305,33.51971],[-112.375309,33.519728],[-112.375351,33.521607],[-112.375367,33.522304],[-112.375426,33.522419],[-112.375587,33.522423],[-112.375767,33.522426],[-112.382694,33.522547],[-112.382697,33.522654],[-112.382698,33.522709],[-112.382714,33.523282],[-112.382958,33.523283],[-112.383939,33.52329],[-112.383935,33.523153],[-112.386882,33.523097],[-112.38781,33.523781],[-112.38801,33.523609],[-112.388673,33.523001],[-112.388794,33.522895],[-112.388852,33.522844],[-112.389115,33.522837],[-112.389205,33.522761],[-112.389319,33.522661],[-112.392416,33.51994],[-112.392509,33.519195],[-112.392516,33.51914],[-112.401093,33.51914],[-112.401098,33.519779],[-112.401098,33.519838],[-112.401137,33.519885],[-112.401146,33.519903],[-112.40124,33.520001],[-112.401311,33.520066],[-112.401432,33.520158],[-112.401754,33.520412],[-112.402133,33.520685],[-112.402411,33.520892],[-112.402552,33.52098],[-112.402692,33.521087],[-112.402882,33.521256],[-112.402948,33.52133],[-112.403016,33.521428],[-112.403062,33.521517],[-112.4031,33.521621],[-112.40312,33.521715],[-112.403129,33.521822],[-112.403119,33.521937],[-112.403102,33.522011],[-112.403064,33.522109],[-112.403009,33.522208],[-112.402908,33.522336],[-112.402781,33.522475],[-112.402685,33.52257],[-112.402641,33.522613],[-112.402553,33.522692],[-112.401659,33.523488],[-112.401228,33.52388],[-112.401157,33.523961],[-112.401123,33.524028],[-112.401107,33.524102],[-112.401108,33.524213],[-112.401116,33.525097],[-112.401119,33.5263],[-112.401119,33.52634],[-112.401119,33.526441],[-112.399658,33.52646],[-112.399258,33.526743],[-112.395079,33.52973],[-112.394771,33.529977],[-112.39013,33.534207],[-112.388661,33.535533],[-112.385957,33.538011],[-112.384107,33.539698],[-112.384007,33.539732],[-112.383947,33.539786],[-112.38381,33.539862],[-112.384585,33.551063],[-112.384605,33.551372],[-112.384609,33.551434],[-112.384614,33.551508],[-112.384416,33.551505],[-112.38385,33.551499],[-112.38131,33.551461],[-112.380126,33.551454],[-112.378928,33.551432],[-112.376262,33.551405],[-112.373858,33.551381],[-112.372583,33.551378],[-112.370038,33.551354],[-112.368768,33.55135],[-112.367585,33.551339],[-112.36749,33.551338],[-112.363501,33.551312]]]]}'
I tried to use it the way I would use values inside a dictionary but I am unable to it.
This might help. It is untested, so it might not work, or might need to be adjusted slightly for your use case.
The important features of this program are:
Use json.loads() to convert a JSON string to a Python data structure
Decompose the data structure according to GeoJSON standard.
Reference:
http://geojson.org/geojson-spec.html#multipolygon
#UNTESTED
import json
# For every zipcode, print the X position of the first
# coordinate of the exterior of the multipolygon associated
# with that zip code
for zipcode, geo in zips:
geo = json.loads(geo)
assert geo["type"] == "MultiPolygon"
# Coordinates of a MultiPolygon are an array of Polygon coordinate arrays
array_of_polygons = geo["coordinates"]
polygon0 = array_of_polygons[0]
# Coordinates of a Polygon are an array of LinearRing coordinate arrays
ring0 = polygon0[0]
# A LinearRing is a closed LineString with 4 or more positions
# A LineString is an array of positions
vertex0 = ring0[0]
# A position is represented by an array of numbers.
x0 = vertex0[0]
print zipcode, x0
You can import json and apply json.loads to convert the string data in your geojson column to dict. Then, you can extract data from dict directly, or use one of many Python modules that deal with GIS data. I find shapely easy to use and helpful in many cases.