I’m working with geo-located social media posts and clustering their locations (latitude/longitude) using DBSCAN. In my data set, I have many users who have posted multiple times, which allows me to derive their trajectory (a time ordered sequence of positions from place to place). Ex:
3945641 [[38.9875, -76.94], [38.91711157, -77.02435118], [38.8991, -77.029], [38.8991, -77.029], [38.88927534, -77.04858468])
I have derived trajectories for my entire data set, and my next step is to cluster or aggregate the trajectories in order to identify areas with dense movements between locations. Any ideas on how to tackle trajectory clustering/aggregation in Python?
Here is some code I've been working with to create trajectories as line strings/JSON dicts:
import pandas as pd
import numpy as np
import ujson as json
import time
# Import Data
data = pd.read_csv('filepath.csv', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')[['latitude','longitude']].values.tolist()] for id in uniqueIds]
# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('filepath_out.csv', index=False, header=False)
# Save outputs as JSON
#outputDict = {}
#for i in output:
# outputDict[i[0]]=i[1]
#with open('filepath.json','w') as f:
#json.dump(outputDict, f, sort_keys=True, indent=4, ensure_ascii=False,)
EDIT
I've come across a python package, NetworkX, and was debating the idea of creating a network graph from my clusters as opposed to clustering the trajectory lines/segments. Any opinions on clustering trajectories v.s. turning clusters into a graph to identify densely clustered movements between locations.
Below is an example of some clusters look like:
In an effort to answer my own 1+ year old question, I've come up with a couple solutions for which have solved this (and similar questions), albeit, without Python (which was my hope). First, using a method I provided a user in the GIS StackExchange using ArcGIS and a couple of built-in tools to carry out a line density analysis (https://gis.stackexchange.com/questions/42224/creating-polyline-based-heatmap-from-gps-tracks/270524#270524). This takes GPS points, creates lines, segments the lines, and then clusters them. The second method uses SQL (ST_MakeLine primarily) and a Postgres/GIS/CARTO data base to create lines ordered by ascending timestamp, and then grouped by users (e.g. https://carto.com/blog/jets-and-datelines/). One can then count the number of line occurrences (assuming points are clustered with clearly defined centroids similar to the initial question of mine above), and treat this as a cluster (e.g. Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance, https://carto.com/blog/alteryx-and-carto-to-explore-london-bike-data/).
Related
I am working with a data set, from which I need to remove some records from a variable.
The datasets a is from the sklearn library:
from sklearn.datasets import fetch_kddcup99
Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated.
datos = pd_data.groupby('labels').size().sort_values(ascending=False)
top = datos.head(2)
print(top)
I try to delete them this way but I can't delete them:
When looking at the dataset the other records still follow:
And I need:
If I understand your question, you want to create a dataframe containing only those records containing the two most frequent labels.
Assuming you have a list of the desired labels a you can filter the dataframe as follows:
a = ["b'neptune,'", "b'normal,'"]
dfout = df['labels].isin(a)
I have a set of Polygon instances (Loaded via geopandas), which comprise of historical severe-weather based watches and warnings, and netCDF data at 15-minute intervals containing parameters such as tornado / hail forecast parameters and updraft helicity (These data are point data). Each polygon has a set start time and an end time. What I'm trying to do is for each of the 15-minute interval within the polygon's time range (start time -> end time), I want to spatially join in from the netCDF files, the highest value of each of these forecast parameter within the polygon.
I already have the code that pulls the range of time steps (time_list) required to analyze from the netCDF files for each polygon (TOR):
# Pull the warning issue time, locate the closest index in our subhour files
issue_time = datetime.strptime(str(TOR.ISSUED), "%Y%m%d%H%M")
expire_time = datetime.strptime(str(TOR.EXPIRED), "%Y%m%d%H%M")
closest_first = min(time_list, key=lambda d: abs(d - issue_time))
closest_last = min(time_list, key=lambda d: abs(d - expire_time))
# Pull the timesteps in between the issue time and the expire time
before = time_list.index(closest_first) if closest_first < issue_time else time_list.index(closest_first)-1
after = time_list.index(closest_last) if closest_last > issue_time else time_list.index(closest_last)+1
The TOR object is generated by slicing the geoDataFrame of the full polygon data. Right now I am using itertuples on the below slice to get the TOR object:
TORNADOES_IN_WINDOW = df_slice_TO_WRN[(df_slice_TO_WRN.ISSUED > int(first.strftime("%Y%m%d%H%M")))
& (df_slice_TO_WRN.ISSUED < int(last.strftime("%Y%m%d%H%M")))]
Inside the loop, I then iterate over the list of netCDF files in the range of timesteps found (before -> after), and load these into geopandas as well such that I can perform a spatial join, this is done with this block of code:
xr_i = xr.open_dataset(subhr_files[i])
ds_i = xr_i.to_dataframe()
ds_i = ds_i.reset_index()
geom = [Point(x,y) for x, y in zip(ds_i['XLONG'], ds_i['XLAT'])]
gdf = gpd.GeoDataFrame(ds_i, geometry=geom)
The set of problems I'm encountering right now comprise of the following. When trying to run itertuples on the geoDataFrame object (TORNADOES_IN_WINDOW), each iterable returns as a Pandas object, not a GeoPandas object, and therefore I cannot run sjoin to join the attributes. Also, this is not a very efficient methodology, and I'm wondering if perhaps there is a better way to do it.
I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000). The data are time-series of hydraulic parameters, for example wave height.
The global data set is huge so it is divided into many NetCDF files. Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (e.g. wave height) and one year (e.g. 2020). Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.
My current approach is a triple loop through years, variables, and nodes. I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary. Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file. With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).
My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long. My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted). But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.
Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files? I put the code I currently use below. Thanks for any help!
# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1) # year range
ndata = {} # dictionary which will contain all data
# loop through all the desired variables
for v in vars.keys():
ndata[v] = {}
# For each variable, loop through each year, open the nc file and extract the data
for y in years:
# Open file with xarray
fname = 'xxx.nc'
data = xr.open_dataset(fname)
# loop through the locations and load the data for each node as temp
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
lon = nodes.lon.iloc[n]
lat = nodes.lat.iloc[n]
temp = data.sel(longitude=lon, latitude=lat)
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
# merge the variables for the current location into one dataset
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
dset = xr.merge(ndata[v][node] for v in variables.keys())
df = dset.to_dataframe()
# save dataframe as pickle file, named by the node id
df.to_pickle('%s.xz'%(node)))
This is a pretty common workflow so I'll give a few pointers. A few suggested changes, with the most important ones first
Use xarray's advanced indexing to select all points at once
It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id'. As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C. In this case:
# create an xr.Dataset indexed by node_id with arrays `lat` and `lon
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# select all points from each file simultaneously, reshaping to be
# indexed by `node_id`
node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# dump this reshaped data to pandas, with each variable becoming a column
node_df = node_data.to_dataframe()
Only reshape arrays once
In your code, you are looping over many years, and every year after
the first one you are allocating a new array with enough memory to
hold as many years as you've stored so far.
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
Instead, just gather all the years worth of data and concatenate
them at the end. This will only allocate the needed array for all the data once.
Use dask, e.g. with xr.open_mfdataset to leverage multiple cores. If you do this, you may want to consider using a format that supports multithreaded writes, e.g. zarr
All together, this could look something like this:
# build nested filepaths
filepaths = [
['xxx.nc'.format(year=y, variable=v) for y in years
for v in variables
]
# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
filepaths,
combine='nested',
concat_dim=['variable', 'year'],
parallel=True,
)
# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')
# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()
I work with a test system that outputs a large CSV matrix of values which I then process using the Pandas module in Python. The parameters that system uses when testing a given part are governed by a predetermined sequence. A simplified example is shown here:
Raw data frame
However, not all of these steps are desired in the output data. In fact, the rows containing a 'Clock Frequency' value of '3.0MHz' are only included to act as buffer points to allow a climate chamber to reach the intended temperature. I do not wish to include data collected at these parameters in my results.
I found I was pretty easily able to remove these rows from my data frame by using the below code. Note that in this example I am working with a Pandas data frame called 'csvDF'.
tempBuffers = csvDF[csvDF['Clock Frequency']==3e6].index
csvDF.drop(tempBuffers, inplace=True)
This produces the following output:
Data frame with buffer steps removed
The issue with this is that now my 'Sequence Step' column is wrong. I want the data table to appear as if those buffer steps never existed. The sequence steps should be sequential for all non-buffer steps. The desired output is shown below:
Data frame with buffer steps removed and corrected sequence step column
What code do I need to instantiate in order to achieve this?
You can try something like this:
n = 3 # number of rows in step
csvDF.reset_index(inplace=True, drop=True)
csvDF['Sequence step'] = pd.Series(range(len(csvDF)))
csvDF['Sequence step'] = csvDF['Sequence step'].apply(lambda x: int(x / n))
I am stuck with a strange issue. I am reading data from a CSV file and converting it into a multi-level dictionary.
CSV Format: I have a total of 1,500 rows in my CSV file, see the format below.
1-103rd Street,1-96th Street,2327.416174
1-116th Street–Columbia University,1-Cathedral Parkway–110th Street,2327.416174
1-125th Street,1-116th Street–Columbia University,2327.416174
1-137th Street–City College,1-125th Street,2327.416174
1-145th Street,1-137th Street–City College,2327.416174
1-14th Street,1-Christopher Street–Sheridan Square,2327.416174
In the above file, the first column denotes a source station, the second column denotes a destination station, and the third column provides the distance between them.
I will have to apply Dijkstra's Algorithm to find the shortest distance between two stations, and for that I need to convert the whole CSV file into a weighted graph, in which each station is a node and the distance between them is the weight of the edge.
My approach:
First I am reading each row from the CSV file and converting it into a multi-level dictionary. I am getting a proper dictionary for this. Below is my code.
my_dict = {}
with open('final_subway_data.csv') as f_input:
for row in csv.reader(f_input):
my_dict[row[0]] = {row[1]: row[2]}
Now I need to convert this newly created dictionary into a graph in order to apply Dijkstra's Algorithm. For that I am using this code:
G = nx.from_dict_of_dicts(my_dict)
But I am getting an error saying "TypeError: Input graph is not a networkx graph type".
Please help me. How can I convert the whole CSV file into a graph so I can apply Dijkstra's Algorithm to find a shortest distance between any two stations.
I'm not super familiar with NetworkX, but I'd do the following using pandas and nx.from_pandas_dataframe().
import pandas as pd
import networkx as nx
df = pd.read_csv('csvpath.csv', names=['origin', 'dest', 'dist'])
g = nx.from_pandas_dataframe(df, source='origin', target='dest', edge_attr='dist')
g['1-103rd Street']['1-96th Street']['dest']
# 2327.416174