Merging points based on coordinates using python (pandas or geopandas) - python

I have a dataset like:
pointID lat lon otherinfo
I want to round up the coordinates and aggregate all the points whose coordinates become equal into one single item, and assign it a new name, which would probably be a new dataframe column. The "otherinfo" column must be preserved, meaning that by the end of the operation I will have the same number of rows I had before, but with new IDs based on the rounded coordinates.
How can I achieve this using pandas? Is it any easier if I use geoPandas?

If you already have columns for coodinates (lat and lon), you can do for example (rounding to 2 decimal numbers):
df['new_id'] = df.groupby([df.lat.round(2), df.lon.round(2)]).ngroup()
The ngroup method on the groupby gives for each original row to which group it belongs, so in fact gives you a new unique ID based on rounded lat/lon.

Related

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

Pandas Get Row with Smallest Distance Between Strings (Closest Match)

Suppose I have a dataframe with an index column filled with strings. Now, suppose I have very similar but somewhat different strings that I want to use to look up rows within the dataframe. How would I do this since they aren't identical? My guess would be to simply choose the row with the lowest distance between the two strings, but I'm not really sure how I could do that efficiently.
For example, if my dataframe is:
and I want to lookup "Lord of the rings", I should get the 2nd row. How would I do this in pandas?

How can I add a column of one data frame to another based on the nearest identifier?

Problem:
I have a data frame foo that contains measurements and a common_step column, which contains integers indicating when each row was measured.
I have a second data frame that also contains a common_step column and a bar_step column. It translates between the two integer steps.
I would like to add bar_step as a column to foo. However, the common_step values of both data frames are not aligned.
Thus, for each row in foo, I would like to find the row in bar with the nearest global_step and add its bar_step to the foo row.
I have found a way to do this. However, the solution is very slow. This is because for every row in foo, it searches through all rows in bar to find the one with closest global_step.
foo.sort_values('common_step', inplace=True)
bar.sort_values('common_step', inplace=True)
def find_nearest(foo_row):
index = abs(bar.common_step - foo_row.common_step).idxmin()
return bar.loc[index].bar_step
foo['bar_step'] = scores.apply(find_nearest, axis=1)
Questions:
How I can add the closest match for bar_step to the foo data table with sub quadratic run time?
Moreover, it would be ideal to have a flag that chooses the row with the closest but smaller global_step.
As #QuangHoang suggested in the comment, merge_asof does this. Moreover, the second data frame should contain no other columns to not interfere with existing columns in the first one:
foo.sort_values('common_step', inplace=True)
bar.sort_values('common_step', inplace=True)
bar = bar[['bar_step', 'common_step']]
foo = pandas.merge_asof(foo, bar, on='common_step', direction='backward')
The direction parameter specifies whether to use the nearest lower match, nearest higher match, or nearest match considering both directions. From the documentation:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Pandas dataframe returns incorrect sort using two float columns

I am playing with some geo data. Given a point, I am trying to map to an object. So for each connection, I generate two distances, both floats. To find the closest, I want to sort my dataframe by both distances and pick the top row.
Unfortunately when I run a sort (df.sort_values(by=['direct distance', 'pt_to_candidate']) I get the following out-of-order result
I would expect the top two rows, but flipped. If I run the sort on either column solo, I get expected results. If I flip the order of the sort (['pt_to_candidate', 'direct distance']) I get a correct, though not what I necessarily want for my function.
Both columns are type float64.
Why is this sort returning oddly?
For completeness, I should state that I have more columns and rows. From the main dataframe, I filter first and then sort. Also, I cannot recreate by manually entering data into a new dataframe, so I suspect the float length is the issue.
Edit
Adding a value_counts on 'direct distance'
4.246947 7
3.147303 2
2.875081 1
2.875081 1

Rejecting zero values when creating a list of minimum values. (Python Field Calc)

I'm trying to create a list of minimum values from four columns of values. Below is the statement I have used.
min ([!Depth!, !Depth_1!, !Depth_12!, !Depth_1_13!])
The problem I'm having is that some of the fields under these columns contain zeros. I need it to return the next lowest value from the columns that is greater than zero.
I have an attribute table for a shapefile from an ArcGIS document. It has 10 columns. ID, Shape, Buffer ID (x4), Depth (x4).
I need to add an additional column to this data which represents the minimum number from the 4 depth columns. Many of the cells in this column are equal to zero. I need the new column to take the minimum value from the four depth columns but ignore the zero values and take the next lowest value.
A screen shot of what I am working from:
Create a function that does it for you. I added a pic so you can follow the steps. Just change the input names to your column names.
def my_min(d1,d2,d3,d4):
lst = [d1,d2,d3,d4]
return min([x for x in lst if x !=0])

Categories

Resources