Find and Delete out of range data from dataframe - python

I am working with Latitude and Longitude dataset (pandas dataframe). I tried to find minimum and maximum values of latitude and longitude to get a bounding box and export the image of the area from Open Street Map. The results is showing out of range data and I want to find and remove all of out of range data from my dataset. Not sure what is the best way of doing this.
Thanks for any help.

I think you can use np.where. For e.g. you want to filter out longitude & latitude of certain range, you can use this in way
np.where((df['longitude']>=100) & (df['latitude']<=100) | (df['longitude']<=100) & (df['latitude']>=100) )
Ofcourse you need to modify range/column name & equality logic based on the requirement.
More information will certainly help in answer question better.

Related

Aggregate raster data based on geopandas DataFrame geometries

Situation
I have two datasets:
Raster data loaded using rioxarray, an xarray.DataArray
An geopandas.DataFrame with geometries indicating areas in the 1. dataset
The geo data in both datasets are in the same CRS (EPSG:4326).
Problem
For each entry in 2. I want to aggregate all values from 1. which overlap with the specific geometry. Kind of like an .group-by() using the geometries + .sum().
Current WIP approach
The package xagg does that already, is unfortunately slow on a subset of my dataset and scales worse when I try to use it on my full dataset.
Question
Is there an simple and efficient way to do this in Python?
(The solution wouldn't need to replicate the results from xagg accurately.)
WRT your comment here is some pseudo code I've used to do what you're after. The function being executed outputs files in this case. If it's not obvious this, strategy isn't going to help if you just have 1 big raster and 1 big poly file. This method assumes tiled data and uses an indexing system to match the right rasters with the overlaying polys. The full example is kind of beyond the scope of a single answer. But if you ask specifics if you have issues I can try to assist. In addition to dask's good documentation, there are lots of other posts on here with dask delayed examples.
results_list = []
for f in raster_file_list:
temp_list = dask.delayed(your_custom_function)(f, your_custom_function_arg_1, your_custom_function_arg_2)
results.append(temp_list)
results = dask.compute(*results_list, scheduler='processes')

Interpolation in data frame without making a new data frame

I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.

Pandas.. does quantile function need sorted data to calculate percentiles?

I'm using Pandas to clean up some data and do basic statistics. I am wondering if quantile() does sort the values before the calculation or i must do the sorting beforehand?
For example, here I'm trying to get the 50th percentile of the number of workers in each company
Percentile50th = Y2015_df.groupby (["company"])["worker"].quantile(0.50)
I'm asking because when I was verifying the values I got with the results in MS Excel, I discovered that Median function requires the data to be sorted in order to get the right median. But I'm not sure if its the case in Pandas.
You do not need to sort. See the link in my previous comment. Example

How to find distribution of a grouped column with respect to another column in excel?

Excel newbie and aspiring Data analyst, I have this data and I want to find the distribution of City wise Shopping Experience. The column M has the shopping experience rated from 1 to 5.
What I tried
I am not able to google how to do this at all. I tried running correlation, but the in-built excel data analysis tool does not let me run it on non-numeric data, and I am not able to group the City cells either. I thought of replacing every city with numeric alias but I don't know how to do that either. How to search, or go ahead with this problem?
Update: I was thinking of some way to get this out of the cities column.
I am thinking this is better done in python.
How about something like this, have just taken the cities and data to show averageif, sumif and countif:
I used Data validation to provide the list to select from.

Merging millions of polylines into single unique polyline

I am trying to figure out an efficient way to calculate the uniqueness of millions of paths. i.e. calculate coverage.
I have millions of [(lon,lat),...] points, that represent paths taken around a given city.
I am trying to figure out an efficient way to calculate unique millage. So points from one side of the street cancel another out, or paths that are heading in different direction cancel each other out, etc. Leaving only 1 path representing that a given area has been walked.
Now I can loop through the dataset, adding points to a new list, while also comparing that new point's distance to all points in the new list, and if its within a certain radius consider it none unique, and don't include it.
This is slow.
This has to be a solved problem, but alas I can not find anything helpful online as of yet. Does anyone have an recommendations, formulas, or any general advice on merging [(lon,lat),...] paths, to represent unique coverage in miles?
Since your intermediate goal seems to be reducing the number of points in your polylines and they are conveniently situated in a city, you could try map matching them to known streets and paths to achieve that. This GIS SE question might be helpful in that regard.

Categories

Resources