Performing calculations on DataFrames of different lengths

Performing calculations on DataFrames of different lengths - python

I have two different DataFrames that look something like this:
Lat
Lon
28.13
-87.62
28.12
-87.65
......
......
Calculated_Dist_m
34.5
101.7
..............
The first DataFrame (name=df) (consisting of the Lat and Lon columns) has just over 1000 rows (values) in it. The second DataFrame (name=new_calc_dist) (consisting of the Calculated_Dist_m column) has over 30000 rows (values) in it. I want to determine the new longitude and latitude coordinates using the Lat, Lon, and Calculated_Dist_m columns. Here is the code I've tried:
r_earth = 6371000
new_lat = df['Lat'] + (new_calc_dist['Calculated_Dist_m'] / r_earth) * (180/np.pi)
new_lon = df['Lon'] + (new_calc_dist['Calculated_Dist_m'] / r_earth) * (180/np.pi) / np.cos(df['Lat'] * np.pi/180)
When I run the code, however, it only gives me new calculations for certain index values, and gives me NaNs for the rest. I'm not entirely sure how I should go about writing the code so that new longitude and latitude points are calculated for each of over 30000 row values based on the initial 1000 longitude and latitude points. Any suggestions?
EDIT
Here would be some sample outputs. Note that these are not exact figures, but give the idea.
Lat
Lon
28.13
-87.62
28.12
-87.65
28.12
-87.63
.....
......
Calculated_Dist_m
34.5
101.7
28.6
30.8
76.5
.................
And so the sample out put would be:
Lat
Lon
28.125
-87.625
28.15
-87.61
28.127
-87.623
28.128
-87.623
28.14
-87.615
28.115
-87.655
28.14
-87.64
28.117
-87.653
28.118
-87.653
28.15
-87.645
28.115
-87.635
28.14
-87.62
28.115
-87.613
28.117
-87.633
28.118
-87.633
......
.......
Again, these are just random outputs (I tried getting the exact calculations, but could not get it to work). But overall, this gives an idea of what would be wanted: taking the coordinates from the first dataframe and calculating new coordinates based on each of the calculated distances from the second dataframe.

If I understood correctly and assuming df1 and df2 as input, you can perform a cross merge to get all combinations of df1 and df2 rows, then apply your computation (here as new columns Lat2/Lon2):
df = df1.merge(df2, how='cross')
r_earth = 6371000
df['Lat2'] = df['Lat'] + (df['Calculated_Dist_m'] / r_earth) * (180/np.pi)
df['Lon2'] = df['Lon'] + (df['Calculated_Dist_m'] / r_earth) * (180/np.pi) / np.cos(df['Lat'] * np.pi/180)
output:
Lat Lon Calculated_Dist_m Lat2 Lon2
0 28.13 -87.62 34.5 28.130310 -87.619648
1 28.13 -87.62 101.7 28.130915 -87.618963
2 28.13 -87.62 28.6 28.130257 -87.619708
3 28.13 -87.62 30.8 28.130277 -87.619686
4 28.13 -87.62 76.5 28.130688 -87.619220
5 28.12 -87.65 34.5 28.120310 -87.649648
6 28.12 -87.65 101.7 28.120915 -87.648963
7 28.12 -87.65 28.6 28.120257 -87.649708
8 28.12 -87.65 30.8 28.120277 -87.649686
9 28.12 -87.65 76.5 28.120688 -87.649220
10 28.12 -87.63 34.5 28.120310 -87.629648
11 28.12 -87.63 101.7 28.120915 -87.628963
12 28.12 -87.63 28.6 28.120257 -87.629708
13 28.12 -87.63 30.8 28.120277 -87.629686
14 28.12 -87.63 76.5 28.120688 -87.629220

In case you just want the result as two 2D arrays (without repeats of the input, so also O[m*n] in memory but 2/5 of the requirement from the result of cross-join):
r_earth = 6371000
z = 180 / np.pi * new_calc_dist['Calculated_Dist_m'].values / r_earth
lat = df['Lat'].values
lon = df['Lon'].values
new_lat = lat[:, None] + z
new_lon = lon[:, None] + z / lat[:, None]
Example:
df = pd.DataFrame([[28.13, -87.62], [28.12, -87.65]], columns=['Lat', 'Lon'])
new_calc_dist = pd.DataFrame([[34.5], [101.7], [60.0]], columns=['Calculated_Dist_m'])
# result of above
>>> new_lat
array([[28.13031027, 28.13091461, 28.13053959],
[28.12031027, 28.12091461, 28.12053959]])
>>> new_lon
array([[-87.61998897, -87.61996749, -87.61998082],
[-87.64998897, -87.64996747, -87.64998081]])
If you do want those results as DataFrames:
kwargs = dict(index=df.index, columns=new_calc_dist.index)
new_lat = pd.DataFrame(new_lat, **kwargs)
new_lon = pd.DataFrame(new_lon, **kwargs)

Related

removing similar data after grouping and sorting python

I have this data:
lat = [79.211, 79.212, 79.214, 79.444, 79.454, 79.455, 82.111, 82.122, 82.343, 82.231, 79.211, 79.444]
lon = [0.232, 0.232, 0.233, 0.233, 0.322, 0.323, 0.321, 0.321, 0.321, 0.411, 0.232, 0.233]
val = [2.113, 2.421, 2.1354, 1.3212, 1.452, 2.3553, 0.522, 0.521, 0.5421, 0.521, 1.321, 0.422]
df = pd.DataFrame({"lat": lat, 'lon': lon, 'value':val})
and I am grouping it by lat & lon and then sorting by the value column and taking the top 5 as shown below:
grouped = df.groupby(["lat", "lon"])
val_max = grouped['value'].max()
df_1 = pd.DataFrame(val_max)
df_1 = df_1.sort_values('value', ascending = False)[0:5]
The output I get is this:
value
lat lon
79.212 0.232 2.4210
79.455 0.323 2.3553
79.214 0.233 2.1354
79.211 0.232 2.1130
79.454 0.322 1.4520
I want to remove any row that is within 1 of the last decimal place of any of the above. So we see that row 1 is almost the same location as row 4 and row 2 is almost the same location as row 5 so 4 and 5 would be replaced by the next ranked lat lon, which would make the output:
value
lat lon
79.212 0.232 2.4210
79.455 0.323 2.3553
79.214 0.233 2.1354
82.343 0.321 0.5421
82.111 0.321 0.5220
Please le me know how I can do this.

You could sort the dataframe, like this:
grouped = df.groupby(["lat", "lon"])
val_max = grouped["value"].max()
df_1 = pd.DataFrame(val_max)
df_1 = (
df_1.sort_values("value", ascending=False).reset_index().sort_values(["lat", "lon"])
)
Then, iterate on each row and compare it to the previous one, find and drop similar ones :
# Find similar rows and mark them in a new "match" column
df_1["match"] = ""
for i in range(df_1.shape[0] + 1):
if i == 0:
continue
df_1.loc[
(df_1.iloc[i, 0] - df_1.iloc[i - 1, 0] <= 0.001)
| (df_1.iloc[i, 1] - df_1.iloc[i - 1, 1] <= 0.001),
"match",
] = pd.NA
# Remove empty rows
df_1 = df_1.dropna(how="all").reset_index(drop=True)
# Remove unwanted rows and cleanup
index = [i - 1 for i in df_1[df_1["match"].isna()].index]
df_1 = df_1.drop(index=index).drop(columns="match").reset_index(drop=True)
Which outputs:
print(df_1)
lat lon value
0 79.212 0.232 2.4210
1 79.214 0.233 2.1354
2 79.444 0.233 1.3212
3 79.455 0.323 2.3553
4 82.111 0.321 0.5220
5 82.122 0.321 0.5210
6 82.231 0.411 0.5210
7 82.343 0.321 0.5421

xarray groupby coordinates and non coordinate variables

I am trying to calculate the distribution of a variable in a xarray. I can achieve what I am looking for by converting the xarray to a pandas dataframe as follows:
lon = np.linspace(0,10,11)
lat = np.linspace(0,10,11)
time = np.linspace(0,10,1000)
temperature = 3*np.random.randn(len(lat),len(lon),len(time))
ds = xr.Dataset(
data_vars=dict(
temperature=(["lat", "lon", "time"], temperature),
),
coords=dict(
lon=lon,
lat=lat,
time=time,
),
)
bin_t = np.linspace(-10,10,21)
DS = ds.to_dataframe()
DS.loc[:,'temperature_bin'] = pd.cut(DS['temperature'],bin_t,labels=(bin_t[0:-1]+bin_t[1:])*0.5)
DS_stats = DS.reset_index().groupby(['lat','lon','temperature_bin']).count()
ds_stats = DS_stats.to_xarray()
<xarray.Dataset>
Dimensions: (lat: 11, lon: 11, temperature_bin: 20)
Coordinates:
* lat (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* temperature_bin (temperature_bin) float64 -9.5 -8.5 -7.5 ... 7.5 8.5 9.5
Data variables:
time (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
temperature (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
Is there a way to generate ds_stats without converting to a dataframe? I have tried to use groupby_bins but this does not preserve coordinates.
print(ds.groupby_bins('temperature',bin_t).count())
distributed.utils_perf - WARNING - full garbage collections took 21% CPU time recently (threshold: 10%)
<xarray.Dataset>
Dimensions: (temperature_bins: 20)
Coordinates:
* temperature_bins (temperature_bins) object (-10.0, -9.0] ... (9.0, 10.0]
Data variables:
temperature (temperature_bins) int64 121 315 715 1677 ... 709 300 116

Using xhistogram may be helpful.
With the same definitions as you had set above,
from xhistogram import xarray as xhist
ds_stats = xhist.histogram(ds.temperature, bins=bin_t,dim=['time'])
should do the trick.
The one difference is that it returns a DataArray, not a Dataset, so if you want to do it for multiple variables, you'll have to do it separately for each one and then recombine, I believe.

PCA transformation of xarray.Dataset

I need to apply a PCA conversion to some Landsat (satellite imagery) scenes stored as xarray.Dataset and containing nan values (for technical reason each band of a given pixel will be nan).
Here is the code to create an example dataset:
import numpy as np
import xarray as xr
# Create a demo xarray.Dataset
ncols = 25
nrows = 50
lon = [50 + x * 0.2 for x in range(nrows)]
lat = [30 + x * 0.2 for x in range(ncols)]
red = np.random.rand(nrows, ncols) * 10000
green = np.random.rand(nrows, ncols) * 10000
blue = np.random.rand(nrows, ncols) * 10000
nir = np.random.rand(nrows, ncols) * 10000
swir1 = np.random.rand(nrows, ncols) * 10000
swir2 = np.random.rand(nrows, ncols) * 10000
ds = xr.Dataset({'red': (['longitude', 'latitude'], red),
'green': (['longitude', 'latitude'], green),
'blue': (['longitude', 'latitude'], blue),
'nir': (['longitude', 'latitude'], nir),
'swir1': (['longitude', 'latitude'], swir1),
'swir2': (['longitude', 'latitude'], swir2)},
coords = {'longitude': (['longitude'], lon),
'latitude': (['latitude'], lat)})
# To keep example realistic let's add some nodata
ds = ds.where(ds.latitude + ds.longitude < 90)
print(ds)
<xarray.Dataset> Dimensions: (latitude: 25, longitude: 50) Coordinates: * longitude (longitude) float64 50.0 50.2 50.4 50.6
50.8 51.0 51.2 51.4 ... * latitude (latitude) float64 30.0 30.2 30.4 30.6 30.8 31.0 31.2 31.4 ... Data variables:
red (longitude, latitude) float64 6.07e+03 13.8 9.682e+03 ...
green (longitude, latitude) float64 5.476e+03 350.4 7.556e+03 ...
blue (longitude, latitude) float64 4.306e+03 2.104e+03 9.267e+03 ...
nir (longitude, latitude) float64 1.445e+03 8.633e+03 6.388e+03 ...
swir1 (longitude, latitude) float64 6.005e+03 7.692e+03 4.004e+03 ...
swir2 (longitude, latitude) float64 8.235e+03 3.127e+03 674.6 ...
After a search on the internet, I tried unsuccessfully to implement sklearn.decomposition PCA functions.
I first convert each 2 dimensions band into a single dimension:
# flatten dataset
tmp_list = []
for b in ['red', 'green', 'blue','nir','swir1','swir2']:
tmp_list.append(ds[b].values.flatten().astype('float64'))
flat_ds = np.array(tmp_list)
Then I tried to compute PCA and transform the original data in a location without nan. I succeeded to generate some output but totally different than the one generated with ArcGIS or Grass.
When I changed my location it appeared sklearn function is not able to process data containing nan. So I removed nan values from the flattened dataset, which is problematic when I deflate the flattened PCA result as it does not contains a multiple of original dataset dimensions.
# deflate PCAs
dims = ds.dims['longitude'], ds.dims['latitude']
pcas = xr.Dataset()
for i in range(flat_pcas.shape[0]):
pcas['PCA_%i' % (i + 1)] = xr.DataArray(np.reshape(flat_pcas[i], dims),
coords=[ds.longitude.values, ds.latitude.values],
dims=['longitude','latitude'])
To resume the situation:
Does another simpler approach exist to implement PCA transformation on xarray.Dataset ?
How to deal with nan ?

Try to use eofs, available here: https://github.com/ajdawson/eofs
In the documentation they say:
Transparent handling of missing values: missing values are removed automatically when computing EOFs and re-inserted into output fields.
I have used this a few times and I have found it very well-designed.

You can also use the EOFs available from pycurrents (https://currents.soest.hawaii.edu/ocn_data_analysis/installation.html)
I have an example at https://github.com/manmeet3591/Miscellaneous/blob/master/EOF/global_sst.ipynb

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Please see attached snapshot as I ask my question.
I have a data frame with a pair of Latitude / Longitudes that have been geocoded from different program and I am trying to generate a distance matrix between (Latitude1/Longitude1) and (Latitude2/Longitude2) and so on to find distance between locations.
My program below doesn't seem to read all the rows.
import pandas as pd
import googlemaps
import requests, json
gmaps = googlemaps.Client(key='123)
source = pd.DataFrame({'Latitude': df['Latitude1'] ,'Longitude': df['Longitude1']})
destination = pd.DataFrame({'Latitude': df['Latitude2'] ,'Longitude': df['Longitude2']})
source = source.reset_index(drop=True)
destination = destination.reset_index(drop=True)
for i in range(len(source)):
result = gmaps.distance_matrix(source, destination)
print(result)
Expected Output
Distance
12 Miles
10 Miles
5 Miles
1 Mile
DataFrame
Key Latitude1 Longitude1 Latitude2 Longitude#2
1 42 -91 40 -92
2 39 -94.35 38 -94
3 37 -120 36 -120
4 28.7 -90 35 -90
5 40 -94 38 -90
6 30 -90 25 -90

I haven't used gmaps, but this is a simple formula for calculating distance.
This is just maths, so I won't explain it here.
Just know you need 2 locations in the format (lat, lon) as the arguments and need to import math
def distance(origin, destination):
lat1, lon1 = origin
lat2, lon2 = destination
radius = 3959 # mi
dlat = math.radians(lat2-lat1)
dlon = math.radians(lon2-lon1)
a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
* math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = radius * c
return d
Now we need to merge the 2 dataframes More detail here
maindf = pd.merge(source, destination , left_index=True, right_index=True)
Next you need to apply that to each row
maindf['Distance'] = maindf.apply(lambda row: distance((row.Latitude1,row.Longditude1),(row.Latitude2,row.Longditude2)), axis=1)
Apply loops over the dataframe and applies the function.
In this case it is applies 'distance' to every row based on the 2 lat/long pairs in each row.
This adds a new column 'Distance' with the distance in miles between the 2 locations.
I would also add, if that is your full code, you don't actually add any data to the dataframes.

Vectorization to calculate many distances

I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.
For example, my dataset1 may look like below:
id lon lat varA
1 20.11 19.88 100
2 20.87 18.65 90
3 18.99 20.75 120
and my dataset2 may look like below:
placeid lon lat
a 18.75 20.77
b 19.77 22.56
c 20.86 23.76
d 17.55 20.74
Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.
I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.
dataset2['count'] = dataset1.apply(lambda x:
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis
= 1)
However, this gives the total number of rows, but not the ones that satisfy my requirements.
Would anyone be able to point me how to make the code work?

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.
ds1
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
ds2
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
from scipy.spatial import distance
# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11, 19.88],
# [ 20.87, 18.65],
# [ 18.99, 20.75]])
distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074, 2.70148108, 3.95182236, 2.70059253],
# [ 2.99813275, 4.06178532, 5.11000978, 3.92307278],
# [ 0.24083189, 1.97091349, 3.54358575, 1.44003472]])
distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well.
For the sake of this example, let's assume vara is:
vara = np.array([2,4.5,2])
(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):
vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926, 0.70148108, 1.95182236, 0.70059253],
# [-1.50186725, -0.43821468, 0.61000978, -0.57692722],
# [-1.75916811, -0.02908651, 1.54358575, -0.55996528]])
then setting positive values to zero and making negative values positive will give us the final array:
res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926, 0. , 0. , 0. ],
# [ 1.50186725, 0.43821468, 0. , 0.57692722],
# [ 1.75916811, 0.02908651, 0. , 0.55996528]])
Now, to sum over each row:
sum_ = res.sum(axis=1)
#out: array([ 0.37466926, 2.51700915, 2.34821989])
and to count the items in each row:
count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])
This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.
x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]
query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.
one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.
UPDATE:
To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.
lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)
Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)
You can do df.apply() and use Proj_to_... to project df.

IIUC:
Source DFs:
In [160]: d1
Out[160]:
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [161]: d2
Out[161]:
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
Vectorized haversine function:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])
a = pd.np.sin((lat2-lat1)/2.0)**2 + \
pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))
Solution:
x = d2.assign(x=1) \
.merge(d1.loc[d1['id']==1, ['lat','lon']].assign(x=1),
on='x', suffixes=['','2']) \
.drop(['x'], 1)
x['dist'] = haversine(x.lat, x.lon, x.lat2, x.lon2)
yields:
In [163]: x
Out[163]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
2 c 20.86 23.76 19.88 20.11 438.324033
3 d 17.55 20.74 19.88 20.11 283.565975
filtering:
In [164]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[164]:
Empty DataFrame
Columns: [placeid, lon, lat, lat2, lon2, dist]
Index: []
let's change d1, so a few rows would satisfy the criteria:
In [171]: d1.loc[0, 'varA'] = 350
In [172]: d1
Out[172]:
id lon lat varA
0 1 20.11 19.88 350 # changed: 100 --> 350
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [173]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[173]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
3 d 17.55 20.74 19.88 20.11 283.565975

Use scipy.spatial.distance.cdist with your user-defined distance algorithm as the metric
h = lambda u, v: haversine(u['lon'], u['lat'], v['lon'], v['lat'])
dist_mtx = scipy.spatial.distance.cdist(dataset1, dataset2, metric = h)
Then to check the number in the area, just broadcast it
dataset2['count'] = np.sum(dataset1['A'][:, None] > dist_mtx, axis = 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performing calculations on DataFrames of different lengths - python

Related

removing similar data after grouping and sorting python

xarray groupby coordinates and non coordinate variables

PCA transformation of xarray.Dataset

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Vectorization to calculate many distances

Categories

Resources