Geopandas: dataframe to geodataframe with different espg code

Geopandas: dataframe to geodataframe with different espg code - python

I have a dataframe (df2): wherein x,y are specified in rd new epsg:28992 coordinates.
x y z batch_nr batch_description
0 117298.377 560406.392 0.612 5800 PRF Grasland (l)
1 117297.803 560411.756 1.015
2 117296.327 560419.840 1.580
3 117295.470 560425.716 2.490
4 117296.875 560429.976 4.529
more CRS info:
# def CRS, used in geopandas
from pyproj import CRS
crs_rd = CRS.from_user_input(28992)
crs_rd
<Derived Projected CRS: EPSG:28992>
Name: Amersfoort / RD New
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone.
- bounds: (3.2, 50.75, 7.22, 53.7)
Coordinate Operation:
- name: RD New
- method: Oblique Stereographic
Datum: Amersfoort
- Ellipsoid: Bessel 1841
- Prime Meridian: Greenwich
How can I convert df2 to a geodatafame where the geometry is set as CRS: EPSG 28992?

It's a simple case of using GeoPandas constructor with crs parameter and points_from_xy()
import geopandas as gpd
import pandas as pd
import io
df2 = pd.read_csv(io.StringIO(""" x y z batch_nr batch_description
0 117298.377 560406.392 0.612 5800 PRF Grasland (l)
1 117297.803 560411.756 1.015
2 117296.327 560419.840 1.580
3 117295.470 560425.716 2.490
4 117296.875 560429.976 4.529"""), sep="\s\s+", engine="python")
gdf = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2["x"], df2["y"], df2["z"]), crs="epsg:28992")
gdf
output
x
y
z
batch_nr
batch_description
geometry
0
117298
560406
0.612
5800
PRF Grasland (l)
POINT Z (117298.377 560406.392 0.612)
1
117298
560412
1.015
nan
POINT Z (117297.803 560411.756 1.015)
2
117296
560420
1.58
nan
POINT Z (117296.327 560419.84 1.58)
3
117295
560426
2.49
nan
POINT Z (117295.47 560425.716 2.49)
4
117297
560430
4.529
nan
POINT Z (117296.875 560429.976 4.529)

Related

Pandas Replace column values based on condition upon another column in the same data frame

I have a data frame as below:
I want to copy the selected values of easting and northing to the column longitude and latitude if northing is < 90 and then set the original values to None.
Here is my python code:
for i, j in waypoint_df.iterrows():
easting = j['easting']
northing = j['northing']
if northing and northing < 90:
waypoint_df.at[i,'latitude'] = northing
waypoint_df.at[i,'longitude'] = easting
waypoint_df.at[i,'northing'] = None
waypoint_df.at[i,'easting']= None
Is there a simpler way to run the operation above instead of iterating all rows?

Use pandas.DataFrame.assign to swap the values of eas/nor and lon/lat with a boolean mask.
mask = waypoint_df['northing'].lt(90)
waypoint_df.loc[mask] = waypoint_df.assign(latitude= waypoint_df['easting'], easting= waypoint_df'[latitude'],
longitude= waypoint_df['northing'], northing= waypoint_df['longitude'])
# Output :
print(waypoint_df)
locality_id locality waypoint easting northing description grid date collected_by latitude longitude waypoint_pk
0 761 None BATurnoff 368255.0 6614695.0 Turnoff from Yarri Rd (also access to Harpers AMG51 12-12-07 SJB NaN NaN 1
1 761 None BKAW 367700.0 6616800.0 End of access track at breakaway AMG51 12-12-07 SJB NaN NaN 2
2 761 None BKAWT 367071.0 6615492.0 Access track turnoff from Harpers Lagoon road AMG51 12-12-07 SJB NaN NaN 3
3 3581 None DM14-01 NaN NaN King of the Hills dyke -- Hugely contaminated None 10-11-15 D Mole 121.2 -28.62 4
4 3581 None DM14-02 NaN NaN Intrusion into KOTH dyke? -- Graphic granite - None 10-11-15 D Mole 121.2 -28.62 5

3D interpolation in Python Pandas using a mesh grid

I have the following 10 lines of a large Pandas dataframe df;
X,Y,Z are grid points in xyz-direction; U,V,W are (measured) velocity components in x,y,z-direction.
X Y Z U V W
0 -201.0 -2.00 11.200 3.750 -15.20 -0.75800
1 -201.0 -2.00 12.220 3.640 -15.40 -0.71100
2 -200.0 -3.00 1.079 -1.480 -3.86 0.03670
3 -198.0 -3.00 7.190 4.220 -13.50 -1.31000
4 -198.0 -1.43 5.530 3.510 -10.10 -1.56000
5 -195.0 -1.43 6.140 3.900 -11.80 -1.50000
6 -195.0 -2.54 0.000 -0.767 -5.19 0.00154
7 -195.0 -3.54 0.600 -1.210 -6.04 -0.05580
8 -191.0 -5.54 1.449 -1.510 -2.80 -0.20900
9 -191.0 -7.54 2.392 -0.782 -2.65 -0.56000
I want to now interpolate the values U,V,W over a finer 5x5x5 grid in X,Y,Z.
x = np.arange(-200, -175, 5)
y = np.arange(-10, 5, 5)
z = np.arange(0,20,5)
xx, yy, zz = np.meshgrid(x, y,z )
NT = np.product(xx.shape)
data_grid = {
"x_grid": np.reshape(xx,NT),
"y_grid": np.reshape(yy,NT),
"z_grid": np.reshape(zz,NT)
}
df2 = pd.DataFrame(data= data_grid)
I see scipy has this interpolate griddata function which I am trying to call (for now I only interpolate U in XYZ).
xp = df['X'].to_numpy()
yp = df['Y'].to_numpy()
zp = df['Z'].to_numpy()
up = df['U'].to_numpy()
U_grid = griddata([(xp,yp,zp)], up, [(x_grid,y_grid,z_grid)], method='nearest')
But this gives me:
"ValueError: different number of values and points"
What do I do wrong?

Replace grouped columns' outliers with mean of the group based on defined zscore

I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the mean value for the group.
df =
[data][1]
I have tried the zscore values table without success
<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

You can use haversine_distances from scikit-learn to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.
Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):
name county latitude longitude
0 Aaron's Hill Surrey 51.18291 -0.63098
1 Abbas Combe Somerset 51.00283 -2.41825
2 Abberley Worcestershire 52.30522 -2.37574
3 Abberton Essex 51.83440 0.91066
4 Abberton Worcestershire 52.17955 -2.00817
5 Abberwick Northumberland 55.41325 -1.79720
6 Abbess End Essex 51.78000 0.28172
7 Abbess Roding Essex 51.77815 0.27685
8 Abbey Devon 50.88896 -3.22276
9 Abbeycwmhir / Abaty Cwm-hir Powys 52.33104 -3.38988
And here the code to change to solve your problem:
from math import radians
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances
df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])
# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))
# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)
# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
lambda x: haversine_distances(
[x[['latitude_radians', 'longitude_radians']].values],
[dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
)[0][0] * 6371000/1000, # multiply by Earth radius to get kilometers,
axis=1
)
# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})
# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
axis=1
)
# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values
The resulting DataFrame df looks like:
name county latitude longitude latitude_radians longitude_radians dist zscore
0 Aaron's Hill Surrey 51.18291 -0.63098 0.893310 -0.011013 12.479147 -0.293419
1 Abbas Combe Somerset 51.00283 -2.41825 0.890167 -0.042206 35.205157 1.088695
2 Abberley Worcestershire 52.30522 -2.37574 0.912898 -0.041464 17.014249 0.266168
3 Abberton Essex 51.83440 0.91066 0.904681 0.015894 24.504285 -0.254400
4 Abberton Worcestershire 52.17955 -2.00817 0.910705 -0.035049 11.906150 -0.663460
... ... ... ... ... ... ... ... ...
1795 Ayton Berwickshire 55.84232 -2.12285 0.974632 -0.037051 5.899085 0.007876
1796 Ayton Tyne and Wear 54.89416 -1.55643 0.958084 -0.027165 3.192591 -0.935937
If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):
name county latitude longitude
414 Aimes Green Essex 51.846594 0.554532
1721 Aveley Essex 51.846594 0.554532

Why don't all of the factor variables appear in the legend?

I'm pretty new to plotting using matplotlib and I'm having a few problems with the legends, I have this data set:
Wavelength CD Time
0 250.0 0.00000 1
1 249.8 -0.04278 1
2 249.6 -0.03834 1
3 249.4 -0.02384 1
4 249.2 -0.04817 1
... ... ... ...
3760 200.8 0.99883 15
3761 200.6 0.50277 15
3762 200.4 -0.19228 15
3763 200.2 0.81317 15
3764 200.0 0.90226 15
[3765 rows x 3 columns]
Column types:
Wavelength float64
CD float64
Time int64
dtype: object
Why when plotted with Time as the categorical variable all the values are not shown in the legend?
x = df1['Wavelength']
y = df1['CD']
z = df1['Time']
sns.lineplot(x, y, hue = z)
plt.tight_layout()
plt.show()
But I can plot using pandas built in matplotlib function with a colorbar bar like this:
df1.plot.scatter('Wavelength', 'CD', c='Time', cmap='RdYlBu')
What's the best way of choosing between discrete and continuous legends using matplotlib/seaborn?
Many thanks!

GeoPandas - grid scattered data and reproject

I need to grid scattered data in a GeoPandas dataframe to a regular grid (e.g. 1 degree) and get the mean values of the individual grid boxes and secondly plot this data with various projections.
The first point I managed to achieve using the gpd_lite_toolbox.
This result I can plot on a simple lat lon map, however trying to convert this to any other projection fails.
Here is a small example with some artificial data showing my issue:
import gpd_lite_toolbox as glt
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
from shapely import wkt
# creating the artificial df
df = pd.DataFrame(
{'data': [20, 15, 17.5, 11.25, 16],
'Coordinates': ['POINT(-58.66 -34.58)', 'POINT(-47.91 -15.78)',
'POINT(-70.66 -33.45)', 'POINT(-74.08 4.60)',
'POINT(-66.86 10.48)']})
# converting the df to a gdf with projection
df['Coordinates'] = df['Coordinates'].apply(wkt.loads)
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(df, crs=crs, geometry='Coordinates')
# gridding the data using the gridify_data function from the toolbox and setting grids without data to nan
g1 = glt.gridify_data(gdf, 1, 'data', cut=False)
g1 = g1.where(g1['data'] > 1)
# simple plot of the gridded data
fig, ax = plt.subplots(ncols=1, figsize=(20, 10))
g1.plot(ax=ax, column='data', cmap='jet')
# trying to convert to (any) other projection
g2 = g1.to_crs({'init': 'epsg:3395'})
# I get the following error
---------------------------------------------------------------------------
AttributeError: 'float' object has no attribute 'is_empty'
I would also be happy to use different gridding function if this solves the problem

Your g1 conatin too much NaN value.
g1 = g1.where(g1['data'] > 1)
print(g1)
geometry data
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
...
You should use g1[g1['data'] > 1] instead of g1.where(g1['data'] > 1).
g1 = g1[g1['data'] > 1]
print(g1)
geometry data
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
181 POLYGON ((-71.08 -32.52, -70.08 -32.52, -70.08... 17.50
322 POLYGON ((-67.08 10.48, -66.08 10.48, -66.08 9... 16.00
735 POLYGON ((-59.08 -34.52, -58.08 -34.52, -58.08... 20.00
1222 POLYGON ((-48.08 -15.52, -47.08 -15.52, -47.08... 15.00
g2 = g1.to_crs({'init': 'epsg:3395'})
print(g2)
geometry data
5 POLYGON ((-8246547.877965705 606885.3761893312... 11.25
181 POLYGON ((-7912589.405585884 -3808795.10464339... 17.50
322 POLYGON ((-7467311.442412791 1165421.424891677... 16.00
735 POLYGON ((-6576755.516066602 -4074627.00861716... 20.00
1222 POLYGON ((-5352241.117340593 -1737775.44359649... 15.00

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Geopandas: dataframe to geodataframe with different espg code - python

Related

Pandas Replace column values based on condition upon another column in the same data frame

3D interpolation in Python Pandas using a mesh grid

Replace grouped columns' outliers with mean of the group based on defined zscore

Why don't all of the factor variables appear in the legend?

GeoPandas - grid scattered data and reproject

Categories

Resources