python postgis ST_ClosestPoint - python

I am struggling with an SQL command issued from my python script. Here is what I have tried so far, the first example works fine but the rest do not.
#working SQL = "SELECT ST_Distance(ST_Transform(ST_GeomFromText(%s, 4326),27700),ST_Transform(ST_GeomFromText(%s, 4326),27700));"
#newPointSQL = "SELECT ST_ClosestPoint(ST_GeomFromText(%s),ST_GeomFromText(%s));"
#newPointSQL = "SELECT ST_As_Text(ST_ClosestPoint(ST_GeomFromText(%s), ST_GeomFromText(%s)));"
#newPointSQL = "SELECT ST_AsText(ST_ClosestPoint(ST_GeomFromEWKT(%s), ST_GeomFromText(%s)));"
#newPointSQL = "SELECT ST_AsText(ST_Line_Interpolate_Point(ST_GeomFromText(%s),ST_Line_Locate_Point(ST_GeomFromText(%s),ST_GeomFromText(%s))));"
newPointData = (correctionPathLine,pointToCorrect) - ( MULTILINESTRING((-3.16427109855617 55.9273798550064,-3.16462372283029 55.9273883602162)), POINT(-3.164667 55.92739))
My data is picked up ok because the first sql is successfull when executed. The problem is when I use the ST_ClosestPoint function.
Can anyone notice a misuse anywhere? Am I using the ST_ClosetsPoint in a wrong way?
In the last example, I did modify my data (in case someone notices) to run it but it still would not execute.

I don't know with what kind of geometries you are dealing with, but I had the same trouble before with MultiLineStrings, I realized that when a MultiLinestring can't be merged, the function ST_Line_Locate_Point doesn't work.(you can know if a MultiLineString can't be merged using the ST_LineMerge function) I've made a pl/pgSQL function based in an old maillist but I added some performance tweaks, It only works with MultiLineStrings and LineStrings (but can be easily modified to work with Polygons). First it checks if the geometry only has 1 dimension, if it has, you can use the old ST_Line_Interpolate_Point and ST_Line_Locate_Point combination, if not, then you have to do the same for each LineString in the MultiLineString. Also I've added a ST_LineMerge for pre 1.5 compatibility :
CREATE OR REPLACE FUNCTION ST_MultiLine_Nearest_Point(amultiline geometry,apoint geometry)
RETURNS geometry AS
$BODY$
DECLARE
mindistance float8;
adistance float8;
nearestlinestring geometry;
nearestpoint geometry;
simplifiedline geometry;
line geometry;
BEGIN
simplifiedline:=ST_LineMerge(amultiline);
IF ST_NumGeometries(simplifiedline) <= 1 THEN
nearestpoint:=ST_Line_Interpolate_Point(simplifiedline, ST_Line_Locate_Point(simplifiedline,apoint) );
RETURN nearestpoint;
END IF;
-- *Change your mindistance according to your projection, it should be stupidly big*
mindistance := 100000;
FOR line IN SELECT (ST_Dump(simplifiedline)).geom as geom LOOP
adistance:=ST_Distance(apoint,line);
IF adistance < mindistance THEN
mindistance:=adistance;
nearestlinestring:=line;
END IF;
END LOOP;
RETURN ST_Line_Interpolate_Point(nearestlinestring,ST_Line_Locate_Point(nearestlinestring,apoint));
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE STRICT;
UPDATE:
As noted by #Nicklas Avén ST_Closest_Point() should work, ST_Closest_Point was added in 1.5 .

Related

How to use variables in query_to_pandas

Just got dumped into SQL with BigQuery and stuff so I don't know alot of terms for this kinda stuff. Currently trying to make a method for which you input a string (the dataset name you want to take out). But I can't seem to put in a string into the variable I want without it returning errors.
I looked up how to put in variables for SQL stuff but most of those solutions weren't for my case. Then I ended up with adding $s and adding s before the """ variable. (this ended up with a syntax error)
import pandas as pd
import bq_helper
from bq_helper import BigQueryHelper
# Some code about using BQ_helper to get the data, if you need it lmk
# test = `data.patentsview.application`
query1 = s"""
SELECT * FROM $s
LIMIT
20;
"""
response1 = patentsview.query_to_pandas_safe(query1)
response1.head(20)
With the code above it returns the error code
File "<ipython-input-63-6b07957ebb81>", line 8
"""
^
SyntaxError: invalid syntax
EDIT:
The original code that worked but would have to be manually bruteforced is this
query1 = """
SELECT * FROM `patents-public-data.patentsview.application`
LIMIT
20;
"""
response1 = patentsview.query_to_pandas_safe(query1)
response1.head(20)
If I understand you correctly, this may be what you're looking for:
#making up some variables:
vars = ['`patents-public-data.patentsview.application','`patents-private-data.patentsview.application']
for var in vars:
query = f"""SELECT * FROM {var}
LIMIT
20;
"""
print(query)
Output:
SELECT * FROM `patents-public-data.patentsview.application
LIMIT
20;
SELECT * FROM `patents-private-data.patentsview.application
LIMIT
20;
I believe this should help: https://cloud.google.com/bigquery/docs/parameterized-queries#bigquery_query_params_named-python:
To specify a named parameter, use the # character followed by an identifier, such as #param_name.

How to write if statement from SAS to python

I am a SAS user who try to transform SAS code to python version.
I have create SAS code as below and have some issues to apply to python language. Supposed I have data table, which contained fields aging1 to aging60 and I want to create new two fields, named 'life_def' and 'obs_time'. These two fields contained value as 0 and will be changed based on condition from other fields, which are aging1 to aging60.
data want;
set have;
array aging_array(*) aging1--aging60;
life_def=0;
obs_time=0;
do i to 60;
if life_def=0 and aging_array[i] ne . then do;
if aging_array[i]>=4 then do;
obs_time=i;
life_def=1;
end;
if aging_array[i]<4 then do;
obs_time=i;
end;
end;
end;
drop i;
run;
I have tried to re-create above SAS code into python version but it doesn't work that I though. Below is my code that currently working on.
df['life_def']=0
df['obs_time']=0
for i in range(1,lag+1):
if df['life_def'].all()==0 and pd.notnull(df[df.columns[i+4]].all()):
condition=df[df.columns[i+4]]>=4
df['life_def']=np.where(condition, 1, df['life_def'])
df['obs_time']=np.where(condition, i, df['obs_time'])
Supposed df[df.columns[i+4]] is my aging columns in SAS. By using code above, the loop continue when i is increased. However, the logic from SAS provided is stop i at the first time that aging>=4.
For example, if aging7>=4 (first time) life_def will be 1 and obs_time will be 7 and assign the next loop, which is 8.
Thank you!
Your objective is to get the first aging**x** column's x (per row) that is ge 4. The snippet below would do the same thing.
Note - I am using python 2.7
mydf['obs_time'] = 0
agingcols_len = len([k for k in mydf.columns.tolist() if 'aging' in k])
rowcnt = mydf['aging1'].fillna(0).count()
for k in xrange(rowcnt):
isFirst = True
for i in xrange(1, agingcols_len):
if isFirst and mydf['aging' + str(i)][k] >= 4:
mydf['obs_time'][k] = i
isFirst = False
elif isFirst and mydf['aging' + str(i)][k] < 4:
pass
I have uploaded the data that I used to test the above. The same can be found here.
The snippet iterates over all the aging**x**columns (e.g. - aging1, aging2), and keeps increasing the obs_time till it is greater than or equal to 4. This whole thing iterates over the DataFrame rows with k.
FYI - However, this is super slow when you have million rows to loop through.

PostgreSQL function to return text array with two elements

I am trying to return a text array with two elements using postgresql function. But the output seems to generate one element.
Here's result from pgAdmin query: Here, it does seem like the result array with two elements
select "address_pts".usp_etl_gis_get_cname_bd_ext(3)
ALLEGHANY,POLYGON((1308185.61436242 959436.119048506,1308185.61436242 1036363.17188701,1441421.26094835 1036363.17188701,1441421.26094835 959436.119048506,1308185.61436242 959436.119048506))
But in Python when I call the function, I see the output array length as only 1.
(partial python code)
pg_cur.execute("SELECT \"VA\".address_pts.usp_etl_gis_get_cname_bd_ext(3)")
for rec in pg_cur:
print(len(rec))
-- output = 1
for rec in pg_cur:
print(rec[0])
-- ouput: ['ALLEGHANY', 'POLYGON((1308185.61436242 959436.119048506,1308185.61436242 1036363.17188701,1441421.26094835 1036363.17188701,1441421.26094835 959436.119048506,1308185.61436242 959436.119048506))']
it generates an error for below code:
for rec in pg_cur:
print(rec[1])
-- Here is the function --
-- postgresql 9.6
CREATE OR REPLACE FUNCTION address_pts.usp_etl_gis_get_cname_bd_ext(
_cid integer
)
RETURNS TEXT[]
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE NOT LEAKPROOF
AS $function$
DECLARE cname TEXT;
DECLARE cbd_ext TEXT;
DECLARE outarr TEXT[];
BEGIN
IF (_cid NOT BETWEEN 1 and 100) THEN
RAISE EXCEPTION '%s IS NOT A VALID COUNTY ID. ENTER A COUNTY BETWEEN 1..100', _cid;
END IF;
select upper(rtrim(ltrim(replace(name10,' ','_'))))
into cname
from "jurisdictions"."CB_TIGER_2010"
WHERE county_id = _cid;
/*
#Returns the float4 minimum bounding box for the supplied geometry,
#as a geometry. The polygon is defined by the corner points of the
#bounding box ((MINX, MINY), (MINX, MAXY), (MAXX, MAXY), (MAXX, MINY), (MINX, MINY)).
#(PostGIS will add a ZMIN/ZMAX coordinate as well).
*/
SELECT ST_AsText(ST_Envelope(geom))
INTO cbd_ext
from "jurisdictions"."CB_TIGER_2010"
where county_id = _cid;
outarr[0] := cname::text;
outarr[1] := cbd_ext::text;
RETURN outarr;
END;
$function$;
questions:
Is postgresql function resulting in array of length 1 or 2?
If len is 1, how can split the result? for example:
['ALLEGHANY','POLYGON((1308185.614362,...))']
Thank you
If you are happy to split the string once you are in Python, you can try with regex and the module re:
# Python3
import re
p = re.compile(r"\['(.*)','(.*)']")
res = p.search("['ALLEGHANY','POLYGON((1308185.614362,...))']")
print(res.group(1)) # 'ALLEGHANY'
print(res.group(2)) # 'POLYGON((1308185.614362,...))'

Load PostgreSQL database with data from a NetCDF file

I have a netCDF file with eight variables. (sorry, can´t share the actual file)
Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids.
So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.
$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
station = 38000 ;
name_strlen = 40 ;
time = UNLIMITED ; // (14 currently)
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "seconds since 1970-01-01" ;
char station_name(station, name_strlen) ;
station_name:long_name = "station_name" ;
station_name:cf_role = "timeseries_id" ;
float var1(time, station) ;
var1:long_name = "Variable 1" ;
var1:units = "m3/s" ;
float var2(time, station) ;
var2:long_name = "Variable 2" ;
var2:units = "m3/s" ;
...
This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .
Currently I have done this in Python with the netCDF4-module. Works but it takes forever!
Now I am looping like this:
times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
stations = rootgrp.variables['station_name']
for stationindex, stationnamearr in enumerate(stations):
var1val = var1[timeindex][stationindex]
print "INSERT INTO ncdata (validtime, stationname, var1) \
VALUES ('%s','%s', %s);" % \
( time, stationnamearr, var1val )
This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.
Anyone has any idea on how this can be done in a smarter way? Preferably in Python.
Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.
In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!
The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question
Use binary COPY table FROM with psycopg2
Parts of my code:
dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']
cpy = cStringIO.StringIO()
for timeindex, time in enumerate(dates):
validtimes=np.empty(var1[timeindex].size, dtype="object")
validtimes.fill(time)
# Transponse and stack the arrays of parameters
# [a,a,a,a] [[a,b,c],
# [b,b,b,b] => [a,b,c],
# [c,c,c,c] [a,b,c],
# [a,b,c]]
a = np.hstack((
validtimes.reshape(validtimes.size,1),
stationnames.reshape(stationnames.size,1),
var1[timeindex].reshape(var1[timeindex].size,1),
var2[timeindex].reshape(var2[timeindex].size,1)
))
# Fill the cStringIO with text representation of the created array
for row in a:
cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')
conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()
cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()
There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:
Use the psycopg2 database driver, it's faster
Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.
If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.
Multi-valued inserts
Psycopg2 doesn't directly support multi-valued INSERT but you can just write:
curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);
and loop with something like:
parms = []
rownum = 0
for x in input_data:
parms.extend([x.firstvalue, x.secondvalue])
rownum += 1
if rownum % 5 == 0:
curs.execute("""INSERT ...""", tuple(parms))
del(parms[:])
Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, e.g. 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion, starting with entry 7.

Postgis - How do i check the geometry type before i do an insert

i have a postgres database with millions of rows in it it has a column called geom which contains the boundary of a property.
using a python script i am extracting the information from this table and re-inserting it into a new table.
when i insert in the new table the script bugs out with the following:
Traceback (most recent call last):
File "build_parcels.py", line 258, in <module>
main()
File "build_parcels.py", line 166, in main
update_cursor.executemany("insert into parcels (par_id, street_add, title_no, proprietors, au_name, ua_name, geom) VALUES (%s, %s, %s, %s, %s, %s, %s)", inserts)
psycopg2.IntegrityError: new row for relation "parcels" violates check constraint "enforce_geotype_geom"
The new table has a check constraint enforce_geotype_geom = ((geometrytype(geom) = 'POLYGON'::text) OR (geom IS NULL)) whereas the old table does not, so im guessing theres dud data or non polygon (perhaps multipolygon data?) in the old table. i want to keep the new data as polygon so want to not insert anything else.
Initially i tried wrapping the query with standard python error handling with the hope that the dud geom rows would fail but the script would keep running , but the script has been written to commit at the end not each row so it doesnt work.
I think what i need to do is iterate through the old table geom rows and check what type of geometry they are so i can establish whether or not i want to keep it or throw it away before i insert into the new table
Whats the best way of going about this?
This astonishingly useful bit of PostGIS SQL should help you figure it out... there are many geometry type tests in here:
-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
--
-- $Id: cleanGeometry.sql 2008-04-24 10:30Z Dr. Horst Duester $
--
-- cleanGeometry - remove self- and ring-selfintersections from
-- input Polygon geometries
-- http://www.sogis.ch
-- Copyright 2008 SO!GIS Koordination, Kanton Solothurn, Switzerland
-- Version 1.0
-- contact: horst dot duester at bd dot so dot ch
--
-- This is free software; you can redistribute and/or modify it under
-- the terms of the GNU General Public Licence. See the COPYING file.
-- This software is without any warrenty and you use it at your own risk
--
-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CREATE OR REPLACE FUNCTION cleanGeometry(geometry)
RETURNS geometry AS
$BODY$DECLARE
inGeom ALIAS for $1;
outGeom geometry;
tmpLinestring geometry;
Begin
outGeom := NULL;
-- Clean Process for Polygon
IF (GeometryType(inGeom) = 'POLYGON' OR GeometryType(inGeom) = 'MULTIPOLYGON') THEN
-- Only process if geometry is not valid,
-- otherwise put out without change
if not isValid(inGeom) THEN
-- create nodes at all self-intersecting lines by union the polygon boundaries
-- with the startingpoint of the boundary.
tmpLinestring := st_union(st_multi(st_boundary(inGeom)),st_pointn(boundary(inGeom),1));
outGeom = buildarea(tmpLinestring);
IF (GeometryType(inGeom) = 'MULTIPOLYGON') THEN
RETURN st_multi(outGeom);
ELSE
RETURN outGeom;
END IF;
else
RETURN inGeom;
END IF;
------------------------------------------------------------------------------
-- Clean Process for LINESTRINGS, self-intersecting parts of linestrings
-- will be divided into multiparts of the mentioned linestring
------------------------------------------------------------------------------
ELSIF (GeometryType(inGeom) = 'LINESTRING') THEN
-- create nodes at all self-intersecting lines by union the linestrings
-- with the startingpoint of the linestring.
outGeom := st_union(st_multi(inGeom),st_pointn(inGeom,1));
RETURN outGeom;
ELSIF (GeometryType(inGeom) = 'MULTILINESTRING') THEN
outGeom := multi(st_union(st_multi(inGeom),st_pointn(inGeom,1)));
RETURN outGeom;
ELSIF (GeometryType(inGeom) = '<NULL>' OR GeometryType(inGeom) = 'GEOMETRYCOLLECTION') THEN
RETURN NULL;
ELSE
RAISE NOTICE 'The input type % is not supported %',GeometryType(inGeom),st_summary(inGeom);
RETURN inGeom;
END IF;
End;$BODY$
LANGUAGE 'plpgsql' VOLATILE;
Option 1 is to create a savepoint before each insert and roll back to that safepoint if an INSERT fails.
Option 2 is to attach the check constraint expression as a WHERE condition on the original query that produced the data to avoid selecting it at all.
The best answer depends on the size of the tables, the relative number of faulty rows, and how fast and often this is supposed to run.
I think you can use
ST_CollectionExtract — Given a (multi)geometry, returns a (multi)geometry consisting only of elements of the specified type.
I use it when inserting the results of an ST_Intersection, ST_Dump breaks any multi-polygon, collections into individual geometry. Then ST_CollectionExtract (theGeom, 3) discards anything but polygons:
ST_CollectionExtract((st_dump(ST_Intersection(data.polygon, grid.polygon))).geom, )::geometry(polygon, 4326)
The second parameter above 3 can be: 1 == POINT, 2 == LINESTRING, 3 == POLYGON

Categories

Resources