I use pyscripter to call PostgreSQL from outside and route my network, and here's my code,
import sys, os
#set up psycopg2 environment
import psycopg2
#driving_distance module
query = """
select *
from driving_distance ($$
select
gid as id,
source::int4 as source,
target::int4 as target,
cost::double precision as cost
from network
$$, %s, %s, %s, %s
)
"""
#make connection between python and postgresql
conn = psycopg2.connect("dbname = 'TC_area' user = 'postgres' host = 'localhost' password = 'xxxx'")
cur = conn.cursor()
#count rows in the table
cur.execute("select count(*) from network")
result = cur.fetchone()
k = result[0] + 1 #number of points = number of segments + 1
#run loops
rs = []
i = 1
while i <= k:
cur.execute(query, (i, 1000000, False, True))
rs.append(cur.fetchall())
i = i + 1
#import csv module
import csv
j = 0
h = 0
ars = []
element = list(rs)
#export data to every row
with open('distMatrix.csv', 'wb') as f:
writer = csv.writer(f, delimiter = ',')
while j <= k - 1:
while h <= k - 1:
rp = element[j][h][1]
ars.append(rp)
h = h + 1
else:
h = 0
writer.writerow(ars)
ars = []
j = j + 1
conn.close()
The result is good, but if I want to use the reverse_cost function in the function driving_distance in PostgreSQL, I just add one line below the 'cost::double precision as cost',
rcost::double precision as reverse_cost
I got this error box popped out after I added this line,
and the error message in python IDLE,
Traceback (most recent call last):
File "C:\Users\Heinz\Documents\pyscript\postgresql\distMatrix.py.py", line 47, in <module>
cur.execute(query, (i, 1000000, False, True))
ProgrammingError: 錯誤: 在"語法錯誤"附近發生 rcost
LINE 7: rcost::double precision as reverse_cost
^
QUERY:
select
gid as id,
source::int4 as source,
target::int4 as target,
cost::double precision as cost
rcost::double precision as reverse_cost
from network
PS. I have altered the table of this network, thus it does have a 'rcost' column, and here's the part of view,
If I execute the code inside pgAdmin, I could successfully get the right result,
SELECT * FROM driving_distance('
SELECT gid as id,
source::int4 AS source,
target::int4 AS target,
cost::double precision as cost,
rcost::double precision as reverse_cost
FROM network',
3, 10000000, false, True);
But I need python to do the loop, how can I solve this problem?
PS. If I set both boolean in the function to FALSE, theoretically the program will ignore rcost and return answers calculated from cost only, but I still got the same error,
The problem seems results from rcost.
I am using PostgreSQL 8.4, python 2.7.6 under Windows 8.1 x64.
UPDATE#1
I changed 2 lines in my script and then it works,
cost::double precision as cost, #need to add a trailing comma if followed by rcost
cur.execute(query, (i, 100000000000, False, True)) #the range must be greater than max of rcost
Your line "cost::double precision as cost" needs a trailing comma.
Related
I've got this netcdf of weather data (one of thousands that require postgresql ingestion). I'm currently capable of inserting each band into a postgis-enabled table at a rate of about 20-23 seconds per band. (for monthly data, there is also daily data that i have yet to test.)
I've heard of different ways of speeding this up using COPY FROM, removing the gid, using ssds, etc... but I'm new to python and have no idea how to store the netcdf data to something I could use COPY FROM or what the best route might be.
If anyone has any other ideas on how to speed this up, please share!
Here is the ingestion script
import netCDF4, psycopg2, time
# Establish connection
db1 = psycopg2.connect("host=localhost dbname=postgis_test user=********** password=********")
cur = db1.cursor()
# Create Table in postgis
print(str(time.ctime()) + " CREATING TABLE")
try:
cur.execute("DROP TABLE IF EXISTS table_name;")
db1.commit()
cur.execute(
"CREATE TABLE table_name (gid serial PRIMARY KEY not null, thedate DATE, thepoint geometry, lon decimal, lat decimal, thevalue decimal);")
db1.commit()
print("TABLE CREATED")
except:
print(psycopg2.DatabaseError)
print("TABLE CREATION FAILED")
rawvalue_nc_file = 'netcdf_file.nc'
nc = netCDF4.Dataset(rawvalue_nc_file, mode='r')
nc.variables.keys()
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:], time_var.units)
newtime = [fdate.strftime('%Y-%m-%d') for fdate in dtime]
rawvalue = nc.variables['tx_max'][:]
lathash = {}
lonhash = {}
entry1 = 0
entry2 = 0
lattemp = nc.variables['lat'][:].tolist()
for entry1 in range(lat.size):
lathash[entry1] = lattemp[entry1]
lontemp = nc.variables['lon'][:].tolist()
for entry2 in range(lon.size):
lonhash[entry2] = lontemp[entry2]
for timestep in range(dtime.size):
print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
for _lon in range(lon.size):
for _lat in range(lat.size):
latitude = round(lathash[_lat], 6)
longitude = round(lonhash[_lon], 6)
thedate = newtime[timestep]
thevalue = round(float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3)
if (thevalue > -100):
cur.execute("INSERT INTO table_name (thedate, thepoint, thevalue) VALUES (%s, ST_MakePoint(%s,%s,0), %s)",(thedate, longitude, latitude, thevalue))
db1.commit()
cur.close()
db1.close()
print(" Done!")
If you're certain most of the time is spent in PostgreSQL, and not in any other code of your own, you may want to look at the fast execution helpers, namely cur.execute_values() in your case.
Also, you may want to make sure you're in a transaction, so the database doesn't fall back to an autocommit mode. ("If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful) COMMIT wrapped around it.")
Something like this could do the trick -- not tested though.
for timestep in range(dtime.size):
print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
values = []
cur.execute("BEGIN")
for _lon in range(lon.size):
for _lat in range(lat.size):
latitude = round(lathash[_lat], 6)
longitude = round(lonhash[_lon], 6)
thedate = newtime[timestep]
thevalue = round(
float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3
)
if thevalue > -100:
values.append((thedate, longitude, latitude, thevalue))
psycopg2.extras.execute_values(
cur,
"INSERT INTO table_name (thedate, thepoint, thevalue) VALUES %s",
values,
template="(%s, ST_MakePoint(%s,%s,0), %s)"
)
db1.commit()
i am using Python to import a csv file with coordinates in it, passing it to a list and using the contained data to calculate the area of each irregular figure. The data within the csv file looks like this.
ID Name DE1 DN1 DE2 DN2 DE3 DN3
88637 Zack Fay -0.026841782 -0.071375637 0.160878583 -0.231788845 0.191811833 0.396593863
88687 Victory Greenfelder 0.219394372 -0.081932907 0.053054879 -0.048356016
88737 Lynnette Gorczany 0.043632299 0.118916157 0.005488698 -0.268612073
88787 Odelia Tremblay PhD 0.083147337 0.152277791 -0.039216388 0.469656787 -0.21725977 0.073797219
The code i am using is below - however it brings up an IndexError: as the first line doesn't have data in all columns. Is there a way to write the csv file so it only uses the colums with data in them ?
import csv
import math
def main():
try:
# ask user to open a file with coordinates for 4 points
my_file = raw_input('Enter the Irregular Differences file name and location: ')
file_list = []
with open(my_file, 'r') as my_csv_file:
reader = csv.reader(my_csv_file)
print 'my_csv_file: ', (my_csv_file)
reader.next()
for row in reader:
print row
file_list.append(row)
all = calculate(file_list)
save_write_file(all)
except IOError:
print 'File reading error, Goodbye!'
except IndexError:
print 'Index Error, Check Data'
# now do your calculations on the 'data' in the file.
def calculate(my_file):
return_list = []
for row in my_file:
de1 = float(row[2])
dn1 = float(row[3])
de2 = float(row[4])
dn2 = float(row[5])
de3 = float(row[6])
dn3 = float(row[7])
de4 = float(row[8])
dn4 = float(row[9])
de5 = float(row[10])
dn5 = float(row[11])
de6 = float(row[12])
dn6 = float(row[13])
de7 = float(row[14])
dn7 = float(row[15])
de8 = float(row[16])
dn8 = float(row[17])
de9 = float(row[18])
dn9 = float(row[19])
area_squared = abs((dn1 * de2) - (dn2 * de1)) + ((de3 * dn4) - (dn3 * de4)) + ((de5 * dn6) - (de6 * dn5)) + ((de7 * dn8) - (dn7 * de8)) + ((dn9 * de1) - (de9 * dn1))
area = area_squared / 2
row.append(area)
return_list.append(row)
return return_list
def save_write_file(all):
with open('output_task4B.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["ID", "Name", "de1", "dn1", "de2", "dn2", "de3", "dn3", "de4", "dn4", "de5", "dn5", "de6", "dn6", "de7", "dn7", "de8", "dn8", "de9", "dn9", "Area"])
writer.writerows(all)
if __name__ == '__main__':
main()
Any suggestions
Your problem appears to be in the calculate function.
You are trying to access various indexes of row without first confirming they exist. One naive approach might be to consider the values to be zero if they are not present, except that:
+ ((dn9 * de1) - (de9 * dn1)
is an attempt to wrap around, and this might invalidate your math since they would go to zero.
A better approach is probably to use a slice of the row, and use the sequence-iterating approach instead of trying to require a certain number of points. This lets your code fit the data.
coords = row[2:] # skip id and name
assert len(coords) % 2 == 0, "Coordinates must come in pairs!"
prev_de = coords[-2]
prev_dn = coords[-1]
area_squared = 0.0
for de, dn in zip(coords[:-1:2], coords[1::2]):
area_squared += (de * prev_dn) - (dn * prev_de)
prev_de, prev_dn = de, dn
area = abs(area_squared) / 2
The next problem will be dealing with variable length output. I'd suggest putting the area before the coordinates. That way you know it's always column 3 (or whatever).
Using psycopg2, I'm able to select data from a table in one PostgreSQL database connection and INSERT it into a table in a second PostgreSQL database connection.
However, I'm only able to do it by setting the exact feature I want to extract, and writing out separate variables for each column I'm trying to insert.
Does anyone know of a good practice for either:
moving an entire table between databases, or
iterating through features while not having to declare variables for every column you want to move
or...?
Here's the script I'm currently using where you can see the selection of a specific feature, and the creation of variables (it works, but this is not a practical method):
import psycopg2
connDev = psycopg2.connect("host=host1 dbname=dbname1 user=postgres password=*** ")
connQa = psycopg2.connect("host=host2 dbname=dbname2 user=postgres password=*** ")
curDev = connDev.cursor()
curQa = connQa.cursor()
sql = ('INSERT INTO "tempHoods" (nbhd_name, geom) values (%s, %s);')
curDev.execute('select cast(geom as varchar) from "CCD_Neighborhoods" where nbhd_id = 11;')
tempGeom = curDev.fetchone()
curDev.execute('select nbhd_name from "CCD_Neighborhoods" where nbhd_id = 11;')
tempName = curDev.fetchone()
data = (tempName, tempGeom)
curQa.execute (sql, data)
#commit transactions
connDev.commit()
connQa.commit()
#close connections
curDev.close()
curQa.close()
connDev.close()
connQa.close()
One other note is that python allows the ability to explicitly work with SQL functions / data type casting, which for us is important as we work with the GEOMETRY data type. Above you can see I'm casting it to TEXT then dumping it into an existing geometry column in the source table - this will work with MSSQL Server, which is a huge feature in the geospatial community...
In your solution (your solution and your question have a different order of statements) change the lines which start with 'sql = ' and the loop before '#commit transactions' comment to
sql_insert = 'INSERT INTO "tempHoods" (nbhd_id, nbhd_name, typology, notes, geom) values '
sql_values = ['(%s, %s, %s, %s, %s)']
data_values = []
# you can make this larger if you want
# ...try experimenting to see what works best
batch_size = 100
sql_stmt = sql_insert + ','.join(sql_values*batch_size) + ';'
for i, row in enumerate(rows, 1):
data_values += row[:5]
if i % batch_size == 0:
curQa.execute (sql_stmt , data_values )
data_values = []
if (i % batch_size != 0):
sql_stmt = sql_insert + ','.join(sql_values*(i % batch_size)) + ';'
curQa.execute (sql_stmt , data_values )
BTW, I don't think you need to commit. You don't begin any transactions. So there should not be any need to commit them. Certainly, you don't need to commit a cursor if all you did was a bunch of selects on it.
Here's my updated code based on Dmitry's brilliant solution:
import psycopg2
connDev = psycopg2.connect("host=host1 dbname=dpspgisdev user=postgres password=****")
connQa = psycopg2.connect("host=host2 dbname=dpspgisqa user=postgres password=****")
curDev = connDev.cursor()
curQa = connQa.cursor()
print "Truncating Source"
curQa.execute('delete from "tempHoods"')
connQa.commit()
#Get Data
curDev.execute('select nbhd_id, nbhd_name, typology, notes, cast(geom as varchar) from "CCD_Neighborhoods";') #cast geom to varchar and insert into geometry column!
rows = curDev.fetchall()
sql_insert = 'INSERT INTO "tempHoods" (nbhd_id, nbhd_name, typology, notes, geom) values '
sql_values = ['(%s, %s, %s, %s, %s)'] #number of columns selecting / inserting
data_values = []
batch_size = 1000 #customize for size of tables...
sql_stmt = sql_insert + ','.join(sql_values*batch_size) + ';'
for i, row in enumerate(rows, 1):
data_values += row[:5] #relates to number of columns (%s)
if i % batch_size == 0:
curQa.execute (sql_stmt , data_values )
connQa.commit()
print "Inserting..."
data_values = []
if (i % batch_size != 0):
sql_stmt = sql_insert + ','.join(sql_values*(i % batch_size)) + ';'
curQa.execute (sql_stmt, data_values)
print "Last Values..."
connQa.commit()
# close connections
curDev.close()
curQa.close()
connDev.close()
connQa.close()
I want to speed up one of my tasks and I wrote a little program:
import psycopg2
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
def write_sim_to_db(all_ids2):
if all_ids1[i] != all_ids2:
c.execute("""SELECT count(*) FROM similarity WHERE prod_id1 = %s AND prod_id2 = %s""", (all_ids1[i], all_ids2,))
count = c.fetchone()
if count[0] == 0:
sim_sum = random.random()
c.execute("""INSERT INTO similarity(prod_id1, prod_id2, sim_sum)
VALUES(%s, %s, %s)""", (all_ids1[i], all_ids2, sim_sum,))
conn.commit()
conn = psycopg2.connect("dbname='db' user='user' host='localhost' password='pass'")
c = conn.cursor()
all_ids1 = list(n for n in range(1000))
all_ids2_list = list(n for n in range(1000))
for i in range(len(all_ids1)):
with ThreadPoolExecutor(max_workers=5) as pool:
results = [pool.submit(write_sim_to_db, i) for i in all_ids2_list]
For a while, the program is working correctly. But then I get an error:
Segmentation fault (core dumped)
Or
*** Error in `python3': double free or corruption (out): 0x00007fe574002270 ***
Aborted (core dumped)
If I run this program in one thread, it works great.
with ThreadPoolExecutor(max_workers=1) as pool:
Postgresql seems no time to process the transaction. But I'm not sure. In the log file any mistakes there.
I do not know how to find the error.
Help.
I had to use connection pool.
import psycopg2
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
from psycopg2.pool import ThreadedConnectionPool
def write_sim_to_db(all_ids2):
if all_ids1[i] != all_ids2:
conn = tcp.getconn()
c = conn.cursor()
c.execute("""SELECT count(*) FROM similarity WHERE prod_id1 = %s AND prod_id2 = %s""", (all_ids1[i], all_ids2,))
count = c.fetchone()
if count[0] == 0:
sim_sum = random.random()
c.execute("""INSERT INTO similarity(prod_id1, prod_id2, sim_sum)
VALUES(%s, %s, %s)""", (all_ids1[i], all_ids2, sim_sum,))
conn.commit()
tcp.putconn(conn)
DSN = "postgresql://user:pass#localhost/db"
tcp = ThreadedConnectionPool(1, 10, DSN)
all_ids1 = list(n for n in range(1000))
all_ids2_list = list(n for n in range(1000))
for i in range(len(all_ids1)):
with ThreadPoolExecutor(max_workers=2) as pool:
results = [pool.submit(write_sim_to_db, i) for i in all_ids2_list]
This is the sane approach to speed it up. It will be much faster and simpler than your code.
tuple_list = []
for p1 in range(3):
for p2 in range(3):
if p1 == p2: continue
tuple_list.append((p1,p2,random.random()))
insert = """
insert into similarity (prod_id1, prod_id2, sim_sum)
select prod_id1, prod_id2, i.sim_sum
from
(values
{}
) i (prod_id1, prod_id2, sim_sum)
left join
similarity s using (prod_id1, prod_id2)
where s is null
""".format(',\n '.join(['%s'] * len(tuple_list)))
print cur.mogrify(insert, tuple_list)
cur.execute(insert, tuple_list)
Output:
insert into similarity (prod_id1, prod_id2, sim_sum)
select prod_id1, prod_id2, i.sim_sum
from
(values
(0, 1, 0.7316830646236253),
(0, 2, 0.36642199082207805),
(1, 0, 0.9830936499726003),
(1, 2, 0.1401200246162232),
(2, 0, 0.9921581283868096),
(2, 1, 0.47250175432277497)
) i (prod_id1, prod_id2, sim_sum)
left join
similarity s using (prod_id1, prod_id2)
where s is null
BTW there is no need for Python at all. It can all be done in a plain SQL query.
Here's the code I'm working on:
poljeID = int(cursor.execute("SELECT poljeID FROM stanje"))
xkoord = cursor.execute("SELECT xkoord FROM polje WHERE poljeID = %s;", poljeID)
ykoord = cursor.execute("SELECT ykoord FROM polje WHERE poljeID = %s;", poljeID)
print xkoord, ykoord
It's a snippet from it, basically what it needs to do is fetch the ID of the field (poljeID) where an agent is currently on (stanje) and use it to get the x and y coordinates of that field (xkoord, ykoord).
The initial values for the variables are:
poljeID = 1
xkoord = 0
ykoord = 0
The values that I get with that code are:
poljeID = 1
xkoord = 1
ykoord = 1
What am I doing wrong?
cursor.execute does not return the result of the query, it returns the number of rows affected. To get the result, you need to do cursor.fetchone() (or cursor.fetchall()) for each query.
(Note, really the second and third queries should be done at once: SELECT xkoord, ycoord FROM ...)