How to pass updates into pd.read_sql? Update - python

I have this query that I use on a weekly basis to create a report using pd.read_sql. I want to be able to update the case statement of store, update the Date Add, and store IN at the end of the statement without having to manually change the store numbers and the dates. Is there any way that I can edit the query to make the updates?
This is the query
dataframe = pd.read_sql("""
SELECT Top(10)
CAST( Store as VARCHAR) + 'þ' as Store,
CONVERT( VARCHAR, Tran_Dt2, 101 ) + 'þ' as Tran_Dt,
CONVERT(char(5), Start_Time, 108) + 'þ' as Start_Time,
[Count]
FROM
(
SELECT
CASE
WHEN [Store] = 313 THEN 3174
WHEN [Store] = 126 THEN 3191
END AS Store
, DATEADD (YEAR, +2, DATEADD(DAY, +4, Tran_Dt2)) as Tran_Dt2
,[Start_Time]
,[Count]
,Store as Sister_Store
FROM
(
SELECT
Store,
CONVERT(datetime, Tran_Dt) as Tran_Dt2,
Start_Time,
Count
FROM [VolumeDrivers].[dbo].[SALES_DRIVERS_ITC_Signup_65wks]
WHERE CONVERT(datetime, Tran_Dt) between CONVERT(datetime,'2/8/2019') and CONVERT(datetime,'3/15/2019')
AND
Store IN (313, 126)
--Single Store: Store = Store #
) AS A
) AS B
ORDER BY Tran_Dt2, Store
""", con = conn)
I would want to be able to do something like declare a variable and have it populate in the code such as something like:
oldstore1 = 313
newstore1 = 3174
oldstore2 = 126
newstore2 = 3191
daframe = pd.ready_sql("""...
...
SELECT
CASE
WHEN [Store] = oldstore1 THEN newstore1
WHEN [Store] = oldstore2 THEN newstore2
...
UPDATE----
I am currently at this point and had the query working until my kernel restarted and I lost my code. Any tips on why it isn't working anymore?
#Declare variables for queries
old_store1 = 313
new_store1 = 3157
old_store2 = 126
new_store2 = 3196
datefrom = '2/8/2019'
dateto = '3/15/2019'
yearadd = '+2'
dayadd = '+4'
ITC = pd.read_sql("""SELECT
CAST( Store as VARCHAR) + 'þ' as Store,
CONVERT( VARCHAR, Tran_Dt2, 101 ) + 'þ' as Tran_Dt,
CONVERT(char(5), Start_Time, 108) + 'þ' as Start_Time,
[Count]
FROM
(
SELECT
CASE
WHEN [Store] = {old_store1} THEN {new_store1}
WHEN [Store] = {old_store2} THEN {new_store2}
END AS Store
, DATEADD (YEAR, {yearadd}, DATEADD(DAY, {dayadd}, Tran_Dt2)) as Tran_Dt2
,[Start_Time]
,[Count]
,Store as Sister_Store
FROM
(
SELECT
Store,
CONVERT(datetime, Tran_Dt) as Tran_Dt2,
Start_Time,
Count
FROM [VolumeDrivers].[dbo].[SALES_DRIVERS_ITC_Signup_65wks]
WHERE CONVERT(datetime, Tran_Dt) between CONVERT(datetime,{datefrom}) and CONVERT(datetime,{dateto})
AND
Store IN ({old_store1}, {old_store2})
--Single Store: Store = Store #
) AS A
) AS B
ORDER BY Tran_Dt2, Store
""", con = conn)

Was able to figure out why it wasn't working. I guess python 3 and beyond has a built in function that allows you to place "f" in front of the query and will let you pass the variables you created. I know this isn't the most secure way of executing the script but I'll look into creating a for loop in the future that will allow it to be more secure. Thanks for all the help!
#Declare variables for queries
old_store1 = 313
new_store1 = 3157
old_store2 = 126
new_store2 = 3196
datefrom = '2/15/2019'
dateto = '3/22/2019'
yearadd = '+2'
dayadd = '+4'
ITC = pd.read_sql(f"""SELECT
CAST( Store as VARCHAR) + 'þ' as Store,
CONVERT( VARCHAR, Tran_Dt2, 101 ) + 'þ' as Tran_Dt,
CONVERT(char(5), Start_Time, 108) + 'þ' as Start_Time,
[Count]
FROM
(
SELECT
CASE
WHEN [Store] = {old_store1} THEN {new_store1}
WHEN [Store] = {old_store2} THEN {new_store2}
.....
#run code and verify it works
Sales_Drivers_ITCSignup = pd.read_sql(ITCQuery, con = conn, index_col='Store')
Sales_Drivers_ITCSignup.head()

Related

Selecting less than / higher than in sqlite3

I would like to insert data to a new table where the AmountSpent in tblFinance for CategoryID'?' is < the budget in tblCategory which is linked to the ID
e.g.
If AmountSpent = 1000, CategoryID = 1 CategoryBudget = 900, insert the amountspent into tblExpRptOver and if it's less than the budget, insert to tblExpRptUnder.
I've tried what I thought should work but it gives me this error:
line 613, in test
'CategoryID = ? AND AmountSpent < ? ', (category_input, budget_input, ))
sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.
The code I'm trying to execute is:
view_avg = input("Would you like to view the averages for under/over expenses for a specific category? Y/N \n")
if view_avg == "Y":
category_input = input("Please input the CategoryID for the category you'd like to view the averages for:"
"\n")
budget_input = ('SELECT CategoryBudget FROM tblCategory WHERE CategoryID = ?', (category_input, ))
c.execute('INSERT INTO tblExpOver SELECT CategoryID, AmountSpent FROM tblExpRptMonth WHERE '
'CategoryID = ? AND AmountSpent < ? ', (category_input, budget_input, ))
c.execute('INSERT INTO tblExpUnder SELECT CategoryID, AmountSpent FROM tblExpRptMonth WHERE '
'CategoryID = ? AND AmountSpent > ? ', (category_input, budget_input,))
df_avg_over = pd.read_sql_query("SELECT * FROM tblExpOver", conn)
df_avg_under = pd.read_sql_query("SELECT * FROM tblExpUnder", conn)
with PdfPages('GraphByMonth.pdf') as pdf:
firstPage = plt.figure(figsize=(11.69, 8.27))
firstPage.clf()
txt = 'Average Expense on Over Expenses for Category ', category_input, ':',\
df_avg_over["AmountSpent"].mean()
txt1 = 'Average Expense on Under Expenses for Category ', category_input, ':', \
df_avg_under["AmountSpent"].mean()
firstPage.text(0.5, 0.5, txt, txt1, transform=firstPage.transFigure, size=24, ha="center")
pdf.savefig()
plt.close()
fig = plt.figure(figsize=(11.69, 8.27))
df2.plot(kind='bar')
plt.title('Expense Graph by Month')
txt = 'Month Expense Graph'
plt.text(0.05, 0.95, txt, transform=fig.transFigure, size=24)
plt.xlabel("Expense Number")
plt.ylabel("Amount Spent")
pdf.savefig()
plt.close()
menu()
Any help is appreciated!
EDIT: CategoryBudget is stored in a different table for category information - tblCategory.
You're never executing the budget_input query. You can't use a placeholder for a SELECT query.
Instead of doing a separate query, you should join the tblCategory table in the INSERT ... SELECT query.
c.execute('''INSERT INTO tblExpOver
SELECT r.CategoryID, r.AmountSpent
FROM tblExpRptMonth AS r
JOIN tblCategory AS c ON r.CategoryId = c.CategoryId AND r.AmountSpent < c.CategoryBudget
WHERE r.CategoryID = ?''', (category_input,))
c.execute('''INSERT INTO tblExpUnder
SELECT r.CategoryID, r.AmountSpent
FROM tblExpRptMonth AS r
JOIN tblCategory AS c ON r.CategoryId = c.CategoryId AND r.AmountSpent > c.CategoryBudget
WHERE r.CategoryID = ?''', (category_input,))
Note that you never insert anything for rows that are exactly equal to the budgeted amount. Unless that's intentional, you should change either < or > to <= or >=, depending on which category you want them to be in.
Also, it seems like you may have < and > backwards, if the Over and Under table names mean that the spending is more and less than the budgeted amount.

Speeding up insertion of point data from netcdf

I've got this netcdf of weather data (one of thousands that require postgresql ingestion). I'm currently capable of inserting each band into a postgis-enabled table at a rate of about 20-23 seconds per band. (for monthly data, there is also daily data that i have yet to test.)
I've heard of different ways of speeding this up using COPY FROM, removing the gid, using ssds, etc... but I'm new to python and have no idea how to store the netcdf data to something I could use COPY FROM or what the best route might be.
If anyone has any other ideas on how to speed this up, please share!
Here is the ingestion script
import netCDF4, psycopg2, time
# Establish connection
db1 = psycopg2.connect("host=localhost dbname=postgis_test user=********** password=********")
cur = db1.cursor()
# Create Table in postgis
print(str(time.ctime()) + " CREATING TABLE")
try:
cur.execute("DROP TABLE IF EXISTS table_name;")
db1.commit()
cur.execute(
"CREATE TABLE table_name (gid serial PRIMARY KEY not null, thedate DATE, thepoint geometry, lon decimal, lat decimal, thevalue decimal);")
db1.commit()
print("TABLE CREATED")
except:
print(psycopg2.DatabaseError)
print("TABLE CREATION FAILED")
rawvalue_nc_file = 'netcdf_file.nc'
nc = netCDF4.Dataset(rawvalue_nc_file, mode='r')
nc.variables.keys()
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:], time_var.units)
newtime = [fdate.strftime('%Y-%m-%d') for fdate in dtime]
rawvalue = nc.variables['tx_max'][:]
lathash = {}
lonhash = {}
entry1 = 0
entry2 = 0
lattemp = nc.variables['lat'][:].tolist()
for entry1 in range(lat.size):
lathash[entry1] = lattemp[entry1]
lontemp = nc.variables['lon'][:].tolist()
for entry2 in range(lon.size):
lonhash[entry2] = lontemp[entry2]
for timestep in range(dtime.size):
print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
for _lon in range(lon.size):
for _lat in range(lat.size):
latitude = round(lathash[_lat], 6)
longitude = round(lonhash[_lon], 6)
thedate = newtime[timestep]
thevalue = round(float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3)
if (thevalue > -100):
cur.execute("INSERT INTO table_name (thedate, thepoint, thevalue) VALUES (%s, ST_MakePoint(%s,%s,0), %s)",(thedate, longitude, latitude, thevalue))
db1.commit()
cur.close()
db1.close()
print(" Done!")
If you're certain most of the time is spent in PostgreSQL, and not in any other code of your own, you may want to look at the fast execution helpers, namely cur.execute_values() in your case.
Also, you may want to make sure you're in a transaction, so the database doesn't fall back to an autocommit mode. ("If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful) COMMIT wrapped around it.")
Something like this could do the trick -- not tested though.
for timestep in range(dtime.size):
print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
values = []
cur.execute("BEGIN")
for _lon in range(lon.size):
for _lat in range(lat.size):
latitude = round(lathash[_lat], 6)
longitude = round(lonhash[_lon], 6)
thedate = newtime[timestep]
thevalue = round(
float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3
)
if thevalue > -100:
values.append((thedate, longitude, latitude, thevalue))
psycopg2.extras.execute_values(
cur,
"INSERT INTO table_name (thedate, thepoint, thevalue) VALUES %s",
values,
template="(%s, ST_MakePoint(%s,%s,0), %s)"
)
db1.commit()

python MySQLdb query tables iteratively

I'm trying to query one table against the other using Python and MySQLdb. Here's what I've got so far:
db = MySQLdb.connect( host = 'localhost', user = 'user', passwd=
'password', db = 'vacants')
cursor = db.cursor()
numrows = cursor.rowcount
query = "SELECT address, ((20903520) * acos (cos ( radians(38.67054) )* cos(
radians( lat ) ) * cos( radians( `long` ) - radians(-90.22942) ) + sin (
radians(38.67054) ) * sin( radians( lat ) ))) AS distance FROM vacants HAVING
distance < 100;"
cursor.execute(query)
I have one table, cfs, and another, vacants. I want to see for each row in cfs is there a vacant property within 100 feet. So for the ( radians(38.67054) and radians(-90.22942), I need to loop through the cfs table so that each cfs latitude and longitude replaces those two numbers. (That's just a test latitude and longitude we used)
In the end I'd like to have (in a .csv) the vacant property address, the distance from the call for service, and the type of call (which are two separate fields in the calls for service database). Something like this, which is from the query above:
Here's example data - calls for service coordinates:
38.595767638008056,-90.2316138251402
38.57283495467307,-90.24649031378685
38.67497061776659,-90.28415976525395
38.67650431524285,-90.25623757427952
38.591971519414784,-90.27782710145746
38.61272746420862,-90.23292862245287
38.67312983860098,-90.23591869583113
38.625956494342674,-90.18853950906939
38.69044465638584,-90.24339061920696
38.67745024638241,-90.20657832034047
And vacants:
38.67054,-90.22942
38.642956,-90.21466
38.671535,-90.27293
38.666367,-90.23749
38.65339,-90.23141
38.645996,-90.20334
38.60214,-90.224815
38.67265,-90.214134
38.665504,-90.274414
38.668354,-90.269966
This is not the final solution as there is insufficient info (address field? and 20903520 ?) in the question but it might help you get on track by showing how to iterate through both tables and substitute lat, lon from the CFS table into the query applied to the vacants table:
import mysql.connector
cnx1 = mysql.connector.connect(user='root',password='xxxx',host='127.0.0.1',database=db)
cursor1 = cnx1.cursor()
cnx2 = mysql.connector.connect(user='root',password='xxxx',host='127.0.0.1',database=db)
cursor2 = cnx2.cursor()
sql_cfs = ('select lat,lon from cfs')
cursor1.execute(sql_cfs)
for cfs in cursor1:
[cfs_lat,cfs_lon] = cfs
print (cfs_lat,cfs_lon)
query = ("SELECT address, ((20903520) * " \
"acos (cos(radians(lon)) *" \
"cos(radians({})) * " \
"cos(radians({})-radians(lat)) + sin(radians(lon)) * " \
"sin( radians({})))) AS distance " \
"FROM vacants HAVING distance < 100;".format(cfs_lat,cfs_lon,cfs_lat))
print (query)
cursor2.execute(query)
for vacants in cursor2:
print (vacants)

read_sql query returns an empty dataframe after I pass parameters as a dict in python pandas

I am trying to parameterize some parts of a SQL Query using the below dictionary:
query_params = dict(
{'target':'status',
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and %(target)s in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and %(target)s in ('ACT')
order by random() limit 50000);""")
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
However this returns a dataframe with no records at all. I am not sure what the error is since no error is being thrown.
df_data_sample.shape
Out[7]: (0, 1211)
The final PostgreSql query would be:
select *
from table_name
where dt = '201805'
and status in ('NPA')
----------------------------------------------------
union all
----------------------------------------------------
(select *
from table_name
where dt = '201712'
and status in ('ACT')
order by random() limit 50000);-- This part of random() is only for running it on my local and not on server.
Below is a small sample of data for replication. The original data has more than a million records and 1211 columns
service_change_3m service_change_6m dt grp_m2 status
0 -2 201805 $50-$75 NPA
0 0 201805 < $25 NPA
0 -1 201805 $175-$200 ACT
0 0 201712 $150-$175 ACT
0 0 201712 $125-$150 ACT
-1 1 201805 $50-$75 NPA
Can someone please help me with this?
UPDATE:
Based on suggestion by #shmee.. I am finally using :
target = 'status'
query_params = dict(
{
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and {0} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and {0} in ('ACT')
order by random() limit 50000);""").format(target)
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
Yes, I am quite confident that your issue results from trying to set column names in your query via parameter binding (and %(target)s in ('ACT')) as mentioned in the comments.
This results in your query restricting the result set to records where 'status' in ('ACT') (i.e. Is the string 'status' an element of a list containing only the string 'ACT'?). This is, of course, false, hence no record gets selected and you get an empty result.
This should work as expected:
import psycopg2.sql
col_name = 'status'
table_name = 'public.churn_data'
query_params = {'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
}
sql_data_sample = """select *
from {0}
where dt = %(date_to)s
and {1} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *
from {0}
where dt = %(date_from)s
and {1} in ('ACT')
order by random() limit 50000);"""
sql_data_sample = sql.SQL(sql_data_sample).format(sql.Identifier(table_name),
sql.Identifier(col_name))
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)

How to use python to ETL between databases?

Using psycopg2, I'm able to select data from a table in one PostgreSQL database connection and INSERT it into a table in a second PostgreSQL database connection.
However, I'm only able to do it by setting the exact feature I want to extract, and writing out separate variables for each column I'm trying to insert.
Does anyone know of a good practice for either:
moving an entire table between databases, or
iterating through features while not having to declare variables for every column you want to move
or...?
Here's the script I'm currently using where you can see the selection of a specific feature, and the creation of variables (it works, but this is not a practical method):
import psycopg2
connDev = psycopg2.connect("host=host1 dbname=dbname1 user=postgres password=*** ")
connQa = psycopg2.connect("host=host2 dbname=dbname2 user=postgres password=*** ")
curDev = connDev.cursor()
curQa = connQa.cursor()
sql = ('INSERT INTO "tempHoods" (nbhd_name, geom) values (%s, %s);')
curDev.execute('select cast(geom as varchar) from "CCD_Neighborhoods" where nbhd_id = 11;')
tempGeom = curDev.fetchone()
curDev.execute('select nbhd_name from "CCD_Neighborhoods" where nbhd_id = 11;')
tempName = curDev.fetchone()
data = (tempName, tempGeom)
curQa.execute (sql, data)
#commit transactions
connDev.commit()
connQa.commit()
#close connections
curDev.close()
curQa.close()
connDev.close()
connQa.close()
One other note is that python allows the ability to explicitly work with SQL functions / data type casting, which for us is important as we work with the GEOMETRY data type. Above you can see I'm casting it to TEXT then dumping it into an existing geometry column in the source table - this will work with MSSQL Server, which is a huge feature in the geospatial community...
In your solution (your solution and your question have a different order of statements) change the lines which start with 'sql = ' and the loop before '#commit transactions' comment to
sql_insert = 'INSERT INTO "tempHoods" (nbhd_id, nbhd_name, typology, notes, geom) values '
sql_values = ['(%s, %s, %s, %s, %s)']
data_values = []
# you can make this larger if you want
# ...try experimenting to see what works best
batch_size = 100
sql_stmt = sql_insert + ','.join(sql_values*batch_size) + ';'
for i, row in enumerate(rows, 1):
data_values += row[:5]
if i % batch_size == 0:
curQa.execute (sql_stmt , data_values )
data_values = []
if (i % batch_size != 0):
sql_stmt = sql_insert + ','.join(sql_values*(i % batch_size)) + ';'
curQa.execute (sql_stmt , data_values )
BTW, I don't think you need to commit. You don't begin any transactions. So there should not be any need to commit them. Certainly, you don't need to commit a cursor if all you did was a bunch of selects on it.
Here's my updated code based on Dmitry's brilliant solution:
import psycopg2
connDev = psycopg2.connect("host=host1 dbname=dpspgisdev user=postgres password=****")
connQa = psycopg2.connect("host=host2 dbname=dpspgisqa user=postgres password=****")
curDev = connDev.cursor()
curQa = connQa.cursor()
print "Truncating Source"
curQa.execute('delete from "tempHoods"')
connQa.commit()
#Get Data
curDev.execute('select nbhd_id, nbhd_name, typology, notes, cast(geom as varchar) from "CCD_Neighborhoods";') #cast geom to varchar and insert into geometry column!
rows = curDev.fetchall()
sql_insert = 'INSERT INTO "tempHoods" (nbhd_id, nbhd_name, typology, notes, geom) values '
sql_values = ['(%s, %s, %s, %s, %s)'] #number of columns selecting / inserting
data_values = []
batch_size = 1000 #customize for size of tables...
sql_stmt = sql_insert + ','.join(sql_values*batch_size) + ';'
for i, row in enumerate(rows, 1):
data_values += row[:5] #relates to number of columns (%s)
if i % batch_size == 0:
curQa.execute (sql_stmt , data_values )
connQa.commit()
print "Inserting..."
data_values = []
if (i % batch_size != 0):
sql_stmt = sql_insert + ','.join(sql_values*(i % batch_size)) + ';'
curQa.execute (sql_stmt, data_values)
print "Last Values..."
connQa.commit()
# close connections
curDev.close()
curQa.close()
connDev.close()
connQa.close()

Categories

Resources