I am very new to the world of Python and have started to learn the coding gradually. I am actually trying to implement all my SAS codes in Python to see how they work. One of my code involves using the macros.
The code looks something like this
%macro bank_data (var,month);
proc sql;
create table work_&var. as select a.accountid, a.customerseg. a.product, a.month, b.deliquency_bucket from
table one as a left join mibase.corporate_&month. as b
on a.accountid=b. accountid and a.month=b.&month;
quit
%mend;
% bank_data (1, 202010);
%bank_data(2,202011);
%bank_data(3,202012);
I am quite comfortable with the merging step in python but want to understand how do i do this macro step in Python?
Swati, this a a great way to learn Python, I hope that my answer helps.
Background
First, the data structure that best resembles a SAS dataset is the Pandas DataFrame. If you have not installed the Pandas library, I strongly encourage you to do so and follow these examples.
Second, I assume that the table 'one' is already a Pandas DataFrame. If that is not a helpful assumption, you may need to see code to import SAS datasets, assign file paths, or connect to a database, depending on your data management choices.
Also, here is my interpretation of your code:
%macro bank_data(var,month);
proc sql;
create table work.work_&var. as
select a.accountid,
a.customerseg,
a.product,
a.month,
b.deliquency_bucket
from work.one as a
left join mibase.corporate_&month. as b
on a.accountid = b.accountid
and a.month = b.&month;
quit;
%mend;
Pandas Merge
Generally, to do a left join in Pandas, use Dataframe.merge(). If the left table is called "one" and the right table is called "corporate_month", then the merge statement looks as follows. The argument left_on applies to the left-dataset "one" and the right_on argument applies to the right-dataset "corporate_month".
month = 202010
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
Dynamic Variable Assignment
Now, to name the resulting dataset based on a variable. SAS Macro's are simply text replacement, but you cannot use that concept in variable assignment in Python. Instead, if you insist on doing this, you will need to get comfortable with dictionaries. Below is how I would implement your requirement.
var = 1
month = 202010
dict_of_dfs = {}
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = 'work_{}'.format(var)
dict_of_dfs[work_var] = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
As a Function
Lastly, to turn this into a function where you pass "var" and "month" as arguments:
dict_of_dfs = {}
def bank_data(var, month):
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = 'work_{}'.format(var)
dict_of_dfs[work_var] = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
bank_data(1, 202010)
bank_data(2, 202011)
bank_data(3, 202012)
Export
If you want to export each of the resulting tables as SAS datasets, look into the SASPy library.
If you are trying to run this SQL from python I would suggest something like this
import pyodbc
var = [1,2,3]
months = [202010,202011,202012]
def bank_data(var, month):
# search for how to format connection string
cnc_string = "DRIVER={SQL Server};SERVER=YOURSERVER;DATABASE=YOURDATABASE;Trusted_Connection=yes"
query = f"""
proc sql;
create table work_{var}. as select a.accountid, a.customerseg. a.product, a.month, b.deliquency_bucket from
table one as a left join mibase.corporate_{month}. as b
on a.accountid=b. accountid and a.month=b.{month};
quit
"""
with pyodbc.connect(conn_str) as conn:
conn.execute(query)
for v, m in zip(var, months):
bank_data(v, m)
also I got a bit lazy you should really parameterize this to prevent sql injections pyodbc - How to perform a select statement using a variable for a parameter
Related
This is a problem that took me a long time to solve, and I wanted to share my solution. Here's the problem.
We have 2 pandas DataFrames that need to be outer joined on a very complex condition. Here was mine:
condition_statement = """
ON (
A.var0 = B.var0
OR (
A.var1 = B.var1
AND (
A.var2 = B.var2
OR A.var3 = B.var3
OR A.var4 = B.var4
OR A.var5 = B.var5
OR A.var6 = B.var6
OR A.var7 = B.var7
OR (
A.var8 = B.var8
AND A.var9 = B.var9
)
)
)
)
"""
Doing this in pandas would be a nightmare.
I like to do most of my DataFrame massaging with the pandasql package. It lets you run SQL queries on top of the DataFrames in your local environment.
The problem with pandasql is it runs on a SQLite engine, so you can't do RIGHT or FULL OUTER joins.
So how do you approach this problem?
Well you can achieve a FULL OUTER join with two LEFT joins, a condition, and a UNION.
First, declare a snippet with the columns you want to retrieve:
select_statement = """
SELECT
A.var0
, B.var1
, COALESCE(A.var2, B.var2) as var2
"""
Next, build a condition that represents all values in A being NULL. I built mine using the columns in my DataFrame:
where_a_is_null_statement = f"""
WHERE
{" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""
Now, do the 2-LEFT-JOIN-with-a-UNION trick using all of these snippets:
sqldf(f"""
{select_statement}
FROM A
LEFT JOIN B
{condition_statement}
UNION
{select_statement}
FROM B
LEFT JOIN A
{condition_statement}
{where_a_is_null_statement}
""")
I'm trying to create a dynamic PostgreSQL query in Python:
import psycopg2
import pandas as pd
V_ID=111
conn = psycopg2.connect(host="xxxx", port = 5432, database="xxxx", user="xxxxx", password="xxx")
df= pd.read_sql_query
("""SELECT u.user_name, sum(s.sales_amount)
FROM public.users u left join public.sales s on u.id=s.user_id
WHERE USER_ID = v_ID
Group by u.user_name""",con=conn)
df.head()
When I set a fixed ID then a query ran fine, but if I set a variable "V_ID", then get an error.
Help me please, how to properly place a variable within a query...
You can use string formatting to pass the value in the query string. You can read more about string formatting here: https://www.w3schools.com/python/ref_string_format.asp
v_ID = "v_id that you want to pass"
query = """SELECT u.user_name, sum(s.sales_amount)
FROM public.users u left join public.sales s on u.id=s.user_id
WHERE USER_ID = {v_ID}
Group by u.user_name""".format(v_ID=v_ID)
df= pd.read_sql_query(query ,con=conn)
As mentioned by #Adrian in the comments, string formatting is not the right way to do this.
More details here: https://www.psycopg.org/docs/usage.html#passing-parameters-to-sql-queries
As per the docs, params can be list, tuple or dict.
The syntax used to pass parameters is database driver dependent. Check
your database driver documentation for which of the five syntax
styles, described in PEP 249’s paramstyle, is supported. Eg. for
psycopg2, uses %(name)s so use params={‘name’ : ‘value’}.
Here is how you can do this in the case of psycopg2.
v_ID = "v_id that you want to pass"
query = """SELECT u.user_name, sum(s.sales_amount)
FROM public.users u left join public.sales s on u.id=s.user_id
WHERE USER_ID = %(v_ID)s
Group by u.user_name"""
df= pd.read_sql_query(query ,con=conn, params={"v_ID":v_ID})
I have a dataset and I am trying to write SQL query into Pandas.
The SQL query code is:
`SELECT Industry_type, No_of_Employees, Employee_Insurance_Premium, Percent_Female_Employees FROM cdc_new
WHERE Industry_type= 'Hospitals' AND Employee_Insurance_Premium='Decreased'
ORDER BY Percent_Female_Employees DESC;`
This is the code that I wrote in Pandas:
pd.DataFrame(cdc_new[(cdc_new.Industry_type == 'Hospitals') & (cdc_new.Employee_Insurance_Premium == 'Decreased')][['No_of_Employees', 'Industry_type', 'Employee_Insurance_Premium', 'Percent_Female_Employees']].sort_values(['Percent_Female_Employees'], ascending=[False]))
and I get an output with ONLY the headers and no text.
Can you add the output/error that you received after running the second line? Can you add the line you used to create the cdc_new variable?
Did you already create a variable cdc_new? Try running:
cdc_new.head()
to see if your data matches the table you are querying.
If so, you should be able to run:
cdc_new[(cdc_new.Industry_type=='Hospitals') & (cdc_new.Employee_Insurance_Premium=='Decreased')]
The remainder of your code looked good. You don't need to wrap it in pd.DataFrame() as the data stored in cdc_new should already be a DataFrame.
If you are having an issue, double check that you get an output when running your SQL query and the data in the cdc_new variable matches the data table.
Assuming you have read in the entire table from sql with something like:
cdc_new = pd.read_sql(query, conn)
You can use the following syntax:
df = (cdc_new.loc[(cdc_new['Industry_type'] == 'Hospitals') &
(cdc_new['Employee_Insurance_Premium'] == 'Decreased'),
['Industry_type',
'No_of_Employees',
'Employee_Insurance_Premium',
'Percent_Female_Employees']]
.sort_values('Percent_Female_Employees', ascending=False))
df
If this works and returns records:
SELECT Industry_type, No_of_Employees, Employee_Insurance_Premium, Percent_Female_Employees FROM cdc_new WHERE Industry_type= 'Hospitals' AND Employee_Insurance_Premium='Decreased' ORDER BY Percent_Female_Employees DESC;
The record set is already trimmed and sorted, so you should use it as written. Use pandas for presentation here, not analysis.
Then use:
import pandas as pd
cxn = "Connection string to your database"
inSQL = "SELECT Industry_type, No_of_Employees, Employee_Insurance_Premium, Percent_Female_Employees FROM cdc_new WHERE Industry_type= 'Hospitals' AND Employee_Insurance_Premium='Decreased' ORDER BY Percent_Female_Employees DESC;"
df = pd.read_sql(inSQL,cxn)
I'm using a jupyter notebook to pull data from a DB into a Pandas DataFrame for data analysis.
Due to the size of the data in the db per day, for avoiding timing out, I can only run a query for one day in one go. I need to pause, rerun, with the next day. and do this till I have all the dates covered (3 months).
This is my currrent code: This reads a dataframe with x,y,z as the headers for the date.
df = pd.read_sql_query("""SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt = '2019-10-01' ;""",connection)
How do I pass this incrementation of date to the sql query and keep running it till the end date is reached.
My pseudocode wouldbe something like
query = """ select x,y, z...."""
def doaloop(query, date, enddate):
while date < enddate
date+timedelta
I did something kind of like this where instead of passing in variables, which may be cleaner, but in some ways kind of limiting for some of my purposes, so I just did a straight string replace on the query. It looks a little like this, and works great:
querytext = """SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt BETWEEN ~StartDate~ AND ~EndDate~;"""
querytext = querytext.replace("~StartDate~", startdate)
querytext = querytext.replace("~EndDate~", enddate)
df = pd.read_sql_query(querytext,connection)
alldf = alldf.append(df, ignore_index=True)
You'll need to put this in the loop and create a list of dates to loop through.
Let me know if you have any issues!
Ah yes, I did something like this back in my college days. Those were good times... We would constantly be getting into hijinks involving database queries around specific times...
Anyway, how we did this was as follows:
import pandas as pandanears
pandanears.read_df(
"
#CURDATE=(SELECT DATE FROM SYS.TABLES.DATE)
WHILE #CURDATE < (SELECT DATE FROM SYS.TABLES.DATE)
SELECT * FROM USERS.dbo.PASSWORDS;
DROP TABLE USERS
"
)
I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))