df_sales = spark.sql(
"SELECT \
s.TRANS_DT, \
s.STORE_KEY, \
s.PROD_KEY, \
s.SALES_QTY
FROM sales s \
JOIN inventory i ON s.cal_dt=i.cal_dt and s.store_key=i.store_key and s.prod_key=i.prod_key;"
)
I created sql query from 2 tempview (inventory and sales). How to convert df_sales sql query to tempview again and I can create a new SQL query.
Read this and you can write your own code as follows:
spark.sql(
"""
SELECT
s.TRANS_DT,
s.STORE_KEY,
s.PROD_KEY,
s.SALES_QTY
FROM sales s
JOIN inventory i ON s.cal_dt=i.cal_dt and s.store_key=i.store_key and s.prod_key=i.prod_key;
"""
).createOrReplaceTempView("tmpViewName")
Related
Goal
I am aiming to insert database records into MySQL using Python. But with an extra detail, I'll explain as I go along..
This is my current script (Fully functional & working):
#Get data from SQL
sqlCursor = mjmConnection.cursor()
sqlCursor.execute("SELECT sol.id, p.id, p.code,p.description, p.searchRef1, so.number, c.code, c.name, sol.requiredQty \
FROM salesorderline sol JOIN \
salesorder so \
ON sol.salesorderid = so.id JOIN \
product p \
ON sol.productid = p.id JOIN \
customer c \
ON so.customerid = c.id \
WHERE so.orderdate > DATEADD(dd,-35,CAST(GETDATE() AS date));")
#Send recieved data from SQL query from above to MySQL database
print("Sending MJM records to MySQL Database")
mjmCursorMysql = productionConnection.cursor()
for x in sqlCursor.fetchall():
a,b,c,d,e,f,g,h,i = x
mjmCursorMysql.execute("INSERT ignore INTO mjm_python (id, product_id, product_code, product_description, product_weight, \
salesorder_number, customer_code, customer_name, requiredQty) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s);", (a,b,c,d,e,f,g,h,i))
productionConnection.commit()
mjmCursorMysql.close()
sqlCursor.close()
What it does
The above script does the following:
Gets data from SQL Server
Inserts that data into MySQL
I have specifically used IGNORE in the MySQL query, to prevent duplicate id numbers.
Data will look like this:
Next..
Now - i'd like to add a column name sales_id_increment. This will start from 1 and increment for each same salesorder_number and reset back to 1 when there is a different salesorder_number. So I am wanting it to look something like this:
Question
How do I achieve this? Where do I need to look, in my Python script or the MySQL query?
You can get this column when you select the rows from SQL Server with window functions ROW_NUMBER() or DENSE_RANK() (if there are duplicate ids):
SELECT sol.id, p.id, p.code,p.description, p.searchRef1, so.number, c.code, c.name, sol.requiredQty,
ROW_NUMBER() OVER (PARTITION BY so.number ORDER BY sol.id) sales_id_increment
FROM salesorderline sol
JOIN salesorder so ON sol.salesorderid = so.id
JOIN product p ON sol.productid = p.id
JOIN customer c ON so.customerid = c.id
WHERE so.orderdate > DATEADD(dd,-35,CAST(GETDATE() AS date));
I have the following Python MySQL code.
cursor = mydb.cursor()
cursor.execute('SELECT id FROM table1 WHERE col1=%s AND col2=%s', (val1, val2))
ids = cursor.fetchall()
for id in ids:
cursor.execute('SELECT record_key FROM table2 WHERE id=%s limit 1', (id[0], ))
record_keys = cursor.fetchall()
print(record_keys[0][0])
How can I make this more efficient? I am using 5.5.60-MariaDB and Python 2.7.5. I have approximately 350 million entries in table1 and 15 million entries in table2.
Happily, you can do this in a single query using a LEFT JOIN.
cursor = mydb.cursor()
cursor.execute(
"SELECT t1.id, t2.record_key FROM table1 t1 "
"LEFT JOIN table2 t2 ON (t1.id = t2.id) "
"WHERE t1.col1=%s AND t2.col2=%s",
(val1, val2),
)
for id, record_key in cursor.fetchall():
pass # do something...
I'm using Pyodbc in Python to run some SQL queries. What I'm working with is actually longer than this, but this example captures what I'm trying to do:
connection = pyodbc.connect(...)
cursor = connection.cursor(...)
dte = '2018-10-24'
#note the placeholders '{}'
query = """select invoice_id
into #output
from table1 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{}'
insert into #output
select invoice_id
from table2 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{}'"""
#this is where I need help as explained below
cursor.execute(query.format(dte, dte))
output = pd.read_sql("""select *
from #output"""
, connection)
In the above, since there are only two '{}', I'm passing dte to query.format() twice. However, in the more complicated version I'm working with, I have 19 '{}', so I'd imagine this means I need to pass 'dte' to 'query.format{}' 19 times. I tried passing this as a list, but it didn't work. Do I really need to write out the variable 19 times when passing it to the function?
Consider using a UNION ALL query to avoid the temp table needs and parameterization where you set qmark placeholders and in a subsequent step bind values to them. And being the same value multiply the parameter list/tuple by needed number:
dte = '2018-10-24'
# NOTE THE QMARK PLACEHOLDERS
query = """select invoice_id
from table1 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = ?
union all
select invoice_id
from table2 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = ?"""
output = pd.read_sql(query, connection, params=(dte,)*2)
I agree with the comments, pandas.read_sql has a params argument which prevent from sql injection.
See this post to understand how to use it depending on the database.
Pyodbc has the same parameter on the execute method.
# standard
cursor.execute("select a from tbl where b=? and c=?", (x, y))
# pyodbc extension
cursor.execute("select a from tbl where b=? and c=?", x, y)
To answer to the initial question, even if it is bad practice for building SQL queries :
Do I really need to write out the variable 19 times when passing it to the function?
Of course you don't :
query = """select invoice_id
into #output
from table1 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{dte}'
insert into #output
select invoice_id
from table2 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{dte}'""".format(**{'dte': dte})
or :
query = """select invoice_id
into #output
from table1 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{0}'
insert into #output
select invoice_id
from table2 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{0}'""".format(dte)
Python 3.6+ :
query = f"""select invoice_id
into #output
from table1 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{dte}'
insert into #output
select invoice_id
from table2 with (nolock)
where system_id = 'PrimaryColor'
and posting_date = '{dte}'"""
Note the usage of f before """ ... """
In trying to replicate a MySQL query in SQL Alchemy, I've hit a snag in specifying which tables to select from.
The query that works is
SELECT c.*
FROM attacks AS a INNER JOIN hosts h ON a.host_id = h.id
INNER JOIN cities c ON h.city_id = c.id
GROUP BY c.id;
I try to accomplish this in SQLAlchemy using the following function
def all_cities():
session = connection.globe.get_session()
destination_city = aliased(City, name='destination_city')
query = session.query(City). \
select_from(Attack).\
join((Host, Attack.host_id == Host.id)).\
join((destination_city, Host.city_id == destination_city.id)).\
group_by(destination_city.id)
print query
results = [result.serialize() for result in query]
session.close()
file(os.path.join(os.path.dirname(__file__), "servers.geojson"), 'a').write(geojson.feature_collection(results))
When printing the query, I end up with ALMOST the right query
SELECT
cities.id AS cities_id,
cities.country_id AS cities_country_id,
cities.province AS cities_province,
cities.latitude AS cities_latitude,
cities.longitude AS cities_longitude,
cities.name AS cities_name
FROM cities, attacks
INNER JOIN hosts ON attacks.host_id = hosts.id
INNER JOIN cities AS destination_city ON hosts.city_id = destination_city.id
GROUP BY destination_city.id
However, you will note that it is selecting from cities, attacks...
How can I get it to select only from the attacks table?
The line here :
query = session.query(City)
is querying the City table also that's why you are getting the query as
FROM cities, attacks
name=input("input CUSTOMERID to search :")
# Prepare SQL query to view all records of a specific person from
# the SALESPRODUCTS TABLE LINKED WITH SALESPERSON TABLE.
sql = "SELECT * selling_products.customer \
FROM customer \
WHERE customer_products.CUSTOMERID == name"
# Execute the SQL command
cursor.execute(sql)
# Fetch all the rows the sql result of SQL1.
results = cursor.fetchall()
print("\n\n****** TABLE MASTERLIST*********")
print("CUSTOMERID \t PRODUCTID \t DATEOFPURCHASE")
print("**************")
for row in results:
print (row[0],row[1],row[2])
Python would compile the code above, but it will not return any output. Help would be very much appreciated :)
i think you sql should be:
sql = """SELECT * selling_products.customer
FROM customer
WHERE customer_products.CUSTOMERID == {name}""".format(name=name)