Joining tables in Databricks vs. SAS - how to handle duplicate column names - python

long time SAS user, new to Databricks and am trying to migrate some basic code.
Running into an extremely basic join issue but cannot find a solution.
In SAS (proc sql), when I run the following code, SAS is smart enough to realize that the joining columns are obviously on both the left and right tables, and so only produces one instance of those variables.
e.g.
proc sql;
create table work.test as select * from
data.table1 t1
left join data.table2 t2 on (t1.bene_id=t2.bene_id) and (t1.pde_id=t2.pde_id)
;
quit;
This code runs just fine.
However, when I run the same thing in Databricks, it produces both instances of the bene_id and pde_id fields, and therefore bombs out when it tries to create the same (because its trying to create columns with the same name).
I realize one solution is to not use the * in the select statement, manually specify each field and ensure Im only selecting a single instance of each field, but with the number of joins happening + the number of fields Im dealing with, this is a real waste of time.
I also came across another potential solution is this sort of syntax
%python
from pyspark.sql import *
t1 = spark.read.table("data1")
t2 = spark.read.table("data2")
temp=t1.join(t2,["bene_id","pde_id"],"left")
However, this only suppresses duplicates for the fields being joined upon (i.e. bene_id and pde_id). If there was a 3rd field, say srvc_dt in both tables, but I am not using this field in the join, it will again be generated twice and bomb out.
Finally, I realize another solution is to write some code to dynamically rename the columns in both the left and right table so that all columns will always have unique names. I just feel like there has to be a simple way to achieve what SAS is doing without requiring all the workarounds, and Im just not aware of it.
Thanks for any advice.

You have to either rename the columns, drop one of the duplicates before joining or use aliases as described in this answer.
Spark wants you to be very explicit about which column you want to keep, so that you are not accidentally dropping columns.

Related

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

sqlalchemy create a column that is autoupdated depending on other columns

I need to create a column in a table that is autoupdated if one or more columns (possibly in another table) are updated, but it also should be possible to edit this column directly (and value should be kept in sql unless said other cols are updated, in which case first logic is applied)
I tried column_property but it seems that its merely a construction inside python and doesnt represent an actual column
I also tried hybrid_property and default, both didnt accomplish this
This looks like insert/update trigger, however i want to know "elegant" way to declare it if its even possible
I use declarative style for tables on postgres
I dont make any updates to sql outside of sqlalchemy
Definitely looks like insert/update triggers. But if I were you, I would incapsulate this logic in python by using 2 queries , so it will be more clear

Creating Dynamic Select Query in SqlAlchemy

I have researched this topic and have decided just to ask here since I can't seem to find anything. I'll explain below:
Context: Flask Application with a form the client fills out and posts to the server. The form inputs are used to create a query and return data.
I am using SQLalchemy currently to construct the query from scratch. At this point, I have successfully connected to my existing Redshift database and can query properly but I cannot figure out how to dynamically construct a simple Select x, y, z statement based on the user's form inputs.
The main problem being that Query() can't take in a python list of columns. It seems you must specify each column like table.c.column1 which doesn't work well with a dynamic query since I don't know what columns I want until the user submits the form.
My 2 ideas so far:
Loop through all column names and use Query.add_columns(table.c['colname'])
Use select([col1, col2, ...]) instead of Query()
Use load_columns() to load only specific columns in a table to query. Unfortunately seems to only work with model objects and not reflected tables unless I am mistaken
Both of these seem backwards to me as they do not really accomplish my goal effectively.
SQLAlchemy is quite flexible, so both 1 and 2 get the job done. If you've no need for ORM functionality, then perhaps #2 is more natural. If the user were to pass a list of column names such as
columns = request.args.getlist('columns')
you could then create your select() quite easily with a bunch of column() constructs:
stmt = select([column(c) for c in columns]).\
select_from(some_table)
or if you have the table at hand, like you hint in the question:
stmt = select([table.c[c] for c in columns])
and then all that is left is to execute your statement:
results = db.session.execute(stmt).fetchall()

Diffing and Synchronizing 2 tables MySQL

I have 2 tables, One with new data, and another with old data.
I need to find the diff between the two tables and push only the changes into the table with the old data as it will be in production.
Both the tables are identical in terms of columns, only the data varies.
EDIT:
I am looking for only one way sync
EDIT 2
The table may have foreign keys.
Here are the constraints
I can't use shell utilities like mk-table-sync
I can't use gui tools,because they cannot be automated, like suggested here.
This needs to be done programmatically, or in the db.
I am working in python on Google App-engine.
Currently I am doing things like
OUTER JOINs and WHERE [NOT] EXISTS to compare each record in SQL queries and pushing the results.
My questions are
Is there a better way to do this ?
Is it better to do this in python rather than in the db ?
According to your comment to my question, you could simply do:
DELETE FROM OldTable;
INSERT INTO OldTable (field1, field2, ...) SELECT * FROM NewTable;
As I pointed out above, there might be reasons not to do this, e.g., data size.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.
There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Categories

Resources