How can I use "where not exists" SQL condition in pyspark? - python

I have a table on Hive and I am trying to insert data in that table. I am taking data from SQL but I don't want to insert id which already exists in the Hive table. I am trying to use the same condition like where not exists. I am using PySpark on Airflow.

The exists operator doesn't exist in Spark but there are 2 join operators that can replace it : left_anti and left_semi.
If you want for example to insert a dataframe df in a hive table target, you can do :
new_df = df.join(
spark.table("target"),
how='left_anti',
on='id'
)
then you write new_df in your table.
left_anti allows you to keep only the lines which do not meet the join condition (equivalent of not exists). The equivalent of exists is left_semi.

You can use not exist directly using spark SQL on the dataframes through temp views:
table_withNull_df.createOrReplaceTempView("table_withNull")
tblA_NoNull_df.createOrReplaceTempView("tblA_NoNull")
result_df = spark.sql("""
select * from table_withNull
where not exists
(select 1 from
tblA_NoNull
where table_withNull.id = tblA_NoNull.id)
""")
This method can be preferred to left anti joins since they can cause unexpected BroadcastNestedLoopJoin resulting in a broadcast timeout (even without explicitly requesting the broadcast in the anti join).
After that you can do write.mode("append") to insert the previously not encountered data.
Example taken from here

IMHO I don't think exists such a property in Spark. I think you can use 2 approaches:
A workaround with the UNIQUE condition (typical of relational DB): in this way when you try to insert (in append mode) an already existing record you'll get an exception that you can properly handle.
Read the table in which you want to write, outer join it with the data that you want to add to the aforementioned table and then write the result in overwrite mode (but I think that the first solution may be better in performance).
For more details feel free to ask

Related

Returning full statement with join in sqlalchemy

I'm using the SQLalchemy table declaration to avoid using strings and manually generating SQL statements. This has worked quite well, except for the following use case where I'm trying to return a statement which creates a merged table with all columns from both tables, joined on an implicit PK/FK which exists between the tables.
It seems like the statement below only selects columns from the first (Result) table and I'm not sure how to generate the full select statement?
sql = Query(Results, Details).join(Details)\
.filter(Details.result_type == 'standard')\
.statement
The query construct isn't the best to use for this use case, as it will return the objects themselves. Instead, this can be done with the select construct as follows:
sql = select([Results, Details])\
.select_from(Results.join(Details))

Spark sql version of the same query does not work whereas the normal sql query does

The normal sql query :
SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;
Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?
result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")
This gives an error :
AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
How to resolve this issue?
Your "normal" query should not work anywhere. The correct way to write the query is:
SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;
This should work on any database (where the columns and tables are defined and of the right types).
Your version has state_geoid in the SELECT, but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).
Also, you almost never want to use SELECT DISTINCT with GROUP BY. And, the parentheses after the DISTINCT make no difference. The construct is SELECT DISTINCT. DISTINCT is not a function.

Deduping a table while keeping record structures

I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Can I get table names along with column names using .description() in Python's DB API?

I am using Python with SQLite 3. I have user entered SQL queries and need to format the results of those for a template language.
So, basically, I need to use .description of the DB API cursor (PEP 249), but I need to get both the column names and the table names, since the users often do joins.
The obvious answer, i.e. to read the table definitions, is not possible -- many of the tables have the same column names.
I also need some intelligent behaviour on the column/table names for aggregate functions like avg(field)...
The only solution I can come up with is to use an SQL parser and analyse the SELECT statement (sigh), but I haven't found any SQL parser for Python that seems really good?
I haven't found anything in the documentation or anyone else with the same problem, so I might have missed something obvious?
Edit: To be clear -- the problem is to find the result of an SQL select, where the select statement is supplied by a user in a user interface. I have no control of it. As I noted above, it doesn't help to read the table definitions.
Python's DB API only specifies column names for the cursor.description (and none of the RDBMS implementations of this API will return table names for queries...I'll show you why).
What you're asking for is very hard, and only even approachable with an SQL parser...and even then there are many situations where even the concept of which "Table" a column is from may not make much sense.
Consider these SQL statements:
Which table is today from?
SELECT DATE('now') AS today FROM TableA FULL JOIN TableB
ON TableA.col1 = TableB.col1;
Which table is myConst from?
SELECT 1 AS myConst;
Which table is myCalc from?
SELECT a+b AS myCalc FROM (select t1.col1 AS a, t2.col2 AS b
FROM table1 AS t1
LEFT OUTER JOIN table2 AS t2 on t1.col2 = t2.col2);
Which table is myCol from?
SELECT SUM(a) as myCol FROM (SELECT a FROM table1 UNION SELECT b FROM table2);
The above were very simple SQL statements for which you either have to make up a "table", or arbitrarily pick one...even if you had an SQL parser!
What SQL gives you is a set of data back as results. The elements in this set can not necessarily be attributed to specific database tables. You probably need to rethink your approach to this problem.

Categories

Resources