Need to retain all columns but getting Duplicate Column Name #1060

Need to retain all columns but getting Duplicate Column Name #1060 - python

I have this large query I am trying to perform. I perform a series of joins, and then from that resulting relation I want to perform another join and filter out certain tuples.
SELECT *
FROM
(
SELECT *
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
) as R
inner join Table4 on R.instrument_id = Table4.instrument_id
where Table4.fill_timestamp between CURDATE() - INTERVAL 30 DAY AND NOW();
R is the "series of joins" I'm referring to. I want to inner join R with Table4 and then filter out the resulting relation for the last 30 days (where the date attribute is Table4.fill_timestamp). I'm using SQLAlchemy so I thought about somehow saving R to some result relation variable and performing a separate query on that, but I don't know how SQLAlchemy handles that, so I wanted to try doing the entire query in SQL first.
I keep getting the Duplicate Column Name "instrument_id" error. instrument_id is the primary key for all tables except market_instrument, where it's the same but it's called id instead. What can I do to get around this issue?

The problem is that R has all the columns from several tables, and more than one of those tables has a column named "instrument_id". You have not assigned aliases to any of those column names, so SQL does not know which instrument_id column you mean when you say "R.instrument_id".
If market_instrument is the only table with an id column then you could join on R.id instead of R.instrument_id.
Alternatively, another group of solutions involves assigning different names to some or all of the columns in R. For example,
SELECT
market_instrument.*,
exchange_instrument.*,
Table1.instrument_id AS the_one_true_id,
Table1.another_column,
Table1.yet_another_column,
...
Table2.*,
options.*,
Table3.*
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
With the above, you could then join on R.the_one_true_id. Alternatively, you could leave your current join as it is, and rename all the instrument_id columns but one. It might (or might not) be convenient to do that in the context of replacing R with a full-fledged VIEW in your schema.
Alternatively, your select list could enumerate all the columns of all the tables in the join. That might be tedious, but if you really do need all of them, then you will need to do that to disambiguate the other duplicate names, which include, at least, the various other instrument_id columns. Presented with such a task, however, perhaps you would discover that you don't really need every one of them.
As yet another alternative, you could add more columns instead of renaming existing ones. For example,
SELECT
*
exchange_instrument.instrumentId AS ei_instrument_id,
Table1.instrument_id AS t1_instrument_id,
Table2.instrument_id AS t2_instrument_id,
options.instrument_id AS op_instrument_id,
Table3.instrument_id AS t3_instrument_id
FROM
...
Then you can access, say, R.t1_instrument_id, whose name is presumably unique.

Related

What did I misunderstand about inner join meaning for wrds database?

I learned join methods in sql, and I know that inner join means returning only the intersections of the two different tables that we want to set.
I thought for python the concept is same. But I have problem understanding the certain code.
crsp1=pd.merge(crsp, crsp_maxme, how='inner', on=['jdate','permco','me'])
crsp1=crsp1.drop(['me'], axis=1)
crsp2=pd.merge(crsp1, crsp_summe, how='inner', on=['jdate','permco'])
If I understood correctly, the first line merges table crsp and crsp_maxme with intersection on column 'jdate', 'permco', 'me'. So the table crsp1 would have 3 columns.
The second line drops the 'me' column of table crsp1.
The last lien would merge newly adjusted table crsp1 and crsp_summe with inner join, with intersection on 'jdate' and 'permco'. Which makes newly merged table crsp2 only having 2 columns.
However, the code explanation from line 2 says that the second and third lines drop 'me' column from crsp1 and then replace it with 'me' from crsp_summe table, which I had problem understanding.
Could anyone clarify these lines for me?
PS: I thought it isn't necessary to explain what the table crsp, crsp_summe, and crsp_maxme since they are all framed by inner join function. So please excuse the lack of background info.

The merge() functions on parameter specifies on what columns you want to make joins. how specifies what type of join you want to apply (similar to sql joins as outer, inner, left, right etc.).
Ex:
suppose there are two tables A and B containing columns as A['x1','x2','x3'] and B['x2','y1'] so joining them based on 'x1' (as it is common column in both table) would produce A_join_B_on_x1['A_B_x1','A_x2','A_x3','B_y1'] and the join will based on how you want to join.
in your current code consider,
A = crsp1
B = crsp_maxme
C = crsp_summe
Now in your program your
first line merges your A,B on ['jdate','permco','me'] columns and creates a new dataframe A_B containing ['jdate','permco','me',...'+columns_from_both_tables(A)(B)'] as inner join (i.e rows which are common in both A,B based on ['jdate','permco','me'] columns)
second line drops 'me' column from A_B dataframe. so it will be something like
['jdate','permco',...'+columns_from_both_tables(A)(B)']
third line merges your A_B,C on ['jdate','permco'] and creates ['jdate','permco',...'+columns_from_both_tables(A_B)(C)'] as inner join (i.e rows which are common in both A_B,C based on ['jdate','permco','me'] columns)

Nested SELECT query in Pyspark DataFrames

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?

As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

Raw SQL to SQLAlchemy

Part of my raw sql statement looks like this:
select /*some selects*/
if(/*condition*/, table1.price , if(/*condition*/, t2.price, t3.price)) as price
/*some joins*/
left join table2 t2 on table1.type=t2.id
left join table3 t3 on table1.type=t3.id
This statement works as expected.
SQLAlchemy ORM:
query = db_session.query(Table1,\
func.IF(Table1.field5 == 5, Table1.price,\
func.IF(Table1.new_model == 1, Table2.price, Table3.price))
#+some selects
#+some joins
query = query.join(Table2, Table1.type == Table2.id)\
.join(table3, Table1.type == Table3.id)
And it doesn`t work the same way. It returns the result that only connected to the Table2. And not using this joins in query returns needed rows, but without needed fields from this Table2 and Table3, of course.
What is my mistake?

you need to use outerjoin for LEFT JOIN

LEFT JOIN and JOIN are different operations.
For LEFT JOIN use outerjoin. For JOIN (aka INNER JOIN) use join.

Nested Joins in SQLAlchemy

How do I do nested joins in SQLAlchemy? The statement I'm trying to run is
SELECT a.col1, a.col2, c.col3
FROM a
LEFT OUTER JOIN (b INNER JOIN c ON c.col4 = b.col4) ON b.col5 = a.col5
I need to show all records in A, but join them only with those records in B that can JOIN with C.
The code I have so far is
session.query(a.col1, a.col2, c.col3).outerjoin(b, b.col5 == a.col5).all()
This gets me most of what I need, with A records showing null values where it's missing B records; however, too many Bs are getting in, and I need to limit them. However, if I just add another join, i.e.,
session.query(a.col1, a.col2, c.col3).outerjoin(b, b.col5 == a.col5).join(c, b.col4 == c.col4).all()
It drops out all of the A records with null values in B.
I should point out that I can't join A to C directly because the only connection between the two is through B.

This is done easiest using subquery:
subq = (session.query(b.col5).join(c, c.col4 == b.col4)).subquery("subq")
qry = session.query(a).outerjoin(subq, a.col5 == subq.c.col5)
print(qry)
If you showed more of the model definition and especially the nature of the relationships between the tables, there might be a more elegant solution.

Need a workaround to filter on related model and aggregated fields in Django

I opened a ticket for this problem.
In a nutshell here is my model:
class Plan(models.Model):
cap = models.IntegerField()
class Phone(models.Model):
plan = models.ForeignKey(Plan, related_name='phones')
class Call(models.Model):
phone = models.ForeignKey(Phone, related_name='calls')
cost = models.IntegerField()
I want to run a query like this one:
Phone.objects.annotate(total_cost=Sum('calls__cost')).filter(total_cost__gte=0.5*F('plan__cap'))
Unfortunately Django generates bad SQL:
SELECT "app_phone"."id", "app_phone"."plan_id",
SUM("app_call"."cost") AS "total_cost"
FROM "app_phone"
INNER JOIN "app_plan" ON ("app_phone"."plan_id" = "app_plan"."id")
LEFT OUTER JOIN "app_call" ON ("app_phone"."id" = "app_call"."phone_id")
GROUP BY "app_phone"."id", "app_phone"."plan_id"
HAVING SUM("app_call"."cost") >= 0.5 * "app_plan"."cap"
and errors with:
ProgrammingError: column "app_plan.cap" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...."plan_id" HAVING SUM("app_call"."cost") >= 0.5 * "app_plan"....
Is there any workaround apart from running raw SQL?

When aggregating, SQL requires any value in a field either be unique within a group, or that the field be wrapped in an aggregation function which ensures that only one value will come out for each group. The problem here is that "app_plan.cap" could have many different values for each combination of "app_phone.id" and "app_phone.plan_id", so you need to tell the DB how to treat those.
So, valid SQL for your result is one of two different possibilities, depending on the result you want. First, you could include app_plan.cap in the GROUP BY function, so that any distinct combination of (app_phone.id, app_phone.plan_id, app_plan.cap) will be a different group:
SELECT "app_phone"."id", "app_phone"."plan_id", "app_plan"."cap",
SUM("app_call"."cost") AS "total_cost"
FROM "app_phone"
INNER JOIN "app_plan" ON ("app_phone"."plan_id" = "app_plan"."id")
LEFT OUTER JOIN "app_call" ON ("app_phone"."id" = "app_call"."phone_id")
GROUP BY "app_phone"."id", "app_phone"."plan_id", "app_plan"."cap"
HAVING SUM("app_call"."cost") >= 0.5 * "app_plan"."cap"
The trick is to get the extra value into the "GROUP BY" call. We can weasel our way into this by abusing "extra", though this hard-codes the table name for "app_plan" which is unideal -- you could do it programmatically with the Plan class instead if you wanted:
Phone.objects.extra({
"plan_cap": "app_plan.cap"
}).annotate(
total_cost=Sum('calls__cost')
).filter(total_cost__gte=0.5*F('plan__cap'))
Alternatively, you could wrap app_plan.cap in an aggregation function, turning it into a unique value. Aggregation functions vary by DB provider, but might include things like AVG, MAX, MIN, etc.
SELECT "app_phone"."id", "app_phone"."plan_id",
SUM("app_call"."cost") AS "total_cost",
AVG("app_plan"."cap") AS "avg_cap",
FROM "app_phone"
INNER JOIN "app_plan" ON ("app_phone"."plan_id" = "app_plan"."id")
LEFT OUTER JOIN "app_call" ON ("app_phone"."id" = "app_call"."phone_id")
GROUP BY "app_phone"."id", "app_phone"."plan_id"
HAVING SUM("app_call"."cost") >= 0.5 * AVG("app_plan"."cap")
You could get this result in Django using the following:
Phone.objects.annotate(
total_cost=Sum('calls__cost'),
avg_cap=Avg('plan__cap')
).filter(total_cost__gte=0.5 * F("avg_cap"))
You may want to consider updating the bug report you left with a clearer specification of the result you expect -- for example, the valid SQL you're after.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.