I need some guidance on how to approach this problem.
Consider the following 3 tables:
I inherited this dataset in Oracle. I need to find a way to merge these 3 tables into a single table. Why? Compliance and fraud checks. We're being audited, I have no choice but to comply.
The tables are millions of records. 69million. 7 million, 4 million. If I try to join and search by using contains, or instr or like, I'd be creating Cartesian joins (I think) and it will be quite slow.
How do I approach this? The data is formatted with the square brackets as in the picture. I have considered pulling these tables into pandas and using python to get the answer.
The resulting table would look like this:
I think your first step would be to get the meals and teams tables in a usable format. Using these weird concatenation columns is never going to be easy or fast.
Doing something simple like this to split the ids onto different rows:
select trim(column_value) partner_id,
team
from teams,
xmltable(('"' || replace(teams.concat_identifier_partner, ']',']","') || '"'))
Gives an output like this:
+------------+------+
| PARTNER_ID | TEAM |
+------------+------+
| [012345] | Blue |
| [012346] | Blue |
| [012347] | Red |
+------------+------+
Then you could use these as inline views in your query or create a new temporary table for each of these. The temporary table would allow you to index them on the new ID column.
After your data is in the easier format, the joins become easy.
select people.identifier,
people.name,
formatted_teams.team,
formatted_meals.meal
from people
join (select trim(column_value) partner_id,
team
from teams,
xmltable(('"' || replace(teams.concat_identifier_partner, ']',']","') || '"')) ) formatted_teams
on people.identifier = formatted_teams.partner_id
left join (select trim(column_value) plate_id,
meal
from meals,
xmltable(('"' || replace(meals.concat_identifier_plate, ']',']","') || '"')) ) formatted_meals
on people.identifier = formatted_meals.plate_id
order by people.identifier
Output:
+------------+--------+------+--------+
| IDENTIFIER | NAME | TEAM | MEAL |
+------------+--------+------+--------+
| [012345] | George | Blue | Salmon |
| [012346] | Jenny | Blue | |
| [012347] | Alex | Red | Pike |
| [012347] | Alex | Red | Cod |
+------------+--------+------+--------+
Related
I have a database stored in a GridDB container. The main table in the database contains a column with outdated data. I would like to replace the whole column with a new column from another table (with the same total number of rows). Is there any way I could do that with Python?
For example, the whole process looks thus:
-- old table
column_0 | column_1 | old_column
------------+-----------+------------
1344 | Max | 5263.42525
1345 | John | 1465.41234
1346 | Alex | 8773.12344
1347 | Matthew | 5489.23522
1348 | Mark | 9874.31423
-- replacement
col_0 | updated
---------+------------
4242 | 3553.42824
4243 | 8942.98731
4244 | 1424.36742
4245 | 7642.75352
4246 | 2844.92468
-- output
column_0 | column_1 | old_column
------------+-----------+------------
1344 | Max | 3553.42824
1345 | John | 8942.98731
1346 | Alex | 1424.36742
1347 | Matthew | 7642.75352
1348 | Mark | 2844.92468
I have tried to replace the values one by one but I want something faster and kind of automated.
What you are trying to do here is what is called a join in SQL. Assuming that rows between tables are matched according to their rank, you can join two subqueries with an rank() window function and then pick the wanted columns from both sets.
SELECT column_0, column_1, updated AS old_column
FROM
(SELECT rank() over() r, * FROM old_table ORDER BY column_0) left
JOIN (SELECT rank() over() r, * FROM replacement ORDER BY col_0) right
ON left.r = right.r
Each sub-queries will add a new counter for each row with the rank function, rows will be matched together according to that rank value producing a row containing columns from both sub-queries. The top level will then only pick the wanted column with an appropriate AS clause for the wanted naming.
Each subquery
I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)
Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
I have a table like this
| Id | Name | skills |
|---- |---------|-----------------------------|
| 1 | John | python, java, c++ |
| 2 | Mike | c++, javascript |
| 3 | Smith | java, ruby, vuejs, python |
I need to find all the unique values for the column skills.
Try this:
skills_list = []
for obj in MyModel.objects.all():
skills_list += obj.skills.split(", ")
unique_skills = list(set(skills_list))
If you are using PostgreSQL as your database, consider storing skills in an ArrayField. This will allow you to deal with it as a python list, and you wouldn't need to use split.(", ") in the above example.
So I'm working on this database structuring and trying to figure out if this is the best method. I'm pulling records from a 3rd party site and store them to a temporary table (tableA) I then check for duplicates in tableB and then insert the non duplicated in to tableB from tableA. Is there anyway to get the id assigned from tableB each time a record is inserted? Right now I'm looking for the latest records inserted in tableB by date and then retrieving the IDs. Is there a more efficient way?
Is there a reason you're not using INSERT IGNORE? It seems to me that you could do away with the whole temporary-table process...
+----+------+
| id | name |
|----|------|
| 1 | adam |
| 2 | bob |
| 3 | carl |
+----+------+
If id has a unique constraint, then this:
INSERT IGNORE INTO tableName (id, name) VALUES (3, "carl"), (4, "dave");
...will result in:
+----+------+
| id | name |
|----|------|
| 1 | adam |
| 2 | bob |
| 3 | carl |
| 4 | dave |
+----+------+
...whereas if you'd just done an INSERT (without the IGNORE part), it would give you a unique key constraint error.
In terms of getting the ID back, just use:
SELECT LAST_INSERT_ID()
...after every INSERT IGNORE call you make.
It sounds like you want something called an after-insert trigger on table B. This is a piece of code that runs after inserting one or more rows into a table. Documentation is here.
The code is something like:
CREATE TRIGGER mytrigger AFTER INSERT ON B
FOR EACH ROW BEGIN
-- Do what you want with each row here
END;