I have a database stored in a GridDB container. The main table in the database contains a column with outdated data. I would like to replace the whole column with a new column from another table (with the same total number of rows). Is there any way I could do that with Python?
For example, the whole process looks thus:
-- old table
column_0 | column_1 | old_column
------------+-----------+------------
1344 | Max | 5263.42525
1345 | John | 1465.41234
1346 | Alex | 8773.12344
1347 | Matthew | 5489.23522
1348 | Mark | 9874.31423
-- replacement
col_0 | updated
---------+------------
4242 | 3553.42824
4243 | 8942.98731
4244 | 1424.36742
4245 | 7642.75352
4246 | 2844.92468
-- output
column_0 | column_1 | old_column
------------+-----------+------------
1344 | Max | 3553.42824
1345 | John | 8942.98731
1346 | Alex | 1424.36742
1347 | Matthew | 7642.75352
1348 | Mark | 2844.92468
I have tried to replace the values one by one but I want something faster and kind of automated.
What you are trying to do here is what is called a join in SQL. Assuming that rows between tables are matched according to their rank, you can join two subqueries with an rank() window function and then pick the wanted columns from both sets.
SELECT column_0, column_1, updated AS old_column
FROM
(SELECT rank() over() r, * FROM old_table ORDER BY column_0) left
JOIN (SELECT rank() over() r, * FROM replacement ORDER BY col_0) right
ON left.r = right.r
Each sub-queries will add a new counter for each row with the rank function, rows will be matched together according to that rank value producing a row containing columns from both sub-queries. The top level will then only pick the wanted column with an appropriate AS clause for the wanted naming.
Each subquery
Related
Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
theoretical database/coding query here - Python / Pandas dataframe related. I'm dealing with up to 50k rows in a table so optimal solutions seem... erm, optimal. And I'm no coding expert either, so, bear with me.
I have a table with unique child code/country pair rows, some with matching parent codes.Eg:
Index | Parent | Child | Country | NewValue
0 | A | A-1 | X | Null
1 | A | A-1 | Y | Null
2 | A | A-2 | X | Null
3 | B | B-1 | X | Null
4 | B | B-2 | Y | Null
I need to update every Parent / Country pair with a calculated unique value (NewValue). What's the best approach to finding and updating each pair over every row?
So far I'm generating a seperate list of unique Parent / Country pairs (to avoid calculating NewValue for every row needlessly; I just itterate through this list generating NewValue for each pair), eg:
Parent | Country
A | X
A | Y
B | X
B | Y
Now, is it better to simply do a lookup in the first table for every given parent/country match, get the row index for any matching rows, and then update via the row index?
Or, generate the second table in a way that includes any relevant indexes to start with, and use these to update the first table? Eg:
Parent | Country | Index(s)
A | X | 0,2
A | Y | 1
B | X | 3
B | Y | 4
If 2, how? Because I'm using df.unique() to generate the second table, I only get one index per pair, not any mathing indexes (and I'm not sure how they'd show up if I did). And I'm not sure if either way is particularly good, but it's the best I've come up with in a day :o)
Thanks,
Christopher / pepsi_max2k
You might want to look at the merge function.
What you have to do in your case is
df_children.merge(df_parent, on=["Parent","Country"])
where df_children is your table with [Index | Parent | Child | Country] columns and df_parent has [Parent | Country | NewValue]
I need some guidance on how to approach this problem.
Consider the following 3 tables:
I inherited this dataset in Oracle. I need to find a way to merge these 3 tables into a single table. Why? Compliance and fraud checks. We're being audited, I have no choice but to comply.
The tables are millions of records. 69million. 7 million, 4 million. If I try to join and search by using contains, or instr or like, I'd be creating Cartesian joins (I think) and it will be quite slow.
How do I approach this? The data is formatted with the square brackets as in the picture. I have considered pulling these tables into pandas and using python to get the answer.
The resulting table would look like this:
I think your first step would be to get the meals and teams tables in a usable format. Using these weird concatenation columns is never going to be easy or fast.
Doing something simple like this to split the ids onto different rows:
select trim(column_value) partner_id,
team
from teams,
xmltable(('"' || replace(teams.concat_identifier_partner, ']',']","') || '"'))
Gives an output like this:
+------------+------+
| PARTNER_ID | TEAM |
+------------+------+
| [012345] | Blue |
| [012346] | Blue |
| [012347] | Red |
+------------+------+
Then you could use these as inline views in your query or create a new temporary table for each of these. The temporary table would allow you to index them on the new ID column.
After your data is in the easier format, the joins become easy.
select people.identifier,
people.name,
formatted_teams.team,
formatted_meals.meal
from people
join (select trim(column_value) partner_id,
team
from teams,
xmltable(('"' || replace(teams.concat_identifier_partner, ']',']","') || '"')) ) formatted_teams
on people.identifier = formatted_teams.partner_id
left join (select trim(column_value) plate_id,
meal
from meals,
xmltable(('"' || replace(meals.concat_identifier_plate, ']',']","') || '"')) ) formatted_meals
on people.identifier = formatted_meals.plate_id
order by people.identifier
Output:
+------------+--------+------+--------+
| IDENTIFIER | NAME | TEAM | MEAL |
+------------+--------+------+--------+
| [012345] | George | Blue | Salmon |
| [012346] | Jenny | Blue | |
| [012347] | Alex | Red | Pike |
| [012347] | Alex | Red | Cod |
+------------+--------+------+--------+
Is it possible to select whole row from table_1 (without autoincrement ID) and insert it into another table table_2 which has the same relational scheme as table_1 (the same columns)?
I can do that using for example Python but the table has too much rows to do write a code for that.
So this is the example:
table_1:
id | name | age | sex | degree
1 | Pate | 98 | it | doc
2 | Ken | 112 | male| -
table_2:
id | name | age | sex | degree
SQLite3:
INSERT INTO table_2 (SELECT * FROM table_1 WHERE id=2);
RESULT:
table_2:
id | name | age | sex | degree
1 | Ken | 112 | male| -
EDIT:
If this is not possible, it could be done including id so the table_2 would look like:
id | name | age | sex | degree
2 | Ken | 112 | male| -
The INSERT statement indeed has a form that inserts each row returned by a SELECT. However, that SELECT is not a subquery, so you have to omit the parentheses around it, and when you're not inserting all columns, you have to specifiy which columns to use:
INSERT INTO table_2 (name, age, sex, degree)
SELECT name, age, sex, degree
FROM table_1
WHERE id = 2;
So I'm working on this database structuring and trying to figure out if this is the best method. I'm pulling records from a 3rd party site and store them to a temporary table (tableA) I then check for duplicates in tableB and then insert the non duplicated in to tableB from tableA. Is there anyway to get the id assigned from tableB each time a record is inserted? Right now I'm looking for the latest records inserted in tableB by date and then retrieving the IDs. Is there a more efficient way?
Is there a reason you're not using INSERT IGNORE? It seems to me that you could do away with the whole temporary-table process...
+----+------+
| id | name |
|----|------|
| 1 | adam |
| 2 | bob |
| 3 | carl |
+----+------+
If id has a unique constraint, then this:
INSERT IGNORE INTO tableName (id, name) VALUES (3, "carl"), (4, "dave");
...will result in:
+----+------+
| id | name |
|----|------|
| 1 | adam |
| 2 | bob |
| 3 | carl |
| 4 | dave |
+----+------+
...whereas if you'd just done an INSERT (without the IGNORE part), it would give you a unique key constraint error.
In terms of getting the ID back, just use:
SELECT LAST_INSERT_ID()
...after every INSERT IGNORE call you make.
It sounds like you want something called an after-insert trigger on table B. This is a piece of code that runs after inserting one or more rows into a table. Documentation is here.
The code is something like:
CREATE TRIGGER mytrigger AFTER INSERT ON B
FOR EACH ROW BEGIN
-- Do what you want with each row here
END;