I'm trying to run VACUUM REINDEX for some huge tables in Redshift. When I run one of those vacuums in SQLWorkbenchJ, it never finishes and returns a connection reset by peer after about 2 hours. Same thing actually happens in Python when I run the vacuums using something like this:
conn_string = "postgresql+pg8000://%s:%s#%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,
execution_options={'autocommit': True},
encoding='utf-8',
connect_args={"keepalives": 1, "keepalives_idle": 60,
"keepalives_interval": 60},
isolation_level="AUTOCOMMIT")
conn.execute(query)
Is there a way that either using Python or SQLWorkbenchJ I can run these queries? I expect them to last at least an hour each. Is this expected behavior?
Short Answer
You might need to add a mechanism in your python script to retry when the reindexing fails, based on https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html
If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the full vacuum operation.
However...
Couple of things to note (I apologize if you already know this)
Tables in redshift can have N sort keys (columns that data are sorted by) and Redshift supports only 2 sorting styles
Compound: You are really sorting
based on the first sort column and then on the second, ...
Interleaved: The table will be sorted on all sort columns (https://en.wikipedia.org/wiki/Z-order_curve), some people would choose this style when they are not sure how the table will be used. However, it comes with a lot of issues on its own (More solid documentation here https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/ where compound sorting is generally favored)
So how does this answer the question?
If your table is using compound sorting or no sorting at all VACUUM REINDEX is not necessary at all, it brings no value
If your table is using interleaved, you will need to first check whether you even need to re-index?. Sample query
SELECT tbl AS table_id,
(col + 1) AS column_num, -- Column in this view is zero indexed
interleaved_skew,
last_reindex
FROM svv_interleaved_columns
If the value of the skew is 1.0 you for sure don't need REINDEX
Bringing it all together
You could have your python script run the query listed in https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_INTERLEAVED_COLUMNS.html to find the tables that you need to re-index (maybe you add some business logic that works better for your situation, example: your own sort skew threshold)
REINDEX applies the worst type of lock, so try to target the run of the script during off hours if possible
Challenge the need for interleaving sorting and favor compound
Related
I have a 2 queries that will be run repetitively to feed a report and some charts so need to make sure it is tight. First query has 25 columns and will yield out 25-50 rows from a massive table. My second query will result in another 25 columns (a couple matching columns) of 25 to 50 rows from another massive table.
Desired end result is a single document in that Query 1 (Problem) and Query 2 (Problem tasks) could match on a common column (Problem ID) so that row 1 is the problem, row 2-4 is the tasks, row 5 is the next problem and 6-9 are the tasks....ect. Now I realize I could do this manually by running the 2 queries and them just combining them in excel by hand, but looking for a eloquent process that could be reusable in my absence without too much overhead.
I was exploring inserts, union all, and cross join but the 2 queries have different columns that contain different critical data elements to be returned. Also, exploring setting up a Python job to do this by importing the CSVs and interlacing results but I am a early data science student and not yet much past creating charts from imported CSVs.
Any suggestions on how I might attack this challenge? Thanks for the help.
Picture of desired end result.
enter image description here
You can do it with something like
INSERT INTO target_table (<columns...>)
SELECT <your first query>
UNION
SELECT <your second query>
And then to retrieve data
SELECT * from target_table
WHERE <...>
ORDER BY problem_id, task_id
Just ensure both queries return the same columns, i.e. the columns you want to populate in target_table, probably using fixed default values (e.g. the first query may return a default task_id by including NULL as task_id in the column list)
Thanks for the feedback #gimix, I ended up aliasing the columns that I was able to put together from the 2 tables (open_time vs date_opened ect...) so they all matched and selected '' for the null values I needed to. I unioned the 2 selected statements as suggested, Then I finally realized I can just insert my filtering queries twice as sub queries. It will now be nice and quickly repeatable for pulling and dropping into excel 2x per week. Thank you!
I'm trying to get a DataFrame from a PostgreSQL table using the following code:
import pandas
from sqlalchemy.engine import create_engine
engine = create_engine("postgresql+psycopg2://user:password#server/database")
table = pandas.read_sql_table(con=engine, table_name= "table_name", schema= "schema")
Suppose the database table primary key goes from 1 to 100, the Data Frames first column will go like 50 to 73, then 1 to 49, the 73 to 100. I've tried adding a chunk_size value to see if that made a difference and got the same result.
AFAIK databases don't always return values in order by primary key. You can sort in pandas:
table.sort_values(by=['id'])
Logically SQL tables have no order and the same applies to queries, unless explicitly defined using ORDER BY. Some DBMS, but not PostgreSQL1, may use a clustered index and store rows physically in order, but that does not guarantee that a SELECT returns rows in that order without using ORDER BY. For example parallel execution plans throw all expectations about query results matching physical order in the bin. Note that DBMS can use for example indexes or other information to fetch rows in order without having to sort, so ordering by a primary key should not add too much of an overhead.
Either sort the data in Python as shown in the other answer, or use read_sql_query() instead and pass a query with a specified order:
table = pandas.read_sql_query(
"SELECT * FROM schema.table_name ORDER BY some_column",
con=engine)
1: PostgreSQL has a CLUSTER command that clusters the table based on an index, but it is a one-time operation.
I've recently began to work with Database queries when I was asked to develop a program that would have read data from the last 1 month in a Firebird database with almost 100M rows.
After stumbling a little bit, I finally managed to filter the database, using Python (and, more specifically, Pandas library), but the code takes more than 8 hours just to filter the data, so it becomes useless when trying to realize the task with the required frequency.
The rest of the code runs really quickly, since I just need around the 3000 last rows of the dataset.
So far, my function responsible to execute the query is:
def read_query(access):
start_time = time.time()
conn = pyodbc.connect(access)
df = pd.read_sql_query(r"SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'", conn)
Or, isolating the query:
SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'
Since I will only need a X number of rows from the bottom of the database (where this X number changes everyday), I know I could optimize my code by just reading part of the database, starting from the last rows, iterating through each one of the rows, without having to process the entire dataframe.
So my question is: how can it be done? And if it's not a good idea/approach whatelse could I do to solve this issue?
I think chunksize is your way out, please check the documentation here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html
and also the examples posted here:
http://shichaoji.com/2016/10/11/python-iterators-loading-data-in-chunks/#Loading-data-in-chunks
Good luck!
I have fetched data into mysql cursor now i want to sum up a column in the cursor.
Is there any BIF or any thing i can do to make it work ?
db = cursor.execute("SELECT api_user.id, api_user.udid, api_user.ps, api_user.deid, selldata.lid, api_selldata.sells
FROM api_user
INNER JOIN api_user.udid=api_selldata.udid AND api_user.pc='com'")
I suspect the answer is that you can't include aggregate functions such as SUM() in a query unless you can guarantee (usually by adding a GROUP BY clause) that the values of the non-aggregated columns are the same for all rows included in the SUM().
The aggregate functions effectively condense a column over many rows into a single value, which cannot be done for non-aggregated columns unless SQL knows that they are guaranteed to have the same value for all considered rows (which a GROUP BY will do, but this may not be what you want).
Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.