How to query on a Mysql cursor? - python

I have fetched data into mysql cursor now i want to sum up a column in the cursor.
Is there any BIF or any thing i can do to make it work ?
db = cursor.execute("SELECT api_user.id, api_user.udid, api_user.ps, api_user.deid, selldata.lid, api_selldata.sells
FROM api_user
INNER JOIN api_user.udid=api_selldata.udid AND api_user.pc='com'")

I suspect the answer is that you can't include aggregate functions such as SUM() in a query unless you can guarantee (usually by adding a GROUP BY clause) that the values of the non-aggregated columns are the same for all rows included in the SUM().
The aggregate functions effectively condense a column over many rows into a single value, which cannot be done for non-aggregated columns unless SQL knows that they are guaranteed to have the same value for all considered rows (which a GROUP BY will do, but this may not be what you want).

Related

How can I wrap a list of columns into an array in my SELECT query?

I've got a query that I'm trying to form using SQLAlchemy against a postgres DB.
Here's my query:
select id, array_remove(ARRAY[a.value1, b.value2, etc.], null) FROM table t
JOIN a on t.id = a.id
JOIN b on t.id = b.id
This provides a return where the second column is an array of values from different tables. The goal is to have those columns represented in a single column value separated by a comma.
In SQL Alchemy, I'm generating those tables on the so in doing so, I have a list of the Column object themselves. When I'm building my query, I've got the joins and whatnot down but how can I structure the code so that I pass in my list of columns and it results in the expected "ARRAY[column1, column2, etc.]" I expect to see in SQL?
Here's where I'm at so far:
my_query.add_columns(func.array_remove(ARRAY(id_cols), null))
Neither the ARRAY type or array function (literal) appear to take a list of columns. I tried using func.cast to a an array and that also didn't work as it wouldn't take a list of columns. Using a string list of column names isn't ideal because these columns might conflict in name... I guess the fully qualified name might be okay but seems to also be difficult to get with SQLAlchemy.
you need to use the PostgreSQL ARRAY literal instead.
from sqlalchemy.dialects.postgresql import array
my_query.add_columns(func.array_remove(array(id_cols), null))

Pandas function pandas.read_sql_table() returns a DataFrame with the values in the wrong order

I'm trying to get a DataFrame from a PostgreSQL table using the following code:
import pandas
from sqlalchemy.engine import create_engine
engine = create_engine("postgresql+psycopg2://user:password#server/database")
table = pandas.read_sql_table(con=engine, table_name= "table_name", schema= "schema")
Suppose the database table primary key goes from 1 to 100, the Data Frames first column will go like 50 to 73, then 1 to 49, the 73 to 100. I've tried adding a chunk_size value to see if that made a difference and got the same result.
AFAIK databases don't always return values in order by primary key. You can sort in pandas:
table.sort_values(by=['id'])
Logically SQL tables have no order and the same applies to queries, unless explicitly defined using ORDER BY. Some DBMS, but not PostgreSQL1, may use a clustered index and store rows physically in order, but that does not guarantee that a SELECT returns rows in that order without using ORDER BY. For example parallel execution plans throw all expectations about query results matching physical order in the bin. Note that DBMS can use for example indexes or other information to fetch rows in order without having to sort, so ordering by a primary key should not add too much of an overhead.
Either sort the data in Python as shown in the other answer, or use read_sql_query() instead and pass a query with a specified order:
table = pandas.read_sql_query(
"SELECT * FROM schema.table_name ORDER BY some_column",
con=engine)
1: PostgreSQL has a CLUSTER command that clusters the table based on an index, but it is a one-time operation.

shuffling values in sqlite database gives None

I tried to shuffle the values of a column in an sqlite database using sqlalchemy but using SQL queries.
I have a table mtr_cdr with a column idx. If I have
conn = engine.connect()
conn.execute('SELECT idx FROM mtr_cdr').fetchall()[:3]
I get the right values: [(3,),(3,),(3,)] in this case, which are the top values in the column (at this point, the idx column is ordered, and there are multiple rows that correspond to the same idx).
However, if I try:
conn.execute('SELECT idx FROM mtr_cdr ORDER BY RANDOM()').fetchall()[:3]
I get [(None,),(None,),(None,).
I'm using sqlalchemy version 1.2.9 if that helps.
Which approach can give me the correct result, but extendable to a large database? In the full database, I expect around 100 million+ rows, and I heard ORDER BY RANDOM() (if I can even get it to work) will be very slow...

Long vacuums in redshift

I'm trying to run VACUUM REINDEX for some huge tables in Redshift. When I run one of those vacuums in SQLWorkbenchJ, it never finishes and returns a connection reset by peer after about 2 hours. Same thing actually happens in Python when I run the vacuums using something like this:
conn_string = "postgresql+pg8000://%s:%s#%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,
execution_options={'autocommit': True},
encoding='utf-8',
connect_args={"keepalives": 1, "keepalives_idle": 60,
"keepalives_interval": 60},
isolation_level="AUTOCOMMIT")
conn.execute(query)
Is there a way that either using Python or SQLWorkbenchJ I can run these queries? I expect them to last at least an hour each. Is this expected behavior?
Short Answer
You might need to add a mechanism in your python script to retry when the reindexing fails, based on https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html
If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the full vacuum operation.
However...
Couple of things to note (I apologize if you already know this)
Tables in redshift can have N sort keys (columns that data are sorted by) and Redshift supports only 2 sorting styles
Compound: You are really sorting
based on the first sort column and then on the second, ...
Interleaved: The table will be sorted on all sort columns (https://en.wikipedia.org/wiki/Z-order_curve), some people would choose this style when they are not sure how the table will be used. However, it comes with a lot of issues on its own (More solid documentation here https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/ where compound sorting is generally favored)
So how does this answer the question?
If your table is using compound sorting or no sorting at all VACUUM REINDEX is not necessary at all, it brings no value
If your table is using interleaved, you will need to first check whether you even need to re-index?. Sample query
SELECT tbl AS table_id,
(col + 1) AS column_num, -- Column in this view is zero indexed
interleaved_skew,
last_reindex
FROM svv_interleaved_columns
If the value of the skew is 1.0 you for sure don't need REINDEX
Bringing it all together
You could have your python script run the query listed in https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_INTERLEAVED_COLUMNS.html to find the tables that you need to re-index (maybe you add some business logic that works better for your situation, example: your own sort skew threshold)
REINDEX applies the worst type of lock, so try to target the run of the script during off hours if possible
Challenge the need for interleaving sorting and favor compound

sqlite3 and cursor.description

When using the sqlite3 module in python, all elements of cursor.description except the column names are set to None, so this tuple cannot be used to find the column types for a query result (unlike other DB-API compliant modules). Is the only way to get the types of the columns to use pragma table_info(table_name).fetchall() to get a description of the table, store it in memory, and then match the column names from cursor.description to that overall table description?
No, it's not the only way. Alternatively, you can also fetch one row, iterate over it, and inspect the individual column Python objects and types. Unless the value is None (in which case the SQL field is NULL), this should give you a fairly precise indication what the database column type was.
sqlite3 only uses sqlite3_column_decltype and sqlite3_column_type in one place, each, and neither are accessible to the Python application - so their is no "direct" way that you may have been looking for.
I haven't tried this in Python, but you could try something like
SELECT *
FROM sqlite_master
WHERE type = 'table';
which contains the DDL CREATE statement used to create the table. By parsing the DDL you can get the column type info, such as it is. Remember that SQLITE is rather vague and unrestrictive when it comes to column data types.

Categories

Resources