I tried to shuffle the values of a column in an sqlite database using sqlalchemy but using SQL queries.
I have a table mtr_cdr with a column idx. If I have
conn = engine.connect()
conn.execute('SELECT idx FROM mtr_cdr').fetchall()[:3]
I get the right values: [(3,),(3,),(3,)] in this case, which are the top values in the column (at this point, the idx column is ordered, and there are multiple rows that correspond to the same idx).
However, if I try:
conn.execute('SELECT idx FROM mtr_cdr ORDER BY RANDOM()').fetchall()[:3]
I get [(None,),(None,),(None,).
I'm using sqlalchemy version 1.2.9 if that helps.
Which approach can give me the correct result, but extendable to a large database? In the full database, I expect around 100 million+ rows, and I heard ORDER BY RANDOM() (if I can even get it to work) will be very slow...
Related
I'm trying to get a DataFrame from a PostgreSQL table using the following code:
import pandas
from sqlalchemy.engine import create_engine
engine = create_engine("postgresql+psycopg2://user:password#server/database")
table = pandas.read_sql_table(con=engine, table_name= "table_name", schema= "schema")
Suppose the database table primary key goes from 1 to 100, the Data Frames first column will go like 50 to 73, then 1 to 49, the 73 to 100. I've tried adding a chunk_size value to see if that made a difference and got the same result.
AFAIK databases don't always return values in order by primary key. You can sort in pandas:
table.sort_values(by=['id'])
Logically SQL tables have no order and the same applies to queries, unless explicitly defined using ORDER BY. Some DBMS, but not PostgreSQL1, may use a clustered index and store rows physically in order, but that does not guarantee that a SELECT returns rows in that order without using ORDER BY. For example parallel execution plans throw all expectations about query results matching physical order in the bin. Note that DBMS can use for example indexes or other information to fetch rows in order without having to sort, so ordering by a primary key should not add too much of an overhead.
Either sort the data in Python as shown in the other answer, or use read_sql_query() instead and pass a query with a specified order:
table = pandas.read_sql_query(
"SELECT * FROM schema.table_name ORDER BY some_column",
con=engine)
1: PostgreSQL has a CLUSTER command that clusters the table based on an index, but it is a one-time operation.
I am going to read data from a table in my oracle database and fetch it in a data frame in python.
The table has 22 million records and using fetchall() takes a long time without any result.
(the query runs in oracle in 1 second)
I have tried using slicing the data with below code, but still it is not efficient.
import cx_Oracle
import pandas as pd
from pandas import DataFrame
connect_serv = cx_Oracle.connect(user='', password='', dsn='')
cur = connect_serv.cursor()
table_row_count=22242387;
batch_size=100000;
sql="""select t.* from (select a.*,ROW_NUMBER() OVER (ORDER BY column1 ) as row_num from table1 a) T where t.row_num between :LOWER_BOUND and :UPPER_BOUND"""
data=[]
for lower_bound in range (0,table_row_count,batch_size):
cur.execute(sql,{'LOWER_BOUND':lower_bound,
'UPPER_BOUND':lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)
I would like to know what is the proper solution to fetch this amount of data in python in a reasonable time.
It's not the query that is slow, it's the stacking of the data with data.append(row).
Try using
data.extend(cur.fetchall())
for starters. It will avoid the repeated single-row appending, but rather append the entire set of rows coming from fetchall at once.
You will have to tune arraysize and prefetchrow parameters. I was having the same issue. Increasing arraysize resolved the issue. Choose the values based on the memory you have.
Link:
https://cx-oracle.readthedocs.io/en/latest/user_guide/tuning.html?highlight=arraysize#choosing-values-for-arraysize-and-prefetchrows
I'm using sqlite in python. I'm pretty new to sql. My sql table has two columns start and end which represent an interval. I have another "input" list of intervals (represented as a pandas dataframe) and I'd like to find all the overlaps between the input and the db.
SELECT * FROM db WHERE
# you can write an interval query in two statements like so:
db.start <= input.end AND db.end >= input.start
My issue is that the above queries for overlaps with a single input interval, I'm not sure how to write a query for many overlaps. I'm also unsure how to effectively write this in python. From the sqlite docs:
t = ('RHAT',)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)
print(c.fetchone())
This seems difficult because I need to pass in a range for my expression, or a list of ranges, and so a single ? probably won't cut it, right?
I'd appreciate either sql or python+sql or suggestions for how to do this entirely differently. Thanks!
Putting multiple interval values into a single query becomes cumbersome quickly.
Create a (temporary) table input for the intervals, and then search for any match in that table:
SELECT *
FROM db
WHERE EXISTS (SELECT *
FROM input
WHERE db.start <= input.end
AND db.end >= input.start);
It's simpler to write this as a join, but then you get multiple output rows if multiple inputs match (OTOH, this might actually be what you want):
SELECT db.*
FROM db
JOIN input ON db.start <= input.end
AND db.end >= input.start;
I have a very large table with 250,000+ rows, many containing a large text block in one of the columns. Right now it's 2.7GB and expected to grow at least tenfold. I need to perform python specific operations on every row of the table, but only need to access one row at a time.
Right now my code looks something like this:
c.execute('SELECT * FROM big_table')
table = c.fetchall()
for row in table:
do_stuff_with_row
This worked fine when the table was smaller, but the table is now larger than my available ram and python hangs when I try and run it. Is there a better (more ram efficient) way to iterate row by row over the entire table?
cursor.fetchall() fetches all results into a list first.
Instead, you can iterate over the cursor itself:
c.execute('SELECT * FROM big_table')
for row in c:
# do_stuff_with_row
This produces rows as needed, rather than load them all first.
I have fetched data into mysql cursor now i want to sum up a column in the cursor.
Is there any BIF or any thing i can do to make it work ?
db = cursor.execute("SELECT api_user.id, api_user.udid, api_user.ps, api_user.deid, selldata.lid, api_selldata.sells
FROM api_user
INNER JOIN api_user.udid=api_selldata.udid AND api_user.pc='com'")
I suspect the answer is that you can't include aggregate functions such as SUM() in a query unless you can guarantee (usually by adding a GROUP BY clause) that the values of the non-aggregated columns are the same for all rows included in the SUM().
The aggregate functions effectively condense a column over many rows into a single value, which cannot be done for non-aggregated columns unless SQL knows that they are guaranteed to have the same value for all considered rows (which a GROUP BY will do, but this may not be what you want).