I have a list of database name like this [client_db1, clientdb2,...].
There are total 300 databases.
Now each database has a collection with same name as secrets.
I want to fetch records from each collection so I will get 10-15 (not more than this) records from each collection.
In my code base,
we iterate through each database and then create a dataframe.
combined_auth_df = pd.DataFrame()
for client_db in all_client_dbname_data:
auth_df = get_dataframe_from_collection(client_db)
combined_auth_df = combined_auth_df.append(auth_df, ignore_index=True)
This process is taking a lot of time.
Is there any good way to do this. Also I am curious if we can do the same with multithreading/multiprocessing.
I have the following code that iterates through the rows of certain tables in AWS. It grabs the first 50k rows and keeps going as long as there are 50k more rows to grab and it works extremely quickly because I'm usually only getting the last 2 days worth of data.
top=50000
i=0
days = 2
df = pd.DataFrame()
result = pd.DataFrame()
curs = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
while((i==0) or (len(df)==top)):
start_time = (dt.datetime.now()-timedelta(days=days)).strftime("%Y-%m-%d %H:%M:%S")
sql=f'SELECT * FROM {str.upper(table)} WHERE INSERTED_AT >= \'{start_time}\' OR UPDATED_AT >= \'{start_time}\' LIMIT {top} OFFSET {i}'
curs.execute(sql)
data = curs.fetchall()
df=pd.DataFrame([i.copy() for i in data])
result = result.append(df,ignore_index=True)
#load result to snowflake
i += top
The trouble is I have a very large table that is about 7 million rows long and growing exponentially. I found that if I backload all its data (day=1000) that I will be missing data probably because each iteration what was 0-50k,50k-100k, etc. has now changed since the table loaded more data whilst I was running the while loop.
What is a better way to load data into snowflake that will avoid missing data issues? Do I have to use parallelization to get all these pieces of the table at once? Even if top=3mil I still find I'm missing large amounts of data, likely due to the lag time it takes me to load while the actual table rows are incrementing. Is there a standardized block of code that excels for large tables?
I would skip the Python and favor the UNLOAD Snowflake command.
UNLOAD lets you dump the contents of a Redshift table into an S3 bucket:
https://community.snowflake.com/s/article/How-To-Migrate-Data-from-Amazon-Redshift-into-Snowflake
unload ('select * from emp where date = GET_DATE()')
to 's3://mybucket/mypath/'
credentials 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
delimiter '\001'
null '\\N'
escape
[allowoverwrite]
[gzip];
You could set up a stored procedure that runs and have a schedule that kicks it off once per day (I use Astronomer/Airflow for this).
From there you can build an external table on top of the bucket:
https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html
I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?
There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item
I have the following queries:
files = DataAvailability.objects.all()
type1 = files.filter(type1=True)
type2 = files.filter(type2=True)
# People information
people = Person.objects.all()
users = people.filter(is_subject=True)
# count information (this is taking a long time to query them all)
type1_users = type1.filter(person__in=users).count()
type2_users = type2.filter(person__in=users).count()
total_users = files.filter(person__in=users).count()
# another way
total_users2 = files.filter(person__in=users)
type1_users2 = total_users.filter(type1=True).count()
type2_users2 = total_users.filter(type2=True).count()
total_count = total_users2.count()
I thought about creating a query with .values() and putting into a set().
After that is done execute some functions within the set (like diff).
Is this the only way to improve the query time?
You can always do raw SQL https://docs.djangoproject.com/en/2.0/topics/db/sql/#performing-raw-queries
Example:
# Dont do this its insecure
YourModel.objects.raw(f"select id from {YourModel._meta.db_table}")
# Do like this to avoid SQL injection issues.
YourModel.objects.raw("select id from app_model_name")
The name of the table can be obtained as: YourModel._meta.db_table and also you can get the sql of a queryset like this:
type1_users = type1.filter(person__in=users)
type1_users.query.__str__()
So you can build join this query to another one.
I don't have to make those queries very often (once a day at most). So I'm running on a cron job which exports the data to a file (you could create a table in your database for auditing purposes, for ex). I then read the file and use the data from there. It's working well/fast.
I have implemented a python script in order to divide millions of documents (generated by a .NET web application and which were all content into a single directory) into sub folders with this scheme: year/month/batch, as all the tasks these documents come from were originally divided into batches.
My python scripts performs queries to SQL Server 2014 which contains all data it needs for each document, in particular the month and year it was created in. Then it uses shutil module to move the pdf. So, I firstly perform a first query to get a list of batches, for a given month and year:
queryBatches = '''SELECT DISTINCT IDBATCH
FROM [DBNAME].[dbo].[WORKS]
WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)
Then I perform:
for batch in batches:
query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
FROM [DBNAME].[dbo].[WORKS]
WHERE NAMEDOCUMENTI IS NOT NULL and
NAMEDOCUMENT not like '/%/%/%/%.pdf' and
YEAR(DATETIMEWORK)={} and
MONTH(DATETIMEWORK)={} and
IDBATCH={}'''.format(year,month,batch[0])
whose records are collected into a cursor, according to PYMSSQL use documentation. So I go on with:
IDWorksUpdate = []
row = cursor.fetchone()
while row:
if moveDocument(...):
IDWorksUpdate.append(row[0])
row = cursor.fetchone()
Finally, when the cycle has ended, in IDWorksUpdate I have all the PKs of WORKS whose documents succeeded to be correctly moved into a subfolder. So, I close the cursor and the connection and I instantiate new ones.
In the end I perform:
subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
newCursor.close()
del newCursor
newConn.close()
This morning I see that only for a couple of batches that update query failed executing at the database, even if the documents were correctly moved into subdirectories.
That batched had more than 55000 documents to be moved, so maybe the IDWorksUpdate overflowed and it helped to bad create that final update query? I thought that 55000 was not such a big list of integers. The problem is that in PYMSSQL we cannot have more than one connection/cursor at a time to the same database so I cannot update the records as the respective files are moved. So I thought to create a list of PKs of works whose documents were correctly moved and finally update them with a new connection/cursor. What could have happened? Am I doing it wrong?
UPDATE
I've just written a simple script to reproduce the query which is going to be executed to update the records, and this is the error I get from SQL Server:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
This is the query:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)
The fact is that table is very big (with about 14 millions of records). But I need that list of PKs because only the tasks whose document have been correctly processed and moved can be updated. I cannot simply run:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and
MONTH(DATETIMEWORK)=12 and IDBATCH=1484
This because as our server was attacked by a crypto locker, I must process and move only the documents that still exist, waiting for the other to be released.
Should I split those string into sub lists? How?
UPDATE 2
It seems the following could be a solution: I split the list of PKs into chunks of 10000 (a fully experimental number) and then I execute as many queries as many chunks, each of them with a chunk as subquery.
def updateDB(listID, y, m, b, log):
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
if len(listID) <= 10000:
subquery = '('+', '.join(str(x) for x in listID)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('...')
log.write('\n\n')
finally:
newCursor.close()
del newCursor
newConn.close()
else:
chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]
for sublistPK in chunksPK:
subquery = '('+', '.join(str(x) for x in sublistPK)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Could not execute partial {}'.format(query))
log.write('\n\n')
newCursor.close()
del newCursor
newConn.close()
Could this be a good/secure solution?
As stated in the MSDN document
IN (Transact-SQL)
Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.
(The error message you cited was error 8623.)
Putting the IN list values into a temporary table and then using
... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)
strikes me as being more straightforward than the "chunking" method you described.