Faster cosmos db query - python

I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?

There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item

Related

pyspark: Get hudi last/latest commit using pyspark

I am doing an incremental query with spark-hudi every hour and saving that incremental query begin and end time in db(say mysql) everytime. For nexti ncemental query I use begin time as end time of previous query fetch from mysql.
incremental query should look like this:
hudi_incremental_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': hudi_start_commit,
'hoodie.datasource.read.end.instanttime': hudi_end_commit
}
but I am not sure how to find hudi_end_commit in pyspark(python). In Java I can do the same with helper class HoodieDataSourceHelpers like:
String hudi_end_commit = HoodieDataSourceHelpers.latestCommit(FileSystem.get(javaSparkContext.hadoopConfiguration()),l1BasePath);
but unable to find a solution to do the same in python.
After a work around I found a solution that is not feasible for a large dataset.
spark_session.read.format("hudi").load(l1_base_path).createOrReplaceTempView("hudi_trips_snapshot")
commits = list(map(lambda row: row[0],
spark_session.sql("select distinct(_hoodie_commit_time) as commitTime from "
"hudi_trips_snapshot order by commitTime desc").limit(1).collect()))
But when the data size is too large it loaded whole data to get the hudi commits that takes more time than reading the actual data itself.
Is there any easy way to find hudi latest/last commit.
Try this (worked for me in pyspark shell):
hudi_end_commit = spark._sc._gateway.jvm.org.apache.hudi.HoodieDataSourceHelpers.latestCommit(
spark._sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(spark._sc._jsc.hadoopConfiguration()),
"/path/to/hudi/table"
)

SQLAlchemy `.fetchmany()` vs `.limit()`

Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`
limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.
I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk

How does one batch/chunk both SQL source data and SQL destination data using Python?

I'm creating a quick-and-dirty ETL using Python from one database server (currently DB2) to another (MSSQL). I'm just landing the data, so no transformation is occurring. The code I've written works, but it must retrieve the entire dataset first, and then insert the entire dataset to the destination.
I'd like to create a solution that would allow me to specify 'x' number of rows to pull from the source, and batch them to the destination.
I'm sure there's an elegant solution out there, but I'm not familiar enough with Python. I'm just looking for recommendations on an implementation, methods to use, or techniques.
I'm using SQLAlchemy and pandas to accomplish the task. My source and destination tables are identical (as much as possible, since datatypes differ between SQL implementations). I'm populating a dataframe, then bulk inserting the data using MetaData and automap_base.
Bulk insert function
def bulkInsert(engine, df, tableName, schemaName = 'dbo'):
metadata = MetaData()
metadata.reflect(engine, only = [tableName], schema = schemaName)
Base = automap_base(metadata = metadata)
Base.prepare()
tableToInsert = Base.classes[tableName]
conn = engine.connect()
Session = sessionmaker(bind = conn)
session = Session()
session.bulk_insert_mappings(tableToInsert, df.to_dict(orient="records"), render_nulls = True)
session.commit()
session.close()
conn.close()
Grab the source data
db2 = db2Connect(db2Server)
df = pd.read_sql(query, db2, coerce_float=False)
db2.close()
Set up destination
engine = mssqlSAEngine(server, database)
Start bulk insert, replace NaN with NULL
bulkInsert(engine, df.where(pd.notnull(df), None), tableName)
I've had no trouble with successfully inserting data. However, when I approach the million row mark, my system runs out of memory, and data starts paging. Naturally, performance degrades substantially.
We do have other tools in house (SSIS for example), but I'm looking for a dynamic method. In SSIS, I can either write a C# script task to basically accomplish what I'm doing here in Python, or create custom DFT's for each table. With this method, I just need to pass the source and destination.

How to distinctly bulk update all objects of a django model without iterating over them in python?

Basically can we achieve the same result without doing this:
from my_app import models
for prd,count in x.iteritems():
models.AggregatedResult.objects.filter(product=prd).update(linked_epp_count=count)
?
As is evident, x is a dictionary containing keys same as AggregatedResult's product field and the 'value' is the count that I wish to update. It is taking more than 2 - 3 minutes to run on a test table having < 15k rows and the size of the table is ~ 200k currently and is expected to grow upto a million. So, I need help.
Easiest (but not the safest) way is to use raw sql query.
Something like:
for prd,count in x.iteritems():
from django.db import connection, transaction
cursor = connection.cursor()
query = """
UPDATE {table}
set {column} = {value}
where {condition} = {condition_value}""".format(
table=AggregatedResult._meta.db_table,
condition='product',
condition_value=prd,
column='linked_epp_count',
value=count
)
cursor.execute(query)
transaction.commit_unless_managed()
Warning: Not tested and extremely vulnerable to sql-injections. Use at your own risk
Alternative (much safer) approach would be to first load contents of x into temporary table, than issue just one raw query to update. Assuming temp table for x is temp_prod:
update aggregated_result ar
set linked_epp_count=tp.count
from temp_prod tp
where ar.product = tp.product
How do you upload data from x to temp table is something that I'm not very proficient with, so it's left for you. :)

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Categories

Resources