pyspark: Get hudi last/latest commit using pyspark

pyspark: Get hudi last/latest commit using pyspark - python

I am doing an incremental query with spark-hudi every hour and saving that incremental query begin and end time in db(say mysql) everytime. For nexti ncemental query I use begin time as end time of previous query fetch from mysql.
incremental query should look like this:
hudi_incremental_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': hudi_start_commit,
'hoodie.datasource.read.end.instanttime': hudi_end_commit
}
but I am not sure how to find hudi_end_commit in pyspark(python). In Java I can do the same with helper class HoodieDataSourceHelpers like:
String hudi_end_commit = HoodieDataSourceHelpers.latestCommit(FileSystem.get(javaSparkContext.hadoopConfiguration()),l1BasePath);
but unable to find a solution to do the same in python.
After a work around I found a solution that is not feasible for a large dataset.
spark_session.read.format("hudi").load(l1_base_path).createOrReplaceTempView("hudi_trips_snapshot")
commits = list(map(lambda row: row[0],
spark_session.sql("select distinct(_hoodie_commit_time) as commitTime from "
"hudi_trips_snapshot order by commitTime desc").limit(1).collect()))
But when the data size is too large it loaded whole data to get the hudi commits that takes more time than reading the actual data itself.
Is there any easy way to find hudi latest/last commit.

Try this (worked for me in pyspark shell):
hudi_end_commit = spark._sc._gateway.jvm.org.apache.hudi.HoodieDataSourceHelpers.latestCommit(
spark._sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(spark._sc._jsc.hadoopConfiguration()),
"/path/to/hudi/table"
)

Related

Faster cosmos db query

I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?

There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item

SQLAlchemy `.fetchmany()` vs `.limit()`

Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`

limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.

I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk

PyMongo cursor operations are very slow

I'm new to both MongoDB and pyMongo,
and am having some performance issues
regarding cursors.
TL,DNR: Anything operation I try to perform
using a cursor takes about a second.
Long version
I have a small database, which I bulkloaded. Each entry has 3 fields:
dom: domain name (unique)
date: date, YYYYMMDD
flag: string
I've loaded about 1.9 million entries, without incident, and quite quickly.
I created a hash index on the dom field.
Now, I want to grab certain records by the domain field, and update them, using a Python program.
That's where the problem lies.
I'm using the latest MongoDB, and the latest pyMongo.
stripped down program...
import pymongo
from pymongo import MongoClient
db = client.myindexname
posts = db.posts
print list(db.profiles.index_information()) # shows hash index is present
for k in newdomainlist.keys(): #iterate list of domains to check
ret = posts.find({"dom": k}) #this runs fine, and quickly
#'ret' is a cursor
print ret #this runs quickly
#Here's the problem
print ret.count() #this takes about a second. why?
If I just 'print ret', the speed is fine. However, if I try to
reference anything in the cursor, the speed drops to the floor - I
can do about 1 operation per second.
In this case, I'm just trying to see if ret.count() returns '0' (we don't
have this domain), or '1' (we have it already).
I've tried adding a batch_size(10000) to the find, without it helping.
I DO have the Python C extensions loaded.
What the heck am I doing wrong?
thanks

It turned out that I'd created my hashed index on the wrong field, 'collection', rather than 'posts'. Chalk it up to mongodb inexperience. We can close this one now, or delete it entirely.

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.

Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?

I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark: Get hudi last/latest commit using pyspark - python

Try this (worked for me in pyspark shell): hudi_end_commit = spark._sc._gateway.jvm.org.apache.hudi.HoodieDataSourceHelpers.latestCommit( spark._sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(spark._sc._jsc.hadoopConfiguration()), "/path/to/hudi/table" )

Related

Faster cosmos db query

SQLAlchemy `.fetchmany()` vs `.limit()`

PyMongo cursor operations are very slow

Slow MySQL queries in Python but fast elsewhere

Optimizing performance of Postgresql database writes in Django?

Categories

Resources