How to best query AWS table without missing data using Python? - python

I have the following code that iterates through the rows of certain tables in AWS. It grabs the first 50k rows and keeps going as long as there are 50k more rows to grab and it works extremely quickly because I'm usually only getting the last 2 days worth of data.
top=50000
i=0
days = 2
df = pd.DataFrame()
result = pd.DataFrame()
curs = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
while((i==0) or (len(df)==top)):
start_time = (dt.datetime.now()-timedelta(days=days)).strftime("%Y-%m-%d %H:%M:%S")
sql=f'SELECT * FROM {str.upper(table)} WHERE INSERTED_AT >= \'{start_time}\' OR UPDATED_AT >= \'{start_time}\' LIMIT {top} OFFSET {i}'
curs.execute(sql)
data = curs.fetchall()
df=pd.DataFrame([i.copy() for i in data])
result = result.append(df,ignore_index=True)
#load result to snowflake
i += top
The trouble is I have a very large table that is about 7 million rows long and growing exponentially. I found that if I backload all its data (day=1000) that I will be missing data probably because each iteration what was 0-50k,50k-100k, etc. has now changed since the table loaded more data whilst I was running the while loop.
What is a better way to load data into snowflake that will avoid missing data issues? Do I have to use parallelization to get all these pieces of the table at once? Even if top=3mil I still find I'm missing large amounts of data, likely due to the lag time it takes me to load while the actual table rows are incrementing. Is there a standardized block of code that excels for large tables?

I would skip the Python and favor the UNLOAD Snowflake command.
UNLOAD lets you dump the contents of a Redshift table into an S3 bucket:
https://community.snowflake.com/s/article/How-To-Migrate-Data-from-Amazon-Redshift-into-Snowflake
unload ('select * from emp where date = GET_DATE()')
to 's3://mybucket/mypath/'
credentials 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
delimiter '\001'
null '\\N'
escape
[allowoverwrite]
[gzip];
You could set up a stored procedure that runs and have a schedule that kicks it off once per day (I use Astronomer/Airflow for this).
From there you can build an external table on top of the bucket:
https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html

Related

pyspark: Get hudi last/latest commit using pyspark

I am doing an incremental query with spark-hudi every hour and saving that incremental query begin and end time in db(say mysql) everytime. For nexti ncemental query I use begin time as end time of previous query fetch from mysql.
incremental query should look like this:
hudi_incremental_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': hudi_start_commit,
'hoodie.datasource.read.end.instanttime': hudi_end_commit
}
but I am not sure how to find hudi_end_commit in pyspark(python). In Java I can do the same with helper class HoodieDataSourceHelpers like:
String hudi_end_commit = HoodieDataSourceHelpers.latestCommit(FileSystem.get(javaSparkContext.hadoopConfiguration()),l1BasePath);
but unable to find a solution to do the same in python.
After a work around I found a solution that is not feasible for a large dataset.
spark_session.read.format("hudi").load(l1_base_path).createOrReplaceTempView("hudi_trips_snapshot")
commits = list(map(lambda row: row[0],
spark_session.sql("select distinct(_hoodie_commit_time) as commitTime from "
"hudi_trips_snapshot order by commitTime desc").limit(1).collect()))
But when the data size is too large it loaded whole data to get the hudi commits that takes more time than reading the actual data itself.
Is there any easy way to find hudi latest/last commit.
Try this (worked for me in pyspark shell):
hudi_end_commit = spark._sc._gateway.jvm.org.apache.hudi.HoodieDataSourceHelpers.latestCommit(
spark._sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(spark._sc._jsc.hadoopConfiguration()),
"/path/to/hudi/table"
)

Effectively querying DBF files with Python

I have the need to read from a legacy VFP DBF database and gather all rows which have an etd within the current week.
I am using dbf however it seems that when querying the table, it begins the query at the very first record in the table. This causes performance issues when attempting to find data within the last week, as it has to iterate over every line in the database (60k+) every time it runs.
table = dbf.Table(r'\\server\file.dbf')
table.open()
for row in table:
if (self.monday < row.etd < self.friday) and ('LOC' not in row.route):
self.datatable.Rows.Add(row.manifest, row.route, row.etd, row.eta, row.inst, row.subname)
else:
continue
I tried to "reverse" the table with for row in table[::-1]:
However, this takes the same amount of time as I believe it needs to load the database into memory prior to the [::-1]
What would be a more efficient way to query these DBF files?
As you know, dbf does not support index files. It does, however, have some methods reminiscent of VFP that could help:
# untested
table = ...
potental_records = []
with table: # auto opens and closes
table.bottom() # goes to end of table
while True:
table.skip(-1) # move to previous record
row = table.current_record
if self.monday > row.etd:
# gone back beyond range
break
elif row.etd < self.friday:
potential_records.append(row)
# at this point the table is closed and potential_records should have all
# records in the etd range.
The above will only work if the records are physically ordered by etd.

SQLAlchemy `.fetchmany()` vs `.limit()`

Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`
limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.
I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk

Snowflake Fetch Many in Python

I'm trying to create a pandas data frame using the Snowflake Packages in python.
I run some query
sf_cur = get_sf_connector()
sf_cur.execute("USE WAREHOUSE Warehouse;")
sf_cur.execute("""select Query"""
)
print('done')
The output is roughly 21k rows. Then using
df = pd.DataFrame(sf_cur.fetchall())
takes forever, even on a limit sample of only 100 rows. Is there a way to optimize this, ideally the bigger query would be run in a loop so handling even bigger data sets would be ideal.
as fetchall() copies all the result in memory, you should try to iterate over the cursor object directly and map it to a data frame inside the for block
cursor.execute(query)
for row in cursor:
#build the data frame
Other example, just to show:
query = "Select ID from Users"
cursor.execute(query)
for row in cursor:
list_ids.append(row["ID"])
Use df = cur.fetch_pandas_all() to build pandas dataframe on top of results.

Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()
Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.

Categories

Resources