Effectively querying DBF files with Python - python

I have the need to read from a legacy VFP DBF database and gather all rows which have an etd within the current week.
I am using dbf however it seems that when querying the table, it begins the query at the very first record in the table. This causes performance issues when attempting to find data within the last week, as it has to iterate over every line in the database (60k+) every time it runs.
table = dbf.Table(r'\\server\file.dbf')
table.open()
for row in table:
if (self.monday < row.etd < self.friday) and ('LOC' not in row.route):
self.datatable.Rows.Add(row.manifest, row.route, row.etd, row.eta, row.inst, row.subname)
else:
continue
I tried to "reverse" the table with for row in table[::-1]:
However, this takes the same amount of time as I believe it needs to load the database into memory prior to the [::-1]
What would be a more efficient way to query these DBF files?

As you know, dbf does not support index files. It does, however, have some methods reminiscent of VFP that could help:
# untested
table = ...
potental_records = []
with table: # auto opens and closes
table.bottom() # goes to end of table
while True:
table.skip(-1) # move to previous record
row = table.current_record
if self.monday > row.etd:
# gone back beyond range
break
elif row.etd < self.friday:
potential_records.append(row)
# at this point the table is closed and potential_records should have all
# records in the etd range.
The above will only work if the records are physically ordered by etd.

Related

Azure Table Storage sync between 2 different storages

I have a list of storage accounts and I would like to copy the exact table content from source_table to destination_table exactly how it is. Which mean if I add an entry to source_table that will be moved to the destination_table same think if I delete the entry from source I want it to be deleted from destination.
So far I have in place this code:
source_table = TableService(account_name="sourcestorageaccount",
account_key="source key")
destination_storage = TableService(account_name="destination storage",
account_key="destinationKey")
query_size = 1000
# save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name, resp_data: ListGenerator,
table_out: TableService, table_in: TableService, query_size: int):
for item in resp_data:
tb_name = source_table_name
del item.etag
del item.Timestamp
print("INSERT data:" + str(item) + "into TABLE:" + tb_name)
table_in.insert_or_replace_entity(target_table_name, item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name, num_results=query_size,
marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data, table_out, table_in, query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name
# create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
# first query
data = table_service_out.query_entities(tb.name, num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table, data, table_service_out, table_service_in, query_size)
As you can see this block of code up runs just perfectly, it loops over the source storage account table and creates the same table and its content in destination storage account. but I am missing the part of how I can check if a record has been deleted from the source storage and remove the same record from the destination table.
I hope my question/issue is clear enough, and if not please just ask me for more informations.
Thank you so much for any help you can provide
UPDATE:
The more a think about this the more the logic get messy.
One of the solution that I thought about and tried is to have 2 lists to store every single table entity:
Source_table_entries
Destination_table_entries
Once I have populated the lists for each run I can compare the partition keys and if a partition key is present in Destination_table_entries but on in source, that will me promoted to be deleted.
But this logic will work flawless as long as I have a small table, unfortunately some table contains hundreds of thousands of entities (and I have hundreds of storages) which sooner or later will become a mess to managed.
So one of the solution that I thought about. Is to keep the same code I have above and just create a new table every week and delete the older one (from the destination storage). For example
Table week 1
Table week 2
Table week 3 (this will be deleted)
I read around that I could potentially add a metadata to the table for date and leverage that to decide which table should be deleted based on date time. But I cannot find anything in the documentation.
Can anyone please direct me on the best approach for this. Thank you so much, I am loosing my mind on this last bit

How to best query AWS table without missing data using Python?

I have the following code that iterates through the rows of certain tables in AWS. It grabs the first 50k rows and keeps going as long as there are 50k more rows to grab and it works extremely quickly because I'm usually only getting the last 2 days worth of data.
top=50000
i=0
days = 2
df = pd.DataFrame()
result = pd.DataFrame()
curs = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
while((i==0) or (len(df)==top)):
start_time = (dt.datetime.now()-timedelta(days=days)).strftime("%Y-%m-%d %H:%M:%S")
sql=f'SELECT * FROM {str.upper(table)} WHERE INSERTED_AT >= \'{start_time}\' OR UPDATED_AT >= \'{start_time}\' LIMIT {top} OFFSET {i}'
curs.execute(sql)
data = curs.fetchall()
df=pd.DataFrame([i.copy() for i in data])
result = result.append(df,ignore_index=True)
#load result to snowflake
i += top
The trouble is I have a very large table that is about 7 million rows long and growing exponentially. I found that if I backload all its data (day=1000) that I will be missing data probably because each iteration what was 0-50k,50k-100k, etc. has now changed since the table loaded more data whilst I was running the while loop.
What is a better way to load data into snowflake that will avoid missing data issues? Do I have to use parallelization to get all these pieces of the table at once? Even if top=3mil I still find I'm missing large amounts of data, likely due to the lag time it takes me to load while the actual table rows are incrementing. Is there a standardized block of code that excels for large tables?
I would skip the Python and favor the UNLOAD Snowflake command.
UNLOAD lets you dump the contents of a Redshift table into an S3 bucket:
https://community.snowflake.com/s/article/How-To-Migrate-Data-from-Amazon-Redshift-into-Snowflake
unload ('select * from emp where date = GET_DATE()')
to 's3://mybucket/mypath/'
credentials 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
delimiter '\001'
null '\\N'
escape
[allowoverwrite]
[gzip];
You could set up a stored procedure that runs and have a schedule that kicks it off once per day (I use Astronomer/Airflow for this).
From there you can build an external table on top of the bucket:
https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html

PYMSSQL/SQL Server 2014: is there a limit to the length of a list of PKs to use as a subquery?

I have implemented a python script in order to divide millions of documents (generated by a .NET web application and which were all content into a single directory) into sub folders with this scheme: year/month/batch, as all the tasks these documents come from were originally divided into batches.
My python scripts performs queries to SQL Server 2014 which contains all data it needs for each document, in particular the month and year it was created in. Then it uses shutil module to move the pdf. So, I firstly perform a first query to get a list of batches, for a given month and year:
queryBatches = '''SELECT DISTINCT IDBATCH
FROM [DBNAME].[dbo].[WORKS]
WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)
Then I perform:
for batch in batches:
query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
FROM [DBNAME].[dbo].[WORKS]
WHERE NAMEDOCUMENTI IS NOT NULL and
NAMEDOCUMENT not like '/%/%/%/%.pdf' and
YEAR(DATETIMEWORK)={} and
MONTH(DATETIMEWORK)={} and
IDBATCH={}'''.format(year,month,batch[0])
whose records are collected into a cursor, according to PYMSSQL use documentation. So I go on with:
IDWorksUpdate = []
row = cursor.fetchone()
while row:
if moveDocument(...):
IDWorksUpdate.append(row[0])
row = cursor.fetchone()
Finally, when the cycle has ended, in IDWorksUpdate I have all the PKs of WORKS whose documents succeeded to be correctly moved into a subfolder. So, I close the cursor and the connection and I instantiate new ones.
In the end I perform:
subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
newCursor.close()
del newCursor
newConn.close()
This morning I see that only for a couple of batches that update query failed executing at the database, even if the documents were correctly moved into subdirectories.
That batched had more than 55000 documents to be moved, so maybe the IDWorksUpdate overflowed and it helped to bad create that final update query? I thought that 55000 was not such a big list of integers. The problem is that in PYMSSQL we cannot have more than one connection/cursor at a time to the same database so I cannot update the records as the respective files are moved. So I thought to create a list of PKs of works whose documents were correctly moved and finally update them with a new connection/cursor. What could have happened? Am I doing it wrong?
UPDATE
I've just written a simple script to reproduce the query which is going to be executed to update the records, and this is the error I get from SQL Server:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
This is the query:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)
The fact is that table is very big (with about 14 millions of records). But I need that list of PKs because only the tasks whose document have been correctly processed and moved can be updated. I cannot simply run:
UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and
MONTH(DATETIMEWORK)=12 and IDBATCH=1484
This because as our server was attacked by a crypto locker, I must process and move only the documents that still exist, waiting for the other to be released.
Should I split those string into sub lists? How?
UPDATE 2
It seems the following could be a solution: I split the list of PKs into chunks of 10000 (a fully experimental number) and then I execute as many queries as many chunks, each of them with a chunk as subquery.
def updateDB(listID, y, m, b, log):
newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()
if len(listID) <= 10000:
subquery = '('+', '.join(str(x) for x in listID)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('...')
log.write('\n\n')
finally:
newCursor.close()
del newCursor
newConn.close()
else:
chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]
for sublistPK in chunksPK:
subquery = '('+', '.join(str(x) for x in sublistPK)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)
try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Could not execute partial {}'.format(query))
log.write('\n\n')
newCursor.close()
del newCursor
newConn.close()
Could this be a good/secure solution?
As stated in the MSDN document
IN (Transact-SQL)
Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.
(The error message you cited was error 8623.)
Putting the IN list values into a temporary table and then using
... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)
strikes me as being more straightforward than the "chunking" method you described.

Is it possible to read table from database & write to another table in same code at python

I want write python code which read data from one table & do some operation & write the output along with some more columns from previous table in same database at same code.
Here is description::
I have table (name= table-1) from where I am reading data. Doing some operation & getting value in some variable. I want store that value in table-2 in same database in a same code.
I have use two cursor (curr-1 for table1 to read & doing some operation)&(curr-2 for table2 to insert value)
I am inserting the value (using curr-2) where I am getting value in the code.
Here some sample code which I would like to execute:
connection_1= sqlite3.connect('/home/Documents/attendance_report/data.db')
cur_1 = connection_1.cursor()
connection_2 = sqlite3.connect('/home/Documents/attendance_report/data.db')
cur_2 = connection_2.cursor()
cur_2.execute('select * from table-1')
count=1
for row in cur_2:
count+=1
##doing some operation in the variable name(xyz)
new_row = [col1_4m_table_1,col2_4m_table_1,xyz]
cur_1.execute('''insert into total_time values(?,?,?)''', new_row)
And also that variable xyz is in timedelta format,which output should display like 09:23:54 format.it's not working if declare as (time) format
code has not syntax & logical error.
Deeply thanks & welcome for your feedback. If need some clearance please fell free to ask
Multiple connections would interfere with each other if they could read and write at the same time.
To execute multiple SQL statements, use two cursors from the same connection.

Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()
Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.

Categories

Resources