I have a list of database name like this [client_db1, clientdb2,...].
There are total 300 databases.
Now each database has a collection with same name as secrets.
I want to fetch records from each collection so I will get 10-15 (not more than this) records from each collection.
In my code base,
we iterate through each database and then create a dataframe.
combined_auth_df = pd.DataFrame()
for client_db in all_client_dbname_data:
auth_df = get_dataframe_from_collection(client_db)
combined_auth_df = combined_auth_df.append(auth_df, ignore_index=True)
This process is taking a lot of time.
Is there any good way to do this. Also I am curious if we can do the same with multithreading/multiprocessing.
Related
I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?
There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item
I have the following queries:
files = DataAvailability.objects.all()
type1 = files.filter(type1=True)
type2 = files.filter(type2=True)
# People information
people = Person.objects.all()
users = people.filter(is_subject=True)
# count information (this is taking a long time to query them all)
type1_users = type1.filter(person__in=users).count()
type2_users = type2.filter(person__in=users).count()
total_users = files.filter(person__in=users).count()
# another way
total_users2 = files.filter(person__in=users)
type1_users2 = total_users.filter(type1=True).count()
type2_users2 = total_users.filter(type2=True).count()
total_count = total_users2.count()
I thought about creating a query with .values() and putting into a set().
After that is done execute some functions within the set (like diff).
Is this the only way to improve the query time?
You can always do raw SQL https://docs.djangoproject.com/en/2.0/topics/db/sql/#performing-raw-queries
Example:
# Dont do this its insecure
YourModel.objects.raw(f"select id from {YourModel._meta.db_table}")
# Do like this to avoid SQL injection issues.
YourModel.objects.raw("select id from app_model_name")
The name of the table can be obtained as: YourModel._meta.db_table and also you can get the sql of a queryset like this:
type1_users = type1.filter(person__in=users)
type1_users.query.__str__()
So you can build join this query to another one.
I don't have to make those queries very often (once a day at most). So I'm running on a cron job which exports the data to a file (you could create a table in your database for auditing purposes, for ex). I then read the file and use the data from there. It's working well/fast.
I create a list of queries with a loop:
chunks = 1000
chunkTimeInS = (endDate - startDate).total_seconds()/chunks
queryList=[]
for timeChunk in range(0,chunks):
startTime=startDate+timedelta(seconds=chunkTimeInS*timeChunk)
endTime=startDate+timedelta(seconds=chunkTimeInS*(timeChunk+1))
queryList.append(myModel.objects.using('myDB')
.exclude(myValue1=-99)
.filter(timestamp__gte=startTime, timestamp__lt=endTime)
.order_by("timestamp")
.aggregate(
Avg('myValue1'),
Avg('myValue2'),
myValue3 = Avg(F('myValue3')*F('myValue3')),
)
)
I want to save all data to a list by doing: dataList=list(queryList). This call takes way too long. I guess is due to the large amount of queries in the queryList.
Is there a way to merge this list into one query? Or maybe other solutions to speed up the database access.
The database which is used is Oracle 11g.
I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()
Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.
I am writing a script in Python to QC data in a proprietary ESRI database table. The purpose of the script is not to modify invalid data, but simply to report invalid data to the user via a csv file. I am using ESRI's ArcPy package to access each individual record with arcpy.SearchCursor. The SearchCursor is the only way to access each individual record in the ESRI formats.
As I scroll through each record of the tables, I do multiple QC checks to validate specific business logic. One of those checks is looking for duplicate data in particular fields. One of those fields may be geometry. I have done this by creating an empty container object for each of those fields and as I check each record I use the following logic.
for field in dupCheckFields:
if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
else: fieldValues[field].append(row.getValue(field))
The above code is an example of the basic logic I use. Where I am running into trouble is the fact that each of these tables may contain anywhere from 5000 records to 10 million records. I either run out of memory or the performance grinds to a halt.
I have tried the following container types: sets, lists, dictionaries, ZODB + BList, and Shelve.
With the in-memory types (sets, lists, dictionaries) the process is very fast at the start, but as it progresses it gets much slower. We these types, if I have many records in the table I will run out of memory. With the persistent data types, I don't run out of memory, but it takes a very long time to process.
I only need the data while the script is running and any persistent data files will be deleted upon completion.
Question: Is there a better container type out there to provide low-memory storage of lots of data without a large cost in performance when accessing the data?
System: Win7 64-bit, Python 2.6.5 32-bit, 4gb RAM
Thanks in advance for your help.
EDIT:
Sample SQLite code:
import sqlite3, os, arcpy, timeit
fc = r"path\to\feature\class"
# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]
# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}
fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]
# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])
# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))
# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'
def createDB():
conn = sqlite3.connect(os.path.join(loc,'database.db'))
cur = conn.cursor()
cur.execute(createTableString)
conn.commit()
rows = arcpy.SearchCursor(fc)
i = 0
for row in rows:
try:
cur.execute(insertString, [row.getValue(field[0]) for field in fields])
if i % 10000 == 0:
print i, "records"
conn.commit()
i += 1
except sqlite3.IntegrityError: pass
print i, "records"
t1 = timeit.Timer("createDB()","from __main__ import createDB")
print t1.timeit(1)
Unfortunately I cannot share the test data I used with this code, however it was an ESRI ArcSDE geodatabase table containing approx. 10 fields and approx. 7 mil records.
I tried to use timeit to determine how long this process took, however after 2 hours of processing, only 120,000 records were complete.
If you store hashes in (compressed) files, you could stream through them to compare hashes to look for duplicates. Streaming usually has very low memory requirements — you can set the buffer you want, say, one line per hashed record. The tradeoff is generally time, particularly if you add compression, but if you order the files by some criteria, then you may be able to walk through the uncompressed streams to more quickly compare records.
I think I'd evaluate storing persistent data (such as the known field vals and counts) in a SQLite database. It is of course a trade off between memory usage and performance.
If you use a persistence mechanism that supports concurrent access, you can probably parallelise the processing of your data, using multiprocessing. Once complete, a summary of errors can be generated from the database.