Currently, we have a table containing a varchar2 column with 4000 characters, however, it became a limitation as the size of the 'text' being inserted can grow bigger than 4000 characters, therefore we decided to use CLOB as the data type for this specific column, what happens now is that both the insertions and selections are way too slow compared to the previous varchar2(4000) data type.
We are using Python combined with SqlAlchemy to do both the insertions and the retrieval of the data. In simple words, the implementation itself did not change at all, only the column data type in the database.
Does anyone have any idea on how to tweak the performance?
There are two kinds of storage for CLOB's
in row
The clob is stored like any other column in the row. This can only be
done for clob up to a certain size (approx 4k). Clobs larger than this
will stored in a separate segment (the "lobsegment")
out of row
The clob is always stored out of the row in the lobsegment
You can which is being used for your table by checking USER_LOBS.
It is possible, particularly in the first 'in row' instance that your
table consume more blocks for the "normal" rows because of the
interspersed lob data, and hence takes longer to scan.
See here: https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9536389800346757758
If your LOB is stored separately, you have to navigate a separate indexing architecture and additional data files to get to the data. This will slow things down because of the additional background operations and disk I/O involved. There isn't much you can do about that except minimize the number of operations that access the LOB data, or get faster storage media.
LOBs have to be read and/or written behind the scenes by your client driver using a loop that reads or writes into a small buffer and keeps doing so until it reaches the end of the LOB. That buffer is smaller than you'd want it to be, and so it becomes quite chatty (lots of back and forth packets across the network between client and database). Not to mention creating the LOB locator and closing it. Too much back-and-forth.
Given that you recently made this change because some data exceeded 4KB, that suggests that most of your data is < 4KB. Take advantage of that situation and work with those < 4KB values using varchar2(4000), and only use CLOB where you have to (with values > 4KB).
Actual implementation of course depends on what you're doing. But this concept should be applied in some way. If you are selecting data, then select like this:
SELECT CAST(clobcolumn AS varchar2(4000)) clobcolumn_as_varchar
FROM mytable
WHERE NVL(dbms_lob.getlength(clobcolumn),0) <= 4000
AND [other predicates you need]
Then do a separate select for the larger rows:
SELECT clobcolumn
FROM mytable
WHERE NVL(dbms_lob.getlength(bigcolumn),0) > 4000
AND [other predicates you need]
For inserts, define a client variable that will be mapped to a varchar, and one that will be mapped to a CLOB, use the first for the < 4KB values, the larger for the >4KB values, separate insert statements. Even if the varchar2 version is inserted into a CLOB column, Oracle will do the implicit conversion for you within the database, and it's fast. The key thing is, the network conversation used varchar2.
Note that this approach can make a world of difference if the great majority of your values are small (< 4KB). If most of them are over the 4KB boundary, then there's little benefit.
You could also ask your DBA if possible to upgrade the DB to max_string_size=EXTENDED, then the max VARCHAR2 size would be 32K.
Related
I like the idea of having my historical stock data stored in a database instead of CSV. Is there a speed penalty for fetching large data sets from MariaDB compared to CSV
Quite the opposite. Whenever you fetch data from a CSV, unless you have a stopping condition (for example, take the first entry with x = 3) you must parse every single line in the file. This is an expensive operation because not only do you have to read all of the lines (making it O(n)), but in general, you will be typecasting as well. In a database, you have already processed all of the lines, and if in this case there is an index on x or whatever attribute you are searching by, the database will be able to find the information in O(log(n)) time and will not look at the vast majority of entries.
I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.
For speed, I'm inserting them using session.bulk_insert_mappings().
Normally, I'd break up the insert into smaller batches to minimize session bookkeeping. However, bulk_insert_mappings() uses dicts and bypasses a lot of the traditional session bookkeeping.
Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?
If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?
In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.
SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement, so you don't have to hand-craft the statement.
Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.
You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.
I have a large SQLite db where I am joining a 3.5M-row table onto itself. I use SQLite since it is the serialization format of my python3 application and the flatfile format is important in my workflow. When iterating over the rows of this join (around 55M rows) using:
cursor.execute('SELECT DISTINCT p.pid, pp.pname, pp.pid FROM proteins'
'AS p JOIN proteins AS pp USING(pname) ORDER BY p.pid')
for row in cursor:
# do stuff with row.
EXPLAIN QUERY PLAN gives the following:
0|0|0|SCAN TABLE proteins AS p USING INDEX pid_index (~1000000 rows)
0|1|1|SEARCH TABLE proteins AS pp USING INDEX pname_index (pname=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
sqlite3 errors with "database or disk is full" after say 1.000.000 rows, which seems to indicate a full SQLite on-disk tempstore. Since I have enough RAM on my current box, that can be solved by setting the tempstore to in memory, but it's suboptimal since in that case all the RAM seems to be used up and I tend to run 4 or so of these processes in parallel. My (probably incorrect) assumption was that the iterator was a generator and would not put a large load on the memory, unlike e.g. fetchall which would load all rows. However I now run out of diskspace (on a small SSD scratch disk) and assuming that SQLite needs to store the results somewhere.
A way around this may be to run chunks of SELECT ... LIMIT x OFFSET y queries, but they get slower for each time a bigger OFFSET is used. Is there any other way to run this? What is stored in these temporary files? They seem to grow the further I iterate.
0|0|0|USE TEMP B-TREE FOR DISTINCT
Here's what's using the disk.
In order to support DISTINCT, SQLite has to store what rows already appeared in the query. For a large number of results, this set can grow huge. So to save on RAM, SQLite will temporarily store the distinct set on disk.
Removing the DISTINCT clause is an easy way to avoid the issue, but it changes the meaning of the query; you can now get duplicate rows. If you don't mind that, or you have unique indices or some other way of ensuring that you never get duplicates, then that won't matter.
What you are trying to do with SQLite3 is a very bad idea, let me try to explain why.
You have the raw data on disk where it fits and is readable.
You generate a result inside of SQLite3 which expands greatly.
You then try to transfer this very large dataset through an sql connector.
Relational databases in general is not made for this kind of operation. SQLite3 is no exception. Relational databases were made for small quick queries that live for a fraction of a second and that returns a couple of rows.
You would be better off using another tool.
Reading the whole dataset into python using Pandas for instance is my recommended solution. Also using itertools is a good idea.
I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.
I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.
Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.
There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;
34 "doc1, doc2, doc45"
The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:
self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")
And the snippet for insert/update operation is:
if self.cursor.execute("UPDATE `fingerprint` SET documents=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))
The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.
Get a high end database. Oracle has some offers. SQL Server also.
TRILLIONS of entries is well beyond the scope of a normal database. THis is very high end very special stuff, especially if you want decent performance. Also get the hardware for it - this means a decent mid range server, 128+gb memory for caching, and either a decent SAN or a good enough DAS setup via SAS.
Remember, TRILLIONS means:
1000gb used for EVERY BYTE.
If the fingerprint is stored as an int64 this is 8000gb disc space alone for this data.
Or do you try running that from a small cheap server iwth a couple of 2tb discs? Good luck.
That data structure isn't a great fit for SQL - the 'correct' design in SQL would be to have a row for each fingerprint/document pair, but querying would be impossibly slow unless you add an index that would take up too much space. For what you are trying to do, SQL adds a lot of overhead to support functions you don't need while not supporting the multiple value column that you do need.
A redis cluster might be a good fit - the atomic set operations should be perfect for what you are doing, and with the right virtual memory setup and consistent hashing to distribute the fingerprints across nodes it should be able to handle the data volume. The commands would then be
SADD fingerprint, docid
to add or update the record, and
SMEMBERS fingerprint
to get all the document ids with that fingerprint.
SADD is O(1). SMEMBERS is O(n), but n is the number of documents in the set, not the number of documents/fingerprints in the system, so effectively also O(1) in this case.
The SQL insert you are currently using is O(n) with n being the very large total number of records, because the records are stored as an ordered list which must be reordered on insert rather than a hash table which is constant time for both get and set.
Greenplum data warehouse, FOC, postgres driven, good luck ...