I like the idea of having my historical stock data stored in a database instead of CSV. Is there a speed penalty for fetching large data sets from MariaDB compared to CSV
Quite the opposite. Whenever you fetch data from a CSV, unless you have a stopping condition (for example, take the first entry with x = 3) you must parse every single line in the file. This is an expensive operation because not only do you have to read all of the lines (making it O(n)), but in general, you will be typecasting as well. In a database, you have already processed all of the lines, and if in this case there is an index on x or whatever attribute you are searching by, the database will be able to find the information in O(log(n)) time and will not look at the vast majority of entries.
Related
When using df = connectorx.read_sql(conn=cx_con, query=f"SELECT * FROM table;"), I occasionally get a Dataframe returned with the correct columns, but all the rows are zeros or it crashes with the message Process finished with exit code -1073741819 (0xC0000005). This only happens when the table is being updated at the same time with df.to_sql("table", con=con_in, if_exists="append")
My program reads a table from a local database that I am continuously updating in a concurrently running program. This issue does not occur when I try to read from the table using pandas.read_sql_query() (which is far slower). This indicates that there is some issue with the handling of the read/write traffic accessing the table when using connectorx that does not exist with the pandas read. Is this a bug with connectorx or is there something I can do to prevent this from happening?
I'm using PyCharm 2022.2.1, Windows11, and Python 3.10
Reason for the error
ConnectorX is more designed for OLAP scenarios where the data is static with readonly queries. The reason that causes zero rows / crushes is because of the inconsistency between multiple queries. In order to achieve maximum speed up, ConnectorX issues queries to fetch metadata before fetching the real query result, including:
limit 1 query to get result schema (column types and names)
count query to get the number of the rows in the result
min/max query to get the min and max value of the partition column (if partition enabled)
The first two are used to pre-allocate the destination pandas.DataFrame in advance (step 1 and 2 in the below example workflow). Getting the dataframe in the beginning makes it possible for ConnectorX to stream the result values directly to the final destination, avoiding extra data copy and result concatenations (step 4-6 below, done in streaming fashion).
If the data is updating, the result of the count query X may be different from the real number of rows that the query returns Y. In such case, the program may crash (if X < Y) or return some rows with all zeros (if X > Y).
Possible workarounds
Avoid COUNT query through arrow
One possible way to avoid the count query is to set the return_type to arrow2 to get the arrow format first. Since arrow table is consisted with multiple record batches, ConnectorX can allocate the memory on demand without issuing the count query. After getting the arrow result, you can then convert arrow to pandas using the efficient to_pandas API provided by pyarrow. Here is an example:
import connectorx as cx
table = cx.read_sql(conn, query, return_type="arrow2")
df = table.to_pandas(split_blocks=False, date_as_object=False)
However, one thing need to be noticed is that if you are using partition, the result might still be incorrect due to the inconsistency among min/max and multiple partitioned queries.
Add predicates for consistency
If your data is append-only in a certain way, for example a monotonic ID column. You can add a predicate like ID <= current max ID so the concurrently appending data will be filtered out in both count query and fetch result query. If you are using partition, you can also partition on this ID column so that the result can be consistent.
I am storing multiple time-series in a MongoDB with sub-second granularity. The DB is updated by a bunch of Python scripts, and the data stored serve two main purposes:
(1) It's a central information source for the latest data from all series. Multiple scripts access it every second or so to read the latest datapoint in each collection.
(2) It's a long-term data store. I often load the whole DB into Python to analyse trends in the data.
To keep the DB as efficient as possible, I want to bucket my data (ideally holding one document per day in each collection). Because of (1), however, the bigger the buckets, the more expensive the sorting required to access the last datapoint.
I can think of two solutions here, but I'm not sure what alternatives there are, or which is the best way:
a) Store the latest timestamp in a one-line document in a separate db/collection. No sorting required on read, but an additional write required every time a any series gets a new datapoint.
b) Keep the buckets smaller (say 1-hour each) and sort.
With a) you write smallish documents to a separate collection, which is performance wise preferable to updating large documents. You could write all new datapoints in this collection and aggregate them for the hour or day, depending on your preference. But as you said this requires an additional write operation.
With b) you need to keep the index size for the sort field in mind. Does the index size fit in memory? That's crucial for the performance of the sort, as you do not want to do any in memory sorting of a large collection.
I recommend exploring the hybrid approach, of storing individual datapoints for a limited time in an 'incoming' collection. Once your bucketing interval of hour or day approaches, you can aggregate the datapoints into buckets and store them in a different collection. Of course there is now some additional complexity in the application, that needs to be able to read bucketed and datapoint collections and merge them.
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?
I have a large SQLite db where I am joining a 3.5M-row table onto itself. I use SQLite since it is the serialization format of my python3 application and the flatfile format is important in my workflow. When iterating over the rows of this join (around 55M rows) using:
cursor.execute('SELECT DISTINCT p.pid, pp.pname, pp.pid FROM proteins'
'AS p JOIN proteins AS pp USING(pname) ORDER BY p.pid')
for row in cursor:
# do stuff with row.
EXPLAIN QUERY PLAN gives the following:
0|0|0|SCAN TABLE proteins AS p USING INDEX pid_index (~1000000 rows)
0|1|1|SEARCH TABLE proteins AS pp USING INDEX pname_index (pname=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
sqlite3 errors with "database or disk is full" after say 1.000.000 rows, which seems to indicate a full SQLite on-disk tempstore. Since I have enough RAM on my current box, that can be solved by setting the tempstore to in memory, but it's suboptimal since in that case all the RAM seems to be used up and I tend to run 4 or so of these processes in parallel. My (probably incorrect) assumption was that the iterator was a generator and would not put a large load on the memory, unlike e.g. fetchall which would load all rows. However I now run out of diskspace (on a small SSD scratch disk) and assuming that SQLite needs to store the results somewhere.
A way around this may be to run chunks of SELECT ... LIMIT x OFFSET y queries, but they get slower for each time a bigger OFFSET is used. Is there any other way to run this? What is stored in these temporary files? They seem to grow the further I iterate.
0|0|0|USE TEMP B-TREE FOR DISTINCT
Here's what's using the disk.
In order to support DISTINCT, SQLite has to store what rows already appeared in the query. For a large number of results, this set can grow huge. So to save on RAM, SQLite will temporarily store the distinct set on disk.
Removing the DISTINCT clause is an easy way to avoid the issue, but it changes the meaning of the query; you can now get duplicate rows. If you don't mind that, or you have unique indices or some other way of ensuring that you never get duplicates, then that won't matter.
What you are trying to do with SQLite3 is a very bad idea, let me try to explain why.
You have the raw data on disk where it fits and is readable.
You generate a result inside of SQLite3 which expands greatly.
You then try to transfer this very large dataset through an sql connector.
Relational databases in general is not made for this kind of operation. SQLite3 is no exception. Relational databases were made for small quick queries that live for a fraction of a second and that returns a couple of rows.
You would be better off using another tool.
Reading the whole dataset into python using Pandas for instance is my recommended solution. Also using itertools is a good idea.
I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.
Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:
http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata
Another approach, using Pandas, is here:
"Large data" work flows using pandas
Okay, so I don't have much experience with oracle databases, but here's some thoughts:
Your access time for any particular records from oracle are slow, because of a lack of indexing, and the fact you want data in timestamp order.
Firstly, you can't enable indexing for the database?
If you can't manipulate the database, you can presumably request a found set that only includes the ordered unique ids for each row?
You could potentially store this data as a single array of unique ids, and you should be able to fit into memory. If you allow 4k for every unique key (conservative estimate, includes overhead etc), and you don't keep the timestamps, so it's just an array of integers, it might use up about 1.1GB of RAM for 3 million records. That's not a whole heap, and presumably you only want a small window of active data, or perhaps you are processing row by row?
Make a generator function to do all of this. That way, once you complete iteration it should free up the memory, without having to del anything, and it also makes your code easier to follow and avoids bloating the actual important logic of your calculation loop.
If you can't store it all in memory, or for some other reason this doesn't work, then the best thing you can do, is work out how much you can store in memory. You can potentially split the job into multiple requests, and use multithreading to send a request once the last one has finished, while you process the data into your new file. It shouldn't use up memory, until you ask for the data to be returned. Try and work out if the delay is the request being fulfilled, or the data being downloaded.
From the sounds of it, you might be abstracting the database, and letting pandas make the requests. It might be worth looking at how it's limiting the results. You should be able to make the request for all the data, but only load the results one row at a time from the database server.