The Blaze readme (here https://github.com/ContinuumIO/blaze) describes a number of improvements over NumPy including "Synthetic Dimensions". I have searched around but have been unable to find out what they are.
Could someone enlighten me?
Thanks.
A couple of caveats first. One, I work on the project and second these features are still in the planning stage so they are not implemented at this moment.
That being said a "synthetic dimension" in Blaze is much like a derived column in database. It is a value calculated formulaically from an expression in terms of other rows or columns but does not exist in physical memory or on disk.
In Blaze we trying to make these kind of columns ( especially for columnar data stores ) be able to be consumed by next generation ufunc objects that are able to calculate values in the computation from the synthetic dimensions as well as physical dimensions.
Hope that answers the question!
Related
I hope this is a good question, if I should post this as an issue on the PyPolars GitHub instead, please let me know.
I have a quite large parquet file where some columns contain binary data.
These columns are not interesting for me right now, so it is ok for me that PyPolars does not support the Binary datatype so far (this is how I understand it at least, my question would be irrelevant if that were not the case!), but I would like to make full use of the query optimization by lazily reading the file with .scan_parquet() instead of read_parquet().
Currently .scan_parquet() gives me the following error:
pyo3_runtime.PanicException: Arrow datatype Binary not supported by Polars. You probably need to activate that data-type feature.
and I don't know of a way to 'activate that data-type feature'
So my workaround is to use .read_parquet() and specify in advance which columns I want to use so that it never attempts to read the Binary ones.
The problem is I am doing exploratory data analysis and there are a large amount of columns so for one it is annoying to have to specify a large list of columns (basically ~150 minus the two that produce the issue) and it is also inefficient to read all these columns each time when I only need some small subset each time (it is even more annoying to change a small list of columns each time I, for example, add some filter).
It would be ideal if I could use .scan_parquet and let the query optimizer figure out that it only needs to read the (unproblematic) columns that I actually need.
Is there a better way of doing things that I am not seeing?
I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.
What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.
But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).
So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?
Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use a compressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parquet or pd.Dataframe.to_pickle.
There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).
Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.
There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!
Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.
I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!
I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.
In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.
Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?
class Yield(IsDescription):
geography_id = UInt16Col()
data = Float32Col(shape=(50, 1)) # for 50 years of data
Any help on the matter would be greatly appreciated.
What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?
If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.
On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.
Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.
PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.
I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.