I am currently using
tCursors=tuple(cursors)
for row in myTable.GetRows(tCursors):
for i in range(0,len(cursors)):
curValue=cursors[i].CurrentValue
to get values from a data table in spotfire. The tuple contains the cursors for all of the columns that I am interested in. Is there a faster way to obtain multiple values from the data table, possibly loading them into an array, or loading the entire data table into a 2d array?
Sorry that this is vague, I am just not sure what options I have to get the data from the spotfire table into python in a format that can be used. Thank you
If you want to process large data set, you should use R (aka datafunctions).
Python is good to manipulate Spotfire elements thanks to the possibility to modify C# elements with the API.
R is good to manipulate data, you can easily perform calculation, aggregation, pivot etc. It is really faster than Ironpython for operations on data.
Here are some great resources to help you start with R:
The official R documentation
Documentation to manipulate data
R is widely used, so you can easily find chunks of code on the web to achieve what you want to do.
Related
I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).
I'm an absolute novice here, but know my way around some Netsuite formulas and data concepts. I have a report that lists all PACK SKUs (Kit/Bundle) and their component SKUs along with the quantity for each component that makes up the PACK. This is similar to BOMs.
I have used decode function in other saved searches to transpose data in columns, but I'm super confused about how to achieve this for Pack/Components.
There is no individual identifier such as component(a) component (b) that I can identify. These are also not under {memberitems} and looks like the company had this restructured to be under PACK/Component structure.
Is there anyway I can get the many components automatically listed in columns with their quantities next to it? Or, is there a potential Python script or an Excel Macro that might be able to assist with 800k rows? Any help would be highly appreciated.
data snapshot
I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!
I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.
Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame
I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.
Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:
http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata
Another approach, using Pandas, is here:
"Large data" work flows using pandas
Okay, so I don't have much experience with oracle databases, but here's some thoughts:
Your access time for any particular records from oracle are slow, because of a lack of indexing, and the fact you want data in timestamp order.
Firstly, you can't enable indexing for the database?
If you can't manipulate the database, you can presumably request a found set that only includes the ordered unique ids for each row?
You could potentially store this data as a single array of unique ids, and you should be able to fit into memory. If you allow 4k for every unique key (conservative estimate, includes overhead etc), and you don't keep the timestamps, so it's just an array of integers, it might use up about 1.1GB of RAM for 3 million records. That's not a whole heap, and presumably you only want a small window of active data, or perhaps you are processing row by row?
Make a generator function to do all of this. That way, once you complete iteration it should free up the memory, without having to del anything, and it also makes your code easier to follow and avoids bloating the actual important logic of your calculation loop.
If you can't store it all in memory, or for some other reason this doesn't work, then the best thing you can do, is work out how much you can store in memory. You can potentially split the job into multiple requests, and use multithreading to send a request once the last one has finished, while you process the data into your new file. It shouldn't use up memory, until you ask for the data to be returned. Try and work out if the delay is the request being fulfilled, or the data being downloaded.
From the sounds of it, you might be abstracting the database, and letting pandas make the requests. It might be worth looking at how it's limiting the results. You should be able to make the request for all the data, but only load the results one row at a time from the database server.
I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.
In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.
Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?
class Yield(IsDescription):
geography_id = UInt16Col()
data = Float32Col(shape=(50, 1)) # for 50 years of data
Any help on the matter would be greatly appreciated.
What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?
If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.
On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.
Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.