Is sorting a DataFrame memory efficient? - python

Is sorting a DataFrame in pandas memory efficient? I.e., can I sort the dataframe without reading the whole thing into memory?

Internally, pandas relies on numpy.argsort to do all the sorting.
That being said: pandas DataFrames are backed by numpy arrays, which have to be present in memory as a whole. So, to answer your question: No, pandas needs the whole dataset in memory for sorting.
Additional thoughts:
You can of course implement such a disk-based external sorting using multiple steps: Load a chunk of your dataset, sort it, save the sorted version. Repeat. Load a part of each sorted subset, join them into one DataFrame and sort it You'll have to be careful here on how much t oload from each source. For example, if your 1000 element dataset is already sorted, getting the top 10 results from each of the 10 subsets won't get you the correct top 100. It will, however, give you the correct top 10.
Without further information about your data, I suggest you let some (relational) database handle all that stuff. They're made for this kind of thing, after all.

Related

Improve Pandas performance for very large dataframes?

I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).

Very efficient parallel sorting of big array in NumPy or Numba

I have an array of events over time. There are more than a billion rows, and each row is made of:
user_id
timestamp
event_type
other high-precision (float64) numerical columns
I need to sort this array in multiple manners, but let's take timestamp for example. I've tried several methods, including:
sort the entire array directly with arr = arr[arr.argsort()] => awful performance
using the same method as above, sort chunks of the array in parallel and then do a final sort => better performance
Still, even with the "sort in chunks then merge" method, performance rapidly degrades with the number of items. It takes more than 5 minutes to sort 200M rows on a 128 CPU machine. This is a problem because I need to perform multiple sortings consecutively, and I need to do it "on the fly", i.e. persisting the sorted array on disk once and for all is not an option because items are always added and removed, not necessarily in chronological order.
Is there a way to drastically improve the performance of this sorting? For instance, could a Numba implementation of mergesort which works in parallel be a solution? Also, I'd be interested in a method which works for sorting on multiple columns (e.g. sort on [user_id, timestamp]).
Notes:
the array is of dtype np.float64 and needs to stay that way because of the contents of other columns which are not mentioned in the above example.
we thought of using a database with separate tables, one for each particular sorting — advantage would be that the tables are always perfectly sorted, but the drawback is in terms of retrieving speed
for the above reason, we went with local Parquet files, which are blazingly fast to retrieve, but then it means we need to sort them once loaded

Memory Error when parsing two large data frames

I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.

General Approach to Working with Data in DataFrames

Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.

Efficiently add single row to Pandas Series or DataFrame

I want to use Pandas to work with series in real-time. Every second, I need to add the latest observation to an existing series. My series are grouped into a DataFrame and stored in an HDF5 file.
Here's how I do it at the moment:
>> existing_series = Series([7,13,97], [0,1,2])
>> updated_series = existing_series.append( Series([111], [3]) )
Is this the most efficient way? I've read countless posts but cannot find any that focuses on efficiency with high-frequency data.
Edit: I just read about modules shelve and pickle. It seems like they would achieve what I'm trying to do, basically save lists on disks. Because my lists are large, is there any way not to load the full list into memory but, rather, efficiently append values one at a time?
Take a look at the new PyTables docs in 0.10 (coming soon) or you can get from master. http://pandas.pydata.org/pandas-docs/dev/whatsnew.html
PyTables is actually pretty good at appending, and writing to a HDFStore every second will work. You want to store a DataFrame table. You can then select data in a query like fashion, e.g.
store.append('df', the_latest_df)
store.append('df', the_latest_df)
....
store.select('df', [ 'index>12:00:01' ])
If this is all from the same process, then this will work great. If you have a writer process and then another process is reading, this is a little tricky (but will work correctly depending on what you are doing).
Another option is to use messaging to transmit from one process to another (and then append in memory), this avoids the serialization issue.

Categories

Resources