Efficient insertion of row into sorted DataFrame - python

My problem requires the incremental addition of rows into a sorted DataFrame (with a DateTimeIndex), but I'm currently unable to find an efficient way to do this. There doesn't seem to be any concept of an "insort".
I've tried appending the row and resorting in place, and I've also tried getting the insertion point with searchsorted and slicing and concatenating to create a new DataFrame. Both being "too slow".
Is Pandas just not suited to jobs where you don't have all the data at once and instead get your data incrementally?
Solutions I've tried:
Concatenation
def insert_data(df, data, index):
insertion_index = df.index.searchsorted(index)
new_df = pandas.concat([df[:insertion_index], pandas.DataFrame(data, index=[index]), df[insertion_index:]])
return new_df, insertion_index
Resorting
def insert_data(df, data, index):
new_df = df.append(pandas.DataFrame(data, index=[index]))
new_df.sort_index(inplace=True)
return new_df

pandas is built on numpy. numpy arrays are fixed sized objects. While there are numpy append and insert functions, in practice they construct new arrays from the old and new data.
There are 2 practical approaches to incrementally defining these arrays:
initialize a large empty array, and fill in values incrementally
incrementally create a Python list (or dictionary), and create the array from the completed list.
Appending to a Python list is a common and fast task. There is also a list insert, but it is slower. For sorted inserts there are specialized Python structures (e.g. bisect).
Pandas may have added functions to deal with common creation scenarios. But unless it has coded something special in C it is unlikely to be faster than a more basic Python structure.
Even if you have to use Pandas features at various points along the incremental build, it might best to create a new DataFrame on the fly from the underlying Python structure.

Related

Memory-efficient way to merge large list of series into dataframe

I have a large list of pandas.Series that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered and new pd.Series objects come in every few seconds. My current approach is to call pd.DataFrame on the list of series as follows:
timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
if counter % 500 == 0:
logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
counter += 1
timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)
The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.
My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:
Create small dataframes with the new data and merge into the growing dataframe
Concat the growing dataframe and the new series
Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.
Thanks a lot!
Edit: Thanks to Timus, I am one step further
Pandas uses the following code when creating a DataFrame:
elif is_list_like(data):
if not isinstance(data, (abc.Sequence, ExtensionArray)):
data = list(data) <-- We don't want this
So how would a generator function have to look like to be considered an instance of either abc.Sequence or ExtensionArray? Thanks!

vaex - create a dataframe from a list of lists

In Vaex's docs, I cannot find a way to create a dataframe from a list of lists.
In pandas I would simply do pd.DataFrame([['A',1,3], ['B',2,4]]).
How can this be done in Vaex?
There is no method to read list of lists in vaex, however, there is vaex.from_arrays() and it works like this:
vaex.from_arrays(column_name_1=list_of_values_1, column_name_2=list_of_values_2)
If you consider different python data structures, you may want to look at vaex.from_dict() or vaex.from_items()
Since you already have the data in memory, you can use pd.DataFrame(list_of_lists) and then load it into vaex:
vaex.from_pandas(pd.DataFrame(list_of_lists))
You may want to del the list of lists data afterwards, to free up memory.

Modify pandas columns without changing underlying numpy arrays

My goal:
I have a data structure in C++ which holds strings (or more accurately, multi-dimensional char array). I wish to expose this structure to Python via Numpy and Pandas. Eventually the goal is to let the user modify a dataframe which actually modifies the underlying C++ data-structure.
What I've accomplished so far:
I've wrapped the C++ data structure with 2D numpy array (via PyArray_New API call) and returned it into python. Then, inside python I'm using pandas.DataFrame(data=ndarray, columns=columns, copy=False) constructor to wrap the ndarray with pandas' dataframe without copying any data.
I've also managed to modify a single column. For example, I've managed to turn strings into lower case in the following way:
tmp = df["Some_field"].str.decode('ascii').str.lower().str.encode('ascii')
df["Some_field"][:] = tmp
The problem:
I'm now trying to make multiple columns into lower-case. I thought it would be straight forward but I'm struggling with this for a while since the manipulations does not change the underlying numpy arrays.
What I've tried to solve the problem:
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df[field][:] = tmp
This yields SettingWithCopyWarning and the underlying structure is changed only for the first field in "fields_to_change".
2.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df.loc[:, field] = tmp[:]
This runs without errors/warning but again, underlying data is not being changed.
3.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
np.copyto(dst=df[field].values, src=tmp.values, casting='unsafe')
This works perfectly and changes underlying data. But this code is problematic from a different aspect. The whole point is to expose pandas functionality to transparently modify underlying data. I could copy all values from user's manipulated dataframe into the arrays which hold the underlying data but it would severely slow down my program.
TLDR; my question is:
How can I use pandas to manipulate strings in certain columns without changing the underlying numpy arrays from which the dataframe was composed? Also, is there a way to make sure that the user cannot change underlying numpy arrays?
Thanks very much in advance.

How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.
flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.
When you do
arr = np.concatenate(flist, axis=0)
I imagine concatenate first does
tmep = [np.asarray(a) for a in flist]
that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.

I have single-element arrays. How do I change them into the elements themselves?

Importing a JSON document into a pandas dataframe using records = pandas.read_json(path), where path was a pre-defined path to the JSON document, I discovered that the content of certain columns of the resulting dataframe "records" are not simply strings as expected. Instead, each "cell" in such a column is an array, containing one single element -- the string of interest. This makes selecting columns using boolean indexing difficult. For example, records[records['category']=='Python Books'] in Ipython outputs an empty dataframe; had the "cells" contained strings instead of arrays of strings, the output would have been nonempty, containing rows that correspond to python books.
I could modify the JSON document, so that "records" reads the strings in properly. But is there a way to modify "records" directly, to somehow strip the single-element arrays into the elements themselves?
Update: After clarification, I believe this might accomplish what you want while limiting it to a single iteration over the data:
nested_column_1 = records["column_name_1"]
nested_column_2 = records["column_name_2"]
clean_column_1 = []
clean_column_2 = []
for i in range(0, len(records.index):
clean_column_1.append(nested_column_1[i][0])
clean_column_2.append(nested_column_2[i][0])
Then you convert the clean_column lists to Series like you mentioned in your comment. Obviously, you make as many nested_column and clean_column lists as you need, and update them all in the loop.
You could generalize this pretty easily by keeping a record of "problem" columns and using that to create a data structure to manage the nested/clean lists, rather than declaring them explicitly as I did in my example. But I thought this might illustrate the approach more clearly.
Obviously, this assumes that all columns have the same number of elements, which maybe isn't a a valid assertion in your case.
Original Answer:
Sorry if I'm oversimplifying or misunderstanding the problem, but could you just do something like this?
simplified_list = [element[0] for element in my_array_of_arrays]
Or if you don't need the whole thing at once, just a generator instead:
simplifying_generator = (element[0] for element in my_array_of_arrays)

Categories

Resources