Time complexity of Boolean operators on pandas dataframes

Time complexity of Boolean operators on pandas dataframes - python

I'm writing a short program for some standard dataframe operations on Pandas but the time complexity of the program is O(n) due to the following piece of code:
criteria = ((cars["Color"] == order["CarColors"]["Include"]) \
& (cars["Size"] != order["CarSize"]["Exclude"])
cars[criteria]
criteria is used to filter the cars dataframe as I only want to include certain colors and exclude certain sizes. I ran the program for an increasingly large cars file and the time complexity increases linearly with the number of points.
I also tried np.isin as below but it actually made the performance worse. Anyone have an idea how I can improve the time complexity? I thought Boolean operators would be quicker than this.
criteria = np.isin(cars["Color"],order["CarColors"]["Include"]) \
& np.isin(cars["Size"],order["CarSize"]["Exclude"], invert=True)
Thanks

You may want to try and evaluate it like this:
car_color = order["CarColors"]["Include"]
car_size = order["CarSize"]["Exclude"]
cars.query("Color == #car_color and Size != #car_size")
When you do boolean indexing, you create some temporary arrays. If these arrays are much bigger than your CPU cache memory, then the query method could enhance performance a bit. source

Related

Jupyter’s kernel crash when i use groupby

I’m analizing a dataset with 200 columns and 6000 rows. I computed all the possibile differences between columns using iterools and implemented them into the dataset. So now the number of columns has increased. Until now everything work fine and kernel doesn’t have problems. Kernel dies when i try to group columns with same first value and sum them.
#difference between two columns,all possible combinations 1-2,1-3,..,199-200
def sb(df):
comb=itertools.permutations(df.columns,2)
N_f=pd.DataFrame()
N_f = pd.concat([df[a]-df[b] for a,b in comb],axis=1)
N_f.iloc[0,:]=[abs(number) for number in N_f.iloc[0,:]]
return N_f
#Here i transform the first row into columns headers and then i try to sum columns with the same head
def fg(m):
f.columns=f.iloc[0]
f=f.iloc[1:]
f=f.groupby(f.columns,axis=1).sum()
return f
Now i tried to run the code without the groupby part, but the kernel keeps dying.

Kernel crashes often suggest a large spike in resource usage, which your machine and/or juypter configuration could not handle.
The question is then, "What am I doing that is using so many resources?".
That's for you to figure out, but my guess is that it has to do with your list comprehension over permutations. Permutations are extremely expensive, and having in-memory data structures for each permutation is going to hurt.
I suggest debugging like so:
# Print out the size of this. Does it surprise you?
comb=itertools.permutations(df.columns,2)
N_f=pd.DataFrame()
# Instead of doing these operations in one list comprehension,
# instead make a for loop and print out the memory
# usage at each iteration in the loop.
# How is it scaling?
N_f = pd.concat([df[a]-df[b] for a,b in comb],axis=1)

Memory-efficient way to merge large list of series into dataframe

I have a large list of pandas.Series that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered and new pd.Series objects come in every few seconds. My current approach is to call pd.DataFrame on the list of series as follows:
timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
if counter % 500 == 0:
logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
counter += 1
timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)
The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.
My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:
Create small dataframes with the new data and merge into the growing dataframe
Concat the growing dataframe and the new series
Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.
Thanks a lot!
Edit: Thanks to Timus, I am one step further
Pandas uses the following code when creating a DataFrame:
elif is_list_like(data):
if not isinstance(data, (abc.Sequence, ExtensionArray)):
data = list(data) <-- We don't want this
So how would a generator function have to look like to be considered an instance of either abc.Sequence or ExtensionArray? Thanks!

How can I pull the first five values from a Dask DataFrame without computing the whole DataFrame?

Good evening!
I have a code similar to the one I will paste below, it has a lot more data but the premise is the same. From both DataFrames I have to pull the first five values but when I am dealing with tens of millions of entries I cannot afford waiting sometimes up to an hour for it to compute the whole DataFrame and return me the first five values. I also cannot use simple Pandas DataFrames as they exceed my memory limit. Is there a solution to this?
import random
import pandas
import dask.dataframe as dd
import time
# Random list from 1 to 10,000,000.
random_pool = [random.randint(1, 1000000) for i in range(10000000)]
random.shuffle(random_pool)
df1 = dd.from_pandas(pandas.DataFrame(random_pool[:100000], columns=["ID"]), npartitions=10)
df2 = dd.from_pandas(pandas.DataFrame(random_pool, columns=["ID"]), npartitions=10)
# Sorting both dataframes.
df1 = df1.sort_values("ID", ascending=True)
df2 = df2.sort_values("ID", ascending=True)
df1_start = time.time()
df1.head(5)
print("DF1 took {:.2f}.".format(time.time() - df1_start))
df2_start = time.time()
df2.head(5)
print("DF2 took {:.2f}.".format(time.time() - df2_start))
The first DataFrame takes around 0.41 seconds meanwhile the second one takes around 1.79.

One thing to keep in mind is that a value in dask is really a stack of operations, serialized. Lots of computation is deferred until when you actually ask for the values to be materialized - like using head, or in general, using .compute().
In the spirit of the general advice on persist, you can try to use .persist() after the sort calls:
It is often ideal to load, filter, and shuffle data once and keep this result in memory. Afterwards, each of the several complex queries can be based off of this in-memory data rather than have to repeat the full load-filter-shuffle process each time. To do this, use the client.persist method [ The .persist() method in this case ].
And take the time to think about what happens if we don't persist here - the future that needs to be resolved when you call head will include the sort_values call, and you're probably seeing the cost of sorting all your data every time you call head - and that explains why getting just the five first items has a cost proportional to the size of the whole dataset - because the whole dataset is being sorted.
The answer is that dask is quite fast about getting the first five items. But it might not be so fast to resolve all the computations to get there, if they are not already in memory.
You should in general avoid whole-dataset shuffling like in this example - the sort!

How to (fast) iterate over two Dataframes (Pandas) with functions applied for comparison

I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!

finding a duplicate in a hdf5 pytable with 500e6 rows

Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution
Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

Two obvious techniques come to mind: hashing and sorting.
A) define a hash function to combine ID and Counter into a single, compact value.
B) count how often each hash code occurs
C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)
D) sort this data set to find duplicates.
The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.
This is essentially a Bloom filter.

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.
I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.
In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.
If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Time complexity of Boolean operators on pandas dataframes - python

Related

Jupyter’s kernel crash when i use groupby

Memory-efficient way to merge large list of series into dataframe

How can I pull the first five values from a Dask DataFrame without computing the whole DataFrame?

How to (fast) iterate over two Dataframes (Pandas) with functions applied for comparison

finding a duplicate in a hdf5 pytable with 500e6 rows

Categories

Resources