This is an observation from Most pythonic way to concatenate pandas cells with conditions
I am not able to understand why third solution one takes more memory compared to first one.
If I don't sample the third solution does not give runtime error, clearly something is weird
To emulate large dataframe I tried to resample, but never expected to run into this kind of error
Background
Pretty self explanatory, one line, looks pythonic
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I dont sample, there is no error and output also match exactly
Error(Updated)(Only happens when I sample from dataframe)
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64
For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
99 # we are updating inplace so we want to ignore is_copy
100 self._update_inplace(
--> 101 result.reindex_like(self, copy=False), verify_is_copy=False
102 )
103
I rerun them from scratch each time
Update
This is part of what I figured
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df
city arr final_target
1 paris 12 paris_12
0 paris 11 paris_11
2 dallas 22 dallas
2 dallas 22 dallas
3 miami 15 miami
3 miami 15 miami
2 dallas 22 dallas
1 paris 12 paris_12
0 paris 11 paris_11
3 miami 15 miami
Indices are repeated when sampled with replacement
So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with df.loc[df['city'] == 'paris', 'arr'].astype(str) will solve it. Just as 2e0byo pointed out.
Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!
#2e0byo hit the nail on the head saying pandas' algorithm is "inefficient" in this case.
As far as .loc, it's not really doing anything remarkable. Its use here is analogous to indexing a numpy array with a boolean array of the same shape, with an added dict-key-like access to a specific column - that is, df['city'] == 'paris' is itself a dataframe, with the same number of rows and the same indexes as df, with a single column of boolean values. df.loc[df['city'] == 'paris'] then gives a dataframe consisting of only the rows that are true in df['city'] == 'paris' (that have 'paris' in the 'city' column). Adding the additional argument 'final_target' then just returns only the 'final_target' column of those rows, instead of all three (and because it only has one column, it's technically a Series object - the same goes for df['arr']).
The memory explosion happens when pandas actually tries to add the two Series. As #2e0byo pointed out, it has to reshape the Series to do this, and it does this by calling the first Series' align() method. During the align operation, the function pandas.core.reshape.merge.get_join_indexers() calls pandas._libs.join.full_outer_join() (line 155) with three arguments: left, right, and max_groups (point of clarification: these are their names inside the function full_outer_join). left and right are integer arrays containing the indexes of the two Series objects (the values in the index column), and max_groups is the maximum number of unique elements in either left or right (in our case, that's five, corresponding to the five original rows in s).
full_outer_join immediately turns and calls pandas._libs.algos.groupsort_indexer() (line 194), once with left and max_groups as arguments and once with right and max_groups. groupsort_indexer returns two arrays - generically, indexer and counts (for the invocation with left, these are called left_sorter and left_count, and correspondingly for right). counts has length max_groups + 1, and each element (excepting the first one, which is unused) contains the count of how many times the corresponding index group appears in the input array. So for our case, with max_groups = 5, the count arrays have shape (6,), and elements 1-5 represent the number of times the 5 unique index values appear in left and right.
The other array, indexer, is constructed so that indexing the original input array with it returns all the elements grouped in ascending order - hence "sorter." After having done this for both left and right, full_outer_join chops up the two sorters and strings them up across from each other. full_outer_join returns two arrays of the same size, left_idx and right_idx - these are the arrays that get really big and throw the error. The order of elements in the sorters determines the order they appear in the final two output arrays, and the count arrays determine how often each one appears. Since left goes first, its elements stay together - in left_idx, the first left_count[1] elements in left_sorter are repeated right_count[1] times each (aaabbbccc...). At the same place in right_idx, the first right_count[1] elements are repeated in a row left_count[1] times (abcabcabc...). (Conveniently, since the 0 row in s is a 'paris' row, left_count[1] and right_count[1] are always equal, so you get x amount of repeats x amount of times to start off). Then the next left_count[2] elements of left_sorter are repeated right_count[2] times, and so on... If any of the counts elements are zero, the corresponding spots in the idx arrays are filled with -1, to be masked later (as in, right_count[i] = 0 means elements in right_idx are -1, and vice versa - this is always the case for left_count[3] and left_count[4], because rows 2 and 3 in s are non-'paris').
In the end, the _idx arrays have an amount of elements equal to N_elements, which can be calculated as follows:
left_nonzero = (left_count[1:] != 0)
right_nonzero = (right_count[1:] != 0)
left_repeats = left_count[1:]*left_nonzero + np.ones(len(left_counts)-1)*(1 - left_nonzero)
right_repeats = right_count[1:]*right_nonzero + np.ones(len(right_counts)-1)*(1 - right_nonzero)
N_elements = sum(left_repeats*right_repeats)
The corresponding elements of the count arrays are multiplied together (with all the zeros replaced with ones), and added together to get N_elements.
You can see this figure grows pretty quickly (O(n^2)). For an original dataframe with 1,000,000 sampled rows, each one appearing about equally, then the count arrays look something like:
left_count = array([0, 2e5, 2e5, 0, 0, 2e5])
right_count = array([0, 2e5, 2e5, 2e5, 2e5, 2e5])
for a total length of about 1.2e11. In general for an initial sample N (df = pd.read_csv(io.StringIO(s)).sample(N, replace=True)), the final size is approximately 0.12*N**2
An Example
It's probably helpful to look at a small example to see what full_outer_join and groupsort_indexer are trying to do when they make those ginormous arrays. We'll start with a small sample of only 10 rows, and follow the various arrays to the final output, left_idx and right_idx. We'll start by defining the initial dataframe:
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df['final_target'] = df['city'] # this line doesn't change much, but meh
which looks like:
city arr final_target
3 miami 15 miami
1 paris 11 paris
0 paris 12 paris
0 paris 12 paris
0 paris 12 paris
1 paris 11 paris
2 dallas 22 dallas
3 miami 15 miami
2 dallas 22 dallas
4 paris 16 paris
df.loc[df['city'] == 'paris', 'final_target'] looks like:
1 paris
0 paris
0 paris
0 paris
1 paris
4 paris
and df['arr'].astype(str):
3 15
1 11
0 12
0 12
0 12
1 11
2 22
3 15
2 22
4 16
Then, in the call to full_outer_join, our arguments look like:
left = array([1,0,0,0,1,4]) # indexes of df.loc[df['city'] == 'paris', 'final_target']
right = array([3,1,0,0,0,1,2,3,2,4]) # indexes of df['arr'].astype(str)
max_groups = 5 # the max number of unique elements in either left or right
The function call groupsort_indexer(left, max_groups) returns the following two arrays:
left_sorter = array([1, 2, 3, 0, 4, 5])
left_count = array([0, 3, 2, 0, 0, 1])
left_count holds the number of appearances of each unique value in left - the first element is unused, but then there a 3 zeros, 2 ones, 0 twos, 0 threes, and 1 four in left.
left_sorter is such that left[left_sorter] = array([0, 0, 0, 1, 1, 4]) - all in order.
Now right: groupsort_indexer(right, max_groups) returns
right_sorter = array([2, 3, 4, 1, 5, 6, 8, 0, 7, 9])
right_count = array([0, 3, 2, 2, 2, 1])
Once again, right_count contains the number of times each count appears: the unused first element, and then 3 zeros, 2 ones, 2 twos, 2 threes, and 1 four (note that elements 1, 2, and 5 of both count arrays are the same: these are the rows in s with 'city' = 'paris'). Also, right[right_sorter] = array([0, 0, 0, 1, 1, 2, 2, 3, 3, 4])
With both count arrays calculated, we can calculate what size the idx arrays will be (a bit simpler with actual numbers than with the formula above):
N_total = 3*3 + 2*2 + 2 + 2 + 1*1 = 18
3 is element 1 for both counts arrays, so we can expect something like [1,1,1,2,2,2,3,3,3] to start left_idx, since [1,2,3] starts left_sorter, and [2,3,4,2,3,4,2,3,4] to start right_idx, since right_sorter begins with [2,3,4]. Then we have twos, so [0,0,4,4] for left_idx and [1,5,1,5] for right_idx. Then left_count has two zeros, and right_count has two twos, so next go 4 -1's in left_idx and the next four elements in right_sorter go into right_idx: [6,8,0,7]. Both count's finish with a one, so one each of the last elements in the sorters go in the idx: 5 for left_idx and 9 for right_idx, leaving:
left_idx = array([1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 0, 4, 4,-1, -1, -1, -1, 5])
right_idx = array([2, 3, 4, 2, 3, 4, 2, 3, 4, 1, 5, 1, 5, 6, 8, 0 , 7, 9])
which is indeed 18 elements.
With both index arrays the same shape, pandas can construct two Series of the same shape from our original ones to do any operations it needs to, and then it can mask these arrays to get back sorted indexes. Using a simple boolean filter to look at how we just sorted left and right with the outputs, we get:
left[left_idx[left_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 4])
right[right_idx[right_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4])
After going back up through all the function calls and modules, the result of the addition at this point is:
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
1 paris_11
1 paris_11
1 paris_11
1 paris_11
2 NaN
2 NaN
3 NaN
3 NaN
4 paris_16
which is result in the line result = op(self, other) in pandas.core.generic.NDFrame._inplace_method (line 11066), with op = pandas.core.series.Series.__add__ and self and other the two Series from before that we're adding.
So, as far as I can tell, pandas basically tries to perform the operation for every combination of identically-indexed rows (like, any and all rows with index 1 in the first Series should be operated with all rows index 1 in the other Series). If one of the Series has indexes that the other one doesn't, those rows get masked out. It just so happens in this case that every row with the same index is identical. It works (albeit redundantly) as long as you don't need to do anything in place - the trouble for the small dataframes arises after this when pandas tries to reindex this result back into the shape of the original dataframe df.
The split (the line that smaller dataframes make it past, but larger ones don't) is that line result = op(self, other) from above. Later in the same function (called, note, _inplace_method), the program exits at self._update_inplace(result.reindex_like(self, copy=False), verify_is_copy=False). It tries to reindex result so it looks like self, so it can replace self with result (self is the original Series, the first one in the addition, df.loc[df['city'] == 'paris', 'final_target']). And this is where the smaller case fails, because, obviously, result has a bunch of repeated indexes, and pandas doesn't want to lose any information when it deletes some of them.
One Last Thing
It's probably worth mentioning that this behaviour isn't particular to the addition operation here. It happens any time you try an arithmetic operation on two large dataframes with a lot of repeated indexes - for example, try just defining a second dataframe the exact same way as the first, df2 = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True), and then try running df.arr*df2.arr. You'll get the same memory error.
Interestingly, logical and comparison operators have protections against doing this - they require identical indexes, and check for it before calling their operator method.
I did all my stuff in pandas 1.2.4, python 3.7.10, but I've given links to the pandas Github, which is currently in version 1.3.3. As far as I can tell, the differences don't affect the results.
I could certainly be wrong about this, but isn't it because df["arr"] has a different shape from df.loc[df["city"] == "paris"]? So something funny is happening in Pandas' internal resampling.
If I explicitly truncate the dataframe myself it works:
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += "_" + df.loc[df['city'] == 'paris', 'arr'].astype(str)
In which case, the answer would be 'because internally pandas has an algorithm for reshaping dataframes when adding different sizes which is inefficient in this case'.
I don't know if that qualifies as an answer as I've not looked more deeply into pandas.
I have a vector dogSpecies showing all four unique dog species under investigation.
#a set of possible dog species
dogSpecies = [1,2,3,4]
I also have a data vector containing integer numbers corresponding to the records of dog species of all dogs tested.
# species of examined dogs
data = np.array(1,1,2,-1,0,2,3,5,4)
Some of the records in data contain values different than 1,2,3 or 4. (Such as -1, 0 or 5). If an element in the data set is not equal to any element of the dogSpecies, such occurrence should be marked in an error evaluation boolean matrix as False.
#initially all the elements of the boolean error evaluation vector are True.
errorEval = np.ones((np.size(data,axis = 0)),dtype=bool)
Ideally my errorEval vector would look like this:
errorEval = np.array[True, True, True, False, False, True, True, False, True]
I want a piece of code that checks if the elements of data are not equal to the elements of dogSpecies vector. My code for some reason marks every single element of the errorEval vector as 'False'.
for i in range(np.size(data, axis = 0)):
# validation of the species
if (data[i] != dogSpecies):
errorEval[i] = False
I understand that I cannot compare a single element with a vector of four elements like above, but how do I do this then?
Isn't this just what you want?
for index, elem in enumerate(data):
if elem not in dogSpecies:
errorEval[index] = False
Probably not very fast, it doesn't use any vectorized numpy ufuncs but if the array isn't very large that won't matter. Converting dogSpecies to a set will also speed things up.
As an aside, your python looks very c/java esque. I'd suggest reading the python style guide.
If I understand correctly, you have a dataframe and a list of dog species. This should achieve what you want.
df = pd.DataFrame({'dog': [1,3,4,5,1,1,8,9,0]})
dog
0 1
1 3
2 4
3 5
4 1
5 1
6 8
7 9
8 0
df['errorEval'] = df['dog'].isin(dogSpecies).astype(int)
dog errorEval
0 1 1
1 3 1
2 4 1
3 5 0
4 1 1
5 1 1
6 8 0
7 9 0
8 0 0
df.errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
If you don't want to create a new column then you can do:
df.assign(errorEval=df['dog'].isin(dogSpecies).astype(int)).errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
As #FHTMitchel stated you have to use in to check if an element is in a list or not.
But you can use list comprehension which is faster as normal loop and shorter:
errorEval = np.array([True if elem in dogSpecies else False for elem in data])
I have a large dataframe holding mapping users (index) to counts of items (columns):
users_items = pd.DataFrame(np.array([[0, 1, 1, 0], # user 0
[1, 0, 0, 0], # user 1
[5, 0, 0, 9], # user 2
[0, 3, 5, 0], # user 3
[0, 2, 2, 0], # user 4
[7, 0, 0, 1], # user 5
[3, 5, 0, 4]]), # user 6
columns=list('ABCD'))
For each user, I want to find all the users that have non-zero counts for at least the same items and sum their counts. So for user 1, this would be users 1, 2, 5 and 6 and the sum of the counts equals [16, 5, 0, 14]. This can be used to suggest new items to users based on the items that "similar" users got.
This naive implementation uses a signature as a regular expression to filter out the relevant rows and a for loop to loop over all signatures:
def create_signature(request_counts):
return ''.join('x' if count else '.' for count in request_counts)
users_items['signature'] = users_items.apply(create_signature, axis=1).astype('category')
current_items = users_items.groupby('signature').sum()
similar_items = pd.DataFrame(index=current_items.index,
columns=current_items.columns)
for signature in current_items.index:
row = current_items.filter(regex=signature, axis='index').sum()
similar_items.loc[signature] = row
The result is:
A B C D
signature
.xx. 0 6 8 0
x... 16 5 0 14
x..x 15 5 0 14
xx.x 3 5 0 4
This works fine, but it is too slow for the actual data set which consists of 100k users and some 600 items. Generating the signatures takes only 10 seconds, but looping over all (40k) signatures takes several hours.
Vectorizing the loop should offer a huge performance boost, but my experience with Pandas is limited so I'm not sure how to go about it. It is even possible to vectorize this type of calculation? Perhaps using masks?
Instead of a string as signature, you can use a frozenset
def create_signature(request_counts):
return frozenset(request_counts[request_counts != 0].index)
an alternative is
def create_signature(request_counts):
return frozenset(request_counts.replace({0: None}).dropna().index)
I don't have a dataset large enough to see whether one is faster than the other.
If you have duplicate columns, insert a call to reset_index() before the .index
This allows you to vectorise your filter in the end
for signature in current_items.index:
row = current_items[signature <= current_items.index].sum()
similar_items.loc[signature] = row
results in
signature A B C D
frozenset({'B', 'C'}) 0 6 8 0
frozenset({'A'}) 16 5 0 14
frozenset({'A', 'D'}) 15 5 0 14
frozenset({'B', 'A', 'D'}) 3 5 0 4
Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64