Testing subsequent values in a DataFrame - python

I have a DataFrame with one column with positive and negative integers. For each row, I'd like to see how many consecutive rows (starting with and including the current row) have negative values.
So if a sequence was 2, -1, -3, 1, -1, the result would be 0, 2, 1, 0, 1.
I can do this by iterating over all the indices, using .iloc to split the column, and next() to find out where the next positive value is. But I feel like this isn't taking advantage of panda's capabilities, and I imagine that there's a better way of doing it. I've experimented with using .shift() and expanding_window but without success.
Is there a more "pandastic" way of finding out how many consecutive rows after the current one meet some logical condition?
Here's what's working now:
import pandas as pd
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1]})
df["b"] = 0
for i in df.index:
sub = df.iloc[i:].a.tolist()
df.b.iloc[i] = next((sub.index(n) for n in sub if n >= 0), 1)
Edit: I realize that even my own example doesn't work when there's more than one negative value at the end. So that makes a better solution even more necessary.
Edit 2: I stated the problem in terms of integers, but originally only put 1 and -1 in my example. I need to solve for positive and negative integers in general.

FWIW, here's a fairly pandastic answer that requires no functions or applies. Borrows from here (among other answers I'm sure) and thanks to #DSM for mentioning the ascending=False option:
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1, -2]})
df['pos'] = df.a > 0
df['grp'] = ( df['pos'] != df['pos'].shift()).cumsum()
dfg = df.groupby('grp')
df['c'] = np.where( df['a'] < 0, dfg.cumcount(ascending=False)+1, 0 )
a b pos grp c
0 2 0 True 1 0
1 -1 3 False 2 3
2 -3 2 False 2 2
3 -1 1 False 2 1
4 1 0 True 3 0
5 1 0 True 3 0
6 -1 1 False 4 1
7 1 0 True 5 0
8 -1 1 False 6 2
9 -2 1 False 6 1
I think a nice thing about this method is that once you set up the 'grp' variable you can do lots of things very easily with standard groupby methods.

This was an interesting puzzle. I found a way to do it using pandas tools, but I think you'll agree it's a lot more opaque :-). Here's the example:
data = pandas.Series([1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1])
x = data[::-1] # reverse the data
print(x.groupby(((x<0) != (x<0).shift()).cumsum()).apply(lambda x: pandas.Series(
np.arange(len(x))+1 if (x<0).all() else np.zeros(len(x)),
index=x.index))[::-1])
The output is correct:
0 0
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
9 1
10 0
dtype: float64
The basic idea is similar to what I described in my answer to this question, and you can find the same approach used in various answers that ask how to make use of inter-row information in pandas. Your question is slightly trickier because your criterion goes in reverse (asking for the number of following negatives rather than the number of preceding negatives), and because you only want one side of the grouping (i.e., you only want the number of consecutive negatives, not the number of consecutive numbers with the same sign).
Here is a more verbose version of the same code with some explanation that may make it easier to grasp:
def getNegativeCounts(x):
# This function takes as input a sequence of numbers, all the same sign.
# If they're negative, it returns an increasing count of how many there are.
# If they're positive, it just returns the same number of zeros.
# [-1, -2, -3] -> [1, 2, 3]
# [1, 2, 3] -> [0, 0, 0]
if (x<0).all():
return pandas.Series(np.arange(len(x))+1, index=x.index)
else:
return pandas.Series(np.zeros(len(x)), index=x.index)
# we have to reverse the data because cumsum only works in the forward direction
x = data[::-1]
# compute for each number whether it has the same sign as the previous one
sameSignAsPrevious = (x<0) != (x<0).shift()
# cumsum this to get an "ID" for each block of consecutive same-sign numbers
sameSignBlocks = sameSignAsPrevious.cumsum()
# group on these block IDs
g = x.groupby(sameSignBlocks)
# for each block, apply getNegativeCounts
# this will either give us the running total of negatives in the block,
# or a stretch of zeros if the block was positive
# the [::-1] at the end reverses the result
# (to compensate for our reversing the data initially)
g.apply(getNegativeCounts)[::-1]
As you can see, run-length-style operations are not usually simple in pandas. There is, however, an open issue for adding more grouping/partitioning abilities that would ameliorate some of this. In any case, your particular use case has some specific quirks that make it a bit different from a typical run-length task.

Related

Dataframe column: to find (cumulative) local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima
You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.

Explosion of memory when using pandas .loc with umatching indices + assignment giving duplicate axis error

This is an observation from Most pythonic way to concatenate pandas cells with conditions
I am not able to understand why third solution one takes more memory compared to first one.
If I don't sample the third solution does not give runtime error, clearly something is weird
To emulate large dataframe I tried to resample, but never expected to run into this kind of error
Background
Pretty self explanatory, one line, looks pythonic
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] + '_' + df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I dont sample, there is no error and output also match exactly
Error(Updated)(Only happens when I sample from dataframe)
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64
For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target'] += '_' + df['arr'].astype(str)
~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
99 # we are updating inplace so we want to ignore is_copy
100 self._update_inplace(
--> 101 result.reindex_like(self, copy=False), verify_is_copy=False
102 )
103
I rerun them from scratch each time
Update
This is part of what I figured
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df
city arr final_target
1 paris 12 paris_12
0 paris 11 paris_11
2 dallas 22 dallas
2 dallas 22 dallas
3 miami 15 miami
3 miami 15 miami
2 dallas 22 dallas
1 paris 12 paris_12
0 paris 11 paris_11
3 miami 15 miami
Indices are repeated when sampled with replacement
So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with df.loc[df['city'] == 'paris', 'arr'].astype(str) will solve it. Just as 2e0byo pointed out.
Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!
#2e0byo hit the nail on the head saying pandas' algorithm is "inefficient" in this case.
As far as .loc, it's not really doing anything remarkable. Its use here is analogous to indexing a numpy array with a boolean array of the same shape, with an added dict-key-like access to a specific column - that is, df['city'] == 'paris' is itself a dataframe, with the same number of rows and the same indexes as df, with a single column of boolean values. df.loc[df['city'] == 'paris'] then gives a dataframe consisting of only the rows that are true in df['city'] == 'paris' (that have 'paris' in the 'city' column). Adding the additional argument 'final_target' then just returns only the 'final_target' column of those rows, instead of all three (and because it only has one column, it's technically a Series object - the same goes for df['arr']).
The memory explosion happens when pandas actually tries to add the two Series. As #2e0byo pointed out, it has to reshape the Series to do this, and it does this by calling the first Series' align() method. During the align operation, the function pandas.core.reshape.merge.get_join_indexers() calls pandas._libs.join.full_outer_join() (line 155) with three arguments: left, right, and max_groups (point of clarification: these are their names inside the function full_outer_join). left and right are integer arrays containing the indexes of the two Series objects (the values in the index column), and max_groups is the maximum number of unique elements in either left or right (in our case, that's five, corresponding to the five original rows in s).
full_outer_join immediately turns and calls pandas._libs.algos.groupsort_indexer() (line 194), once with left and max_groups as arguments and once with right and max_groups. groupsort_indexer returns two arrays - generically, indexer and counts (for the invocation with left, these are called left_sorter and left_count, and correspondingly for right). counts has length max_groups + 1, and each element (excepting the first one, which is unused) contains the count of how many times the corresponding index group appears in the input array. So for our case, with max_groups = 5, the count arrays have shape (6,), and elements 1-5 represent the number of times the 5 unique index values appear in left and right.
The other array, indexer, is constructed so that indexing the original input array with it returns all the elements grouped in ascending order - hence "sorter." After having done this for both left and right, full_outer_join chops up the two sorters and strings them up across from each other. full_outer_join returns two arrays of the same size, left_idx and right_idx - these are the arrays that get really big and throw the error. The order of elements in the sorters determines the order they appear in the final two output arrays, and the count arrays determine how often each one appears. Since left goes first, its elements stay together - in left_idx, the first left_count[1] elements in left_sorter are repeated right_count[1] times each (aaabbbccc...). At the same place in right_idx, the first right_count[1] elements are repeated in a row left_count[1] times (abcabcabc...). (Conveniently, since the 0 row in s is a 'paris' row, left_count[1] and right_count[1] are always equal, so you get x amount of repeats x amount of times to start off). Then the next left_count[2] elements of left_sorter are repeated right_count[2] times, and so on... If any of the counts elements are zero, the corresponding spots in the idx arrays are filled with -1, to be masked later (as in, right_count[i] = 0 means elements in right_idx are -1, and vice versa - this is always the case for left_count[3] and left_count[4], because rows 2 and 3 in s are non-'paris').
In the end, the _idx arrays have an amount of elements equal to N_elements, which can be calculated as follows:
left_nonzero = (left_count[1:] != 0)
right_nonzero = (right_count[1:] != 0)
left_repeats = left_count[1:]*left_nonzero + np.ones(len(left_counts)-1)*(1 - left_nonzero)
right_repeats = right_count[1:]*right_nonzero + np.ones(len(right_counts)-1)*(1 - right_nonzero)
N_elements = sum(left_repeats*right_repeats)
The corresponding elements of the count arrays are multiplied together (with all the zeros replaced with ones), and added together to get N_elements.
You can see this figure grows pretty quickly (O(n^2)). For an original dataframe with 1,000,000 sampled rows, each one appearing about equally, then the count arrays look something like:
left_count = array([0, 2e5, 2e5, 0, 0, 2e5])
right_count = array([0, 2e5, 2e5, 2e5, 2e5, 2e5])
for a total length of about 1.2e11. In general for an initial sample N (df = pd.read_csv(io.StringIO(s)).sample(N, replace=True)), the final size is approximately 0.12*N**2
An Example
It's probably helpful to look at a small example to see what full_outer_join and groupsort_indexer are trying to do when they make those ginormous arrays. We'll start with a small sample of only 10 rows, and follow the various arrays to the final output, left_idx and right_idx. We'll start by defining the initial dataframe:
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df['final_target'] = df['city'] # this line doesn't change much, but meh
which looks like:
city arr final_target
3 miami 15 miami
1 paris 11 paris
0 paris 12 paris
0 paris 12 paris
0 paris 12 paris
1 paris 11 paris
2 dallas 22 dallas
3 miami 15 miami
2 dallas 22 dallas
4 paris 16 paris
df.loc[df['city'] == 'paris', 'final_target'] looks like:
1 paris
0 paris
0 paris
0 paris
1 paris
4 paris
and df['arr'].astype(str):
3 15
1 11
0 12
0 12
0 12
1 11
2 22
3 15
2 22
4 16
Then, in the call to full_outer_join, our arguments look like:
left = array([1,0,0,0,1,4]) # indexes of df.loc[df['city'] == 'paris', 'final_target']
right = array([3,1,0,0,0,1,2,3,2,4]) # indexes of df['arr'].astype(str)
max_groups = 5 # the max number of unique elements in either left or right
The function call groupsort_indexer(left, max_groups) returns the following two arrays:
left_sorter = array([1, 2, 3, 0, 4, 5])
left_count = array([0, 3, 2, 0, 0, 1])
left_count holds the number of appearances of each unique value in left - the first element is unused, but then there a 3 zeros, 2 ones, 0 twos, 0 threes, and 1 four in left.
left_sorter is such that left[left_sorter] = array([0, 0, 0, 1, 1, 4]) - all in order.
Now right: groupsort_indexer(right, max_groups) returns
right_sorter = array([2, 3, 4, 1, 5, 6, 8, 0, 7, 9])
right_count = array([0, 3, 2, 2, 2, 1])
Once again, right_count contains the number of times each count appears: the unused first element, and then 3 zeros, 2 ones, 2 twos, 2 threes, and 1 four (note that elements 1, 2, and 5 of both count arrays are the same: these are the rows in s with 'city' = 'paris'). Also, right[right_sorter] = array([0, 0, 0, 1, 1, 2, 2, 3, 3, 4])
With both count arrays calculated, we can calculate what size the idx arrays will be (a bit simpler with actual numbers than with the formula above):
N_total = 3*3 + 2*2 + 2 + 2 + 1*1 = 18
3 is element 1 for both counts arrays, so we can expect something like [1,1,1,2,2,2,3,3,3] to start left_idx, since [1,2,3] starts left_sorter, and [2,3,4,2,3,4,2,3,4] to start right_idx, since right_sorter begins with [2,3,4]. Then we have twos, so [0,0,4,4] for left_idx and [1,5,1,5] for right_idx. Then left_count has two zeros, and right_count has two twos, so next go 4 -1's in left_idx and the next four elements in right_sorter go into right_idx: [6,8,0,7]. Both count's finish with a one, so one each of the last elements in the sorters go in the idx: 5 for left_idx and 9 for right_idx, leaving:
left_idx = array([1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 0, 4, 4,-1, -1, -1, -1, 5])
right_idx = array([2, 3, 4, 2, 3, 4, 2, 3, 4, 1, 5, 1, 5, 6, 8, 0 , 7, 9])
which is indeed 18 elements.
With both index arrays the same shape, pandas can construct two Series of the same shape from our original ones to do any operations it needs to, and then it can mask these arrays to get back sorted indexes. Using a simple boolean filter to look at how we just sorted left and right with the outputs, we get:
left[left_idx[left_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 4])
right[right_idx[right_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4])
After going back up through all the function calls and modules, the result of the addition at this point is:
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
1 paris_11
1 paris_11
1 paris_11
1 paris_11
2 NaN
2 NaN
3 NaN
3 NaN
4 paris_16
which is result in the line result = op(self, other) in pandas.core.generic.NDFrame._inplace_method (line 11066), with op = pandas.core.series.Series.__add__ and self and other the two Series from before that we're adding.
So, as far as I can tell, pandas basically tries to perform the operation for every combination of identically-indexed rows (like, any and all rows with index 1 in the first Series should be operated with all rows index 1 in the other Series). If one of the Series has indexes that the other one doesn't, those rows get masked out. It just so happens in this case that every row with the same index is identical. It works (albeit redundantly) as long as you don't need to do anything in place - the trouble for the small dataframes arises after this when pandas tries to reindex this result back into the shape of the original dataframe df.
The split (the line that smaller dataframes make it past, but larger ones don't) is that line result = op(self, other) from above. Later in the same function (called, note, _inplace_method), the program exits at self._update_inplace(result.reindex_like(self, copy=False), verify_is_copy=False). It tries to reindex result so it looks like self, so it can replace self with result (self is the original Series, the first one in the addition, df.loc[df['city'] == 'paris', 'final_target']). And this is where the smaller case fails, because, obviously, result has a bunch of repeated indexes, and pandas doesn't want to lose any information when it deletes some of them.
One Last Thing
It's probably worth mentioning that this behaviour isn't particular to the addition operation here. It happens any time you try an arithmetic operation on two large dataframes with a lot of repeated indexes - for example, try just defining a second dataframe the exact same way as the first, df2 = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True), and then try running df.arr*df2.arr. You'll get the same memory error.
Interestingly, logical and comparison operators have protections against doing this - they require identical indexes, and check for it before calling their operator method.
I did all my stuff in pandas 1.2.4, python 3.7.10, but I've given links to the pandas Github, which is currently in version 1.3.3. As far as I can tell, the differences don't affect the results.
I could certainly be wrong about this, but isn't it because df["arr"] has a different shape from df.loc[df["city"] == "paris"]? So something funny is happening in Pandas' internal resampling.
If I explicitly truncate the dataframe myself it works:
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] += "_" + df.loc[df['city'] == 'paris', 'arr'].astype(str)
In which case, the answer would be 'because internally pandas has an algorithm for reshaping dataframes when adding different sizes which is inefficient in this case'.
I don't know if that qualifies as an answer as I've not looked more deeply into pandas.

Check, if a variable does not equal to any of the vector's elements

I have a vector dogSpecies showing all four unique dog species under investigation.
#a set of possible dog species
dogSpecies = [1,2,3,4]
I also have a data vector containing integer numbers corresponding to the records of dog species of all dogs tested.
# species of examined dogs
data = np.array(1,1,2,-1,0,2,3,5,4)
Some of the records in data contain values different than 1,2,3 or 4. (Such as -1, 0 or 5). If an element in the data set is not equal to any element of the dogSpecies, such occurrence should be marked in an error evaluation boolean matrix as False.
#initially all the elements of the boolean error evaluation vector are True.
errorEval = np.ones((np.size(data,axis = 0)),dtype=bool)
Ideally my errorEval vector would look like this:
errorEval = np.array[True, True, True, False, False, True, True, False, True]
I want a piece of code that checks if the elements of data are not equal to the elements of dogSpecies vector. My code for some reason marks every single element of the errorEval vector as 'False'.
for i in range(np.size(data, axis = 0)):
# validation of the species
if (data[i] != dogSpecies):
errorEval[i] = False
I understand that I cannot compare a single element with a vector of four elements like above, but how do I do this then?
Isn't this just what you want?
for index, elem in enumerate(data):
if elem not in dogSpecies:
errorEval[index] = False
Probably not very fast, it doesn't use any vectorized numpy ufuncs but if the array isn't very large that won't matter. Converting dogSpecies to a set will also speed things up.
As an aside, your python looks very c/java esque. I'd suggest reading the python style guide.
If I understand correctly, you have a dataframe and a list of dog species. This should achieve what you want.
df = pd.DataFrame({'dog': [1,3,4,5,1,1,8,9,0]})
dog
0 1
1 3
2 4
3 5
4 1
5 1
6 8
7 9
8 0
df['errorEval'] = df['dog'].isin(dogSpecies).astype(int)
dog errorEval
0 1 1
1 3 1
2 4 1
3 5 0
4 1 1
5 1 1
6 8 0
7 9 0
8 0 0
df.errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
If you don't want to create a new column then you can do:
df.assign(errorEval=df['dog'].isin(dogSpecies).astype(int)).errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
As #FHTMitchel stated you have to use in to check if an element is in a list or not.
But you can use list comprehension which is faster as normal loop and shorter:
errorEval = np.array([True if elem in dogSpecies else False for elem in data])

How to vectorize pandas calculation involving custom grouping?

I have a large dataframe holding mapping users (index) to counts of items (columns):
users_items = pd.DataFrame(np.array([[0, 1, 1, 0], # user 0
[1, 0, 0, 0], # user 1
[5, 0, 0, 9], # user 2
[0, 3, 5, 0], # user 3
[0, 2, 2, 0], # user 4
[7, 0, 0, 1], # user 5
[3, 5, 0, 4]]), # user 6
columns=list('ABCD'))
For each user, I want to find all the users that have non-zero counts for at least the same items and sum their counts. So for user 1, this would be users 1, 2, 5 and 6 and the sum of the counts equals [16, 5, 0, 14]. This can be used to suggest new items to users based on the items that "similar" users got.
This naive implementation uses a signature as a regular expression to filter out the relevant rows and a for loop to loop over all signatures:
def create_signature(request_counts):
return ''.join('x' if count else '.' for count in request_counts)
users_items['signature'] = users_items.apply(create_signature, axis=1).astype('category')
current_items = users_items.groupby('signature').sum()
similar_items = pd.DataFrame(index=current_items.index,
columns=current_items.columns)
for signature in current_items.index:
row = current_items.filter(regex=signature, axis='index').sum()
similar_items.loc[signature] = row
The result is:
A B C D
signature
.xx. 0 6 8 0
x... 16 5 0 14
x..x 15 5 0 14
xx.x 3 5 0 4
This works fine, but it is too slow for the actual data set which consists of 100k users and some 600 items. Generating the signatures takes only 10 seconds, but looping over all (40k) signatures takes several hours.
Vectorizing the loop should offer a huge performance boost, but my experience with Pandas is limited so I'm not sure how to go about it. It is even possible to vectorize this type of calculation? Perhaps using masks?
Instead of a string as signature, you can use a frozenset
def create_signature(request_counts):
return frozenset(request_counts[request_counts != 0].index)
an alternative is
def create_signature(request_counts):
return frozenset(request_counts.replace({0: None}).dropna().index)
I don't have a dataset large enough to see whether one is faster than the other.
If you have duplicate columns, insert a call to reset_index() before the .index
This allows you to vectorise your filter in the end
for signature in current_items.index:
row = current_items[signature <= current_items.index].sum()
similar_items.loc[signature] = row
results in
signature A B C D
frozenset({'B', 'C'}) 0 6 8 0
frozenset({'A'}) 16 5 0 14
frozenset({'A', 'D'}) 15 5 0 14
frozenset({'B', 'A', 'D'}) 3 5 0 4

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

Categories

Resources