How to vectorize pandas calculation involving custom grouping?

How to vectorize pandas calculation involving custom grouping? - python

I have a large dataframe holding mapping users (index) to counts of items (columns):
users_items = pd.DataFrame(np.array([[0, 1, 1, 0], # user 0
[1, 0, 0, 0], # user 1
[5, 0, 0, 9], # user 2
[0, 3, 5, 0], # user 3
[0, 2, 2, 0], # user 4
[7, 0, 0, 1], # user 5
[3, 5, 0, 4]]), # user 6
columns=list('ABCD'))
For each user, I want to find all the users that have non-zero counts for at least the same items and sum their counts. So for user 1, this would be users 1, 2, 5 and 6 and the sum of the counts equals [16, 5, 0, 14]. This can be used to suggest new items to users based on the items that "similar" users got.
This naive implementation uses a signature as a regular expression to filter out the relevant rows and a for loop to loop over all signatures:
def create_signature(request_counts):
return ''.join('x' if count else '.' for count in request_counts)
users_items['signature'] = users_items.apply(create_signature, axis=1).astype('category')
current_items = users_items.groupby('signature').sum()
similar_items = pd.DataFrame(index=current_items.index,
columns=current_items.columns)
for signature in current_items.index:
row = current_items.filter(regex=signature, axis='index').sum()
similar_items.loc[signature] = row
The result is:
A B C D
signature
.xx. 0 6 8 0
x... 16 5 0 14
x..x 15 5 0 14
xx.x 3 5 0 4
This works fine, but it is too slow for the actual data set which consists of 100k users and some 600 items. Generating the signatures takes only 10 seconds, but looping over all (40k) signatures takes several hours.
Vectorizing the loop should offer a huge performance boost, but my experience with Pandas is limited so I'm not sure how to go about it. It is even possible to vectorize this type of calculation? Perhaps using masks?

Instead of a string as signature, you can use a frozenset
def create_signature(request_counts):
return frozenset(request_counts[request_counts != 0].index)
an alternative is
def create_signature(request_counts):
return frozenset(request_counts.replace({0: None}).dropna().index)
I don't have a dataset large enough to see whether one is faster than the other.
If you have duplicate columns, insert a call to reset_index() before the .index
This allows you to vectorise your filter in the end
for signature in current_items.index:
row = current_items[signature <= current_items.index].sum()
similar_items.loc[signature] = row
results in
signature A B C D
frozenset({'B', 'C'}) 0 6 8 0
frozenset({'A'}) 16 5 0 14
frozenset({'A', 'D'}) 15 5 0 14
frozenset({'B', 'A', 'D'}) 3 5 0 4

Related

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]

You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6

subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Dataframe column: to find (cumulative) local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima

You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.

Split a pandas dataframe by a list of values from another data frame

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...
I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.
As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.

Here is a quick and dirty approach using a list comprehension.
>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})
>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]
>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]
>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Use searchsorted:
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
# timestamp
# 0 1.739869
# 1 2.467790
# 2 2.863659
# 3 3.295505
# 4 5.106419
# 5 6.872791
# 6 7.080834
# 7 9.909320
# 8 11.027117
# 9 12.383085
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)
yields
timestamp group
0 0.896705 0
1 1.626945 0
2 2.410220 1
3 3.151872 3
4 3.613962 4
5 4.256528 4
6 4.481392 4
7 5.189938 5
8 5.937064 5
9 6.562172 5
Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).
You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use
B['timestamp'].searchsorted(A['timestamp'], side='left')
if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use
B['timestamp'].searchsorted(A['timestamp'], side='right')
if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.

Matrix as a dictionary; is it safe?

I know that the order of the keys is not guaranteed and that's OK, but what exactly does it mean that the order of the values is not guaranteed as well*?
For example, I am representing a matrix as a dictionary, like this:
signatures_dict = {}
M = 3
for i in range(1, M):
row = []
for j in range(1, 5):
row.append(j)
signatures_dict[i] = row
print signatures_dict
Are the columns of my matrix correctly constructed? Let's say I have 3 rows and at this signatures_dict[i] = row line, row will always have 1, 2, 3, 4, 5. What will signatures_dict be?
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
or something like
1 2 3 4 5
1 4 3 2 5
5 1 3 4 2
? I am worried about cross-platform support.
In my application, the rows are words and the columns documents, so can I say that the first column is the first document?
*Are order of keys() and values() in python dictionary guaranteed to be the same?

You will guaranteed have 1 2 3 4 5 in each row. It will not reorder them. The lack of ordering of values() refers to the fact that if you call signatures_dict.values() the values could come out in any order. But the values are the rows, not the elements of each row. Each row is a list, and lists maintain their order.
If you want a dict which maintains order, Python has that too: https://docs.python.org/2/library/collections.html#collections.OrderedDict

Why not use a list of lists as your matrix? It would have whatever order you gave it;
In [1]: matrix = [[i for i in range(4)] for _ in range(4)]
In [2]: matrix
Out[2]: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]
In [3]: matrix[0][0]
Out[3]: 0
In [4]: matrix[3][2]
Out[4]: 2

Testing subsequent values in a DataFrame

I have a DataFrame with one column with positive and negative integers. For each row, I'd like to see how many consecutive rows (starting with and including the current row) have negative values.
So if a sequence was 2, -1, -3, 1, -1, the result would be 0, 2, 1, 0, 1.
I can do this by iterating over all the indices, using .iloc to split the column, and next() to find out where the next positive value is. But I feel like this isn't taking advantage of panda's capabilities, and I imagine that there's a better way of doing it. I've experimented with using .shift() and expanding_window but without success.
Is there a more "pandastic" way of finding out how many consecutive rows after the current one meet some logical condition?
Here's what's working now:
import pandas as pd
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1]})
df["b"] = 0
for i in df.index:
sub = df.iloc[i:].a.tolist()
df.b.iloc[i] = next((sub.index(n) for n in sub if n >= 0), 1)
Edit: I realize that even my own example doesn't work when there's more than one negative value at the end. So that makes a better solution even more necessary.
Edit 2: I stated the problem in terms of integers, but originally only put 1 and -1 in my example. I need to solve for positive and negative integers in general.

FWIW, here's a fairly pandastic answer that requires no functions or applies. Borrows from here (among other answers I'm sure) and thanks to #DSM for mentioning the ascending=False option:
df = pd.DataFrame({"a": [2, -1, -3, -1, 1, 1, -1, 1, -1, -2]})
df['pos'] = df.a > 0
df['grp'] = ( df['pos'] != df['pos'].shift()).cumsum()
dfg = df.groupby('grp')
df['c'] = np.where( df['a'] < 0, dfg.cumcount(ascending=False)+1, 0 )
a b pos grp c
0 2 0 True 1 0
1 -1 3 False 2 3
2 -3 2 False 2 2
3 -1 1 False 2 1
4 1 0 True 3 0
5 1 0 True 3 0
6 -1 1 False 4 1
7 1 0 True 5 0
8 -1 1 False 6 2
9 -2 1 False 6 1
I think a nice thing about this method is that once you set up the 'grp' variable you can do lots of things very easily with standard groupby methods.

This was an interesting puzzle. I found a way to do it using pandas tools, but I think you'll agree it's a lot more opaque :-). Here's the example:
data = pandas.Series([1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1])
x = data[::-1] # reverse the data
print(x.groupby(((x<0) != (x<0).shift()).cumsum()).apply(lambda x: pandas.Series(
np.arange(len(x))+1 if (x<0).all() else np.zeros(len(x)),
index=x.index))[::-1])
The output is correct:
0 0
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
9 1
10 0
dtype: float64
The basic idea is similar to what I described in my answer to this question, and you can find the same approach used in various answers that ask how to make use of inter-row information in pandas. Your question is slightly trickier because your criterion goes in reverse (asking for the number of following negatives rather than the number of preceding negatives), and because you only want one side of the grouping (i.e., you only want the number of consecutive negatives, not the number of consecutive numbers with the same sign).
Here is a more verbose version of the same code with some explanation that may make it easier to grasp:
def getNegativeCounts(x):
# This function takes as input a sequence of numbers, all the same sign.
# If they're negative, it returns an increasing count of how many there are.
# If they're positive, it just returns the same number of zeros.
# [-1, -2, -3] -> [1, 2, 3]
# [1, 2, 3] -> [0, 0, 0]
if (x<0).all():
return pandas.Series(np.arange(len(x))+1, index=x.index)
else:
return pandas.Series(np.zeros(len(x)), index=x.index)
# we have to reverse the data because cumsum only works in the forward direction
x = data[::-1]
# compute for each number whether it has the same sign as the previous one
sameSignAsPrevious = (x<0) != (x<0).shift()
# cumsum this to get an "ID" for each block of consecutive same-sign numbers
sameSignBlocks = sameSignAsPrevious.cumsum()
# group on these block IDs
g = x.groupby(sameSignBlocks)
# for each block, apply getNegativeCounts
# this will either give us the running total of negatives in the block,
# or a stretch of zeros if the block was positive
# the [::-1] at the end reverses the result
# (to compensate for our reversing the data initially)
g.apply(getNegativeCounts)[::-1]
As you can see, run-length-style operations are not usually simple in pandas. There is, however, an open issue for adding more grouping/partitioning abilities that would ameliorate some of this. In any case, your particular use case has some specific quirks that make it a bit different from a typical run-length task.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize pandas calculation involving custom grouping? - python

Related

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

Dataframe column: to find (cumulative) local maxima

Split a pandas dataframe by a list of values from another data frame

Matrix as a dictionary; is it safe?

Testing subsequent values in a DataFrame

Categories

Resources