Get index from rows with matching values in different columns - python

I have a set like this:
N1 N2
0 a b
1 b f
2 c d
3 d a
4 e b
I want to get the indexes with the repeated values between the two columns, and the value itself.
From the example, I should get something like these shortlists:
(value, idx(N1), idx(N2))
(a, 0, 3)
(b, 1, 0)
(b, 1, 4)
(d, 3, 2)
I have been able to do it with two for-loops, but for a half-million rows dataframe it took hours...

Use numpy broadcasting comparison and then use argwhere to find the indices where the values where equal:
import numpy as np
# make a broadcasted comparison
mat = df['N2'].values == df['N1'].values[:, None]
# find the indices where the values are True
where = np.argwhere(mat)
# select the values
values = df['N1'][where[:, 0]]
# create the DataFrame
res = pd.DataFrame(data=[[val, *row] for val, row in zip(values, where)], columns=['values', 'idx_N1', 'idx_N2'])
print(res)
Output
values idx_N1 idx_N2
0 a 0 3
1 b 1 0
2 b 1 4
3 d 3 2

Related

Pandas DataFrame conditional selection with list comprehension

I have a dataframe with 15 columns named 0,1,2,...,14. I would like to write a method that would take in this data, and a vector of length 15. I would like it to return dataframe conditionally selected based on this vector that I have passed. E.g. the data passed is data_ and the vector passed is v_
I would like to produce that:
data[(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])]
However I would like the method to be flexible, e.g. I could pass in dataframe of 100 columns named 0, ..., 99 and a vector of length 99. My problem is that I do not know how to cleverly create [(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])] programatically to account for "&" sign. Equally well I would be satisfied if someone gave me a method that could merge multiple NxM matrices filled with True and False on "and" or "or" into single MxN matrix.
Thank You very much!
You can try this:
def custom_filter(data, v):
if len(data.columns) == len(v):
# If data has the same number of columns
# as v has elements
mask = (data == v).all(axis=1)
else:
# If they have a different length, we'll need to subset
# the data first, then create our mask
# This attempts to susbet the dataframe by assuming columns
# 0 .. len(v) - 1 exist as columns, and will throw an error
# otherwise
colnames = list(range(len(v)))
mask = (data[colnames] == v).all(axis=1)
return data.loc[mask, :]
df = pd.DataFrame({
"F": list("hiadsfin"),
0: list("aaaabbbb"),
1: list("cccdddee"),
2: list("ffgghhij")
})
v = ["a", "c", "f"]
df
F 0 1 2 H
0 h a c f 1
1 i a c f 2
2 a a c g 3
3 d a d g 4
4 s b d h 5
5 f b d h 6
6 i b e i 7
7 n b e j 8
custom_filter(df, v)
F 0 1 2 H
0 h a c f 1
1 i a c f 2
Note that with this function, if the number of columns exactly matches the length of your vector v, then you do not need to ensure the columns are labelled as 0, 1, 2, ..., len(v)-1. However if you have more columns than elements of v, you need to ensure that a subset of those columns are labelled as 0, 1, 2, ..., len(v)-1. If v` is longer than there are columns in your dataframe, this will throw an error.
This might work:
data[(data==v_.transpose())].dropna(axis=1)

Transformation to simplify matrix with python

First of all thank you for any support. This is my first question published as usually my doubts are solved reading through other user's questions.
Here is my question: I have a number (n) of sets with common elements. These elements are usually added sequentially creating new sets although I do not have the sequence and this is what I am trying to find. The sequence is not always perfect and at some points I have to find the closest one with some uncertainty when the sequence is not 'perfect'.
I coded it using theory of Sets searching sequentially the set that contains all the other sets and when I do not reach the last set then I start from the smallest to the bigger.
I gave some thoughts to the topic and I found, in theory, a more robust and generic approach. The idea is to build a square matrix with the n sets as row index (i) and the n sets as column index (j). The element i,j will be equal to 1 when set j is contained in i.
Here I have an example with sets A to G:
A={a, b, c, d1, d2, e, f};
B={b, c, d1, d2, e, f};
C={c, d1, d2, e, f};
D={d1, f, g};
E={d2, f, g};
F={f, g};
G={g};
If I create the matrix assuming sequence B, E, C, F, D, A, G, I would have:
B E C F D A G
B 1 1 1 1 1 0 1
E 0 1 0 1 0 0 1
C 0 1 1 1 1 0 1
F 0 0 0 1 0 0 1
D 0 0 0 1 1 0 1
A 1 1 1 1 1 1 1
G 0 0 0 0 0 0 1
I should get this matrix transformed into following matrix:
A B C D E F G
A 1 1 1 1 1 1 1
B 0 1 1 1 1 1 1
C 0 0 1 1 1 1 1
D 0 0 0 1 0 1 1
E 0 0 0 0 1 1 1
F 0 0 0 0 0 1 1
G 0 0 0 0 0 0 1
Which shows one of the two possible sequence: A, B, C, D, E, F, G
Here I add a picture as I am not sure matrix are shown clearly.
My first question is how you recommend to handle this matrix (which kind of data type should I use with typical functions to swap rows and columns).
And my second question is if there is already a matrix transformation function for this topic.
From my (small) experience, most used types for matrices are lists and numpy.ndarrays.
For columns swaps in particular, I would recommend numpy. There are many array creation routines in numpy. You either give the list with data explicitly or you create an array based on the shape you want. Example
>>> import numpy as np
>>> np.array([1, 2, 3])
array([1, 2, 3])
>>> np.array([[1, 2, 3], [1, 2, 3]])
array([[1, 2, 3],
[1, 2, 3]])
>>> np.zeros((2, 2))
array([[0., 0.],
[0., 0.]])
np.zeros accepts a shape as an argument (number of rows and columns for matrices). Of course, you can create arrays with how many dimensions you want.
numpy is quite complex regarding indexing its arrays. For a matrix you have:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[0] # row indexing
array([0, 1, 2])
>>> a[1, 1] # element indexing
4
>>> a[:, 2] # column indexing
array([2, 5])
Hopefully the examples are self-explanatory. Regarding the column index, : means "over all the values". So you specify a column index and the fact that you want all the values on that column.
For swapping rows and columns it's pretty short:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[[0, 1]] = a[[1, 0]] # row swapping
>>> a
array([[3, 4, 5],
[0, 1, 2]])
>>> a[:, [0, 2]] = a[:, [2, 0]] # column swapping
>>> a
array([[5, 4, 3],
[2, 1, 0]])
Here advance indexing is used. Each dimension (called axis by numpy) can accept a list of indices. So you can get 2 or more rows/columns at the same time from a matrix.
You don't have to ask for them in a certain order. numpy gives you the values in the order you ask for them.
Swapping rows is done by asking numpy for the two rows in reversed order and saving them in their original positions. It actually respects the pythonic way of swapping values between 2 variables (although surrounded by a complex frame):
a, b = b, a
Regarding matrix transformation, it depends on what you are looking for.
Using the swapping ideas from I made my own functions to find all the swapping to do to get the triangular matrix.
Here I write the code:
`def simple_sort_matrix(matrix):
orden=np.array([i for i in range(len(matrix[0]))])
change=True
while change:
rows_index=row_index_ones(matrix)
change=False
#for i in range(len(rows_index)-1):
i=0
while i
def swap_row_and_column(matrix,i,j):
matrix[[i, j]] = matrix[[j, i]] # row swapping
matrix[:, [i, j]] = matrix[:, [j, i]] # column swapping
return matrix
def row_index_ones(matrix):
return(np.sum(matrix,axis=1))`
Best regards,
Pablo

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
prints:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner
Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
output:
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Output:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.
My method
s=df.groupby(df.id.shift().ne(df.id).cumsum()).agg({'id':'first','value':['min','max']})
s.columns=s.columns.droplevel(0)
t=s['min'].values[:,None]-s['max'].values
t=t.astype(float)
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
newdf=pd.DataFrame(t,index=s['first'],columns=s['first'])
newdf.values[newdf.index.values[:,None]==newdf.index.values]=np.nan
newdf=newdf.T.stack()
newdf
Out[933]:
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Conversion of pandas dataframe to sparse key-item matrix with composite key

I have a data frame of 3 columns. Col 1 is a string order number, Col 2 is an integer day, and Col 3 is a product name.
I would like to convert this into a matrix where each row represents a unique order/day combination, and each column represents a 1/0 for the presence of a product name for that combination.
My approach so far makes use of a product dictionary, and a dictionary with a composite key of order # & day.
The final step, which iterates through the original dataframe in order to flip the bits in the matrix to 1s is sloooow. Like 10 minutes for a matrix the size of 363K X 331 and a sparseness of ~97%.
Is there a different approach I should consider?
E.g.,
ord_nb day prod
1 1 A
1 1 B
1 2 B
1 2 C
1 2 D
would become
A B C D
1 1 0 0
0 1 1 1
My approach has been to create a dictionary of order/day pairs:
ord_day_dict = {}
print("Making a dictionary of ord-by-day keys...")
gp = df.groupby(['day', 'ord'])
for i,g in enumerate(gp.groups.items()):
ord_day_dict[g[0][0], g[0][1]] = i
I append the index represention to the original dataframe:
df['ord_day_idx'] = 0 #Create a place holder column
for i, row in df.iterrows(): #populate the column with the index
df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])
I then initialize a matrix the size of my ord/day X unique products:
n_items = df.prod_nm.unique().shape[0] #unique number of products
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)
I convert my products from strings into an index via a dictionary:
prod_dict = dict()
i = 0
for v in df.prod:
if v not in prod_dict:
prod_dict[v] = i
i = i + 1
And finally iterate through the original dataframe to populate the matrix with 1s where a specific order on a specific day included a specific product.
for line in df.itertuples():
df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1
Here is one option you can try:
df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0)
# A B C D
#ord_nb day
# 1 1 1.0 1.0 0.0 0.0
# 2 0.0 1.0 1.0 1.0
Here's a NumPy based approach to have an array as output -
a = df[['ord_nb','day']].values.astype(int)
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1]
col = np.unique(df.prd.values,return_inverse=1)[1]
out_shp = row.max()+1, col.max()+1
out = np.zeros(out_shp, dtype=int)
out[row,col] = 1
Please note that the third column was assumed to be of name 'prd' instead to avoid name conflict with built-in.
Possible improvements with focus on performance -
If prd has single letter characters only starting from A, we could compute col with simply : df.prd.values.astype('S1').view('uint8')-65.
Alternatively, we could compute row with : np.unique(a[:,0]*(a[:,1].max()+1) + a[:,1],return_inverse=1)[1].
Saving memory with sparse array : For really huge arrays, we could save on memory by storing them as sparse matrices. Thus, the final steps to get such a sparse matrix would be -
from scipy.sparse import coo_matrix
d = np.ones(row.size,dtype=int)
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)
Sample input, output -
In [232]: df
Out[232]:
ord_nb day prd
0 1 1 A
1 1 1 B
2 1 2 B
3 1 2 C
4 1 2 D
In [233]: out
Out[233]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])
In [241]: out_sparse
Out[241]:
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in COOrdinate format>
In [242]: out_sparse.toarray()
Out[242]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A and B:
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.
Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E] of A,
rows [1,3] of A match up with row [6] of B,
row [0] of A matches up with row [2] of B,
row [2] of A matches up with row [3] of B.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to #jezrael and #Divakar for these methods.
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
Setup
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A a MultiIndex to easily group by columns of 10.
Solution
This is very similar to #Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.
These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner #Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
Minimalist One Liner
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
Timing
FindAinB3 is an order of magnitude faster

Categories

Resources