construct sparse matrix using categorical data - python

I have a data that looks something like this:
numpy array:
[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]
its like a user-item matrix.
I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.
Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?
desired output:
1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1
where rows: abc,def,ghi,fg,f76
and columns: a,b,c,d,e,f

The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:
import numpy as np
from scipy import sparse
users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)
points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))

pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix
import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
['d', 'abc'],['a', 'ghi'],['e', 'fg'],
['f', 'f76'],['b', 'f76']],
columns = ['user','item'])
print(x)
# user item
# 0 a abc
# 1 b def
# 2 c ghi
# 3 d abc
# 4 a ghi
# 5 e fg
# 6 f f76
# 7 b f76
for col, col_data in x.iteritems():
if str(col)=='item':
col_data = pd.get_dummies(col_data, prefix = col)
x = x.join(col_data)
print(x)
# user item item_abc item_def item_f76 item_fg item_ghi
# 0 a abc 1 0 0 0 0
# 1 b def 0 1 0 0 0
# 2 c ghi 0 0 0 0 0
# 3 d abc 1 0 0 0 0
# 4 a ghi 0 0 0 0 1
# 5 e fg 0 0 0 1 0
# 6 f f76 0 0 1 0 0
# 7 b f76 0 0 1 0 0

Here's what I could come up with:
You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.
Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.
A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])
customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
[[1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])

Here is my approach using pandas, let me know if it performed better:
#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])
#get rows and cols for your sparse dataframe
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]
#initialize your sparse dataframe,
#(this is not sparse, but you can check pandas support for sparse datatypes
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)
#define apply function
def hasUser(xx):
spdf.ix[xx.name, xx] = 1
#groupby and apply to create desired output dataframe
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))
Here is the sampel dataframes for above code:
spdf
Out[71]:
a b c d e f
abc 1 0 0 1 0 0
def 0 1 0 0 0 0
ghi 1 0 1 0 0 0
fg 0 0 0 0 1 0
f76 0 1 0 0 0 1
x
Out[72]:
User Item
0 a abc
1 b def
2 c ghi
3 d abc
4 a ghi
5 e fg
6 f f76
7 b f76
Also, in case you want to make groupby apply function execution
parallel , this question might be of help:
Parallelize apply after pandas groupby

Related

assign a shorter column to a longer column, given a sliding winow

I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?
df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1
~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2

Extract dataframe from dataframe based on column value with extended bounds

I have the following dataframe:
NAME
SIGNAL
a
0
b
0
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
k
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
t
0
I need to write a function that will allow me to extract another dataframe, or just modify the existing frame based on a condition:
Get all columns (in my case NAME) if SIGNAL column is 1 for the row but also extract 2 rows extra from above and 2 rows extra from bellow.
In my example, the function should return me the following table:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Thanks!
UPDATE:
This is the code I have so far:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['a', 0], ['b', 0], ['c', 1], ['d', 1], ['e', 0], ['f', 0], ['g', 0], ['h', 1], ['i', 0], ['j', 0], ['k', 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'SIGNAL'])
# print dataframe.
print(df)
print("----------------")
for index, row in df.iterrows():
#print(row['Name'], row['Age'])
if((df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index-1]['SIGNAL'] == 0)): #check when the signal change from 0 to 1
print(df.iloc[index]['NAME']) #first line with signal 1 after it was 0
#print the above 2 lines
print(df.iloc[index-1]['NAME'])
print(df.iloc[index-2]['NAME'])
My dataframe is like:
NAME SIGNAL
0 a 0
1 b 0
2 c 1
3 d 1
4 e 0
5 f 0
6 g 0
7 h 1
8 i 0
9 j 0
10 k 0
My code is returning:
c
b
a
h
g
f
The problem here is that I cannot return the value of "d" and "e" + "f" or "i" and "j" because i get the error "IndexError: single positional indexer is out-of-bounds" if i try if condition:
(df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index+1]['SIGNAL'] == 0)
enter code here
Also the extended bounds will be variable, sometimes I will work with 2 extra rows from top and bottom sometimes with more.
I am looking for a solution using dataframes functions and not iteration.
thanks!
This will return the desired data frame:
df[(df.shift(periods=-2, axis="rows").SIGNAL == 1) | (df.shift(periods=-1, axis="rows").SIGNAL == 1) | (df.SIGNAL == 1) | (df.shift(periods=1, axis="rows").SIGNAL == 1) | (df.shift(periods=2, axis="rows").SIGNAL == 1)]
Output:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Add .NAME to the end to get your series of names
2 c
3 d
4 e
5 f
6 g
7 h
8 i
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
Name: NAME, dtype: object
Update: for arbitrarily large span
m=(df.shift(periods=-400, axis="rows").SIGNAL == 1)
for i in list(range(-399,401)):
m= m | (df.shift(periods=i, axis="rows").SIGNAL == 1)
print(df[m])
Disclaimer:
This method may be inefficient for large spans

Checking if a list is a subset of another in a pandas Dataframe

So, i have this Dataframe with almost 3 thousand rows, that looks something like this:
CITIES
0 ['A','B']
1 ['A','B','C','D']
2 ['A','B','C']
4 ['X']
5 ['X','Y','Z']
... ...
2670 ['Y','Z']
I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:
indexesToRemove=[]
for index, row in entrada.iterrows():
citiesListFixed=row['CITIES']
for index2, row2 in entrada.iloc[index+1:].iterrows():
citiesListCurrent=row2['CITIES']
if set(citiesListFixed) <= set(citiesListCurrent):
indexesToRemove.append(index)
break
Is there a more efficient way to do this?
First create the DataFrame of dummies and then we can use matrix multiplication to see if one of the rows is a complete subset of another row, by checking if the sum of multiplication with another row is equal to the number of elements in that row. (Going to be a memory intensive)
import pandas as pd
import numpy as np
df = pd.DataFrame({'Cities': [['A','B'], ['A','B','C','D'], ['A','B','C'],
['X'], ['X','Y','Z'], ['Y','Z']]})
arr = pd.get_dummies(df['Cities'].explode()).max(level=0).to_numpy()
#[[1 1 0 0 0 0 0]
# [1 1 1 1 0 0 0]
# [1 1 1 0 0 0 0]
# [0 0 0 0 1 0 0]
# [0 0 0 0 1 1 1]
# [0 0 0 0 0 1 1]]
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0) # So same row doesn't exclude itself
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)
df[mask]
# Cities
#1 [A, B, C, D]
#4 [X, Y, Z]
As it stands if you have two rows which tie for the longest subset, (i.e. two rows with ['A','B','C','D']) both are dropped. If this is not desired you can first drop_duplicates on 'Cities' (will need to covert to a hashable type like frozenset) and then apply the above.
A possible and didactic approach would be the following:
import pandas as pd
import numpy as np
import time
start = time.process_time()
lst1 = [0,1,2,3,4,2670]
lst2 = [['A','B'], ['A','B','C','D'], ['A','B','C'], ['X'], ['X','Y','Z'], ['Y','Z']]
df = pd.DataFrame(list(zip(lst1, lst2)), columns =['id', 'cities'])
df['contained'] = 0
n = df.shape[0]
for i in range(n):
for j in range(n):
if i != j:
if((set(df.loc[i,'cities']) & set(df.loc[j,'cities']))== set(df.loc[i,'cities'])):
df.loc[i,'contained'] = 1
print(df)
print("\nTime elapsed:", time.process_time() - start, "seconds")
The time complexity of this solution is .
You end up with this data frame as a result:
id cities contained
0 0 [A, B] 1
1 1 [A, B, C, D] 0
2 2 [A, B, C] 1
3 3 [X] 1
4 4 [X, Y, Z] 0
5 2670 [Y, Z] 1
Then you just have to exclude the rows where contained == 1.

Pandas: create column based on first off occurrences in column B after signal in column A

I have A column with signal on == 1 and B column with signal off == 1 ,the rest values are zero.
data = {'A': [1, 0, 0, 0, 0, 1, 0],
'B': [1, 0, 1, 1, 0, 0, 1]}
df = pd.DataFrame.from_dict(data)
I need to create a column C where:
A == 1 and B == 0 or 1, C= 1
C = 1 till to B == 1, than C = 0
Here what the result should be:
df['C'] = [1, 1, 0, 0, 0, 1, 0]
I used
df.loc[df['A'] == 1, 'C'] = 1
to set at 1 the row where A == 1, but I can not find the way to get first non zero in B column, after the 1 signal on A, and replace the other with zeros till to next 1 in A.
You can do mask, with transform idxmax , mask here is to set B to 0 when A equal to 1 , since no matter what value of B, the C will be 1.
df['C']=(df.index<df.B.mask(df.A.eq(1),0).groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
Update
s=df.B.mask(df.A.eq(1),0)
s=(s==1)&(s.shift(-1)==0)
df['C']=(df.index<s.groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df.loc[df.A==1,'C']=1
Hello and welcome to stackoverflow.
This is a case you usually wouldn't use pandas for as the value of C depends on previous rows. And pandas is more about using "split-apply-combine" on independent measurements
If it is not runtime-critical I would probably write a plain old loop for this:
In [4]: C = []
...: signal = 0
...: for _, row in df.iterrows():
...: if ((signal == 1) and (row.B == 1)):
...: signal = 0
...: elif(row.A == 1):
...: signal = 1
...: C.append(signal)
...:
In [5]: C
Out[5]: [1, 1, 0, 0, 0, 1, 0]
In [6]: df['C'] = C
In [7]: df
Out[7]:
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
This won't have a good performance, but imho it is worth it to cleanly express the intent of your code if it is still "fast enough".
Solution based on iterrows (as proposed in one of other answers)
may be too slow.
Define the following function computing the output signal for a group
of input rows (starting on each case of A == 1):
def signal(grp):
return pd.Series(np.equal(np.where(grp.A == 1, 0, grp.B)
.cumsum(), 0).astype(int), index=grp.index)
Then group df and apply this function:
df['C'] = df.groupby(df.A.cumsum()).apply(signal)\
.reset_index(level=0, drop=True)
Edit
Yet faster solution, without grouping, is:
sig = df.A.replace(0, np.nan)
sig.update(df.A.lt(df.B).astype(int).replace(0, np.nan) - 1)
df['C'] = sig.ffill().fillna(0, downcast='infer')
For a sample of 7000 rows (your data repeated 1000 times) the execution
time of this solution is 14 times shorter than the solution by YOBEN_S.

Create a matrix with headers and inserting values by calling the headers

Is it in python possible to create a zero-matrix with headers and insert values by calling column-header and row-headers? For example:
# A B C
d 0 0 0
e 0 0 0
f 0 0 0
the I write for example matrix('A', 'd') = 1 and matrix('B', 'e') = 3 and get
# A B C
d 1 0 0
e 0 3 0
f 0 0 0
And when I am done I want to be able to save it to csv.
Hope someone can help me, because I have not idea if this is possible.
this would come close to your need:
rows = dict((x,y) for y,x in enumerate('ABC')) # rows' headers
cols = dict((x,y) for y,x in enumerate('def')) # cols' headers
len_cols = 3
len_rows = 3
matrix = [[0 for i in range(len_cols)] for j in range(len_rows)]
def update(row, col, val):
matrix[rows[row]][cols[col]] = val
examples:
>>> cols = dict((x,y) for y,x in enumerate('de'))
>>> rows = dict((x,y) for y,x in enumerate('ABCD'))
>>> len_cols = 2
>>> len_rows = 4
>>> matrix = [[0 for i in range(len_cols)] for j in range(len_rows)]
>>> update('B','e',1)
>>> update('D','d',3)
>>> matrix
[[0, 0], [0, 1], [0, 0], [3, 0]]

Categories

Resources