Efficiently create sparse pivot tables in pandas? - python

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)
for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
>>> type(dfs)
pandas.sparse.frame.SparseDataFrame

The answer posted previously by #khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix, \
index=person_c.categories, \
columns=thing_c.categories, \
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
The main changes were:
.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore
Other changes were more superficial:
using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:
# function
import pandas as pd
from scipy.sparse import csr_matrix
def df_to_sm(data, vars_i, vars_j):
grpr_i = data.groupby(vars_i).grouper
idx_i = grpr_i.group_info[0]
grpr_j = data.groupby(vars_j).grouper
idx_j = grpr_j.group_info[0]
data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
shape=(grpr_i.ngroups, grpr_j.ngroups))
return data_sm, grpr_i, grpr_j
# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
'val' : [1, 2, 3, 4, 5, 6]})
data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])
data_sm.todense()

Related

Multiplying each row of a pandas dataframe by another row dataframe

So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])

Conversion of pandas dataframe to sparse key-item matrix with composite key

I have a data frame of 3 columns. Col 1 is a string order number, Col 2 is an integer day, and Col 3 is a product name.
I would like to convert this into a matrix where each row represents a unique order/day combination, and each column represents a 1/0 for the presence of a product name for that combination.
My approach so far makes use of a product dictionary, and a dictionary with a composite key of order # & day.
The final step, which iterates through the original dataframe in order to flip the bits in the matrix to 1s is sloooow. Like 10 minutes for a matrix the size of 363K X 331 and a sparseness of ~97%.
Is there a different approach I should consider?
E.g.,
ord_nb day prod
1 1 A
1 1 B
1 2 B
1 2 C
1 2 D
would become
A B C D
1 1 0 0
0 1 1 1
My approach has been to create a dictionary of order/day pairs:
ord_day_dict = {}
print("Making a dictionary of ord-by-day keys...")
gp = df.groupby(['day', 'ord'])
for i,g in enumerate(gp.groups.items()):
ord_day_dict[g[0][0], g[0][1]] = i
I append the index represention to the original dataframe:
df['ord_day_idx'] = 0 #Create a place holder column
for i, row in df.iterrows(): #populate the column with the index
df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])
I then initialize a matrix the size of my ord/day X unique products:
n_items = df.prod_nm.unique().shape[0] #unique number of products
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)
I convert my products from strings into an index via a dictionary:
prod_dict = dict()
i = 0
for v in df.prod:
if v not in prod_dict:
prod_dict[v] = i
i = i + 1
And finally iterate through the original dataframe to populate the matrix with 1s where a specific order on a specific day included a specific product.
for line in df.itertuples():
df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1
Here is one option you can try:
df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0)
# A B C D
#ord_nb day
# 1 1 1.0 1.0 0.0 0.0
# 2 0.0 1.0 1.0 1.0
Here's a NumPy based approach to have an array as output -
a = df[['ord_nb','day']].values.astype(int)
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1]
col = np.unique(df.prd.values,return_inverse=1)[1]
out_shp = row.max()+1, col.max()+1
out = np.zeros(out_shp, dtype=int)
out[row,col] = 1
Please note that the third column was assumed to be of name 'prd' instead to avoid name conflict with built-in.
Possible improvements with focus on performance -
If prd has single letter characters only starting from A, we could compute col with simply : df.prd.values.astype('S1').view('uint8')-65.
Alternatively, we could compute row with : np.unique(a[:,0]*(a[:,1].max()+1) + a[:,1],return_inverse=1)[1].
Saving memory with sparse array : For really huge arrays, we could save on memory by storing them as sparse matrices. Thus, the final steps to get such a sparse matrix would be -
from scipy.sparse import coo_matrix
d = np.ones(row.size,dtype=int)
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)
Sample input, output -
In [232]: df
Out[232]:
ord_nb day prd
0 1 1 A
1 1 1 B
2 1 2 B
3 1 2 C
4 1 2 D
In [233]: out
Out[233]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])
In [241]: out_sparse
Out[241]:
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in COOrdinate format>
In [242]: out_sparse.toarray()
Out[242]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])

Filter only certain words from sklearn CountVectorizer sparse matrix

I have a pandas series with full of text inside it. Using CountVectorizer function in sklearn package, I have calculated the sparse matrix. I have identified the top words as well. Now I want to filter my sparse matrix for only those top words.
The original data contains more than 7000 rows and contains more than 75000 words. Hence I am creating a sample data here
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
words = pd.Series(['This is first row of the text column',
'This is second row of the text column',
'This is third row of the text column',
'This is fourth row of the text column',
'This is fifth row of the text column'])
count_vec = CountVectorizer(stop_words='english')
sparse_matrix = count_vec.fit_transform(words)
I have created the sparse matrix for all the words in that column. Here just to print my sparse matrix, i am converting it to array using .toarray() function.
print count_vec.get_feature_names()
print sparse_matrix.toarray()
[u'column', u'fifth', u'fourth', u'row', u'second', u'text']
[[1 0 0 1 0 1]
[1 0 0 1 1 1]
[1 0 0 1 0 1]
[1 0 1 1 0 1]
[1 1 0 1 0 1]]
Now I am looking for frequently appearing words using the following
# Get frequency count of all features
features_count = sparse_matrix.sum(axis=0).tolist()[0]
features_names = count_vec.get_feature_names()
features = pd.DataFrame(zip(features_names, features_count),
columns=['features', 'count']
).sort_values(by=['count'], ascending=False)
features count
0 column 5
3 row 5
5 text 5
1 fifth 1
2 fourth 1
4 second 1
From the above result we know that the frequently appearing words are column, row & text. Now I want to filter my sparse matrix only for these words. I dont to convert my sparse matrix to array and then filter. Because I get memory error in my original data, since the number of words are quite high.
The only way I was able to get the sparse matrix is to again repeat the steps with those specific words using vocabulary attribute, like this
countvec_subset = CountVectorizer(vocabulary= ['column', 'text', 'row'])
Instead I am looking for a better solution, where I can filter the sparse matrix directly for those words, instead of creating it again from scratch.
You can work with slicing the sparse matrix. You'd need to derive columns for slicing. sparse_matrix[:, columns]
In [56]: feature_count = sparse_matrix.sum(axis=0)
In [57]: columns = tuple(np.where(feature_count == feature_count.max())[1])
In [58]: columns
Out[58]: (0, 3, 5)
In [59]: sparse_matrix[:, columns].toarray()
Out[59]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]], dtype=int64)
In [60]: type(sparse_matrix[:, columns])
Out[60]: scipy.sparse.csr.csr_matrix
In [71]: np.array(features_names)[list(columns)]
Out[71]:
array([u'column', u'row', u'text'],
dtype='<U6')
The sliced subset is still a scipy.sparse.csr.csr_matrix

Python - NumPy - deleting multiple rows and columns from an array

Let's say I have a square matrix as input:
array([[0, 1, 1, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 1, 1, 0]])
I want to count the nonzeros in the array after removal of rows 2 and 3 and cols 2 and 3. Afterwards I want to do the same for rows 3 and 4 and cols 3 and 4. Hence the output should be:
0 # when removing rows/cols 2 and 3
3 # when removing rows/cols 3 and 4
Here is the naive solution using np.delete:
import numpy as np
a = np.array([[0,1,1,0],[1,1,1,1],[1,1,1,1],[0,1,1,0]])
np.count_nonzero(np.delete(np.delete(a, (1,2), axis=0), (1,2), axis=1))
np.count_nonzero(np.delete(np.delete(a, (2,3), axis=0), (2,3), axis=1))
But np.delete returns a new array. Is there a faster method, which involves deleting rows and columns simultaneously? Can masking be used? The documentation on np.delete reads:
Often it is preferable to use a boolean mask.
How do I go about doing that? Thanks.
Instead of deleting the columns and rows you don't want, it is easier to select the ones you do want. Also note that it is standard to start counting rows and columns from zeros. To get your first example, you thus want to select all elements in rows 0 and 3 and in rows 0 and 3. This requires advanced indexing, for which you can use the ix_ utility function:
In [25]: np.count_nonzero(a[np.ix_([0,3], [0,3])])
Out[25]: 0
For your second example, you want to select rows 0 and 1 and columns 0 and 1, which can be done using basic slicing:
In [26]: np.count_nonzero(a[:2,:2])
Out[26]: 3
There is no need to modify your original array by deleting rows/columns, in order to count the number of non zero elements. Simply use indexing,
a = np.array([[0,1,1,0],[1,1,1,1],[1,1,1,1],[0,1,1,0]])
irows, icols = np.indices(a.shape)
mask = (irows!=2)&(irows!=3)&(icols!=2)&(icols!=3)
np.count_nonzero(a[mask])

Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!
A direct conversion is not supported ATM. Contributions are welcome!
Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient
In [37]: col = np.array([0,0,1,2,2,2])
In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')
In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )
In [40]: m
Out[40]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>
In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
Out[46]:
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6
In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
As of pandas v 0.20.0 you can use the SparseDataFrame constructor.
An example from the pandas docs:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
A much shorter version:
df = pd.DataFrame(m.toarray())

Categories

Resources