Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix - python

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!

A direct conversion is not supported ATM. Contributions are welcome!
Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient
In [37]: col = np.array([0,0,1,2,2,2])
In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')
In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )
In [40]: m
Out[40]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>
In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
Out[46]:
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6
In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame

As of pandas v 0.20.0 you can use the SparseDataFrame constructor.
An example from the pandas docs:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)

A much shorter version:
df = pd.DataFrame(m.toarray())

Related

Multiplying each row of a pandas dataframe by another row dataframe

So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])

How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?
If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
This is why the output is different
Code to reproduce above:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
​
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy output:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
If you'd called .values to return a np array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818

Vectorized Splitting of String on Character Count using Numpy or Pandas

Is there a way to split a Numpy Array in a vectorized manner based upon character count for each element?
Input:
In [1]: import numpy as np
In [2]: y = np.array([ 'USC00013160194806SNOW','USC00013160194806SNOW','USC00013160194806SNOW' ])
In [3]: y
Out[3]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
I want each element of the array split according to a certain number of characters.
Desired Output:
In [3]: y
Out[3]:
array(['USC00013160', 'USC00013160',
'USC00013160'],
dtype='|S21')
I've executed this using standard python loops, but I'm dealing with millions of values, so I'm trying to figure the fastest method.
You can create a view using a data type with the same size as y's dtype that has subfields corresponding to the parts that you want. For example,
In [22]: y
Out[22]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
In [23]: dt = np.dtype([('part1', 'S11'), ('part2', 'S6'), ('part3', 'S4')])
In [24]: v = y.view(dt)
In [25]: v['part1']
Out[25]:
array(['USC00013160', 'USC00013160', 'USC00013160'],
dtype='|S11')
In [26]: v['part2']
Out[26]:
array(['194806', '194806', '194806'],
dtype='|S6')
In [27]: v['part3']
Out[27]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Note that these are all views of the same data in y. If you modify them in place, you are also modifying y. For example,
In [32]: v3 = v['part3']
In [33]: v3
Out[33]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Change v3[1] to 'RAIN':
In [34]: v3[1] = 'RAIN'
In [35]: v3
Out[35]:
array(['SNOW', 'RAIN', 'SNOW'],
dtype='|S4')
Now see that y[1] is also changed:
In [36]: y
Out[36]:
array(['USC00013160194806SNOW', 'USC00013160194806RAIN',
'USC00013160194806SNOW'],
dtype='|S21')
One possible solution I've found is just completing the operation using Pandas Series, but I'm wondering if this can be done using an only Numpy arrays slicing methods. If not, it is fine, more curious about the best practice.
Starting Pandas Series:
In [33]: x = pd.read_csv("data.txt", delimiter='\n', dtype=str, squeeze=True)
In [34]: x
Out[34]:
0 USC00013160194807SNOW
1 USC00013160194808SNOW
2 USC00013160194809SNOW
3 USC00013160194810SNOW
4 USC00013160194811SNOW, dtype: object
Vectorized String Processing based on Character Count:
In [37]: k = x.str[0:11]
Output:
In [38]: k
Out[38]:
0 USC00013160
1 USC00013160
2 USC00013160
3 USC00013160
4 USC00013160

Efficiently create sparse pivot tables in pandas?

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)
for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
>>> type(dfs)
pandas.sparse.frame.SparseDataFrame
The answer posted previously by #khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix, \
index=person_c.categories, \
columns=thing_c.categories, \
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
The main changes were:
.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore
Other changes were more superficial:
using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now
I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:
# function
import pandas as pd
from scipy.sparse import csr_matrix
def df_to_sm(data, vars_i, vars_j):
grpr_i = data.groupby(vars_i).grouper
idx_i = grpr_i.group_info[0]
grpr_j = data.groupby(vars_j).grouper
idx_j = grpr_j.group_info[0]
data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
shape=(grpr_i.ngroups, grpr_j.ngroups))
return data_sm, grpr_i, grpr_j
# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
'val' : [1, 2, 3, 4, 5, 6]})
data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])
data_sm.todense()

Inflating a 1D array into a 2D array in numpy

Say I have a 1D array:
import numpy as np
my_array = np.arange(0,10)
my_array.shape
(10, )
In Pandas I would like to create a DataFrame with only one row and 10 columns using this array. FOr example:
import pandas as pd
import random, string
# Random list of characters to be used as columns
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
But when I try:
pd.DataFrame(my_array, columns = cols)
I get:
ValueError: Shape of passed values is (1,10), indices imply (10,10)
I presume this is because Pandas expects a 2D array, and I have a (flat) 1D array. Is there a way to inflate my 1D array into a 2D array or have Panda use a 1D array in the creation of the dataframe?
Note: I am using the latest stable version of Pandas (0.11.0)
Your value array has length 9, (values from 1 till 9), and your cols list has length 10.
I dont understand your error message, based on your code, i get:
ValueError: Shape of passed values is (1, 9), indices imply (10, 9)
Which makes sense.
Try:
my_array = np.arange(10).reshape(1,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame(my_array, columns=cols)
Which results in:
F H L N M X B R S N
0 0 1 2 3 4 5 6 7 8 9
Either these should do it:
my_array2 = my_array[None] # same as myarray2 = my_array[numpy.newaxis]
or
my_array2 = my_array.reshape((1,10))
A single-row, many-columned DataFrame is unusual. A more natural, idiomatic choice would be a Series indexed by what you call cols:
pd.Series(my_array, index=cols)
But, to answer your question, the DataFrame constructor is assuming that my_array is a column of 10 data points. Try DataFrame(my_array.reshape((1, 10)), columns=cols). That works for me.
By using one of the alternate DataFrame constructors it is possible to create a DataFrame without needing to reshape my_array.
import numpy as np
import pandas as pd
import random, string
my_array = np.arange(0,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame.from_records([my_array], columns=cols)
Out[22]:
H H P Q C A G N T W
0 0 1 2 3 4 5 6 7 8 9

Categories

Resources