I have a .csv file with data of which i want to transform some columns to one-hot. The problem occurs in the second last line, where the one-hot index (e.g. 1st feature) gets placed in all rows instead of just the one i am in currently.
It seems to be some problem with how i access the 2D list... any suggestions?
thank you
def one_hot_encode(data_list, column):
one_hot_list = [[]]
different_elements = []
for row in data_list[1:]: # count different elements
if row[column] not in different_elements:
different_elements.append(row[column])
for i in range(len(different_elements)): # set variable names
one_hot_list[0].append(different_elements[i])
vector = [] # create list shape with zeroes
for i in range(len(different_elements)):
vector.append(0)
for i in range(1460):
one_hot_list.append(vector)
ind_row = 1 # encode 1 for each sample
for row in data_list[1:]:
index = different_elements.index(row[column])
one_hot_list[ind_row][index] = 1 # mistake!! sets all rows to 1
ind_row += 1
Your problem stems from the vector object you're creating to do the one-hot encoding; you've created one object, and then built a one_hot_list that contains 1460 references to the same object. When you make a change in one of the rows, it will be reflected in all of the rows.
Quick solution would be to create separate copies of the vector for each row (See How to clone or copy a list?):
one_hot_list.append(vector[:])
Some of the other things you're doing in your function are a bit slow or roundabout. I'd suggest a few changes:
def one_hot_encode(data_list, column):
one_hot_list = [[]]
# count different elements
different_elements = set(row[column] for row in data_list[1:])
# convert different_elements to a list with a canonical order,
# store in the first element of one_hot_list
one_hot_list[0] = sorted(different_elements)
vector = [0] * len(different_elements) # create list shape with zeroes
one_hot_list.extend([vector[:] for _ in range(1460)])
# build a mapping of different_element values to indices into
# one_hot_list[0]
index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
# encode 1 for each sample
for rindex, row in enumerate(data_list[1:], 1):
cindex = index_lookup[row[column]]
one_hot_list[rindex][cindex] = 1
This builds different_elements in linear time by using the set data type, and uses list comprehensions to produce the values for one_hot_list[0] (the list of element values which are being one-hot encoded), the zero vector, and one_hot_list[1:] (which is the actual one-hot-encoded matrix value). Also, there's a dict called index_lookup that lets you quickly map element values onto their integer index, instead of searching for them over and over again. Finally, your row index into the one_hot_list matrix can be managed for you by enumerate.
I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:
for i in range(1460):
one_hot_list.append(vector)
These are creating the one_hot_list as 1460 references to the same vector of zeros. Whereas I think you want it to be a new vector each time. A direct fix would just be to copy it each time:
for i in range(1460):
one_hot_list.append(vector[:])
But a more Pythonic approach would be to create the list with a comprehension. Perhaps something like this:
vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]
you can use set() for counting unique items in the list
different_elements = list(set(data[1:]))
I suggest you save yourself from the hassle of re-implementing this in plain Python. You can use use pandas.get_dummies for this:
First some test data (test.csv):
A
Foo
Bar
Baz
Then in Python:
import pandas as pd
df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])
You can retrieve the underlying numpy array using:
pd.get_dummies(df['A']).values
Which results in:
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0]], dtype=uint8)
Related
Given an array defined below as:
a = np.arange(30).reshape((3, 10)
col_index = [[1,2,3,5], [3,4,5,7]]
row_index = [2,1]
Is it possible to index a[row_index, col_index], so I can do something like
a[row_index, col_index] =1, so then a becomes
[[0,1,2,3,4,5,6,7,8,9], [10,11,12,1,1,1,16,1,18,19], [20,1,1,1,24,1,26,27,28,29]]
So to clarify, in row 2, column 1,2,3, and 5 are set to one, and in row 1, column 3,4,5,7 is also set to 1.
Or (if you don't like typing)
a[np.c_[row_index], col_index] = 1
or even shorter but Python 2 only
a[zip(row_index), col_index] = 1
What all these solutions do is to make row and col indices broadcastable to each other. np.c_ is the column concatenation convenience object. It makes columns out of 1D objects.
zip used to do essentially the same. Only, since Python 3 it returns an iterator instead of a list and numpy can't handle those. (One could do list(zip(row_index)) but that's not short.)
I have an array X of <class 'scipy.sparse.csr.csr_matrix'> format with shape (44, 4095)
I would like to now to create a new numpy array say X_train = np.empty([44, 4095]) and copy row by row in a different order. Say I want the 5th row of X in 1st row of X_train.
How do I do this (copying an entire row into a new numpy array) similar to matlab?
Define the new row order as a list of indices, then define X_train using integer indexing:
row_order = [4, ...]
X_train = X[row_order]
Note that unlike Matlab, Python uses 0-based indexing, so the 5th row has index 4.
Also note that integer indexing (due to its ability to select values in arbitrary order) returns a copy of the original NumPy array.
This works equally well for sparse matrices and NumPy arrays.
Python works generally by reference, which is something you should keep in mind. What you need to do is make a copy and then swap. I have written a demo function which swaps rows.
import numpy as np # import numpy
''' Function which swaps rowA with rowB '''
def swapRows(myArray, rowA, rowB):
temp = myArray[rowA,:].copy() # create a temporary variable
myArray[rowA,:] = myArray[rowB,:].copy()
myArray[rowB,:]= temp
a = np.arange(30) # generate demo data
a = a.reshape(6,5) # reshape the data into 6x5 matrix
print a # prin the matrix before the swap
swapRows(a,0,1) # swap the rows
print a # print the matrix after the swap
To answer your question, one solution would be to use
X_train = np.empty([44, 4095])
X_train[0,:] = x[4,:].copy() # store in the 1st row the 5th one
unutbu answer seems to be the most logical.
Kind Regards,
I'm trying to figure the best way to turn my data into a numpy/scipy sparse matrix. I don't need to do any heavy computation in this format. I just need to be able to convert data from a dense, too-large-for-memory csv to something I can pass it into an sklearn estimator. My theory is that the sparse-ified data should fit in memory.
Because all of the features are categorical, I'm using a generator to iterate over the file and the hashing trick to one hot encode everything:
def get_data(train=True):
if traindata:
path = '../originalData/train_rev1_short_short.csv'
else:
path = '../originalData/test_rev1_short.csv'
it = enumerate(open(path))
it.next() # burn the header row
x = [0] * 27 # initialize row container
for ix, line in it:
for ixx, f in enumerate(line.strip().split(',')):
# Record sample id
if ixx == 0:
sample_id = f
# If this is the training data, record output class
elif ixx == 1 and train:
c = f
# Use the hashing trick to one hot encode categorical features
else:
x[ixx] = abs(hash(str(ixx) + '_' + f)) % (2 ** 20)
yield (sample_id, x, c) if train else (sample_id, x)
The result are rows like this:
10000222510487979663 [1, 3, 66642, 433470, 960966, ..., 802612, 319257, 80942]
10000335031004381249 [1, 2, 87543, 394759, 183945, ..., 773845, 219833, 64573]
Where the first value is the sample ID and the list is the index values of the columns that have a '1' value.
What it is the most efficient way to turn this into a numpy/scipy sparse matrix? My only requirements are fast row-wise write/read and sklearn compatibility. Based on the scipy documentation, it seems like the CSR matrix is what I need, but I'm having some trouble figuring out to convert the data I have while using the generator construct.
Any advice? Open also to alternate approaches, I'm relatively new to problems like this.
Your data format is almost the internal structure of a scipy.sparse.lil_matrix (list of lists). You should first generate one of those, and then call .tocsr() on it to obtain the desired csr matrix.
A small example on how to populate these:
from scipy.sparse import lil_matrix
positions = [[1, 2, 10], [], [5, 6, 2]]
data = [[1, 1, 1], [], [1, 1, 1]]
l = lil_matrix((3, 11))
l.rows = positions
l.data = data
c = l.tocsr()
where data is just a list of lists of ones mirroring the structure of positions and positions would correspond to your feature indices. As you can see, the attributes l.rows and l.data are real lists here, so you can append data as it comes. In that case you need to be careful with the shape, though. When scipy generates these lil_matrix from other data, then it will put arrays of dtype object, but those are almost lists, too.
I have my data as such:
data = {'x':Counter({'a':1,'b':45}), 'y':Counter({'b':1, 'c':212})}
where my labels are the keys of the data and the key of the inner dictionary are features:
all_features = ['a','b','c']
all_labels = ['x','y']
I need to create list of list as such:
[[data[label][feat] for feat in all_features] for label in all_labels]
[out]:
[[1, 45, 0], [0, 1, 212]]
My len(all_features) is ~5,000,000 and len(all_labels) is ~100,000
The ultimate purpose is to create scipy sparse matrix, e.g.:
from collections import Counter
from scipy.sparse import csc_matrix
import numpy as np
all_features = ['a','b','c']
all_labels = ['x','y']
csc_matrix(np.array([[data[label][feat] for feat in all_features] for label in all_labels]))
but looping through a large list of list is rather inefficient.
So how can I look the large list of list efficiently?
Is there other way to create the scipy matrix from the data without looping through all features and labels?
Converting a dictionary of dictionaries into a numpy or scipy array is, as you are experiencing, not too much fun. If you know all_features and all_labels before hand, you are probably better off using a scipy sparse COO matrix from the start to keep your counts.
Whether that is possible or not, you will want to keep your lists of features and labels in sorted order, to speed up look ups. So I am going to assume that the following doesn't change either array:
all_features = np.array(all_features)
all_labels = np.array(all_labels)
all_features.sort()
all_labels.sort()
Lets extract the labels in data in the order they are stored in the dictionary, and see where in all_labels does each item fall:
labels = np.fromiter(data.iterkeys(), all_labels.dtype, len(data))
label_idx = np.searchsorted(all_labels, labels)
Now lets count how many features does each label has, and compute from it the number of non-zero items there will be in your sparse array:
label_features = np.fromiter((len(c) for c in data.iteritems()), np.intp,
len(data))
indptr = np.concatenate(([0], np.cumsum(label_features)))
nnz = indptr[-1]
Now, we extract the features for each label, and their corresponding counts
import itertools
features_it = itertools.chain(*(c.iterkeys() for c in data.itervalues()))
features = np.fromiter(features_it, all_features.dtype, nnz)
feature_idx = np.searchsorted(all_features, features)
counts_it = itertools.chain(*(c.itervalues() for c in data.itervalues()))
counts = np.fromiter(counts_it, np.intp, nnz)
With what we have, we can create a CSR matrix directly, with labels as rows and features as columns:
sps_data = csr_matrix((counts, feature_idx, indptr),
shape=(len(all_labels), len(all_features)))
The only issue is that the rows of this sparse array are not in the order of all_labels, but in the order they came up when iterating over data. But we have feature_idx telling us where did each label end up, and we can rearrange the rows by doing:
sps_data = sps_data[np.argsort(label_idx)]
Yes, it is messy, confusing, and probably not very fast, but it works, and it will be much more memory efficient that what you proposed in your question:
>>> sps_data.A
array([[ 1, 45, 0],
[ 0, 1, 212]], dtype=int64)
>>> all_labels
array(['x', 'y'],
dtype='<S1')
>>> all_features
array(['a', 'b', 'c'],
dtype='<S1')
The dataset is quite large, so I don't think is practical to construct a temporary numpy array (if 32 bit integers are used a 1e5 x 5e6 matrix would require ~2 Terabytes of memory).
I assume you know the upper bound for number of labels.
The code could look like:
import scipy.sparse
n_rows = len(data.keys())
max_col = int(5e6)
temp_sparse = scipy.sparse.lil_matrix((n_rows, max_col), dtype='int')
for i, (features, counts) in enumerate(data.iteritems()):
for label, n in counts.iteritem():
j = label_pos[label]
temp_sparse[i, j] = n
csc_matrix = temp_sparse.csc_matrix(temp_matrix)
Where label_pos returns the column-index of the label.
If turns out is not practical to use a dictionary for storing the index of 5 millions of labels a hard drive database should do.
The dictionary could be create online, so previous knowledge of all the labels is not necessary.
Iterating through 100,000 features would take a reasonable time, so I think this solution could work if the dataset is sparse enough. Good luck!
s there other way to create the scipy matrix from the data without looping
through all features and labels?
I don't think there is any short-cut that reduces the total number of lookups. You're starting with a dictionary of Counters (a dict subclass) so both levels of nesting are unordered collections. The only way to put them back in required order is to do a data[label][feat] lookup for every data point.
You can cut the time roughly in half by making sure the data[label] lookup is only done once per label:
>>> counters = [data[label] for label in all_labels]
>>> [[counter[feat] for feat in all_features] for counter in counters]
[[1, 45, 0], [0, 1, 212]]
You can also try speeding the running time by using map() instead of a list comprehension (mapping can take advantage of the internal length_hint to pre-size the result array):
>>> [map(counter.__getitem__, all_features) for counter in counters]
[[1, 45, 0], [0, 1, 212]]
Lastly, be sure to run the code inside a function (local variable lookups in CPython are faster than global variable lookups):
def f(data, all_features, all_labels):
counters = [data[label] for label in all_labels]
return [map(counter.__getitem__, all_features) for counter in counters]
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.