how to extract index in factor vector in Rpy2 - python

I have a factor vector sv='ababbc' and a integer vector fv=[1,1,1,1,1,1]. fv is correspond to sv.
import rpy2.robjects as robjects
sv=robjects.StrVector('ababbc')
fac=robjects.FactorVector(sv)
fv=robjects.r['rep'](1,6)
I want to change the value of element to 2 in fv, which of index correspond to letter “a”.
made fv=[2,1,2,1,1,1]
How to do it? Thank you.

To get the index when true:
In [54]:
import numpy as np
np.argwhere(np.array(sv) == 'a')
Out[54]:
array([[0],
[2]])
The 1st and 3rd positions have the letter 'a'.
You can't do that with fac, as it is already factorized and contains only the levels, 1, 2, 3..., not the original 'a', 'b', 'c'... anymore.
In [55]:
np.argwhere(np.array(fac) == 'a')
Out[55]:
array([], shape=(0, 1), dtype=int64)
In [56]:
np.array(fac)
Out[56]:
array([1, 2, 1, 2, 2, 3], dtype=int32)
Or it can be done in R side:
In [51]:
robjects.reval('result1 <- which(sv %in% c("a"))')
print robjects.r.result1
[1] 1 3
To systematically assign a given value to a level, I suggest you to use the factor function in R:
In [53]:
robjects.r.assign('sv', sv)
robjects.reval('result3 <- factor(sv, levels=c("a","b","c"), labels=c(10,2,3))')
print robjects.r.result3
[1] 10 2 10 2 2 3
Levels: 10 2 3
So a gets 10, b gets 2, c gets 3 and so on.

Related

Find index of max element in numpy array excluding few indexes

Say:
p = array([4, 0, 8, 2, 7])
Want to find the index of max value, except few indexes, say:
excptIndx = [2, 3]
Ans: 4, as 7 will be max.
if excptIndx = [1, 3], Ans: 2, as 8 will be max.
In numpy, you can mask all values at excptIndx and run argmax to obtain index of max element:
import numpy as np
p = np.array([4, 0, 8, 2, 7])
excptIndx = [2, 3]
m = np.zeros(p.size, dtype=bool)
m[excptIndx] = True
a = np.ma.array(p, mask=m)
print(np.argmax(a))
# 4
The setup:
In [153]: p = np.array([4,0,8,2,7])
In [154]: exceptions = [2,3]
Original indexes in p:
In [155]: idx = np.arange(p.shape[0])
delete exceptions from both:
In [156]: np.delete(p,exceptions)
Out[156]: array([4, 0, 7])
In [157]: np.delete(idx,exceptions)
Out[157]: array([0, 1, 4])
Find the argmax in the deleted array:
In [158]: np.argmax(np.delete(p,exceptions))
Out[158]: 2
Use that to find the max value (could just as well use np.max(_156)
In [159]: _156[_158]
Out[159]: 7
Use the same index to find the index in the original p
In [160]: _157[_158]
Out[160]: 4
In [161]: p[_160] # another way to get the max value
Out[161]: 7
For this small example, the pure Python alternatives might well be faster. They often are in small cases. We need test cases with a 1000 or more values to really see the advantages of numpy.
Another method
Set the exceptions to a small enough value, and take the argmax:
In [162]: p1 = p.copy(); p1[exceptions] = -1000
In [163]: np.argmax(p1)
Out[163]: 4
Here the small enough is easy to pick; more generally it may require some thought.
Or taking advantage of the np.nan... functions:
In [164]: p1 = p.astype(float); p1[exceptions]=np.nan
In [165]: np.nanargmax(p1)
Out[165]: 4
A solution is
mask = np.isin(np.arange(len(p)), excptIndx)
subset_idx = np.argmax(p[mask])
parent_idx = np.arange(len(p))[mask][subset_idx]
See http://seanlaw.github.io/2015/09/10/numpy-argmin-with-a-condition/
p = np.array([4,0,8,2,7]) # given
exceptions = [2,3] # given
idx = list( range(0,len(p)) ) # simple array of index
a1 = np.delete(idx, exceptions) # remove exceptions from idx (i.e., index)
a2 = np.argmax(np.delete(p, exceptions)) # get index of the max value after removing exceptions from actual p array
a1[a2] # as a1 and a2 are in sync, this will give the original index (as asked) of the max value

Convert scipy condensed distance matrix to lower matrix read by rows

I have a condensed distance matrix from scipy that I need to pass to a C function that requires the matrix be converted to the lower triangle read by rows. For example:
0 1 2 3
0 4 5
0 6
0
The condensed form of this is: [1,2,3,4,5,6] but I need to convert it to
0
1 0
2 4 0
3 5 6 0
The lower triangle read by rows is: [1,2,4,3,5,6].
I was hoping to convert the compact distance matrix to this form without creating a redundant matrix.
Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:
In [128]: import numpy as np
In [129]: from scipy.spatial.distance import squareform
c is the condensed form of the distance matrix:
In [130]: c = np.array([1, 2, 3, 4, 5, 6])
d is the redundant square distance matrix:
In [131]: d = squareform(c)
Here's your condensed lower triangle distances:
In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])
Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n) takes the row i and column j of the redundant distance matrix, with j > i, and returns the corresponding index in the condensed distance array.
In [169]: def condensed_index(i, j, n):
...: return n*i - i*(i+1)//2 + j - i - 1
...:
As above, c is the condensed distance array.
In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])
In [171]: n = 4
In [172]: i, j = np.tril_indices(n, -1)
Note that the arguments are reversed in the following call:
In [173]: indices = condensed_index(j, i, n)
indices gives the desired permutation of the condensed distance array.
In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])
(Basically the same function as condensed_index(i, j, n) was given in several answers to this question.)

Operations on 'N' dimensional numpy arrays

I am attempting to generalize some Python code to operate on arrays of arbitrary dimension. The operations are applied to each vector in the array. So for a 1D array, there is simply one operation, for a 2-D array it would be both row and column-wise (linearly, so order does not matter). For example, a 1D array (a) is simple:
b = operation(a)
where 'operation' is expecting a 1D array. For a 2D array, the operation might proceed as
for ii in range(0,a.shape[0]):
b[ii,:] = operation(a[ii,:])
for jj in range(0,b.shape[1]):
c[:,ii] = operation(b[:,ii])
I would like to make this general where I do not need to know the dimension of the array beforehand, and not have a large set of if/elif statements for each possible dimension.
Solutions that are general for 1 or 2 dimensions are ok, though a completely general solution would be preferred. In reality, I do not imagine needing this for any dimension higher than 2, but if I can see a general example I will learn something!
Extra information:
I have a matlab code that uses cells to do something similar, but I do not fully understand how it works. In this example, each vector is rearranged (basically the same function as fftshift in numpy.fft). Not sure if this helps, but it operates on an array of arbitrary dimension.
function aout=foldfft(ain)
nd = ndims(ain);
for k = 1:nd
nx = size(ain,k);
kx = floor(nx/2);
idx{k} = [kx:nx 1:kx-1];
end
aout = ain(idx{:});
In Octave, your MATLAB code does:
octave:19> size(ain)
ans =
2 3 4
octave:20> idx
idx =
{
[1,1] =
1 2
[1,2] =
1 2 3
[1,3] =
2 3 4 1
}
and then it uses the idx cell array to index ain. With these dimensions it 'rolls' the size 4 dimension.
For 5 and 6 the index lists would be:
2 3 4 5 1
3 4 5 6 1 2
The equivalent in numpy is:
In [161]: ain=np.arange(2*3*4).reshape(2,3,4)
In [162]: idx=np.ix_([0,1],[0,1,2],[1,2,3,0])
In [163]: idx
Out[163]:
(array([[[0]],
[[1]]]), array([[[0],
[1],
[2]]]), array([[[1, 2, 3, 0]]]))
In [164]: ain[idx]
Out[164]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
Besides the 0 based indexing, I used np.ix_ to reshape the indexes. MATLAB and numpy use different syntax to index blocks of values.
The next step is to construct [0,1],[0,1,2],[1,2,3,0] with code, a straight forward translation.
I can use np.r_ as a short cut for turning 2 slices into an index array:
In [201]: idx=[]
In [202]: for nx in ain.shape:
kx = int(np.floor(nx/2.))
kx = kx-1;
idx.append(np.r_[kx:nx, 0:kx])
.....:
In [203]: idx
Out[203]: [array([0, 1]), array([0, 1, 2]), array([1, 2, 3, 0])]
and pass this through np.ix_ to make the appropriate index tuple:
In [204]: ain[np.ix_(*idx)]
Out[204]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
In this case, where 2 dimensions don't roll anything, slice(None) could replace those:
In [210]: idx=(slice(None),slice(None),[1,2,3,0])
In [211]: ain[idx]
======================
np.roll does:
indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)
np.apply_along_axis is another function that constructs an index array (and turns it into a tuple for indexing).
If you are looking for a programmatic way to index the k-th dimension an n-dimensional array, then numpy.take might help you.
An implementation of foldfft is given below as an example:
In[1]:
import numpy as np
def foldfft(ain):
result = ain
nd = len(ain.shape)
for k in range(nd):
nx = ain.shape[k]
kx = (nx+1)//2
shifted_index = list(range(kx,nx)) + list(range(kx))
result = np.take(result, shifted_index, k)
return result
a = np.indices([3,3])
print("Shape of a = ", a.shape)
print("\nStarting array:\n\n", a)
print("\nFolded array:\n\n", foldfft(a))
Out[1]:
Shape of a = (2, 3, 3)
Starting array:
[[[0 0 0]
[1 1 1]
[2 2 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
Folded array:
[[[2 0 1]
[2 0 1]
[2 0 1]]
[[2 2 2]
[0 0 0]
[1 1 1]]]
You could use numpy.ndarray.flat, which allows you to linearly iterate over a n dimensional numpy array. Your code should then look something like this:
b = np.asarray(x)
for i in range(len(x.flat)):
b.flat[i] = operation(x.flat[i])
The folks above provided multiple appropriate solutions. For completeness, here is my final solution. In this toy example for the case of 3 dimensions, the function 'ops' replaces the first and last element of a vector with 1.
import numpy as np
def ops(s):
s[0]=1
s[-1]=1
return s
a = np.random.rand(4,4,3)
print '------'
print 'Array a'
print a
print '------'
for ii in np.arange(a.ndim):
a = np.apply_along_axis(ops,ii,a)
print '------'
print ' Axis',str(ii)
print a
print '------'
print ' '
The resulting 3D array has a 1 in every element on the 'border' with the numbers in the middle of the array unchanged. This is of course a toy example; however ops could be any arbitrary function that operates on a 1D vector.
Flattening the vector will also work; I chose not to pursue that simply because the book-keeping is more difficult and apply_along_axis is the simplest approach.
apply_along_axis reference page

Numpy maximum argument item number between three columns with output as array

I would like to formulate an array which is the maximum (item # not value) between 3 columns.
E.g.
In: arr=([(1,2,3,4), (4,5,16,0), (7,8,9,2)]) # maximum of columns 0, 1, 2, 3
Out: array([2,2,1,0]) # As: 7 > 4 > 1, 8 > 5 > 2, 16 > 9 > 3, and 4 > 2 > 0
Current (non-working solution):
np.argmax([arr['f0'], arr['f1'], arr['f2']])
You can specify the axis key in numpy.argmax, which operates over a specified axis of a numpy array independently. In your case, you want to operate over each column individually by finding the index of the maximum of each column, so specify axis=0. Here's a sample run given your data in IPython:
In [10]: import numpy as np
In [11]: arr=np.array([(1,2,3), (4,5,16), (7,8,9)])
In [12]: np.argmax(arr, axis=0)
Out[12]: array([2, 2, 1])
The above example was what you had before you edited your post. With your new data in your edit, here's a sample run:
In [13]: arr=np.array([(1,2,3,4), (4,5,16,0), (7,8,9,2)])
In [14]: np.argmax(arr, axis=0)
Out[14]: array([2, 2, 1, 0])
More information about numpy.argmax can be found here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html

Convert array of string (category) to array of int from a pandas dataframe

I am trying to do something very similar to that previous question but I get an error.
I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:
import pandas
import milk
from scikits.statsmodels.tools import categorical
then I have:
trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'
labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets
The output console yields first:
features
[[ 0.38846334 0.97681855]
[ 3.8318634 0.5724734 ]
[ 0.67710876 1.01816444]
[ 1.12024943 0.91508699]
[ 7.51749674 1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]
I meet then the following error:
Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'
Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.
The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.
For a Series:
In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
For a DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei',
'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]),
Index([u'single', u'touching', u'nuclei', u'dusts'],
dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
labels labels_enc
0 single 0
1 touching 1
2 nuclei 2
3 dusts 3
4 touching 1
5 single 0
6 nuclei 2
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
The factor has attributes labels and levels:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
I am answering the question for Pandas 0.10.1. Factor.from_array seems to do the trick.
>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0 a
1 b
2 a
3 c
4 a
5 b
6 a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical:
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)
because none of these work for dimensions>1, I made some code working for any numpy array dimensionality:
def encode_categorical(array):
d = {key: value for (key, value) in zip(np.unique(array), np.arange(len(u)))}
shape = array.shape
array = array.ravel()
new_array = np.zeros(array.shape, dtype=np.int)
for i in range(len(array)):
new_array[i] = d[array[i]]
return new_array.reshape(shape)

Categories

Resources