I am trying to convert a multi-index pandas DataFrame into a numpy.ndarray. The DataFrame is below:
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
I would like the resulting numpy.ndarray to be the following with np.shape() = (2,2,4):
[[[ 0.0 0.0 0.8 0.2 ]
[ 0.1 0.0 0.9 0.0 ]]
[[ 0.0 0.0 0.9 0.1 ]
[ 0.0 0.0 1.0 0.0]]]
I have tried df.as_matrix() but this returns:
[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]
[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]
How do I return a list of lists for the first level with each list representing an Action records.
You could use the following:
dim = len(df.index.get_level_values(0).unique())
result = df.values.reshape((dim1, dim1, df.shape[1]))
print(result)
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
The first line just finds the number of groups that you want to groupby.
Why this (or groupby) is needed: as soon as you use .values, you lose the dimensionality of the MultiIndex from pandas. So you need to re-pass that dimensionality to NumPy in some way.
One way
In [151]: df.groupby(level=0).apply(lambda x: x.values.tolist()).values
Out[151]:
array([[[0.0, 0.0, 0.8, 0.2],
[0.1, 0.0, 0.9, 0.0]],
[[0.0, 0.0, 0.9, 0.1],
[0.0, 0.0, 1.0, 0.0]]], dtype=object)
Using Divakar's suggestion, np.reshape() worked:
>>> print(P)
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
>>> np.reshape(P,(2,2,-1))
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
>>> np.shape(P)
(2, 2, 4)
Elaborating on Brad Solomon's answer, to get a sligthly more generic solution - indexes of different sizes and an unfixed number of indexes - one could do something like this:
def df_to_numpy(df):
try:
shape = [len(level) for level in df.index.levels]
except AttributeError:
shape = [len(df.index)]
ncol = df.shape[-1]
if ncol > 1:
shape.append(ncol)
return df.to_numpy().reshape(shape)
If df has missing sub-indexes reshape will not work. One way to add them would be (maybe there are better solutions):
def enforce_df_shape(df):
try:
ind = pd.MultiIndex.from_product([level.values for level in df.index.levels])
except AttributeError:
return df
fulldf = pd.DataFrame(-1, columns=df.columns, index=ind) # remove -1 to fill fulldf with nan
fulldf.update(df)
return fulldf
If you are just trying to pull out one column, say s1, and get an array with shape (2,2) you can use the .index.levshape like this:
x = df.s1.to_numpy().reshape(df.index.levshape)
This will give you a (2,2) containing the value of s1.
Related
I have a matrix :
matrix = np.array([[[0,0.5,0.6],[0.9,1.2,0]],[[0,0.5,0.6],[0.9,1.2,0]]])
I want to replace all the values 0.55 < x < 0.95 by 0.55.
PS : My question is similar to this question. But the answer does not work in my case.
You can use np.where:
matrix = np.array([[[0,0.5,0.6],[0.9,1.2,0]],[[0,0.5,0.6],[0.9,1.2,0]]])
matrix[np.where((matrix > 0.55) & (matrix < 0.95))] = 0.55
# Or
# matrix[(matrix > 0.55) & (matrix < 0.95)] = 0.55
Output:
>>> matrix
array([[[0. , 0.5 , 0.55],
[0.55, 1.2 , 0. ]],
[[0. , 0.5 , 0.55],
[0.55, 1.2 , 0. ]]])
I do have a pandas DataFrame (size = 34,19) which I want to use as a lookup table.
But the values I want to look up are "between" the values in the dataframe
For example:
0.1 0.2 0.3 0.4 0.5
0.1 4.01 31.86 68.01 103.93 139.2
0.2 24.07 57.49 91.37 125.21 158.57
0.3 44.35 76.4 108.97 141.57 173.78
0.4 59.66 91.02 122.8 154.62 186.13
0.5 87.15 117.9 148.86 179.83 210.48
0.6 106.92 137.41 168.26 198.99 229.06
0.7 121.73 152.48 183.4 213.88 243.33
I know want to look up the value for x = 5.5 y = 1.004, so the answer should be around 114.
I tried it with different methods from scipy but the values I get are always way off.
Last method I used was :inter = interpolate.interpn([np.array(np.arange(34)), np.array(np.arange(19))], np_matrix, [x_value, y_value],)
I even get wrong values for points in the grid which do exist.
Can someone tell me what I'm doing wrong or recommend an easy solution for the task?
EDIT:
An additional problem is:
My raw data, from an .xlsx file, look like:
0.1 0.2 0.3 0.4 0.5
0.1 4.01 31.86 68.01 103.93 139.2
0.2 24.07 57.49 91.37 125.21 158.57
0.3 44.35 76.4 108.97 141.57 173.78
0.4 59.66 91.02 122.8 154.62 186.13
0.5 87.15 117.9 148.86 179.83 210.48
0.6 106.92 137.41 168.26 198.99 229.06
0.7 121.73 152.48 183.4 213.88 243.33
But pandas adds an Index column:
0.1 0.2 0.3 0.4 0.5
0 0.1 4.01 31.86 68.01 103.93 139.2
1 0.2 24.07 57.49 91.37 125.21 158.57
2 0.3 44.35 76.4 108.97 141.57 173.78
3 0.4 59.66 91.02 122.8 154.62 186.13
4 0.8 87.15 117.9 148.86 179.83 210.48
5 1.0 106.92 137.41 168.26 198.99 229.06
6 1.7 121.73 152.48 183.4 213.88 243.33
So if I want to access x = 0.4 y = 0.15 I have to input x = 3, y = 0.15.
Data are read with:
model_references = pd.ExcelFile(model_references_path)
Matrix = model_references.parse('Model_References')
n = Matrix.stack().reset_index().values
out = interpolate.griddata(n[:,0:2], n[:,2], (Stroke, Current), method='cubic')
You can reshape data to 3 columns with stack - first column for index, second for columns and last for values, last get values by scipy.interpolate.griddata
from scipy.interpolate import griddata
a = 5.5
b = 1.004
n = df.stack().reset_index().values
#https://stackoverflow.com/a/8662243
out = griddata(n[:,0:2], n[:,2], [(a, b)], method='linear')
print (out)
[104.563]
Detail:
n = df.stack().reset_index().values
print (n)
[[ 1. 1. 4.01]
[ 1. 2. 31.86]
[ 1. 3. 68.01]
[ 1. 4. 103.93]
[ 1. 5. 139.2 ]
[ 2. 1. 24.07]
[ 2. 2. 57.49]
[ 2. 3. 91.37]
[ 2. 4. 125.21]
[ 2. 5. 158.57]
[ 3. 1. 44.35]
[ 3. 2. 76.4 ]
[ 3. 3. 108.97]
[ 3. 4. 141.57]
[ 3. 5. 173.78]
[ 4. 1. 59.66]
[ 4. 2. 91.02]
[ 4. 3. 122.8 ]
[ 4. 4. 154.62]
[ 4. 5. 186.13]
[ 5. 1. 87.15]
[ 5. 2. 117.9 ]
[ 5. 3. 148.86]
[ 5. 4. 179.83]
[ 5. 5. 210.48]
[ 5. 1. 106.92]
[ 5. 2. 137.41]
[ 5. 3. 168.26]
[ 5. 4. 198.99]
[ 5. 5. 229.06]
[ 6. 1. 121.73]
[ 6. 2. 152.48]
[ 6. 3. 183.4 ]
[ 6. 4. 213.88]
[ 6. 5. 243.33]]
Try interp2d from scipy.
import numpy as np
from scipy.interpolate import interp2d
x = [1, 2, 3, 4, 5, 6, 7]
y = [1, 2, 3, 4, 5]
z = [[4.01, 31.86, 68.01, 103.93, 139.2],
[24.07, 57.49, 91.37, 125.21, 158.57],
[44.35, 76.4, 108.97, 141.57, 173.78],
[59.66, 91.02, 122.8, 154.62, 186.13],
[87.15, 117.9, 148.86, 179.83, 210.48],
[106.92, 137.41, 168.26, 198.99, 229.06],
[121.73, 152.48, 183.4, 213.88, 243.33]]
z = np.array(z).T
f = interp2d(x, y, z)
f(x = 5.5, y = 1.004) # returns 97.15748
Try to change method's kind argument in order to experiment with return value.
I have a numpy array like this:
print(pred_galactic_prob.shape)
print(pred_galactic_prob[0:3])
(465, 5)
[[0.05 0.94 0.3 0.01 0.5 ]
[0.01 0.02 0.01 0.85 0.11]
[0.03 0.95 0.3 0.3 0.02]]
I want to append to this and change the shape so there are 13 columns and it would look like this:
[[0.05 0. 0.94 0. 0. 0.3 0. 0. 0.01 0. 0. 0. 0.5 ]
[0.01 0. 0.02 0. 0. 0.01 0. 0. 0.85 0. 0. 0. 0.11]
[0.03 0. 0.95 0. 0. 0.3 0. 0. 0.3 0. 0. 0. 0.02]]
i.e a column with all 0. is added after the first column, two columns with all 0. are added after the second entry and so on, per above.
I have tried the following:
pred_galactic_prob2 = np.array
for i in pred_galactic_prob:
pred_galactic_prob2 = np.append(pred_galactic_prob2, [i[0], 0.0, i[1], 0.0, 0.0, i[2], 0.0, 0.0, i[3], 0.0, 0.0, 0.0, i[4]])
but this just turns it into a 1D array.
A "one-line" solution would be
np.concatenate((a[:,:1],
np.lib.stride_tricks.as_strided(0,[len(a),1],[0,0]),
a[:,1:2],
np.lib.stride_tricks.as_strided(0,[len(a),2],[0,0]),
a[:,2:3],
np.lib.stride_tricks.as_strided(0,[len(a),2],[0,0]),
a[:,3:4],
np.lib.stride_tricks.as_strided(0,[len(a),3],[0,0]),
a[:,4:]), -1)
Though its wired in any sense. Using append would need even more as_strideds. I believe there should be a append-ish function that automatically broadcasts input but I'm not sure what is it. Anyway, a better solution is definitely as #hpaulj mentioned:
b = np.zeros((len(a), 13), a.dtype)
b[:,[0,2,5,8,12]] = a
here a means input, b means output
The question of how to compute a running mean of a series of numbers has been asked and answered before. However, I am trying to compute the running mean of a series of ndarrays, with an unknown length of series. So, for example, I have an iterator data where I would do:
running_mean = np.zeros((1000,3))
while True:
datum = next(data)
running_mean = calc_running_mean(datum)
What would calc_running_mean look like? My primary concern here is memory, as I can't have the entirety of the data in memory, and I don't know how much data I will be receiving. datum would be an ndarray, let's say that for this example it's a (1000,3) array, and the running mean would be an array of the same size, with each element containing the elementwise mean of every element we've seen in that position so far.
The key distinction this question has from previous questions is that it's calculating the elementwise mean of a series of ndarrays, and the number of arrays is unknown.
You can use itertools together with standard operators:
>>> import itertools, operator
>>> running_sum = itertools.accumulate(data)
>>> running_mean = map(operator.truediv, running_sum, itertools.count(1))
Example:
>>> data = (np.linspace(-i, i*i, 6) for i in range(10))
>>>
>>> running_sum = itertools.accumulate(data)
>>> running_mean = map(operator.truediv, running_sum, itertools.count(1))
>>>
>>> for i in running_mean:
... print(i)
...
[0. 0. 0. 0. 0. 0.]
[-0.5 -0.3 -0.1 0.1 0.3 0.5]
[-1. -0.46666667 0.06666667 0.6 1.13333333 1.66666667]
[-1.5 -0.5 0.5 1.5 2.5 3.5]
[-2. -0.4 1.2 2.8 4.4 6. ]
[-2.5 -0.16666667 2.16666667 4.5 6.83333333 9.16666667]
[-3. 0.2 3.4 6.6 9.8 13. ]
[-3.5 0.7 4.9 9.1 13.3 17.5]
[-4. 1.33333333 6.66666667 12. 17.33333333 22.66666667]
[-4.5 2.1 8.7 15.3 21.9 28.5]
I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])