I have one simple one-dimensional array and an empty array in NumPy. I try to concatenate them, but I get an array in float.
from numpy import *
a = zeros(5,'i')
a += 1
b = []
c = hstack((a,b))
d = concatenate((a, b))
print("a",a)
print("b",b)
print("c",c)
print("d",d)
I got:
a [1 1 1 1 1]
b []
c [1. 1. 1. 1. 1.]
d [1. 1. 1. 1. 1.]
But I am looking for an integer array
[1 1 1 1 1]
How? And what is the most efficient way?
Try this way:
numpy array dtype by default is float.So, change it to np.int32
a = np.zeros(5,dtype=np.int32)
a += 1
b = np.array([],dtype=np.int32)
You might create b as 0-size np.array of dtype 'i' rather than list, that is:
import numpy as np
a = np.zeros(5,'i')
a += 1
b = np.array([],'i')
c = np.hstack((a,b))
d = np.concatenate((a, b))
print(d)
Output:
[1 1 1 1 1]
I think numpy assumes the empty array as float64 data type.
If you run the following
np.array([]).dtype
it returns dtype('float64')
so you should initilize the empty array as follows
b=[]
b=np.array(b,dtype="int32")
What is point you willing to have same array as input ?
use numpy.ones to reduce computation instead of numpy.zeros
`
import numpy
a = numpy.ones(5,dtype=int)
b = []
b = numpy.array([],dtype=int)
d = concatenate((a, b))
`
Related
Forgive me if this has been asked before, I couldn't find it. I am trying to progressively sum a numpy array into a new numpy array using vector operations. What I mean by this is that the 2nd index of the new array is equal to the 1st + 2nd index of the old array. or A[n] = B[0] + B[1] ... + B[n]. I know how to do this using a for loop but I'm looking for a vectorized solution.
Here is my non-vectorized solution:
import numpy as np
A = np.arange(10)
B = np.empty(10)
for i in range(len(A)):
B[i] = sum(A[0:i+1])
print(B)
You can do it like this:
import numpy as np
A = np.arange(10)
B = np.cumsum(A)
# [ 0 1 3 6 10 15 21 28 36 45]
Thanks
The "progressive" sum is called cumulative sum. Use NumPy's cumsum for this.
Using your example and comparing B to np.cumsum(A) results in equal arrays:
>>> import numpy as np
>>> A = np.arange(10)
>>> B = np.empty(10)
>>> for i in range(len(A)):
... B[i] = sum(A[0:i+1])
...
>>> np.array_equal(B, np.cumsum(A))
True
I have two 2D arrays like:
A=array[[4,5,6],
[0,7,8],
[0,9,0]]
B = array[[11,12,13],
[14,15,16],
[17,18,19]]
In array A where element value is 0 i want to replace same value in array B by 0 and store the changed matrix in a new variable and retain the old B matrix.
Thanks in advance.
import numpy as np
A=np.array([[4,5,6],
[0,7,8],
[0,9,0]])
B =np.array([[11,12,13],
[14,15,16],
[17,18,19]])
C = B.copy()
B[A == 0] = 0
C, B = B, C
The line B[A == 0] basically first gets all the the values where the array A is 0 by the line A == 0 . It return a boolean array with true at the position where value is zero in array A. This boolean array is then used to mask the array B and assigns 0 to indices the boolean values is True.
I have a nested array with some values. I have another array, where the length of both arrays are equal. I'd like to get an output, where I have a nested array of 1's and 0's, such that it is 1 where the value in the second array was equal to the value in that nested array.
I've taken a look on existing stack overflow questions but have been unable to construct an answer.
masks_list = []
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test.values[i]) * 1
masks_list.append(mask)
masks = np.array(masks_list);
Essentially, that's the code I currently have and it works, but I think that it's probably not the most effecient way of doing it.
YPRED:
[[4 0 1 2 3 5 6]
[0 1 2 3 5 6 4]]
YTEST:
8 1
5 4
Masks:
[[0 0 1 0 0 0 0]
[0 0 0 0 0 0 1]]
Another good solution with less line of code.
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
After that you can check performance like in this answer as follow:
import time
from time import perf_counter as pc
t0=pc()
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
t1 = pc() - t0
t0=pc()
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test[i]) * 1
masks_list.append(mask)
t2 = pc() - t0
val = t1 - t2
Generally it means if value is positive than the first solution are slower.
If you have np.array instead of list you can try do as described in this answer:
type(y_pred)
>> numpy.ndarray
y_pred = y_pred.tolist()
type(y_pred)
>> list
Idea(least loop): compare array and nested array:
masks = np.equal(y_pred, y_test.values)
you can look at this too:
np.array_equal(A,B) # test if same shape, same elements values
np.array_equiv(A,B) # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values
Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!
In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.