Best way to get a specific column as y in pandas DataFrame - python

I want to extract one specific column as y from a pandas DataFrame.
I found two ways to do this so far:
# The First way
y_df = df[specific_column]
y_array = np.array(y_df)
X_df = df.drop(columns=[specific_column])
X_array = np.array(X_df)
# The second way
features = ['some columns in my dataset']
y_df = np.array(df.loc[:, [specific_column]].values)
X_df = df.loc[:, features].values
But when I compare the values in each y array, I see they are not equal:
y[:4]==y_array[:4]
array([[ True, True, False, False],
[ True, True, False, False],
[False, False, True, True],
[False, False, True, True]])
But I am sure that these two arrays contain the same elements:
y[:4], y_array[:4]
(array([[0],
[0],
[1],
[1]], dtype=int64),
array([0, 0, 1, 1], dtype=int64))
So, why do I see False values when I compare them together?

If use double [[]] get one element DataFrame and if convert to array get 2d array:
y_df = np.array(df.loc[:, [specific_column]].values)
Solution is remove [] for Series and if convert to array get 1d array:
y_df = df[specific_column].to_numpy()
#your solution
y_df = np.array(df.loc[:, specific_column].values)

Related

Optimal way to modify value of a numpy array based on condition

I have a numpy.ndarray of the form
import numpy as np
my_array = np.array([[True, True, False], [True, False, True]])
In this example is a matrix of 3 columns and two rows, but my_array is thinking as an arbitriary 2d shape. By other hand I have a numpy.ndarray that represent a vector W with lenght equal to the number of rows of my_array, this vector has float values, for example W = np.array([10., 1.5]). Additionally I have a list WT of two-tuples with lenght equal to W, for example WT = [(0,20.), (0,1.)]. These tuples represents mathematical intervals (a,b).
I want to modify the column values of my_arraybased on the following condition: Given a column, we change to False (or we keep False if the value was that) the i-th element of the column if the i-th element of W does not belong to the mathematical interval of the i-th two-tuple of WT. For example, the first column of my_array is [True, True], so we have to analyze if 10. belong to (0,20) and 1.5 belong to (0,1), the resulting column should be [True, False].
I have a for loop, but I think there is a smart way to do this.
Obs: I donĀ“t need to change values from False to True.
I made this implementation :
import numpy as np
my_array = np.array([[True, True, False], [True, False, True]])
W = np.array([10.0, 1.5])
WT = np.array([[0, 20], [0, 1]])
i = (W > WT[:,0]) * (W < WT[:,1])
print("my_array before", my_array)
my_array[:, 0] = i
print("my_array after", my_array)
It will update the column values given your conditions.

np.meshgrid throws DeprecationWarning or MemoryError for large inputs

For a clustering problem I am trying to create the ideal similarity matrix. That is, I have an one-dimensional array of cluster labels and need to create a two-dimensional binary or boolean matrix with an entry of 1 iff two data points belong to the same cluster.
To do so I use np.meshgrid but it only works for smaller examples. Here's an MWE:
With an array of size 5 it works as desired:
arr = np.random.randint(0, 10, size=5)
print(arr)
mesh_grid = np.meshgrid(arr, arr, sparse=True)
mesh_grid[0] == mesh_grid[1]
gives
[9 0 9 0 7]
array([[ True, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False],
[False, False, False, False, True]])
However, with an array of size 60000 it does not work:
arr = np.random.randint(0, 10, size=60000)
mesh_grid = np.meshgrid(arr, arr, sparse=True)
mesh_grid[0] == mesh_grid[1]
gives
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
mesh_grid[0] == mesh_grid[1]
Setting sparse=False throws a memory error. And based on this answer I assume the DeprecationWarningmust be due to memory too.
Question: How can I solve this or is there another more efficient way to obtain the desired matrix?
If, for example, your array is composed by only 10 differents element (0,1,2,3....) then you only need to compare your array with those 10 elements and not with the whole matrix.
So you can do the following operations:
# Number of different elements
n = 3
# Generate the random vector (2D)
arr = np.random.randint(0, n, size=10)[None,:]
# Create the vector containing all the different elements (2D)
num = np.r_[0:n][:,None]
# We broadcast the 2 vectors to obtain a n*10 matrix
uni = arr==num
# Based on the previous result, we duplicate the row that need to be duplicated:
res = uni[arr] # 10 * 10 matrix
You can use np.unique() to extract the unique values of arr in the case where your unique value are not linearly distributed.

How to properly index to an array of changing size due to masking in python

This is a problem I've run into when developing something, and it's a hard question to phrase. So it's best with an simple example:
Imagine you have 4 random number generators which generate an array of size 4:
[rng-0, rng-1, rng-2, rng-3]
| | | |
[val0, val1, val2, val3]
Our goal is to loop through "generations" of arrays populated by these RNGs, and iteratively mask out the RNG which outputted the maximum value.
So an example might be starting out with:
mask = [False, False, False, False], arr = [0, 10, 1, 3], and so we would mask out rng-1.
Then the next iteration could be: mask = [False, True, False, False], arr = [2, 1, 9] (before it gets asked, yes arr HAS to decrease in size with each rng that is masked out). In this case, it is clear that rng-3 should be masked out (e.g. mask[3] = True), but since arr is now of different size than mask, it is tricky to get the right indexing for setting the mask (since the max of arr is at index 2 of the arr, but the corresponding generator is index 3). This problem grows more an more difficult as more generators get masked out (in my case I'm dealing with a mask of size ~30).
If it helps, here is python version of the example:
rng = np.random.RandomState(42)
mask = np.zeros(10, dtype=bool) # True if generator is being masked
for _ in range(mask.size):
arr = rng.randint(100, size=~mask.sum())
unadjusted_max_value_idx = arr.argmax()
adjusted_max_value_idx = unadjusted_max_value_idx + ????
mask[adjusted_max_value_idx] = True
Any idea a good way to map the index of the max value in the arr to the corresponding index in the mask? (i.e. moving from unadjusted_max_value_idx to adjusted_max_value_idx)
#use a helper list
rng = np.random.RandomState(42)
mask = np.zeros(10, dtype=bool) # True if generator is being masked
ndxLst=list(range(mask.size))
maskHistory=[]
for _ in range(mask.size):
arr = rng.randint(100, size=(~mask).sum())
unadjusted_max_value_idx = arr.argmax()
adjusted_max_value_idx=ndxLst.pop(unadjusted_max_value_idx)
mask[adjusted_max_value_idx] = True
maskHistory.append(adjusted_max_value_idx)
print(maskHistory)
print(mask)

Reduce boolean values in python ndarray using AND

I have a python array of this shape [3, 1000, 3] with boolean values inside. The first 3 is the batch size and the values of a batch are like these
[[False, False, False]\n
[False, True, True]\n
[False, False, True]\n
[True, True, True]\n
...
]
size (1000, 3)
I want to apply the and function to each triplet to end up with this new array
[[False]\n
[False]\n
[False]\n
[True]\n
...
]
size (3, 1000)
Looking at numpy I didn't find something useful. I've also tried to import operator and apply reduce(operator.and_, array) but it doesn't work.
Any idea to solve this?
You can easily do this using np.all.
This will check if all values along the last dimension are True:
y = np.all(arr, axis=-1)
y.shape # (3, 1000)

Fastest way to filter values in np array based on changing threshold

I want to filter an array arr based on some thresholds.
arr = np.array([2,2,2,2,2,5,5,5,1])
thresholds = np.array([4,1])
I want to filter arr based on the values in thresholds when the value in arr is greater than the threshold
My idea is to create a mask for each threshold
Expected result:
# [[False False False False False True True True False]
# [ True True True True True True True True False]]
One way to do it in Python:
mask = [True if x>condi else False for condi in thresholds for x in arr]
mask = np.reshape(mask,(2,9))
Then to get the filtered array by just filteredarr = arr[mask[i]] where i is the index of the relevant threshold
Is there a better way (performance wise) to do it in Python ? Especially that I am dealing with big arrays (len around 250000 for arr, no specific len for thresholds yet, but I am expecting a big array) ?
Edit:
The final output expected on the data is [array([5, 5, 5]), array([2, 2, 2, 2, 2, 5, 5, 5])]
The mask can easily be obtained using
mask = arr[None,:]>thresholds[:,None]
mask
# Output
# array([[False, False, False, False, False, True, True, True, False],
# [ True, True, True, True, True, True, True, True, False]], dtype=bool)
The idea is to blow up the dimensionality by adding an additional axis using None (which does the same as np.newaxis) and to compare then the arrays element-wise.
Once we have the mask we can filter the data using various methods where the choice strongly depends on your problem:
Of course you can do
res = [arr[m] for m in mask]
# [array([5, 5, 5]), array([2, 2, 2, 2, 2, 5, 5, 5])]
in order to obtain a list with the filtered data, but it is slow in general.
In case you have further numeric calculations I would create a masked array in which only the filtered data are taken into account:
m = np.zeros_like(mask).astype(np.int)
m[:] = arr
res = np.ma.masked_where(~mask,m)
Each line corresponds now to the filtered data according to the corresponding threshold.
Masked arrays allow you to continue working with many functions like mean or std
res.mean(axis=1)
# masked_array(data = [5.0 3.125],
# mask = [False False],
# fill_value = 1e+20)
res.mean(axis=1).compressed()
# array([ 5. , 3.125])

Categories

Resources