Assignment by logical indexing in numpy - python

I have a real-valued numpy array of size (1000,). All values lie between 0 and 1, and I want to convert this to a categorical array. All values less than 0.25 should be assigned to category 0, values between 0.25 and 0.5 to category 1, 0.5 to 0.75 to category 2, and 0.75 to 1 to category 3. Logical indexing doesn't seem to work:
Y[Y < 0.25] = 0
Y[np.logical_and(Y >= 0.25, Y < 0.5)] = 1
Y[np.logical_and(Y >= 0.5, Y < 0.75)] = 2
Y[Y >= 0.75] = 3
Result:
for i in range(4):
print(f"Y == {i}: {sum(Y == i)}")
Y == 0: 206
Y == 1: 0
Y == 2: 0
Y == 3: 794
What needs to be done instead?

The error is in your conversion logic, not in your indexing. The final statment:
Y[Y >= 0.75] = 3
Converts not only the values in range 0.75 - 1.00, but also the prior assignments to classes 1 and 2.
You can reverse the assignment order, starting with class 3.
You can put an upper limit on the final class, although you still have a boundary problem with 1.00 vs class 1.
Perhaps best would be to harness the regularity of your divisions, such as:
Y = int(4*Y) # but you still have boundary problems.

Related

How to merge two DataFrames by complicated condition?

Python 3.9.5
The first big DataFrame contains points and the second big DataFrame contains square areas. Square areas are defined by four straight lines, that are parallel to the coordinate axes and are completely defined by a set of constraints: y_min, y_max, x_min, x_max. For example:
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
points
y x
0 0.5 0.5
1 0.5 1.5
2 1.5 1.5
3 1.5 0.5
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
square_areas
y_min y_max x_min x_max
0 0 1 0 1
1 1 2 1 2
How to get all points, that don't belong to square areas without sequential enumeration of areas in a cycle?
Needed Output:
y x
0 0.5 1.5
1 1.5 0.5
I'm not sure how to do this with a 'merge', but you can iterate over the square_areas dataframe and evaluate the conditions of the point dataframe.
I'm assuming you'll have more than two test cases, so this iterative approach should work. Each iteration only looks at points that have not been evaluated by a prior square_areas row.
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
print(points)
# assume everything is outside until it evaluates inside
points['outside'] = 'Y'
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
print(square_areas)
for i in range(square_areas.shape[0]):
ymin = square_areas.iloc[i]['y_min']
ymax = square_areas.iloc[i]['y_max']
xmin = square_areas.iloc[i]['x_min']
xmax = square_areas.iloc[i]['x_max']
points.loc[points['outside'] == 'Y', 'outside'] = np.where(points[points['outside'] == 'Y']['x'].between(xmin, xmax) & points[points['outside'] == 'Y']['y'].between(ymin, ymax), 'N', points[points['outside'] == 'Y']['outside'])
points.loc[points['outside'] == 'Y']
Output
y x outside
1 0.50000 1.50000 Y
3 1.50000 0.50000 Y

How to sample data from the proximity of existing data?

I have data for xor as below -
x
y
z
x ^ y ^ z
0
0
1
1
0
1
0
1
1
0
0
1
1
1
1
1
Kept only the ones that make the xor of all three equal to 1.
I want to generate synthetic data around the already available data within some range uniformly at random. The above table can be thought of as seed data. An example of expected table will be as follows:
x
y
z
x ^ y ^ z
0.1
0.3
0.8
0.9
0.25
0.87
0.03
0.99
0.79
0.09
0.28
0.82
0.97
0.76
0.91
0.89
Above table is sampled with a range of 0 to 0.3 for 0 value and with range 0.7 to 1 for value 1.
I want to achieve this using pytorch.
For a problem such as this, you are able to completely synthesise data without using a reference because it has a simple solution. For zero (0-0.3) you can use the torch.rand function to generate uniformly random data for 0-1 and scale it. For one (0.7-1) you can do the same and just offset it:
N = 5
p = 0.5 #change this to bias your outputs
x_is_1 = torch.rand(N)>p #decide if x is going to be 1 or 0
y_is_1 = torch.rand(N)>p #decide if y is going to be 1 or 0
not_all_0 = ~(x_is_1 & y_is_1) #get rid of the x ^ y ^ z = 0 elements
x_is_1,y_is_1 = x_is_1[not_all_0],y_is_1[not_all_0]
N = x_is_1.shape[0]
x = torch.rand(N) * 0.3
x = torch.where(x_is_1,x+0.7,x)
y = torch.rand(N) * 0.3
y = torch.where(y_is_1,y+0.7,y)
z = torch.logical_xor(x_is_1,y_is_1).float()
triple_xor = 1 - torch.rand(z.shape)*0.3
print(torch.stack([x,y,z,triple_xor]).T)
#x y z x^y^z
tensor([[0.2615, 0.7676, 1.0000, 0.8832],
[0.9895, 0.0370, 1.0000, 0.9796],
[0.1406, 0.9203, 1.0000, 0.9646],
[0.1799, 0.9722, 1.0000, 0.9327]])
Or, to treat your data as the basis (for more complex data), there is a preprocessing tool known as gaussian noise injection which seems to be what you're after. Or you can just define a function and call it a bunch of times.
def add_noise(x,y,z,triple_xor,range=0.3):
def proc(dat,range):
return torch.where(dat>0.5,torch.rand(dat.shape)*range+1-range,torch.rand(dat.shape)*range)
return proc(x,range),proc(y,range),proc(z,range),proc(triple_xor,range)
gen_new_data = torch.cat([torch.stack(add_noise(x,y,z,triple_xor)).T for _ in range(5)])

How to replace values in a array?

I'm beggining to study python and saw this:
I have and array(km_media) that have nan values,
km_media = km / (2019 - year)
it happend because the variable year has some 2019.
So for the sake of learning, I would like to know how do to 2 things:
how can I use the replace() to substitute the nan values for 0 in the variable;
how can i print the variable that has the nan values with the replace.
What I have until now:
1.
km_media = km_media.replace('nan', 0)
print(f'{km_media.replace('nan',0)}')
Thanks
Not sure is this will do what you are looking for?
a = 2 / np.arange(5)
print(a)
array([ inf, 2. , 1. , 0.66666667, 0.5 ])
b = [i if i != np.inf or i != np.nan else 0 for i in a]
print(b)
Output:
[0, 2.0, 1.0, 0.6666666666666666, 0.5]
Or:
np.where(((a == np.inf) | (a == np.nan)), 0, a)
Or:
a[np.isinf(a)] = 0
Also, for part 2 of your question, I'm not sure what you mean. If you have just replaced the inf's with 0, then you will just be printing zeros. If you want the index position of the inf's you have replaced, you can grab them before replacement:
np.where(a == np.inf)[0][0]
Output:
0 # this is the index position of np.inf in array a

How can i replace values in a column using pandas?

It's my first time using python and pandas (plz help this old man). I have a column with float and negative numbers and I want to replace them with conditions.
I.e. if the number is between -2 and -1.6 all'replace it with -2 etc.
How can I create the condition (using if else or other) to modify my column. Thanks a lot
mean=[]
for row in df.values["mean"]:
if row <= -1.5:
mean.append(-2)
elif row <= -0.5 and =-1.4:
mean.append(-1)
elif row <= 0.5 and =-0.4:
mean.append(0)
else:
mean.append(1)
df = df.assign(mean=mean)
Doesn't work
create a function defining your conditions and then apply it to your column (I fixed some of your conditionals based on what I thought they should be):
df = pd.read_table('fun.txt')
# create function to apply for value ranges
def labels(x):
if x <= -1.5:
return(-2)
elif -1.5 < x <= -0.5:
return(-1)
elif -0.5 < x < 0.5:
return(0)
else:
return(1)
df['mean'] = df['mean'].apply(lambda x: labels(x)) # apply your function to your table
print(df)
another way to apply your function that returns the same result:
df['mean'] = df['mean'].map(labels)
fun.txt:
mean
0
-1.5
-1
-0.5
0.1
1.1
output from above:
mean
0 0
1 -2
2 -1
3 -1
4 0
5 1

Get top values and positions of items in list of lists

I have used sklearn to fit and predict a model, but I want to have the top 5 predictions (in terms of probabilities) per item.
So I used predict_proba, which gave me a list of lists like:
probabilities = [[0.8,0.15,0.5,0,0],[0.4,0.6,0,0,0],[0,0,0,0,1]]
What I want to do, is loop over this list of lists to give me an overview of each prediction made, along with its position in the list (which represents the classes).
When using [i for i, j in enumerate(predicted_proba[0]) if j > 0] it returns me [0],[1] , which is what I want for the complete list of lists (and if possible also with the probability next to it).
When trying to use a for-loop over the above code, it returns an IndexError.
Something like this:
probabilities = [[0.8, 0.15, 0.5, 0, 0], [0.4, 0.6, 0, 0, 0], [0, 0, 0, 0, 1]]
for list in range(0,len(probabilities)):
print("Iteration_number:", list)
for index, prob in enumerate(probabilities[list]):
print("index", index, "=", prob)
Results in:
Iteration_number: 0
index 0 = 0.8
index 1 = 0.15
index 2 = 0.5
index 3 = 0
index 4 = 0
Iteration_number: 1
index 0 = 0.4
index 1 = 0.6
index 2 = 0
index 3 = 0
index 4 = 0
Iteration_number: 2
index 0 = 0
index 1 = 0
index 2 = 0
index 3 = 0
index 4 = 1
for i in predicted_proba:
for index, value in enumerate(i):
if value > 0:
print(index)
Hope this helps.

Categories

Resources