Filtering Numpy's array of arrays - python

Working with a numpy's ndarray for preprossessing data to a neural network. It basically contains several fixed-length arrays for sensor data. So for example:
>>> type(arr)
<class 'numpy.ndarray'>
>>> arr.shape
(400,1,5,4)
>>> arr
[
[[ 9.4 -3.7 -5.2 3.8]
[ 2.8 1.4 -1.7 3.4]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]]
..
[[ 0.0 -1.0 2.1 0.0]
[ 3.0 2.8 -3.0 8.2]
[ 7.5 1.7 -3.8 2.6]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]]
]
Each of the nested array is shaped (1, 5,4). The goal is to run through this arr and select only those arrays having at least the first three rows as non-zero (although single entry can be zero, but not whole row).
So in the example I give above, the first nested array should be deleted because only 2 first-rows are non-zero, whereas we need 3 and above.

Here's a trick you can use:
mask = arr[:,:,:3].any(axis=3).all(axis=2)
arr_filtered = arr[mask]
Quick explanation: To keep a nested array it should have at least 3 first rows (hence we need to look only at arr[:,:,:3]) such that all of them (hence .all(axis=2) at the end) have at least one non-zero entry (hence .any(axis=3)).

Related

Remove Nan from two numpy array with different dimension Python

I would like to remove nan elements from two pair of different dimension numpy array using Python. One numpy array with shape (8, 3) and another with shape (8,). Meaning if at least one nan element appear in a row, the entire row need to be removed. However I faced issues when this two pair of array has different dimension.
For example,
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[3.4 nan 4.6] 4.8
[4.2 4.6 4.8] 4.6
[4.6 4.8 4.6] nan
[4.8 4.6 nan] nan
[4.6 nan nan] nan
[nan nan nan] nan
I want it to become
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[4.2 4.6 4.8] 4.6
This is my code which generate the sequence data,
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
n_steps = 3
sequence = df_sensorRefill['sensor'].to_list()
X, y = split_sequence(sequence, n_steps)
Thanks
You could use np.isnan(), np.any() to find rows containing nan's and np.delete() to remove such rows.

Read data with missing values with Python into arrays

I have a datafile which is similar to (The original file is much bigger),
Data
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0
and I need to plot it as a 2D contour.
The first line is dummy, The first 6 shows the number of elements in the x direction as
0.0 0.2 0.4
0.6 0.8
1.0
and the second one shows the number of elements in the y direction as,
0.0 0.4 0.6
1.2 1.6
2.0
Then each 6 number shows the value of contour in each row, starting from row 1 as,
1.0 3.0 4.0 1.0
1.0 3.0
I want to cast this data into a 2D array so that I can plot them.
I tried,
data = numpy.genfromtxt('fileHere',delimiter= " ",skip_header=1)
to read data into a general array and then split it. But I get the following error,
Line #16999 (got 15 columns instead of 3)
I also tried the readline() and split() functions of Python but they make it much harder to continue. I want to have x and y in arrays and a separate array for the data in a let's say 6X6 shape. In Matlab I used to use the fscanf function
fscanf(fid,'%d',6);
I will be happy to have your ideas on this. Thanks
I think you want to read the full file into a variable, then replace all newline '\n', then split and convert it into an ndarray.
Here's how I did it.
import numpy as np
txt = '''\
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0'''
txt_list = txt.replace('\n',' ').split()
# if you want to convert the values to floats, you can include the next line
# otherwise the data will be converted to numpy as string
txt_list = [float(i) for i in txt_list]
width = int(txt_list[1])
height = len(txt_list[2:])//width
txt_array = np.reshape(txt_list[2:], (height, width))
print (txt_array)
The output of this will be:
[['0.0' '0.2' '0.4' '0.6' '0.8' '1.0']
['0.0' '0.4' '0.6' '1.2' '1.6' '2.0']
['1.0' '3.0' '4.0' '1.0' '1.0' '3.0']
['1.0' '2.0' '1.0' '4.0' '5.0' '2.0']
['3.0' '3.0' '1.0' '1.0' '5.0' '1.0']
['2.0' '7.0' '1.0' '1.0' '5.0' '2.0']
['2.0' '3.0' '8.0' '6.0' '3.0' '1.0']
['3.0' '3.0' '4.0' '6.0' '1.0' '1.0']]
Quite a smart solution to reformat your input is possible using
Pandas.
Start with reading your input file as a pandasonic DataFrame (with
standard field separator, i.e. a comma):
df = pd.read_csv('Input.txt')
As your input file does not contain commas, each line is read as a single
field and the column name (Data) is taken from the first line.
So far the initial part of df is:
Data
0 6 6
1 0.0 0.2 0.4
2 0.6 0.8
3 1.0
4 0.0 0.4 0.6
The left column is the index, but it is not important.
The type of the only column is object, actually a string.
Then, to reformat this DataFrame into a 6-columns Numpy array, it
is enough to run the following one-liner:
result = df.drop(0).Data.str.split(' ').explode().astype('float').values.reshape(-1, 6)
Steps:
drop(0) - Drop the initial row (with index 0).
Data - Take Data column.
str.split(' ') - Split each element on spaces (the result is a list of strings).
explode() - Convert each list into a sequence of rows. So far each
element is of string type.
astype('float') - Change the type to float.
values - Take the underlying Numpy (1-D) array.
reshape(-1, 6) - Reshape to 6 columns and as many rows as needed.
The result, for your data sample is:
array([[0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[0. , 0.4, 0.6, 1.2, 1.6, 2. ],
[1. , 3. , 4. , 1. , 1. , 3. ],
[1. , 2. , 1. , 4. , 5. , 2. ],
[3. , 3. , 1. , 1. , 5. , 1. ],
[2. , 7. , 1. , 1. , 5. , 2. ],
[2. , 3. , 8. , 6. , 3. , 1. ],
[3. , 3. , 4. , 6. , 1. , 1. ]])
And the last step is to divide this array into:
2 initial rows (x and y coordinates),
following rows (actual data),
To do it, run:
x = result[0]
y = result[1]
data = result[2:]
Alternative: Don't create separate variables, but call plt.contour
passing respective rows of result as X, Y and Z.
Something like:
plt.contour(result[0], result[1], result[2:]);

How to fill masked 2D array with values from another masked array

I am trying to generate 2 children from 2 parents by crossover. I want to fix a part from parent A and fill the blanks with elements from parent B.
I was able to mask both parents and get the elements on another array, but I am not able to fill in the gaps from the fixed part of Parent A with the fill elements from Parent B
Here's what I have tried so far:
import numpy as np
from numpy.random import default_rng
rng = default_rng()
numMachines = 5
numJobs = 5
population =[[[4, 0, 2, 1, 3],
[4, 2, 0, 1, 3],
[4, 2, 0, 1, 3],
[4, 0, 3, 2, 1],
[2, 3, 4, 1, 0]],
[[2, 0, 1, 3, 4],
[4, 3, 1, 2, 0],
[2, 0, 3, 4, 1],
[4, 3, 1, 0, 2],
[4, 0, 3, 1, 2]]]
parentA = np.array(population[0])
parentB = np.array(population[1])
childA = np.zeros((numJobs, numMachines))
np.copyto(childA, parentA)
childB = np.zeros((numJobs, numMachines))
np.copyto(childB, parentB)
subJobs = np.stack([rng.choice(numJobs ,size=int(np.max([2, np.floor(numJobs/2)])), replace=False) for i in range(numMachines)])
maskA = np.stack([(np.isin(childA[i], subJobs[i])) for i in range(numMachines)])
invMaskA = np.invert(maskA)
maskB = np.stack([(np.isin(childB[i], subJobs[i])) for i in range(numMachines)])
invMaskB = np.invert(maskB)
maskedChildAFixed = np.ma.masked_array(childA, maskA)
maskedChildBFixed = np.ma.masked_array(childB, maskB)
maskedChildAFill = np.ma.masked_array(childA, invMaskA)
maskedChildBFill = np.ma.masked_array(childB, invMaskB)
maskedChildAFill = np.stack([maskedChildAFill[i].compressed() for i in range(numMachines)])
maskedChildBFill = np.stack([maskedChildBFill[i].compressed() for i in range(numMachines)])
EDIT:
Sorry, I was so frustrated with this yesterday that I forgot to add some more information to make it more clear. First, I have fixed the code so it now runs by just copying and pasting (I forgot to add some import calls and some variables).
This is a fixed portion from Parent A that won't change in child A.
>>> print(maskedChildAFixed)
[[-- 0.0 2.0 -- 3.0]
[4.0 -- 0.0 1.0 --]
[4.0 -- -- 1.0 3.0]
[-- 0.0 3.0 2.0 --]
[-- -- 4.0 1.0 0.0]]
I need to fill in these blank parts with the fill part from parent B.
>>> print(maskedChildBFill)
[[1. 4.]
[3. 2.]
[2. 0.]
[4. 1.]
[3. 2.]]
For my children to be legal I can't repeat an integer in each row. If I try to use the "np.na.filled()" function with the compressed maskedChildBFill it gives me an error.
>>> print(np.ma.filled(maskedChildAFixed, fill_value=maskedChildBFill))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Rafael\.conda\envs\CoutoBinario\lib\site-packages\numpy\ma\core.py", line 639, in filled
return a.filled(fill_value)
File "C:\Users\Rafael\.conda\envs\CoutoBinario\lib\site-packages\numpy\ma\core.py", line 3752, in filled
np.copyto(result, fill_value, where=m)
File "<__array_function__ internals>", line 6, in copyto
ValueError: could not broadcast input array from shape (5,2) into shape (5,5)
I'll now coment the part of the code that compresses the fill portion (lines 46 and 47). It won't delete the blank spaces from the maskedChildBFill so that the size of the matrices are preserved.
>>> print(np.ma.filled(maskedChildAFixed, fill_value=maskedChildBFill))
[[2. 0. 2. 3. 3.]
[4. 3. 0. 1. 0.]
[4. 0. 3. 1. 3.]
[4. 0. 3. 2. 2.]
[4. 0. 4. 1. 0.]]
See how I get an invalid individual? Note the repeated integers in row 1. The individual should look like this:
[[1.0 0.0 2.0 4.0 3.0]
[4.0 3.0 0.0 1.0 2.0]
[4.0 2.0 0.0 1.0 3.0]
[4.0 0.0 3.0 2.0 1.0]
[3.0 2.0 4.0 1.0 0.0]]
I hope this update makes it easier to understand what I am trying to do. Thanks for all the help so far! <3
EDIT 2
I was able to work around by converting everything to list and then with for loops substitute the values in place, but this should be super slow. There might be a way to do this using numpy.
maskedChildAFill = maskedChildAFill.tolist()
maskedChildBFill = maskedChildBFill.tolist()
maskedChildAFixed = maskedChildAFixed.tolist()
maskedChildBFixed = maskedChildBFixed.tolist()
for i in range(numMachines):
counterA = 0
counterB = 0
for n, j in enumerate(maskedChildAFixed[i]):
if maskedChildAFixed[i][n] is None:
maskedChildAFixed[i][n] = maskedChildBFill[i][counterA]
counterA += 1
for n, j in enumerate(maskedChildBFixed[i]):
if maskedChildBFixed[i][n] is None:
maskedChildBFixed[i][n] = maskedChildAFill[i][counterB]
counterB += 1
I think you are looking for this:
parentA = np.array(population[0])
parentB = np.array(population[1])
childA = np.zeros((numJobs, numMachines))
np.copyto(childA, parentA)
childB = np.zeros((numJobs, numMachines))
np.copyto(childB, parentB)
subJobs = np.stack([rng.choice(numJobs ,size=int(np.max([2, np.floor(numJobs/2)])), replace=False) for i in range(numMachines)])
maskA = np.stack([(np.isin(childA[i], subJobs[i])) for i in range(numMachines)])
invMaskA = np.invert(maskA)
maskB = np.stack([(np.isin(childB[i], subJobs[i])) for i in range(numMachines)])
invMaskB = np.invert(maskB)
maskedChildAFixed = np.ma.masked_array(childA, maskA)
maskedChildBFixed = np.ma.masked_array(childB, maskB)
maskedChildAFill = np.ma.masked_array(childB, invMaskA)
maskedChildBFill = np.ma.masked_array(childA, invMaskB)
from operator import and_
crossA = np.ma.array(maskedChildAFixed.filled(0)+maskedChildAFill.filled(0),mask=list(map(and_,maskedChildAFixed.mask,maskedChildAFill.mask)))
crossB = np.ma.array(maskedChildBFixed.filled(0)+maskedChildBFill.filled(0),mask=list(map(and_,maskedChildBFixed.mask,maskedChildBFill.mask)))
Please note that I change line maskedChildAFill = np.ma.masked_array(childB, invMaskA) to fit the description of your problem. If that is not what you want, simply change it back to your original code. The last two lines should do the work for you.
output:
crossA
[[4.0 0.0 2.0 1.0 4.0]
[4.0 2.0 0.0 2.0 0.0]
[2.0 2.0 3.0 1.0 3.0]
[4.0 3.0 3.0 2.0 2.0]
[2.0 0.0 4.0 1.0 0.0]]
crossB
[[2.0 0.0 1.0 1.0 4.0]
[4.0 2.0 0.0 2.0 0.0]
[2.0 2.0 3.0 1.0 1.0]
[4.0 3.0 3.0 2.0 2.0]
[4.0 0.0 4.0 1.0 2.0]]
EDIT: Per OP's edit on question, this would work for the purpose:
maskedChildAFixed[np.where(maskA)] = maskedChildBFill.ravel()
maskedChildBFixed[np.where(maskB)] = maskedChildAFill.ravel()
Example output for maskedChildAFixed:
[[4.0 0.0 2.0 1.0 3.0]
[4.0 2.0 0.0 1.0 3.0]
[3.0 2.0 0.0 1.0 4.0]
[4.0 0.0 3.0 2.0 1.0]
[1.0 3.0 4.0 2.0 0.0]]

Rolling concatenation array of numpy arrays

I want to implement a rolling concatenation function for numpy array of arrays. For example, if my numpy array is the following:
[[1.0]
[1.5]
[1.6]
[1.8]
...
...
[1.2]
[1.3]
[1.5]]
then, for a window size of 3, my function should return:
[[1.0]
[1.0 1.5]
[1.0 1.5 1.6]
[1.5 1.6 1.8]
...
...
[1.2 1.3 1.5]]
The input array could have elements of different shapes as well. For example, if input is:
[[1.0]
[1.5]
[1.6 1.7]
[1.8]
...
...
[1.2]
[1.3]
[1.5]]
then output should be:
[[1.0]
[1.0 1.5]
[1.0 1.5 1.6 1.7]
[1.5 1.6 1.7 1.8]
...
...
[1.2 1.3 1.5]]
First, make your array into a list. There's no purpose in having an array of arrays in numpy.
l = arr.tolist() #l is a list of arrays
Now use list comprehension to get your elements, and concatenate them with np.r_
l2 = [np.r_[tuple(l[max(i - n, 0):i])] for i in range(1, len(l)+1)]

Grouping Equal Elements In An Array

I’m writing a program in python, which needs to sort through four columns of data in a text file, and return the four numbers the row with largest number in the third column for each set of identical numbers in the first column.
For example:
I need:
1.0 19.3 15.5 0.1
1.0 25.0 25.0 0.1
2.0 4.8 3.1 0.1
2.0 7.1 6.4 0.1
2.0 8.6 9.7 0.1
2.0 11.0 14.2 0.1
2.0 13.5 19.0 0.1
2.0 16.0 22.1 0.1
2.0 19.3 22.7 0.1
2.0 25.0 21.7 0.1
3.0 2.5 2.7 0.1
3.0 3.5 4.8 0.1
3.0 4.8 10.0 0.1
3.0 7.1 18.4 0.1
3.0 8.6 21.4 0.1
3.0 11.0 22.4 0.1
3.0 19.3 15.9 0.1
4.0 4.8 16.5 0.1
4.0 7.1 13.9 0.1
4.0 8.6 11.3 0.1
4.0 11.0 9.3 0.1
4.0 19.3 5.3 0.1
4.0 2.5 12.8 0.1
3.0 25.0 13.2 0.1
To return:
1.0 19.3 15.5 0.1
2.0 19.3 22.7 0.1
3.0 11.0 22.4 0.1
4.0 4.8 16.5 0.1
Here, the row [1.0, 19.3, 15.5, 0.1] is returned because 15.5 is the greatest third column value that any of the rows has, out of all the rows where 1.0 is the first number. For each set of identical numbers in the first column, the function must return the rows with the greatest value in the third column.
I am struggling with actually doing this in python, because the loop iterates over EVERY row and finds a maximum, not each ‘set’ of first column numbers.
Is there something about for loops that I don’t know which could help me do this?
Below is what I have so far.
import numpy as np
C0,C1,C2,C3 = np.loadtxt("FILE.txt",dtype={'names': ('C0', 'C1', 'C2','C3'),'formats': ('f4', 'f4', 'f4','f4')},unpack=True,usecols=(0,1,2,3))
def FUNCTION(C_0,C_1,C_2,C_3):
for i in range(len(C_1)):
a = []
a.append(C_0 [i])
for j in range(len(C_0)):
if C_0[j] == C_0[i]:
a.append(C_0 [j])
return a
print FUNCTION(C0,C1,C2,C3)
where C0,C1,C2, and C3 are columns in the text file, loaded as 1-D arrays.
Right now I’m just trying to isolate the indexes of the rows with equal C0 values.
An approach could be to use a dict where the value is the row keyed by the first column item. This way you won't have to load the whole text file in memory at once. You can scan line by line and update the dict as you go.
I got some complex because of first and second rows... I believe 25.0 at (2, 3) is your mistake.
My code is not a mathematical solution, but it can be work.
import collections
with open("INPUT.txt", "r") as datasheet:
data = datasheet.read().splitlines()
dataset = collections.OrderedDict()
for dataitem in data:
temp = dataitem.split(" ")
# I just wrote this code, input and output was seperated by four spaces
print(temp)
if temp[0] in dataset.keys():
if float(dataset[temp[0]][1]) < float(temp[2]):
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
else:
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
# Some sort code here
with open("OUTPUT.txt", "w") as outputsheet:
for datakey in dataset.keys():
datavalue = dataset[datakey]
outputsheet.write("%s %s %s %s\n" % (datakey, datavalue[0], datavalue[1], datavalue[2]))
Using Numpy and Lambda
Using the properties of a dict with some lambda functions does the trick..
data = np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),'formats': ('f4', 'f4', 'f4','f4')},usecols=(0,1,2,3))
# ordering by columns 1 and 3
sorted_data = sorted(data, key=lambda x: (x[0],x[2]))
# dict comprehension mapping the value of first column to a row
# this will overwrite all previous entries as mapping is 1-to-1
ret = {d[0]: list(d) for d in sorted_data}.values()
Alternatively, you can make it a (ugly) one liner..
ret = {
d[0]: list(d)
for d in sorted(np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),
'formats': ('f4', 'f4', 'f4','f4')},
usecols=(0,1,2,3)),
key=lambda x: (x[0],x[2]))
}.values()
As #Fallen pointed out, this is an inefficient method as you need to read in the whole file. However, for the purposes of this example where the data set is quite small, it's reasonably acceptable.
Reading one line at a time
The more efficient way is reading in one line at a time.
import re
# Get the data
with open('data', 'r') as f:
str_data = f.readlines()
# Convert to dict
d = {}
for s in str_data:
data = [float(n) for n in re.split(r'\s+', s.strip())]
if data[0] in d:
if data[2] >= d[data[0]][2]:
d[data[0]] = data
else:
d[data[0]] = data
print d.values()
The caveat here is that there's no other sorting metric so if you initially have a row for 1.0 with [1.0, 2.0, 3.0, 5.0] then any subsequent line with a 1.0 where the 3rd column is greater or equal to 3.0 will be overwritten, e.g. [1.0, 1.0, 3.0, 1.0]

Categories

Resources