Read data with missing values with Python into arrays

Read data with missing values with Python into arrays - python

I have a datafile which is similar to (The original file is much bigger),
Data
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0
and I need to plot it as a 2D contour.
The first line is dummy, The first 6 shows the number of elements in the x direction as
0.0 0.2 0.4
0.6 0.8
1.0
and the second one shows the number of elements in the y direction as,
0.0 0.4 0.6
1.2 1.6
2.0
Then each 6 number shows the value of contour in each row, starting from row 1 as,
1.0 3.0 4.0 1.0
1.0 3.0
I want to cast this data into a 2D array so that I can plot them.
I tried,
data = numpy.genfromtxt('fileHere',delimiter= " ",skip_header=1)
to read data into a general array and then split it. But I get the following error,
Line #16999 (got 15 columns instead of 3)
I also tried the readline() and split() functions of Python but they make it much harder to continue. I want to have x and y in arrays and a separate array for the data in a let's say 6X6 shape. In Matlab I used to use the fscanf function
fscanf(fid,'%d',6);
I will be happy to have your ideas on this. Thanks

I think you want to read the full file into a variable, then replace all newline '\n', then split and convert it into an ndarray.
Here's how I did it.
import numpy as np
txt = '''\
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0'''
txt_list = txt.replace('\n',' ').split()
# if you want to convert the values to floats, you can include the next line
# otherwise the data will be converted to numpy as string
txt_list = [float(i) for i in txt_list]
width = int(txt_list[1])
height = len(txt_list[2:])//width
txt_array = np.reshape(txt_list[2:], (height, width))
print (txt_array)
The output of this will be:
[['0.0' '0.2' '0.4' '0.6' '0.8' '1.0']
['0.0' '0.4' '0.6' '1.2' '1.6' '2.0']
['1.0' '3.0' '4.0' '1.0' '1.0' '3.0']
['1.0' '2.0' '1.0' '4.0' '5.0' '2.0']
['3.0' '3.0' '1.0' '1.0' '5.0' '1.0']
['2.0' '7.0' '1.0' '1.0' '5.0' '2.0']
['2.0' '3.0' '8.0' '6.0' '3.0' '1.0']
['3.0' '3.0' '4.0' '6.0' '1.0' '1.0']]

Quite a smart solution to reformat your input is possible using
Pandas.
Start with reading your input file as a pandasonic DataFrame (with
standard field separator, i.e. a comma):
df = pd.read_csv('Input.txt')
As your input file does not contain commas, each line is read as a single
field and the column name (Data) is taken from the first line.
So far the initial part of df is:
Data
0 6 6
1 0.0 0.2 0.4
2 0.6 0.8
3 1.0
4 0.0 0.4 0.6
The left column is the index, but it is not important.
The type of the only column is object, actually a string.
Then, to reformat this DataFrame into a 6-columns Numpy array, it
is enough to run the following one-liner:
result = df.drop(0).Data.str.split(' ').explode().astype('float').values.reshape(-1, 6)
Steps:
drop(0) - Drop the initial row (with index 0).
Data - Take Data column.
str.split(' ') - Split each element on spaces (the result is a list of strings).
explode() - Convert each list into a sequence of rows. So far each
element is of string type.
astype('float') - Change the type to float.
values - Take the underlying Numpy (1-D) array.
reshape(-1, 6) - Reshape to 6 columns and as many rows as needed.
The result, for your data sample is:
array([[0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[0. , 0.4, 0.6, 1.2, 1.6, 2. ],
[1. , 3. , 4. , 1. , 1. , 3. ],
[1. , 2. , 1. , 4. , 5. , 2. ],
[3. , 3. , 1. , 1. , 5. , 1. ],
[2. , 7. , 1. , 1. , 5. , 2. ],
[2. , 3. , 8. , 6. , 3. , 1. ],
[3. , 3. , 4. , 6. , 1. , 1. ]])
And the last step is to divide this array into:
2 initial rows (x and y coordinates),
following rows (actual data),
To do it, run:
x = result[0]
y = result[1]
data = result[2:]
Alternative: Don't create separate variables, but call plt.contour
passing respective rows of result as X, Y and Z.
Something like:
plt.contour(result[0], result[1], result[2:]);

Related

Remove Nan from two numpy array with different dimension Python

I would like to remove nan elements from two pair of different dimension numpy array using Python. One numpy array with shape (8, 3) and another with shape (8,). Meaning if at least one nan element appear in a row, the entire row need to be removed. However I faced issues when this two pair of array has different dimension.
For example,
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[3.4 nan 4.6] 4.8
[4.2 4.6 4.8] 4.6
[4.6 4.8 4.6] nan
[4.8 4.6 nan] nan
[4.6 nan nan] nan
[nan nan nan] nan
I want it to become
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[4.2 4.6 4.8] 4.6
This is my code which generate the sequence data,
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
n_steps = 3
sequence = df_sensorRefill['sensor'].to_list()
X, y = split_sequence(sequence, n_steps)
Thanks

You could use np.isnan(), np.any() to find rows containing nan's and np.delete() to remove such rows.

Filtering Numpy's array of arrays

Working with a numpy's ndarray for preprossessing data to a neural network. It basically contains several fixed-length arrays for sensor data. So for example:
>>> type(arr)
<class 'numpy.ndarray'>
>>> arr.shape
(400,1,5,4)
>>> arr
[
[[ 9.4 -3.7 -5.2 3.8]
[ 2.8 1.4 -1.7 3.4]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]]
..
[[ 0.0 -1.0 2.1 0.0]
[ 3.0 2.8 -3.0 8.2]
[ 7.5 1.7 -3.8 2.6]
[ 0.0 0.0 0.0 0.0]
[ 0.0 0.0 0.0 0.0]]
]
Each of the nested array is shaped (1, 5,4). The goal is to run through this arr and select only those arrays having at least the first three rows as non-zero (although single entry can be zero, but not whole row).
So in the example I give above, the first nested array should be deleted because only 2 first-rows are non-zero, whereas we need 3 and above.

Here's a trick you can use:
mask = arr[:,:,:3].any(axis=3).all(axis=2)
arr_filtered = arr[mask]
Quick explanation: To keep a nested array it should have at least 3 first rows (hence we need to look only at arr[:,:,:3]) such that all of them (hence .all(axis=2) at the end) have at least one non-zero entry (hence .any(axis=3)).

How does the Multivariate imputer in scikit-learn differ from the Simple imputer?

I have a matrix of data with missing values that I am trying to impute, and I am looking at the options for different imputers and checking to see what settings would work best for the biological context I am working in. I understand the knnimpute function in matlab and the simple imputer in scikit-learn. However, I'm not quite sure my understanding of the iterative imputer is correct.
I have looked at the documentation at this site for the multivariate/iterative imputer -- https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
I don't understand the explanation of the algorithm, as round-robin. Does the imputer use the characteristics of both the column and rows in the matrix to determine the "value" of a missing data point? Then taking that approach one random missing data point at a time to avoid shifting the data unnaturally towards the characteristics of a previously imputed data point?

My understanding of the algorithms are as follows:
Simple Imputer
The simple Imputer uses the non missing values in each column to estimate the missing values.
For example if you had a column like age with 10% missing values. It would find the mean age and replace all missing in the age column with that value.
It supports several different methods of imputation such as median and mode(most_common) as well as a constant value you define yourself. These last two can also be used on categorical values.
df = pd.DataFrame({'A':[np.nan,2,3],
'B':[3,5,np.nan],
'C':[0,np.nan,3],
'D':[2,6,3]})
print(df)
A B C D
0 NaN 3.0 0.0 2
1 2.0 5.0 NaN 6
2 3.0 NaN 3.0 3
imp = SimpleImputer()
imp.fit_transform(df)
array([[2.5, 3. , 0. , 2. ],
[2. , 5. , 1.5, 6. ],
[3. , 4. , 3. , 3. ]])
As you can see the imputed values are simply the mean value for each column
iterative Imputer
The Iterative Imputer can do a number of different things depending upon how you configure it. This explanation assumes the default values.
Original Data
A B C D
0 NaN 3.0 0.0 2
1 2.0 5.0 NaN 6
2 3.0 NaN 3.0 3
Firstly It does the same thing as the simple imputer e.g. simple imputes the missing values based upon the initial_strategy parameter(default = Mean).
Initial Pass
A B C D
0 2.5 3.0 0.0 2
1 2.0 5.0 1.5 6
2 3.0 4.0 3.0 3
Secondly it trains the estimator passed in (default = Bayesian_ridge) as a predictor. In our case we have columns; A,B,C,D. So the classifier would fit a model with independent variables A,B,C and dependent variable D
X = df[['A','B','C']]
y = df[['D']]
model = BaysianRidge.fit(X,y)
Then it calls the predict method of this newly fitted model for the values that are flagged as imputed and replaces them.
model.predict(df[df[D = 'imputed_mask']])
This method is repeated for all combinations of columns(the round robin described in the docs) e.g.
X = df[['B','C','D']]
y = df[['A']]
...
X = df[['A','C','D']]
y = df[['B']]
...
X = df[['A','B','D']]
y = df[['C']]
...
This round robin of training an estimator on each combination of columns makes up one pass. This process is repeated until either the stopping tolerance is met or until the iterator reaches the max number of iterations(default = 10)
so if we run for three passes it looks like this:
Original Data
A B C D
0 NaN 3.0 0.0 2
1 2.0 5.0 NaN 6
2 3.0 NaN 3.0 3
Initial (simple) Pass
A B C D
0 2.5 3.0 0.0 2
1 2.0 5.0 1.5 6
2 3.0 4.0 3.0 3
pass_1
[[3.55243135 3. 0. 2. ]
[2. 5. 7.66666393 6. ]
[3. 3.7130697 3. 3. ]]
pass_2
[[ 3.39559017 3. 0. 2. ]
[ 2. 5. 10.39409964 6. ]
[ 3. 3.57003864 3. 3. ]]
pass_3
[[ 3.34980014 3. 0. 2. ]
[ 2. 5. 11.5269743 6. ]
[ 3. 3.51894112 3. 3. ]]
Obviously it doesn't work great for such a small example because there isn't enough data to fit the estimator on so with a smaller data-set it may be best to use the simple impute method.

Grouping Equal Elements In An Array

I’m writing a program in python, which needs to sort through four columns of data in a text file, and return the four numbers the row with largest number in the third column for each set of identical numbers in the first column.
For example:
I need:
1.0 19.3 15.5 0.1
1.0 25.0 25.0 0.1
2.0 4.8 3.1 0.1
2.0 7.1 6.4 0.1
2.0 8.6 9.7 0.1
2.0 11.0 14.2 0.1
2.0 13.5 19.0 0.1
2.0 16.0 22.1 0.1
2.0 19.3 22.7 0.1
2.0 25.0 21.7 0.1
3.0 2.5 2.7 0.1
3.0 3.5 4.8 0.1
3.0 4.8 10.0 0.1
3.0 7.1 18.4 0.1
3.0 8.6 21.4 0.1
3.0 11.0 22.4 0.1
3.0 19.3 15.9 0.1
4.0 4.8 16.5 0.1
4.0 7.1 13.9 0.1
4.0 8.6 11.3 0.1
4.0 11.0 9.3 0.1
4.0 19.3 5.3 0.1
4.0 2.5 12.8 0.1
3.0 25.0 13.2 0.1
To return:
1.0 19.3 15.5 0.1
2.0 19.3 22.7 0.1
3.0 11.0 22.4 0.1
4.0 4.8 16.5 0.1
Here, the row [1.0, 19.3, 15.5, 0.1] is returned because 15.5 is the greatest third column value that any of the rows has, out of all the rows where 1.0 is the first number. For each set of identical numbers in the first column, the function must return the rows with the greatest value in the third column.
I am struggling with actually doing this in python, because the loop iterates over EVERY row and finds a maximum, not each ‘set’ of first column numbers.
Is there something about for loops that I don’t know which could help me do this?
Below is what I have so far.
import numpy as np
C0,C1,C2,C3 = np.loadtxt("FILE.txt",dtype={'names': ('C0', 'C1', 'C2','C3'),'formats': ('f4', 'f4', 'f4','f4')},unpack=True,usecols=(0,1,2,3))
def FUNCTION(C_0,C_1,C_2,C_3):
for i in range(len(C_1)):
a = []
a.append(C_0 [i])
for j in range(len(C_0)):
if C_0[j] == C_0[i]:
a.append(C_0 [j])
return a
print FUNCTION(C0,C1,C2,C3)
where C0,C1,C2, and C3 are columns in the text file, loaded as 1-D arrays.
Right now I’m just trying to isolate the indexes of the rows with equal C0 values.

An approach could be to use a dict where the value is the row keyed by the first column item. This way you won't have to load the whole text file in memory at once. You can scan line by line and update the dict as you go.

I got some complex because of first and second rows... I believe 25.0 at (2, 3) is your mistake.
My code is not a mathematical solution, but it can be work.
import collections
with open("INPUT.txt", "r") as datasheet:
data = datasheet.read().splitlines()
dataset = collections.OrderedDict()
for dataitem in data:
temp = dataitem.split(" ")
# I just wrote this code, input and output was seperated by four spaces
print(temp)
if temp[0] in dataset.keys():
if float(dataset[temp[0]][1]) < float(temp[2]):
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
else:
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
# Some sort code here
with open("OUTPUT.txt", "w") as outputsheet:
for datakey in dataset.keys():
datavalue = dataset[datakey]
outputsheet.write("%s %s %s %s\n" % (datakey, datavalue[0], datavalue[1], datavalue[2]))

Using Numpy and Lambda
Using the properties of a dict with some lambda functions does the trick..
data = np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),'formats': ('f4', 'f4', 'f4','f4')},usecols=(0,1,2,3))
# ordering by columns 1 and 3
sorted_data = sorted(data, key=lambda x: (x[0],x[2]))
# dict comprehension mapping the value of first column to a row
# this will overwrite all previous entries as mapping is 1-to-1
ret = {d[0]: list(d) for d in sorted_data}.values()
Alternatively, you can make it a (ugly) one liner..
ret = {
d[0]: list(d)
for d in sorted(np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),
'formats': ('f4', 'f4', 'f4','f4')},
usecols=(0,1,2,3)),
key=lambda x: (x[0],x[2]))
}.values()
As #Fallen pointed out, this is an inefficient method as you need to read in the whole file. However, for the purposes of this example where the data set is quite small, it's reasonably acceptable.
Reading one line at a time
The more efficient way is reading in one line at a time.
import re
# Get the data
with open('data', 'r') as f:
str_data = f.readlines()
# Convert to dict
d = {}
for s in str_data:
data = [float(n) for n in re.split(r'\s+', s.strip())]
if data[0] in d:
if data[2] >= d[data[0]][2]:
d[data[0]] = data
else:
d[data[0]] = data
print d.values()
The caveat here is that there's no other sorting metric so if you initially have a row for 1.0 with [1.0, 2.0, 3.0, 5.0] then any subsequent line with a 1.0 where the 3rd column is greater or equal to 3.0 will be overwritten, e.g. [1.0, 1.0, 3.0, 1.0]

Create Pandas dataframe from numpy array and use first column of the array as index

I have a numpy array (a):
array([[ 1. , 5.1, 3.5, 1.4, 0.2],
[ 1. , 4.9, 3. , 1.4, 0.2],
[ 2. , 4.7, 3.2, 1.3, 0.2],
[ 2. , 4.6, 3.1, 1.5, 0.2]])
I would like to make a pandas dataframe (pd) with values=a, columns= A,B,C,D and index= to the first column of my numpy array, finally it should looks like this:
A B C D
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
I am trying this:
df = pd.DataFrame(a, index=a[:,0], columns=['A', 'B','C','D'])
and I get the following error:
ValueError: Shape of passed values is (5, 4), indices imply (4, 4)
Any help?
Thanks

You passed the complete array as the data param, you need to slice your array also if you want just 4 columns from the array as the data:
In [158]:
df = pd.DataFrame(a[:,1:], index=a[:,0], columns=['A', 'B','C','D'])
df
Out[158]:
A B C D
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
Also having duplicate values in the index will make filtering/indexing problematic
So here a[:,1:] I take all the rows but index from column 1 onwards as desired, see the docs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read data with missing values with Python into arrays - python

Related

Remove Nan from two numpy array with different dimension Python

Filtering Numpy's array of arrays

How does the Multivariate imputer in scikit-learn differ from the Simple imputer?

Grouping Equal Elements In An Array

Create Pandas dataframe from numpy array and use first column of the array as index

Categories

Resources