I would like to remove nan elements from two pair of different dimension numpy array using Python. One numpy array with shape (8, 3) and another with shape (8,). Meaning if at least one nan element appear in a row, the entire row need to be removed. However I faced issues when this two pair of array has different dimension.
For example,
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[3.4 nan 4.6] 4.8
[4.2 4.6 4.8] 4.6
[4.6 4.8 4.6] nan
[4.8 4.6 nan] nan
[4.6 nan nan] nan
[nan nan nan] nan
I want it to become
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[4.2 4.6 4.8] 4.6
This is my code which generate the sequence data,
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
n_steps = 3
sequence = df_sensorRefill['sensor'].to_list()
X, y = split_sequence(sequence, n_steps)
Thanks
You could use np.isnan(), np.any() to find rows containing nan's and np.delete() to remove such rows.
I'm looking for an identification procedure using Pandas to trace the movement of some objects.
I want to assign a value (ID) compared to previous frame (Time) for each group based on multiple conditions (X,Y,Z) being within a tolerance (X<0.5,Y<0.6,Z<0.7) and write it in place. Also, if the value is not available, give it a new value (Count) which restarts by Group.
Here is an example:
Before:
Group
Time
X
Y
Z
A
1.0
0.1
0.1
0.1
A
1.0
2.2
2.2
2.2
B
1.0
3.3
3.3
3.3
A
1.1
0.4
0.4
0.4
A
1.1
5.5
5.5
5.5
B
1.1
3.6
3.6
3.6
After
Group
Time
X
Y
Z
ID
A
1.0
0.1
0.1
0.1
1
A
1.0
2.2
2.2
2.2
2
B
1.0
3.3
3.3
3.3
1
A
1.1
0.4
0.4
0.4
1
A
1.1
5.5
5.5
5.5
3
B
1.1
3.6
3.6
3.6
1
for clarification:
Row#4: X,Y,Z change is within the tolerance, hence same ID
Row#5: X,Y,Z change is NOT within the tolerance, hence new ID in Group A
Row#6: X,Y,Z change NOT within the tolerance, hence same ID
I think I can trace the movement of my objects only for one direction (Let's say X) using pd.merge_asof regardless of their group and time, and find their ID. My problems are: 1. considering the group and time, 2. assignment of new ID, 3. different tolerances.
df3=pd.merge_asof(df1, df2, on="X", direction="nearest", by=["Group"], tolerance=0.5)
I have a datafile which is similar to (The original file is much bigger),
Data
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0
and I need to plot it as a 2D contour.
The first line is dummy, The first 6 shows the number of elements in the x direction as
0.0 0.2 0.4
0.6 0.8
1.0
and the second one shows the number of elements in the y direction as,
0.0 0.4 0.6
1.2 1.6
2.0
Then each 6 number shows the value of contour in each row, starting from row 1 as,
1.0 3.0 4.0 1.0
1.0 3.0
I want to cast this data into a 2D array so that I can plot them.
I tried,
data = numpy.genfromtxt('fileHere',delimiter= " ",skip_header=1)
to read data into a general array and then split it. But I get the following error,
Line #16999 (got 15 columns instead of 3)
I also tried the readline() and split() functions of Python but they make it much harder to continue. I want to have x and y in arrays and a separate array for the data in a let's say 6X6 shape. In Matlab I used to use the fscanf function
fscanf(fid,'%d',6);
I will be happy to have your ideas on this. Thanks
I think you want to read the full file into a variable, then replace all newline '\n', then split and convert it into an ndarray.
Here's how I did it.
import numpy as np
txt = '''\
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0'''
txt_list = txt.replace('\n',' ').split()
# if you want to convert the values to floats, you can include the next line
# otherwise the data will be converted to numpy as string
txt_list = [float(i) for i in txt_list]
width = int(txt_list[1])
height = len(txt_list[2:])//width
txt_array = np.reshape(txt_list[2:], (height, width))
print (txt_array)
The output of this will be:
[['0.0' '0.2' '0.4' '0.6' '0.8' '1.0']
['0.0' '0.4' '0.6' '1.2' '1.6' '2.0']
['1.0' '3.0' '4.0' '1.0' '1.0' '3.0']
['1.0' '2.0' '1.0' '4.0' '5.0' '2.0']
['3.0' '3.0' '1.0' '1.0' '5.0' '1.0']
['2.0' '7.0' '1.0' '1.0' '5.0' '2.0']
['2.0' '3.0' '8.0' '6.0' '3.0' '1.0']
['3.0' '3.0' '4.0' '6.0' '1.0' '1.0']]
Quite a smart solution to reformat your input is possible using
Pandas.
Start with reading your input file as a pandasonic DataFrame (with
standard field separator, i.e. a comma):
df = pd.read_csv('Input.txt')
As your input file does not contain commas, each line is read as a single
field and the column name (Data) is taken from the first line.
So far the initial part of df is:
Data
0 6 6
1 0.0 0.2 0.4
2 0.6 0.8
3 1.0
4 0.0 0.4 0.6
The left column is the index, but it is not important.
The type of the only column is object, actually a string.
Then, to reformat this DataFrame into a 6-columns Numpy array, it
is enough to run the following one-liner:
result = df.drop(0).Data.str.split(' ').explode().astype('float').values.reshape(-1, 6)
Steps:
drop(0) - Drop the initial row (with index 0).
Data - Take Data column.
str.split(' ') - Split each element on spaces (the result is a list of strings).
explode() - Convert each list into a sequence of rows. So far each
element is of string type.
astype('float') - Change the type to float.
values - Take the underlying Numpy (1-D) array.
reshape(-1, 6) - Reshape to 6 columns and as many rows as needed.
The result, for your data sample is:
array([[0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[0. , 0.4, 0.6, 1.2, 1.6, 2. ],
[1. , 3. , 4. , 1. , 1. , 3. ],
[1. , 2. , 1. , 4. , 5. , 2. ],
[3. , 3. , 1. , 1. , 5. , 1. ],
[2. , 7. , 1. , 1. , 5. , 2. ],
[2. , 3. , 8. , 6. , 3. , 1. ],
[3. , 3. , 4. , 6. , 1. , 1. ]])
And the last step is to divide this array into:
2 initial rows (x and y coordinates),
following rows (actual data),
To do it, run:
x = result[0]
y = result[1]
data = result[2:]
Alternative: Don't create separate variables, but call plt.contour
passing respective rows of result as X, Y and Z.
Something like:
plt.contour(result[0], result[1], result[2:]);
I'm implementing a machine learning algorithm and I'm extracting features out of a dataframe. I obviously need overlapping windows. Suppose the DataFrame looks like this
x y z
12.1 11 0.5
12.2 10 0.3
12.4 11 0.5
12.8 12 0.4
13.1 13 0.4
14.7 14 0.5
15.2 14 0.6
15.3 13 0.5
17.3 14 0.5
18.2 15 0.4
16.1 16 0.2
15.0 17 0.1
But in reality a lot larger (thousands of samples). I now want a list of dataframes where each DataFrame is of length ws (here 150) and to have step (stride) of 60.
This is what I got
r = np.arange(len(df))
s = r[::step]
return [df.iloc[k:k+ws] for k in s]
This works reasonably good but there's still one problem. The last 1,2 or 3 frames might not have length ws. I also cannot just discard the last 3 since there's sometimes only one with a length smaller then ws. So the s variable just keeps all the start indices, I'd need a way to have only the start indices where start_index + step < len(df). Unless of course, there are better and or faster ways for this (maybe a library). All existing documentation only talks about simple arrays.
You might only need to change s:
s = r[:len(df)-ws+1:step]
In this way you only find the start indexes of frames with length ws.
I want to generate 6 random numbers(weights) that always equals one 1000000 times and multiply it the columns of a data i have imported from as csv file. Store the sum in another column(weighted average) and find the difference between the max and min the new column(range). I want to repeat the process 1000000 times and get the least range and the set of random numbers(weights) generated to find that.
Here is what i have done so far:
1.Generate 6 random numbers
2.Import data from csv
3. Multiply the data random numbers with the data from the csv file and find the average(weighted average)
4. save the weighted average in a new column F(x)
5. Find the range
6. Repeat this 1000000 times and get the random numbers that gives me the least range.
Here is some Data from the file
A B C D E F F(x)
0 4.9 3.9 6.3 3.4 7.3 3.4 0.0
1 4.1 3.7 7.7 2.8 5.5 3.9 0.0
2 6.0 6.0 4.0 3.1 3.7 4.3 0.0
3 5.6 6.3 6.6 4.6 8.3 4.6 0.0
Currently getting 0.0 for all F(x) which should not be so.
arr = np.array(np.random.dirichlet(np.ones(6), size=1))
arr=pd.DataFrame(arr)
ar=(arr.iloc[0])
df = pd.read_csv('weit.csv')
df['F(x)']=df.mul(ar).sum(1)
df
df['F(x)'].max() - df['F(x)'].min()
I am getting 0 for all my weighted averages. I need to get the weighted average
I cant loop the code to run 1000000 times and get me the least range.
If understand correctly what you need:
#data from file
print (df)
A B C D E F
0 4.9 3.9 6.3 3.4 7.3 3.4
1 4.1 3.7 7.7 2.8 5.5 3.9
2 6.0 6.0 4.0 3.1 3.7 4.3
3 5.6 6.3 6.6 4.6 8.3 4.6
np.random.seed(3434)
Generate 2d array with 6 'columns' and N 'rows' filled unique random numbers by this:
N = 10
#in real data
#N = 1000000
N = 10
arr = np.array(np.random.dirichlet(np.ones(6), size=N))
print (arr)
[[0.07077773 0.08042978 0.02589592 0.03457833 0.53804634 0.25027191]
[0.22174594 0.22673581 0.26136526 0.04820957 0.00976747 0.23217594]
[0.01202493 0.14247592 0.3411326 0.0239181 0.08448841 0.39596005]
[0.09354759 0.54989312 0.08893737 0.22051801 0.03850101 0.00860291]
[0.09418778 0.33345217 0.11721214 0.33480462 0.11894247 0.00140081]
[0.04285476 0.04531546 0.38105815 0.04316535 0.46902838 0.0185779 ]
[0.00441747 0.08044848 0.33383453 0.09476135 0.37568431 0.11085386]
[0.14613552 0.11260451 0.10421495 0.27880266 0.28994218 0.06830019]
[0.50747802 0.15704797 0.04410511 0.07552837 0.18744306 0.02839746]
[0.00203448 0.13225783 0.43042505 0.33410145 0.08385366 0.01732753]]
Then convert values from DataFrame to 2d numpy array:
b = df.values
#pandas 0.24+
#b = df.to_numpy()
print (b)
[[4.9 3.9 6.3 3.4 7.3 3.4]
[4.1 3.7 7.7 2.8 5.5 3.9]
[6. 6. 4. 3.1 3.7 4.3]
[5.6 6.3 6.6 4.6 8.3 4.6]]
Last multiple both arrays together to 3d array and sum per axis 2, last for subtract maximum with minimum use numpy.ptp:
c = np.ptp((arr * b[:, None]).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Another solution with numpy.einsum:
c = np.ptp(np.einsum('ik,jk->jik', arr, b).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Loop solution for compare, but slow with large N:
out = []
for row in df.values:
# print (row)
a = np.ptp((row * arr).sum(axis=1))
out.append(a)
print (out)
[2.197878921892329, 2.0847676512823052, 1.2654272959079576, 1.4513453259898297]