Grouping Equal Elements In An Array - python

I’m writing a program in python, which needs to sort through four columns of data in a text file, and return the four numbers the row with largest number in the third column for each set of identical numbers in the first column.
For example:
I need:
1.0 19.3 15.5 0.1
1.0 25.0 25.0 0.1
2.0 4.8 3.1 0.1
2.0 7.1 6.4 0.1
2.0 8.6 9.7 0.1
2.0 11.0 14.2 0.1
2.0 13.5 19.0 0.1
2.0 16.0 22.1 0.1
2.0 19.3 22.7 0.1
2.0 25.0 21.7 0.1
3.0 2.5 2.7 0.1
3.0 3.5 4.8 0.1
3.0 4.8 10.0 0.1
3.0 7.1 18.4 0.1
3.0 8.6 21.4 0.1
3.0 11.0 22.4 0.1
3.0 19.3 15.9 0.1
4.0 4.8 16.5 0.1
4.0 7.1 13.9 0.1
4.0 8.6 11.3 0.1
4.0 11.0 9.3 0.1
4.0 19.3 5.3 0.1
4.0 2.5 12.8 0.1
3.0 25.0 13.2 0.1
To return:
1.0 19.3 15.5 0.1
2.0 19.3 22.7 0.1
3.0 11.0 22.4 0.1
4.0 4.8 16.5 0.1
Here, the row [1.0, 19.3, 15.5, 0.1] is returned because 15.5 is the greatest third column value that any of the rows has, out of all the rows where 1.0 is the first number. For each set of identical numbers in the first column, the function must return the rows with the greatest value in the third column.
I am struggling with actually doing this in python, because the loop iterates over EVERY row and finds a maximum, not each ‘set’ of first column numbers.
Is there something about for loops that I don’t know which could help me do this?
Below is what I have so far.
import numpy as np
C0,C1,C2,C3 = np.loadtxt("FILE.txt",dtype={'names': ('C0', 'C1', 'C2','C3'),'formats': ('f4', 'f4', 'f4','f4')},unpack=True,usecols=(0,1,2,3))
def FUNCTION(C_0,C_1,C_2,C_3):
for i in range(len(C_1)):
a = []
a.append(C_0 [i])
for j in range(len(C_0)):
if C_0[j] == C_0[i]:
a.append(C_0 [j])
return a
print FUNCTION(C0,C1,C2,C3)
where C0,C1,C2, and C3 are columns in the text file, loaded as 1-D arrays.
Right now I’m just trying to isolate the indexes of the rows with equal C0 values.

An approach could be to use a dict where the value is the row keyed by the first column item. This way you won't have to load the whole text file in memory at once. You can scan line by line and update the dict as you go.

I got some complex because of first and second rows... I believe 25.0 at (2, 3) is your mistake.
My code is not a mathematical solution, but it can be work.
import collections
with open("INPUT.txt", "r") as datasheet:
data = datasheet.read().splitlines()
dataset = collections.OrderedDict()
for dataitem in data:
temp = dataitem.split(" ")
# I just wrote this code, input and output was seperated by four spaces
print(temp)
if temp[0] in dataset.keys():
if float(dataset[temp[0]][1]) < float(temp[2]):
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
else:
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
# Some sort code here
with open("OUTPUT.txt", "w") as outputsheet:
for datakey in dataset.keys():
datavalue = dataset[datakey]
outputsheet.write("%s %s %s %s\n" % (datakey, datavalue[0], datavalue[1], datavalue[2]))

Using Numpy and Lambda
Using the properties of a dict with some lambda functions does the trick..
data = np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),'formats': ('f4', 'f4', 'f4','f4')},usecols=(0,1,2,3))
# ordering by columns 1 and 3
sorted_data = sorted(data, key=lambda x: (x[0],x[2]))
# dict comprehension mapping the value of first column to a row
# this will overwrite all previous entries as mapping is 1-to-1
ret = {d[0]: list(d) for d in sorted_data}.values()
Alternatively, you can make it a (ugly) one liner..
ret = {
d[0]: list(d)
for d in sorted(np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),
'formats': ('f4', 'f4', 'f4','f4')},
usecols=(0,1,2,3)),
key=lambda x: (x[0],x[2]))
}.values()
As #Fallen pointed out, this is an inefficient method as you need to read in the whole file. However, for the purposes of this example where the data set is quite small, it's reasonably acceptable.
Reading one line at a time
The more efficient way is reading in one line at a time.
import re
# Get the data
with open('data', 'r') as f:
str_data = f.readlines()
# Convert to dict
d = {}
for s in str_data:
data = [float(n) for n in re.split(r'\s+', s.strip())]
if data[0] in d:
if data[2] >= d[data[0]][2]:
d[data[0]] = data
else:
d[data[0]] = data
print d.values()
The caveat here is that there's no other sorting metric so if you initially have a row for 1.0 with [1.0, 2.0, 3.0, 5.0] then any subsequent line with a 1.0 where the 3rd column is greater or equal to 3.0 will be overwritten, e.g. [1.0, 1.0, 3.0, 1.0]

Related

Remove Nan from two numpy array with different dimension Python

I would like to remove nan elements from two pair of different dimension numpy array using Python. One numpy array with shape (8, 3) and another with shape (8,). Meaning if at least one nan element appear in a row, the entire row need to be removed. However I faced issues when this two pair of array has different dimension.
For example,
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[3.4 nan 4.6] 4.8
[4.2 4.6 4.8] 4.6
[4.6 4.8 4.6] nan
[4.8 4.6 nan] nan
[4.6 nan nan] nan
[nan nan nan] nan
I want it to become
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[4.2 4.6 4.8] 4.6
This is my code which generate the sequence data,
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
n_steps = 3
sequence = df_sensorRefill['sensor'].to_list()
X, y = split_sequence(sequence, n_steps)
Thanks
You could use np.isnan(), np.any() to find rows containing nan's and np.delete() to remove such rows.

find a value based on multiple conditions within a tolerance in Pandas

I'm looking for an identification procedure using Pandas to trace the movement of some objects.
I want to assign a value (ID) compared to previous frame (Time) for each group based on multiple conditions (X,Y,Z) being within a tolerance (X<0.5,Y<0.6,Z<0.7) and write it in place. Also, if the value is not available, give it a new value (Count) which restarts by Group.
Here is an example:
Before:
Group
Time
X
Y
Z
A
1.0
0.1
0.1
0.1
A
1.0
2.2
2.2
2.2
B
1.0
3.3
3.3
3.3
A
1.1
0.4
0.4
0.4
A
1.1
5.5
5.5
5.5
B
1.1
3.6
3.6
3.6
After
Group
Time
X
Y
Z
ID
A
1.0
0.1
0.1
0.1
1
A
1.0
2.2
2.2
2.2
2
B
1.0
3.3
3.3
3.3
1
A
1.1
0.4
0.4
0.4
1
A
1.1
5.5
5.5
5.5
3
B
1.1
3.6
3.6
3.6
1
for clarification:
Row#4: X,Y,Z change is within the tolerance, hence same ID
Row#5: X,Y,Z change is NOT within the tolerance, hence new ID in Group A
Row#6: X,Y,Z change NOT within the tolerance, hence same ID
I think I can trace the movement of my objects only for one direction (Let's say X) using pd.merge_asof regardless of their group and time, and find their ID. My problems are: 1. considering the group and time, 2. assignment of new ID, 3. different tolerances.
df3=pd.merge_asof(df1, df2, on="X", direction="nearest", by=["Group"], tolerance=0.5)

Read data with missing values with Python into arrays

I have a datafile which is similar to (The original file is much bigger),
Data
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0
and I need to plot it as a 2D contour.
The first line is dummy, The first 6 shows the number of elements in the x direction as
0.0 0.2 0.4
0.6 0.8
1.0
and the second one shows the number of elements in the y direction as,
0.0 0.4 0.6
1.2 1.6
2.0
Then each 6 number shows the value of contour in each row, starting from row 1 as,
1.0 3.0 4.0 1.0
1.0 3.0
I want to cast this data into a 2D array so that I can plot them.
I tried,
data = numpy.genfromtxt('fileHere',delimiter= " ",skip_header=1)
to read data into a general array and then split it. But I get the following error,
Line #16999 (got 15 columns instead of 3)
I also tried the readline() and split() functions of Python but they make it much harder to continue. I want to have x and y in arrays and a separate array for the data in a let's say 6X6 shape. In Matlab I used to use the fscanf function
fscanf(fid,'%d',6);
I will be happy to have your ideas on this. Thanks
I think you want to read the full file into a variable, then replace all newline '\n', then split and convert it into an ndarray.
Here's how I did it.
import numpy as np
txt = '''\
6 6
0.0 0.2 0.4
0.6 0.8
1.0
0.0 0.4 0.6
1.2 1.6
2.0
1.0 3.0 4.0 1.0
1.0 3.0
1.0 2.0 1.0 4.0
5.0 2.0
3.0 3.0 1.0 1.0
5.0 1.0
2.0 7.0 1.0 1.0
5.0 2.0
2.0 3.0 8.0 6.0
3.0 1.0
3.0 3.0 4.0 6.0
1.0 1.0'''
txt_list = txt.replace('\n',' ').split()
# if you want to convert the values to floats, you can include the next line
# otherwise the data will be converted to numpy as string
txt_list = [float(i) for i in txt_list]
width = int(txt_list[1])
height = len(txt_list[2:])//width
txt_array = np.reshape(txt_list[2:], (height, width))
print (txt_array)
The output of this will be:
[['0.0' '0.2' '0.4' '0.6' '0.8' '1.0']
['0.0' '0.4' '0.6' '1.2' '1.6' '2.0']
['1.0' '3.0' '4.0' '1.0' '1.0' '3.0']
['1.0' '2.0' '1.0' '4.0' '5.0' '2.0']
['3.0' '3.0' '1.0' '1.0' '5.0' '1.0']
['2.0' '7.0' '1.0' '1.0' '5.0' '2.0']
['2.0' '3.0' '8.0' '6.0' '3.0' '1.0']
['3.0' '3.0' '4.0' '6.0' '1.0' '1.0']]
Quite a smart solution to reformat your input is possible using
Pandas.
Start with reading your input file as a pandasonic DataFrame (with
standard field separator, i.e. a comma):
df = pd.read_csv('Input.txt')
As your input file does not contain commas, each line is read as a single
field and the column name (Data) is taken from the first line.
So far the initial part of df is:
Data
0 6 6
1 0.0 0.2 0.4
2 0.6 0.8
3 1.0
4 0.0 0.4 0.6
The left column is the index, but it is not important.
The type of the only column is object, actually a string.
Then, to reformat this DataFrame into a 6-columns Numpy array, it
is enough to run the following one-liner:
result = df.drop(0).Data.str.split(' ').explode().astype('float').values.reshape(-1, 6)
Steps:
drop(0) - Drop the initial row (with index 0).
Data - Take Data column.
str.split(' ') - Split each element on spaces (the result is a list of strings).
explode() - Convert each list into a sequence of rows. So far each
element is of string type.
astype('float') - Change the type to float.
values - Take the underlying Numpy (1-D) array.
reshape(-1, 6) - Reshape to 6 columns and as many rows as needed.
The result, for your data sample is:
array([[0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[0. , 0.4, 0.6, 1.2, 1.6, 2. ],
[1. , 3. , 4. , 1. , 1. , 3. ],
[1. , 2. , 1. , 4. , 5. , 2. ],
[3. , 3. , 1. , 1. , 5. , 1. ],
[2. , 7. , 1. , 1. , 5. , 2. ],
[2. , 3. , 8. , 6. , 3. , 1. ],
[3. , 3. , 4. , 6. , 1. , 1. ]])
And the last step is to divide this array into:
2 initial rows (x and y coordinates),
following rows (actual data),
To do it, run:
x = result[0]
y = result[1]
data = result[2:]
Alternative: Don't create separate variables, but call plt.contour
passing respective rows of result as X, Y and Z.
Something like:
plt.contour(result[0], result[1], result[2:]);

Python overlapping sliding dataframe

I'm implementing a machine learning algorithm and I'm extracting features out of a dataframe. I obviously need overlapping windows. Suppose the DataFrame looks like this
x y z
12.1 11 0.5
12.2 10 0.3
12.4 11 0.5
12.8 12 0.4
13.1 13 0.4
14.7 14 0.5
15.2 14 0.6
15.3 13 0.5
17.3 14 0.5
18.2 15 0.4
16.1 16 0.2
15.0 17 0.1
But in reality a lot larger (thousands of samples). I now want a list of dataframes where each DataFrame is of length ws (here 150) and to have step (stride) of 60.
This is what I got
r = np.arange(len(df))
s = r[::step]
return [df.iloc[k:k+ws] for k in s]
This works reasonably good but there's still one problem. The last 1,2 or 3 frames might not have length ws. I also cannot just discard the last 3 since there's sometimes only one with a length smaller then ws. So the s variable just keeps all the start indices, I'd need a way to have only the start indices where start_index + step < len(df). Unless of course, there are better and or faster ways for this (maybe a library). All existing documentation only talks about simple arrays.
You might only need to change s:
s = r[:len(df)-ws+1:step]
In this way you only find the start indexes of frames with length ws.

Repetitively multiply 6 randomly generated numbers with data from csv

I want to generate 6 random numbers(weights) that always equals one 1000000 times and multiply it the columns of a data i have imported from as csv file. Store the sum in another column(weighted average) and find the difference between the max and min the new column(range). I want to repeat the process 1000000 times and get the least range and the set of random numbers(weights) generated to find that.
Here is what i have done so far:
1.Generate 6 random numbers
2.Import data from csv
3. Multiply the data random numbers with the data from the csv file and find the average(weighted average)
4. save the weighted average in a new column F(x)
5. Find the range
6. Repeat this 1000000 times and get the random numbers that gives me the least range.
Here is some Data from the file
A B C D E F F(x)
0 4.9 3.9 6.3 3.4 7.3 3.4 0.0
1 4.1 3.7 7.7 2.8 5.5 3.9 0.0
2 6.0 6.0 4.0 3.1 3.7 4.3 0.0
3 5.6 6.3 6.6 4.6 8.3 4.6 0.0
Currently getting 0.0 for all F(x) which should not be so.
arr = np.array(np.random.dirichlet(np.ones(6), size=1))
arr=pd.DataFrame(arr)
ar=(arr.iloc[0])
df = pd.read_csv('weit.csv')
df['F(x)']=df.mul(ar).sum(1)
df
df['F(x)'].max() - df['F(x)'].min()
I am getting 0 for all my weighted averages. I need to get the weighted average
I cant loop the code to run 1000000 times and get me the least range.
If understand correctly what you need:
#data from file
print (df)
A B C D E F
0 4.9 3.9 6.3 3.4 7.3 3.4
1 4.1 3.7 7.7 2.8 5.5 3.9
2 6.0 6.0 4.0 3.1 3.7 4.3
3 5.6 6.3 6.6 4.6 8.3 4.6
np.random.seed(3434)
Generate 2d array with 6 'columns' and N 'rows' filled unique random numbers by this:
N = 10
#in real data
#N = 1000000
N = 10
arr = np.array(np.random.dirichlet(np.ones(6), size=N))
print (arr)
[[0.07077773 0.08042978 0.02589592 0.03457833 0.53804634 0.25027191]
[0.22174594 0.22673581 0.26136526 0.04820957 0.00976747 0.23217594]
[0.01202493 0.14247592 0.3411326 0.0239181 0.08448841 0.39596005]
[0.09354759 0.54989312 0.08893737 0.22051801 0.03850101 0.00860291]
[0.09418778 0.33345217 0.11721214 0.33480462 0.11894247 0.00140081]
[0.04285476 0.04531546 0.38105815 0.04316535 0.46902838 0.0185779 ]
[0.00441747 0.08044848 0.33383453 0.09476135 0.37568431 0.11085386]
[0.14613552 0.11260451 0.10421495 0.27880266 0.28994218 0.06830019]
[0.50747802 0.15704797 0.04410511 0.07552837 0.18744306 0.02839746]
[0.00203448 0.13225783 0.43042505 0.33410145 0.08385366 0.01732753]]
Then convert values from DataFrame to 2d numpy array:
b = df.values
#pandas 0.24+
#b = df.to_numpy()
print (b)
[[4.9 3.9 6.3 3.4 7.3 3.4]
[4.1 3.7 7.7 2.8 5.5 3.9]
[6. 6. 4. 3.1 3.7 4.3]
[5.6 6.3 6.6 4.6 8.3 4.6]]
Last multiple both arrays together to 3d array and sum per axis 2, last for subtract maximum with minimum use numpy.ptp:
c = np.ptp((arr * b[:, None]).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Another solution with numpy.einsum:
c = np.ptp(np.einsum('ik,jk->jik', arr, b).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Loop solution for compare, but slow with large N:
out = []
for row in df.values:
# print (row)
a = np.ptp((row * arr).sum(axis=1))
out.append(a)
print (out)
[2.197878921892329, 2.0847676512823052, 1.2654272959079576, 1.4513453259898297]

Categories

Resources