Python compiler giving IndexError when trying to find the similarity matrix

Python compiler giving IndexError when trying to find the similarity matrix - python

I am trying to make a movie recommendation system which requires me to find the user-user similarity matrix for the top 100 users.
On running the code I get:
similarMatrix[row] = top100_similar
IndexError: index 663 is out of bounds for axis 0 with size 617
Code:
def getUser_UserSimilarity(sparseMatrix, top = 100):
startTimestamp20 = datetime.now()
row_index, col_index = sparseMatrix.nonzero() #this will give indices of rows in "row_index" and indices of columns in
#"col_index" where there is a non-zero value exist.
rows = np.unique(row_index)
similarMatrix = np.zeros(61700).reshape(617,100) # 617*100 = 61700. As we are building similarity matrix only
#for top 100 most similar users.
timeTaken = []
howManyDone = 0
for row in rows[:top]:
howManyDone += 1
startTimestamp = datetime.now().timestamp() #it will give seconds elapsed
sim = cosine_similarity(sparseMatrix.getrow(row), sparseMatrix).ravel()
top100_similar_indices = sim.argsort()[-top:]
top100_similar = sim[top100_similar_indices]
similarMatrix[row] = top100_similar
timeforOne = datetime.now().timestamp() - startTimestamp
timeTaken.append(timeforOne)
if howManyDone % 20 == 0:
print("Time elapsed for {} users = {}sec".format(howManyDone, (datetime.now() - startTimestamp20)))
print("Average Time taken to compute similarity matrix for 1 user = "+str(sum(timeTaken)/len(timeTaken))+"seconds")
fig = plt.figure(figsize = (12,8))
plt.plot(timeTaken, label = 'Time Taken For Each User')
plt.plot(np.cumsum(timeTaken), label='Cumulative Time')
plt.legend(loc='upper left', fontsize = 15)
plt.xlabel('Users', fontsize = 20)
plt.ylabel('Time(Seconds)', fontsize = 20)
plt.tick_params(labelsize = 15)
plt.show()
return similarMatrix
simMatrix = getUser_UserSimilarity(TrainUISparseData, 100)
Please tell me where exactly I need to make the changes.

The error is due to the following line
similarMatrix = np.zeros(61700).reshape(617,100)
Your similarMatrix is of smaller dimension than your sparseMatrix. Thats why you are getting index error.
You need to make the dimensions of similarMatrix equal to the dimensions of sparseMatrix. So modify the code as below
similarMatrix = np.zeros(sparseMatrix.shape[0]*100).reshape(sparseMatrix.shape[0],100)
Or for more simple structure
n_cols = 100
n_rows = sparseMatrix.shape[0]
similarMatrix = np.zeros(n_rows*n_cols).reshape(n_rows, n_cols)

Related

Eliminating Certain Values in Dataframe

Initial Data
d = {'RedVal':[1,1.1,2,1.5,1.7,2,1,1.1,2,1,1.1,2,2.6,2.5,2.4,2.5], 'GreenVal':[1,1.1,1.1,1,1.1,1.7,1,1.1,1.5,1,1.9,3,2.8,2.7,2.6,2.5],'Frame':[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],'Particle':[0,0,0,0,2,2,2,2,3,3,3,3,4,4,4,4] }
testframe = pd.DataFrame(data=d)
testframe
framenot = 2 #set how many frames you would like to get initial ratio for
ratarray = [] #initialize blank ratio array
testframe.sort_values(by =[ 'Particle', 'Frame'])
for particle in range(0,5):
if(testframe['Particle']== particle).any() == False:
particle = particle + 1
else:
newframe = testframe.loc[(testframe['Frame']<= framenot) & (testframe['Particle'] == particle)]
#print(particle)
for i in range(framenot):
#print(i)
GVal = newframe['GreenVal'].values[i]
RVal = newframe['RedVal'].values[i]
ratio = RVal/GVal
#print(RVal)
#print(GVal)
#print(ratio)
ratarray.append(ratio)
i+=1
#print(ratarray)
particle+=1
ratarray = np.array(ratarray)
avgRatios = np.average(ratarray.reshape(-1,framenot), axis = 1)
stdRatios = np.std(ratarray.reshape(-1,framenot), axis = 1)
print(avgRatios) #array with average ratios over set frames starting from initial particle
print(stdRatios)
So far I have code that gives the avg and standard deviation for each particle's ratio of Red/Green over the frames 0 and 1. Now I want to compare this avg ratio to the ratio for the next x frames and eliminate particles where the subsequent frames ratios falls outside the avg+2stdev. Not quite sure how to do this. Any help is appreciated.

Reorder Sankey diagram vertically based on label value

I'm trying to plot patient flows between 3 clusters in a Sankey diagram. I have a pd.DataFrame counts with from-to values, see below. To reproduce this DF, here is the counts dict that should be loaded into a pd.DataFrame (which is the input for the visualize_cluster_flow_counts function).
from to value
0 C1_1 C1_2 867
1 C1_1 C2_2 405
2 C1_1 C0_2 2
3 C2_1 C1_2 46
4 C2_1 C2_2 458
... ... ... ...
175 C0_20 C0_21 130
176 C0_20 C2_21 1
177 C2_20 C1_21 12
178 C2_20 C0_21 0
179 C2_20 C2_21 96
The from and to values in the DataFrame represent the cluster number (either 0, 1, or 2) and the amount of days for the x-axis (between 1 and 21). If I plot the Sankey diagram with these values, this is the result:
Code:
import plotly.graph_objects as go
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue"
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
However, I would like to vertically order the bars so that the C0's are always on top, the C1's are always in the middle, and the C2's are always at the bottom (or the other way around, doesn't matter). I know that we can set node.x and node.y to manually assign the coordinates. So, I set the x-values to the amount of days * (1/range of days), which is an increment of +- 0.045. And I set the y-values based on the cluster value: either 0, 0.5 or 1. I then obtain the image below. The vertical order is good, but the vertical margins between the bars are obviously way off; they should be similar to the first result.
The code to produce this is:
import plotly.graph_objects as go
def find_node_coordinates(sources):
x_nodes, y_nodes = [], []
for s in sources:
# Shift each x with +- 0.045
x = float(s.split("_")[-1]) * (1/21)
x_nodes.append(x)
# Choose either 0, 0.5 or 1 for the y-value
cluster_number = s[1]
if cluster_number == "0": y = 1
elif cluster_number == "1": y = 0.5
else: y = 1e-09
y_nodes.append(y)
return x_nodes, y_nodes
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
node_x, node_y = find_node_coordinates(all_sources)
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue",
x = node_x,
y = node_y,
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
Question: how do I fix the margins of the bars, so that the result looks like the first result? So, for clarity: the bars should be pushed to the bottom. Or is there another way that the Sankey diagram can vertically re-order the bars automatically based on the label value?

Firstly I don't think there is a way with the current exposed API to achieve your goal smoothly you can check the source code here.
Try to change your find_node_coordinates function as follows (note that you should pass the counts DataFrame to):
counts = pd.DataFrame(counts_dict)
def find_node_coordinates(sources, counts):
x_nodes, y_nodes = [], []
flat_on_top = False
range = 1 # The y range
total_margin_width = 0.15
y_range = 1 - total_margin_width
margin = total_margin_width / 2 # From number of Cs
srcs = counts['from'].values.tolist()
dsts = counts['to'].values.tolist()
values = counts['value'].values.tolist()
max_acc = 0
def _calc_day_flux(d=1):
_max_acc = 0
for i in [0,1,2]:
# The first ones
from_source = 'C{}_{}'.format(i,d)
indices = [i for i, val in enumerate(srcs) if val == from_source]
for j in indices:
_max_acc += values[j]
return _max_acc
def _calc_node_io_flux(node_str):
c,d = int(node_str.split('_')[0][-1]), int(node_str.split('_')[1])
_flux_src = 0
_flux_dst = 0
indices_src = [i for i, val in enumerate(srcs) if val == node_str]
indices_dst = [j for j, val in enumerate(dsts) if val == node_str]
for j in indices_src:
_flux_src += values[j]
for j in indices_dst:
_flux_dst += values[j]
return max(_flux_dst, _flux_src)
max_acc = _calc_day_flux()
graph_unit_per_val = y_range / max_acc
print("Graph Unit per Acc Val", graph_unit_per_val)
for s in sources:
# Shift each x with +- 0.045
d = int(s.split("_")[-1])
x = float(d) * (1/21)
x_nodes.append(x)
print(s, _calc_node_io_flux(s))
# Choose either 0, 0.5 or 1 for the y-v alue
cluster_number = s[1]
# Flat on Top
if flat_on_top:
if cluster_number == "0":
y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(0, d))*graph_unit_per_val/2
elif cluster_number == "1": y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val/2
else: y = 1e-09
# Flat On Bottom
else:
if cluster_number == "0": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val / 2)
elif cluster_number == "1": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val /2 )
elif cluster_number == "2": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(2,d)) * graph_unit_per_val /2 )
y_nodes.append(y)
return x_nodes, y_nodes
Sankey graphs supposed to weigh their connection width by their corresponding normalized values right? Here I do the same, first, it calculates each node flux, later by calculating the normalized coordinate the center of each node calculated according to their flux.
Here is the sample output of your code with the modified function, note that I tried to adhere to your code as much as possible so it's a bit unoptimized(for example, one could store the values of nodes above each specified source node to avoid its flux recalculation).
With flag flat_on_top = True
With flag flat_on_top = False
There is a bit of inconsistency in the flat_on_bottom version which I think is caused by the padding or other internal sources of Plotly API.

Python: index error with numerical differentiation

In my code I am extracting the velocity and acceleration from time, position measurements and I am receiving an index error when performing numerical differentiation:
VelocityVsTime = np.empty((2,0), float)
for i in range(1, len(PosVsTime[0])-1):
velocity = (PosVsTime[1][i+1] - PosVsTime[1][i-1]) / (PosVsTime[0][i+1] - PosVsTime[0][i-1])
VelocityVsTime = np.append(VelocityVsTime, [[PosVsTime[0][i]], [velocity]], axis = 1)
#print(VelocityVsTime)
AccelerationvsTime = np.empty((2,0), float)
for j in range(1, len(VelocityVsTime[1])-1):
acceleration = (VelocityVsTime[1][i+1] - VelocityVsTime[1][i-1]) / (VelocityVsTime[0][i+1] - VelocityVsTime[0][i-1])
AccelerationvsTime = np.append(AccelerationvsTime, [VelocityVsTime[0][i]], [acceleration], axis=1)
print(AccelerationvsTime)
The error is:
IndexError: index 50 is out of bounds for axis 0 with size 49
any tips on how to correct this? Thanks

heres the full code: the error occurs on line 42 where I declare the acceleration variable
import numpy as np
import matplotlib.pyplot as plt
PosVsTime = np.loadtxt("balldata.txt", delimiter=",").transpose()
#print(PosVsTime[0][0])
#t_0 = PosVsTime[0][0]
#pos_0 = PosVsTime[1][0]
#print("The initial state of this system at time = 0 is ", pos_0)
VelocityVsTime = np.empty((2,0), float)
for i in range(1, len(PosVsTime[0])-1):
velocity = (PosVsTime[1][i+1] - PosVsTime[1][i-1]) / (PosVsTime[0][i+1] - PosVsTime[0][i-1])
VelocityVsTime = np.append(VelocityVsTime, [[PosVsTime[0][i]], [velocity]], axis = 1)
#print(VelocityVsTime)
#plt.errorbar(VelocityVsTime[0], VelocityVsTime[1], fmt = '--k')
AccelerationvsTime = np.empty((2,0), float)
for j in range(1, len(VelocityVsTime[0])-1):
#acceleration = (VelocityVsTime[1][i+1] - VelocityVsTime[1][i-1]) / (VelocityVsTime[0][i+1] - VelocityVsTime[0][i-1])
#AccelerationvsTime = np.append(AccelerationvsTime, [VelocityVsTime[0][i]], [acceleration], axis=1)
print(AccelerationvsTime)

How to improve performance of coincidence filtering of a time-series?

I'm working on instationary experimental data from fluid dynamics. We have measured data on three channels, so the samples are not directly coincident (measured at the same time). I want to filter them with a window scheme to get coincident samples and disgard all others.
Unfortunately, I cannot upload the original data set due to restrictions of the company. But I tried to set up a minimal example, which generates a similiar (smaller) dataset. The original dataset consists of 500000 values per channel, each noted with an arrival time. The coincidence is checked with these time stamps.
Just now, I loop over each sample from the first channel and look at the time differences to the other channels. If it is smaller than the specified window width, the index is saved. Probably it would be a little bit faster if I specifiy an intervall in which to check for the differences (like 100 or 1000 samples in the neighborhood). But the datarate between the channels can differ significantly, so it is not implemented yet. I prefer to get rid of looping over each sample - if possible.
def filterCoincidence(df, window = 50e-6):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
'''
AT_cols = [col for col in df.columns if 'AT' in col]
if len(AT_cols) == 1:
print('only one group available')
return
used_ix = np.zeros( (df.shape[0], len(AT_cols)))
used_ix.fill(np.nan)
for ix, sample in enumerate(df[AT_cols[0]]):
used_ix[ix, 0] = ix
test_ix = np.zeros(2)
for ii, AT_col in enumerate(AT_cols[1:]):
diff = np.abs(df[AT_col] - sample)
index = diff[diff <= window].sort_values().index.values
if len(index) == 0:
test_ix[ii] = None
continue
test_ix[ii] = [ix_use if (ix_use not in used_ix[:, ii+1] or ix == 0) else None for ix_use in index][0]
if not np.any(np.isnan(test_ix)):
used_ix[ix, 1:] = test_ix
else:
used_ix[ix, 1:] = [None, None]
used_ix = used_ix[~np.isnan(used_ix).any(axis=1)]
print(used_ix.shape)
return
no_points = 10000
no_groups = 3
meas_duration = 60
df = pd.DataFrame(np.transpose([np.sort(np.random.rand(no_points)*meas_duration) for _ in range(no_groups)]), columns=['AT {}'.format(i) for i in range(no_groups)])
filterCoincidence(df, window=1e-3)
Is there a module already implemented, which can do this sort of filtering? However, it would be awesome if you can give me some hints to increase the performance of the code.

Just to update this thread if somebody else have a similar problem. I think after several code revisions, I have found a proper solution to this.
def filterCoincidence(self, AT1, AT2, AT3, window = 0.05e-3):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
- arguments:
- three times series AT1, AT2 and AT3 (arrival times of particles in my case)
- window size (50 microseconds as default setting)
- output: indices of combined samples
'''
start_time = datetime.datetime.now()
AT_list = [AT1, AT2, AT3]
# take the shortest period of time
min_EndArrival = np.max(AT_list)
max_BeginArrival = np.min(AT_list)
for i, col in enumerate(AT_list):
min_EndArrival = min(min_EndArrival, np.max(col))
max_BeginArrival = max(max_BeginArrival, np.min(col))
for i, col in enumerate(AT_list):
AT_list[i] = np.delete(AT_list[i], np.where((col < max_BeginArrival - window) | (col > min_EndArrival + window)))
# get channel with lowest datarate
num_points = np.zeros(len(AT_list))
datarate = np.zeros(len(AT_list))
for i, AT in enumerate(AT_list):
num_points[i] = AT.shape[0]
datarate[i] = num_points[i] / (AT[-1]-AT[0])
used_ref = np.argmin(datarate)
# process coincidence
AT_ref_val = AT_list[used_ref]
AT_list = list(np.delete(AT_list, used_ref))
overview = np.zeros( (AT_ref_val.shape[0], 3), dtype=int)
overview[:,0] = np.arange(AT_ref_val.shape[0], dtype=int)
borders = np.empty(2, dtype=object)
max_diff = np.zeros(2, dtype=int)
for i, AT in enumerate(AT_list):
neighbors_lower = np.searchsorted(AT, AT_ref_val - window, side='left')
neighbors_upper = np.searchsorted(AT, AT_ref_val + window, side='left')
borders[i] = np.transpose([neighbors_lower, neighbors_upper])
coinc_ix = np.where(np.diff(borders[i], axis=1).flatten() != 0)[0]
max_diff[i] = np.max(np.diff(borders[i], axis=1))
overview[coinc_ix, i+1] = 1
use_ix = np.where(~np.any(overview==0, axis=1))
borders[0] = borders[0][use_ix]
borders[1] = borders[1][use_ix]
overview = overview[use_ix]
# create all possible combinations refer to the reference
combinations = np.prod(max_diff)
test = np.empty((overview.shape[0]*combinations, 3), dtype=object)
for i, [ref_ix, at1, at2] in enumerate(zip(overview[:, 0], borders[0], borders[1])):
test[i * combinations:i * combinations + combinations, 0] = ref_ix
at1 = np.arange(at1[0], at1[1])
at2 = np.arange(at2[0], at2[1])
test[i*combinations:i*combinations+at1.shape[0]*at2.shape[0],1:] = np.asarray(list(itertools.product(at1, at2)))
test = test[~np.any(pd.isnull(test), axis=1)]
# check distances
ix_ref = test[:,0]
test = test[:,1:]
test = np.insert(test, used_ref, ix_ref, axis=1)
test = test.astype(int)
AT_list.insert(used_ref, AT_ref_val)
AT_mat = np.zeros(test.shape)
for i, AT in enumerate(AT_list):
AT_mat[:,i] = AT[test[:,i]]
distances = np.zeros( (test.shape[0], len(list(itertools.combinations(range(3), 2)))))
for i, AT in enumerate(itertools.combinations(range(3), 2)):
distances[:,i] = np.abs(AT_mat[:,AT[0]]-AT_mat[:,AT[1]])
ix = np.where(np.all(distances <= window, axis=1))[0]
test = test[ix,:]
distances = distances[ix,:]
# check duplicates
# use sum of differences as similarity factor
dist_sum = np.max(distances, axis=1)
unique_sorted = np.argsort([np.unique(test[:,i]).shape[0] for i in range(test.shape[1])])[::-1]
test = np.hstack([test, dist_sum.reshape(-1, 1)])
test = test[test[:,-1].argsort()]
for j in unique_sorted:
_, ix = np.unique(test[:,j], return_index=True)
test = test[ix, :]
test = test[:,:3]
test = test[test[:,used_ref].argsort()]
# check that all values are after each other
ix = np.where(np.any(np.diff(test, axis=0) > 0, axis=1))[0]
ix = np.append(ix, test.shape[0]-1)
test = test[ix,:]
print('{} coincident samples obtained in {}.'.format(test.shape[0], datetime.datetime.now()-start_time))
return test
I'm certain that there is a better solution, but for me it works now. And I know, the variable names should definitely be chosen with more clarity (e.g. test), but I will clean up my code at the end of my master thesis... perhaps :-)

how to draw rectangles using list in python

for line, images_files in zip(lines, image_list):
info = line.split(',')
image_index = [int(info[0])]
box_coordiante1 = [info[2]]
box_coordiante2 = [info[3]]
box_coordiante3 = [info[4]]
box_coordiante4 = [info[5]]
prev_image_num = 1
for image_number in image_index: #### read each other image_number
if prev_image_num != image_number: # if read 11111 but appear different number such as 2, 3 and ect
prev_image_num = image_number # the different number becomes pre_image_num(it was 1)
#box_coordinate = [] # empty box_coordinate
#box_coordinate.append(info[2:6])
#print box_coordinate
# box_coordinate.append() #insert 2 to 6 axis
rect = plt.Rectangle((int(box_coordiante1), int(box_coordiante2)), int(box_coordiante3), int(box_coordiante4), linewidth=1, edgecolor='r', facecolor='none')
ax.add_patch(rect)
im = cv2.imread(images_files)
im = im[:, :, (2, 1, 0)]
# # Display the image
plt.imshow(im)
plt.draw()
plt.pause(0.1)
plt.cla()
I am supposed to draw boxes on each picture.
For showing boxes on each picture,
i guess that gather location of boxes and show them at that same time.
So i used a way using LIST to plt.Rectanle
but it said "TypeError: int() argument must be a string or a number, not 'list'"
Are there other ways??

Umm, I just did just. I don't know if this is what you wanted though.
x = 10
y = 10
a = []
for unit for range(x):
a.append(0)
for unit for range(y):
print(a)

I'm not very familiar with Python, but it seems like you want a plain number in the variables image_index and box_coordinateN. It looks like you're assigning single-element arrays to them. Try changing:
image_index = [int(info[0])] // list containing one element: int(info[0])
box_coordiante1 = [info[2]]
box_coordiante2 = [info[3]]
box_coordiante3 = [info[4]]
box_coordiante4 = [info[5]]
to:
image_index = int(info[0]) // number: int(info[0])
box_coordiante1 = info[2]
box_coordiante2 = info[3]
box_coordiante3 = info[4]
box_coordiante4 = info[5]

The answer above is carelessly sloppy and incorrect Python.
It must be rewritten and corrected as follows:
x = 10
y = 10
a = []
for unit in range(x):
a.append(0)
for unit in range(y):
print(a)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python compiler giving IndexError when trying to find the similarity matrix - python

Related

Eliminating Certain Values in Dataframe

Reorder Sankey diagram vertically based on label value

Python: index error with numerical differentiation

How to improve performance of coincidence filtering of a time-series?

how to draw rectangles using list in python

Categories

Resources