Unexpected dbscan result - python

i have implemented a code to cluster simple array of point's but the results are unexpected.
code:
from pandas import DataFrame
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import DBSCAN
x1 = [[10.0, 1.0], [10.0, 2.0], [10.0, 10.0], [10.0, 10.0], [10.0, 23.0], [10.0, 22.0]]
x2 = [[20.0, 2.0], [20.0, 15.0], [20.0, 26.0], [20.0, 13.0], [20.0, 32.0], [20.0, 35.0]]
x3 = [[30.0, 25.0], [30.0, 28.0], [30.0, 17.0], [30.0, 16.0], [30.0, 15.0], [30.0, 38.0]]
x4 = [[40.0, 1.0], [40.0, 2.0], [40.0, 16.0], [40.0, 41.0], [40.0, 40.0], [40.0, 39.0]]
x5 = [[60.0, 1.0], [60.0, 10.0], [60.0, 12.0], [60.0, 32.0], [60.0, 33.0], [60.0, 50.0]]
df1 = DataFrame(data= x1)
df2 = DataFrame(data= x2)
df3 = DataFrame(data= x3)
df4 = DataFrame(data= x4)
df5 = DataFrame(data= x5)
data = df5
dbscan_opt=DBSCAN(eps=1,min_samples=2)
dbscan_opt.fit(data[[0,1]])
data['DBSCAN_opt_labels']=dbscan_opt.labels_
data['DBSCAN_opt_labels'].value_counts()
# Plotting the resulting clusters
colors=['purple','red','blue','green']
plt.figure(figsize=(6,6))
plt.scatter(data[0],data[1],c=data['DBSCAN_opt_labels'],cmap=matplotlib.colors.ListedColormap(colors),s=60)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()
results:
Red circle means that i expect a cluster but the algorithm detect them as noise points..
How to fix parameter's of DBSCAN to detect them as well?

i found the problem. limited color list! i should add more colors because DBSCAN generates new cluster but with same color and that was my mistake.
so this is working fine now:
dbscan_opt=DBSCAN(eps=3.0,min_samples=1).fit(data[[0,1]])
data['DBSCAN_opt_labels']=dbscan_opt.labels_
data['DBSCAN_opt_labels'].value_counts()
print(data)
# Plotting the resulting clusters
colors=['purple','red','blue','green', 'magenta', 'teal', 'black']
plt.figure(figsize=(6,6))
plt.scatter(data[0],data[1],c=data['DBSCAN_opt_labels'],cmap=matplotlib.colors.ListedColormap(colors),s=60)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

Related

Tensorflow 2.2, tf.nn.conv1d in Lambda layer

I'd like to perform a convolution in a Lambda layer, but I can't get it to work any way.
kernel = [1.0,2.0,1.0] # weighted moving average
x = [ # history_size=5, num_features=10
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],
[2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0],
[3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0],
[4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0],
[5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0],
]
k = tf.constant(kernel, dtype=tf.float32)
y = tf.nn.conv1d(x, k, stride=1, padding='SAME')
I realize dimensions are not correct in the above example, but that's my data's actual format. The training samples have a shape of (history_size, num_features) and the kernel has to convolve along history_size, each feature separately. Any help would be appreciated. I cannot find an example on how to perform tf.nn.conv1d manually.
You could use numpy.convolve() for this.
import numpy as np
kernel = [1.0,2.0,1.0] # weighted moving average
x = [ # history_size=5, num_features=10
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],
[2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0],
[3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0],
[4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0],
[5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0],
]
output = []
for i in range(len(x)):
output.append(list(np.convolve(x[i], kernel, mode = 'same')))
output
'''
[[3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0],
[6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0],
[9.0, 12.0, 12.0, 12.0, 12.0, 12.0, 12.0, 12.0, 12.0, 9.0],
[12.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 12.0],
[15.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 15.0]]
'''
You could try changing the mode whichever fits best to you according to the documentation.

Generate probability distribution or smoothing plot from points containing probabilities

I have points which include the probability on the y-axis and values on the x-axis, like:
p1 =
[[0.0, 0.0001430560406790707],
[10.0, 6.2797052001508247e-13],
[15.0, 4.8114669550502021e-06],
[20.0, 0.0007443231772534647],
[25.0, 0.00061070912573869406],
[30.0, 0.48116582167944905],
[35.0, 0.24698643991977953],
[40.0, 0.016407283121225951],
[45.0, 0.2557158314329116],
[50.0, 1.1252231121357235e-05],
[55.0, 0.064666668633158647],
[60.0, 1.7631447655837744e-17],
[65.0, 1.1294722466816786e-14],
[70.0, 2.9419020411134367e-16],
[75.0, 3.0887653014525822e-17],
[80.0, 4.4973693062706866e-17],
[85.0, 9.0975358174005147e-15],
[90.0, 1.0758266454985257e-10],
[95.0, 7.2923752473657924e-08],
[100.0, 1.8065366882584036e-08]]
p2 =
[[0.0, 4.1652247577331996e-06],
[10.0, 1.2212829713673957e-06],
[15.0, 6.5906857192417344e-08],
[20.0, 0.00016745946587138236],
[25.0, 0.0054431111796765554],
[30.0, 0.0067575214586160616],
[35.0, 0.00011856110316632124],
[40.0, 0.00032181662132509944],
[45.0, 0.001397981055516994],
[50.0, 0.0027058954834684062],
[55.0, 2.553142406703067e-06],
[60.0, 1.1514033594755017e-08],
[65.0, 0.21961568282994792],
[70.0, 2.4658349829099807e-08],
[75.0, 0.0022850986575076743],
[80.0, 3.5603047823624507e-06],
[85.0, 0.99406392082894734],
[90.0, 0.24399923235645221],
[95.0, 0.0013470125217945798],
[100.0, 0.042582366972883985]]
Now I want to generate a probability distribution from the points, where the x-axis values are (0,10,15,20,...,100) and the y-axis values contain the probabilities (0.00014,....)
When using the plt.plot fuction I get:
plt.plot([item[0] for item in p1],[item[1] for item in p1])
And for p2:
plt.plot([item[0] for item in p2],[item[1] for item in p2])
I want to get a more smooth visualization, like a probability distribution:
And if a probability distribution is not possible, then a smoothing spline:
Scipy's gaussian_kde is often used to smoothly approximate a probability distribution. It sums a gaussian kernel for each input point. Usually individual measurements are used as inputs, but the weights parameter allows working with binned data. The function is normalized to have its integral equal to one.
This approach assumes the values of p1 and p2 are meant as a mean for the segment around each x-value, similar to a histogram. I.e. a step function where the x-values identify the end of each step.
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
p1 = np.array([[0.0, 0.0001430560406790707],
[10.0, 6.2797052001508247e-13],
[15.0, 4.8114669550502021e-06],
[20.0, 0.0007443231772534647],
[25.0, 0.00061070912573869406],
[30.0, 0.48116582167944905],
[35.0, 0.24698643991977953],
[40.0, 0.016407283121225951],
[45.0, 0.2557158314329116],
[50.0, 1.1252231121357235e-05],
[55.0, 0.064666668633158647],
[60.0, 1.7631447655837744e-17],
[65.0, 1.1294722466816786e-14],
[70.0, 2.9419020411134367e-16],
[75.0, 3.0887653014525822e-17],
[80.0, 4.4973693062706866e-17],
[85.0, 9.0975358174005147e-15],
[90.0, 1.0758266454985257e-10],
[95.0, 7.2923752473657924e-08],
[100.0, 1.8065366882584036e-08]])
p2 = np.array([[0.0, 4.1652247577331996e-06],
[10.0, 1.2212829713673957e-06],
[15.0, 6.5906857192417344e-08],
[20.0, 0.00016745946587138236],
[25.0, 0.0054431111796765554],
[30.0, 0.0067575214586160616],
[35.0, 0.00011856110316632124],
[40.0, 0.00032181662132509944],
[45.0, 0.001397981055516994],
[50.0, 0.0027058954834684062],
[55.0, 2.553142406703067e-06],
[60.0, 1.1514033594755017e-08],
[65.0, 0.21961568282994792],
[70.0, 2.4658349829099807e-08],
[75.0, 0.0022850986575076743],
[80.0, 3.5603047823624507e-06],
[85.0, 0.99406392082894734],
[90.0, 0.24399923235645221],
[95.0, 0.0013470125217945798],
[100.0, 0.042582366972883985]])
x = np.linspace(0, 100, 1000)
fig, axes = plt.subplots(ncols=2)
for ax, p in zip(axes, [p1, p2]):
p[0, 0] = 5.0 # let each x-value be the end of a segment
ax.step(p[:,0], p[:,1], color='dodgerblue', lw=1, ls=':', where='pre')
ax2 = ax.twinx()
kde = gaussian_kde(p[:,0]-2.5, bw_method=.25, weights=p[:,1])
ax2.plot(x, kde(x), color='crimson')
plt.show()

Problem with for loop and creating list of list

I have an numpy array called expected which is a list of a list of a list.
expected = [[[45.0, 10.0, 10.0], [110.0, 10.0, 8.0], [60.0, 10.0, 5.0], [170.0, 10.0, 4.0]], [[-80.0, 20.0, 10.0], [97.0, 15.0, 12.0], [5.0, 20.0, 8.0], [93.0, 10.0, 8.0], [12.0, 5.0, 15.0], [-88.0, 10.0, 10.0], [176.0, 10.0, 8.0]]]
I want to put it through a loop without having to hardcode so its applicable to different lengths of list.
When the loop runs for the first time i want it to solve this:
horizontal_exp = expected[0][0][1]*expected[0][0][2]
*np.cos(np.deg2rad(expected[0][0][0]))
Then the next loop to be like this:
horizontal_exp = expected[1][1][1]*expected[1][1][2]
*np.cos(np.deg2rad(expected[1][1][0]))
And the following loop to be like this:
horizontal_exp = expected[2][2][1]*expected[2][2][2]
*np.cos(np.deg2rad(expected[2][2][0]))
and so on until it finished the different sections of rows.
I don't understand why the 'i' never worked??
In the end I want horizontal expected to be a list of a list
e.g.
expected = [ [12,21,23,34], [12,32,54,65,76,87,65] ] # These are not the values I'm just giving an example
where the [12,21,23,24] corresponds to the [[45.0, 10.0, 10.0], [110.0, 10.0, 8.0], [60.0, 10.0, 5.0], [170.0, 10.0, 4.0]]
and the [12,32,54,65,76,87,65] corresponds to the [[-80.0, 20.0, 10.0], [97.0, 15.0, 12.0], [5.0, 20.0, 8.0], [93.0, 10.0, 8.0], [12.0, 5.0, 15.0], [-88.0, 10.0, 10.0], [176.0, 10.0, 8.0]]
I'm unsure how to do this, I know you have to append it with a for loop but how do you separate it into a list of a list??
horizontal_expected = []
for i in list(range(len(expected[i]))):
horizontal_exp = expected[i][i][1]*expected[i][i][2]
*np.cos(np.deg2rad(expected[i][i][0]))
horizontal_expected.append(horizontal_exp)
print(horizontal_expected)
The reason why you don't see the desired output is that, even though you have nested list expected, you are iterating only through the nested lists. You first need to iterate through the outer lists and then iterate through the nested lists internally:
import numpy as np
expected = [ [[45.0, 10.0, 10.0], [110.0, 10.0, 8.0], [60.0, 10.0, 5.0], [170.0, 10.0, 4.0]], [[-80.0, 20.0, 10.0], [97.0, 15.0, 12.0], [5.0, 20.0, 8.0], [93.0, 10.0, 8.0], [12.0, 5.0, 15.0], [-88.0, 10.0, 10.0], [176.0, 10.0, 8.0]] ]
horizontal_expected = []
for i in range(len(expected)):
tmp_list = []
for j in range(len(expected[i])):
horizontal_exp = expected[i][i][1]*expected[i][i][2]*np.cos(np.deg2rad(expected[i][i][0]))
tmp_list.append(horizontal_exp)
horizontal_expected.append(tmp_list)
print(horizontal_expected)
The output of that is a list of lists:
>>> print(horizontal_expected)
[[70.71067811865476, 70.71067811865476, 70.71067811865476, 70.71067811865476], [-21.936481812926527, -21.936481812926527, -21.936481812926527, -21.936481812926527, -21.936481812926527, -21.936481812926527, -21.936481812926527]]
As you can see, it holds a value for each of the lists in the input, but the value is the same. This is due to the way that your equation was set up.
You want the indices to be updated based on the level of the loop:
horizontal_exp = expected[i][j][1]*expected[i][j][2]*np.cos(np.deg2rad(expected[i][j][0]))
The full working code would look like this:
import numpy as np
expected = [ [[45.0, 10.0, 10.0], [110.0, 10.0, 8.0], [60.0, 10.0, 5.0], [170.0, 10.0, 4.0]], [[-80.0, 20.0, 10.0], [97.0, 15.0, 12.0], [5.0, 20.0, 8.0], [93.0, 10.0, 8.0], [12.0, 5.0, 15.0], [-88.0, 10.0, 10.0], [176.0, 10.0, 8.0]] ]
horizontal_expected = []
for i in range(len(expected)):
tmp_list = []
for j in range(len(expected[i])):
horizontal_exp = expected[i][j][1]*expected[i][j][2]*np.cos(np.deg2rad(expected[i][j][0]))
tmp_list.append(horizontal_exp)
horizontal_expected.append(tmp_list)
print(horizontal_expected)
And the output:
>>> print(horizontal_expected)
[[70.71067811865476, -27.361611466053496, 25.000000000000007, -39.39231012048832], [34.72963553338608, -21.936481812926527, 159.39115169467928, -4.186876499435507, 73.36107005503543, 3.489949670250108, -79.80512402078594]]

Constructing a 2d interpolator given scattered input data

I have three lists as follows:
x = [100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0]
y = [300.0, 300.0, 300.0, 300.0, 500.0, 500.0, 500.0, 500.0, 700.0, 700.0, 700.0, 700.0, 1000.0, 1000.0, 1000.0, 1000.0, 1500.0, 1500.0, 1500.0, 1500.0, 2000.0, 2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 3000.0, 3000.0, 5000.0, 5000.0, 5000.0, 5000.0, 7500.0, 7500.0, 7500.0, 75000.0, 10000.0, 10000.0, 10000.0, 10000.0]
z = [100.0, 95.0, 87.5, 77.5, 60.0, 57.0, 52.5, 46.5, 40.0, 38.0, 35.0, 31.0, 30.0, 28.5, 26.25, 23.25, 23.0, 21.85, 20.125, 17.825, 17.0, 16.15, 14.875, 13.175, 13.0, 12.35, 11.375, 10.075, 10.0, 9.5, 8.75, 7.75, 7.0, 6.65, 6.125, 5.425, 5.0, 4.75, 4.375, 3.875]
Each entry of each list is read as a point so point 0 is (100,300,100) point 1 is (75,300,95) and so on.
I am trying to do 2d interpolation, so that I can compute a z value for any given input (x0, y0) point.
I was reading that using meshgrid I can interpolate with RegularGridInterpolator from scipy but I am not sure how to set it up when I do:
x_,y_,z_ = np.meshgrid(x,y,z) # both indexing ij or xy
I don't get values for x_,y_,z_ that make sense and I am not sure how to go from there.
I am trying to use the data points I have above to find intermediate values so something similar to scipy's interp1d where
f = interp1d(x, y, kind='cubic')
where I can later call f(any (x,y) point within range) and get the corresponding z value.
You need 2d interpolation over scattered data. I'd default to using scipy.interpolate.griddata in this case, but you seem to want a callable interpolator, whereas griddata needs a given set of points onto which it will interpolate.
Not to worry: griddata with 2d cubic interpolation uses a CloughTocher2DInterpolator. So we can do exactly that:
import numpy as np
import scipy.interpolate as interp
x = [100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0, 100.0, 75.0, 50.0, 0.0]
y = [300.0, 300.0, 300.0, 300.0, 500.0, 500.0, 500.0, 500.0, 700.0, 700.0, 700.0, 700.0, 1000.0, 1000.0, 1000.0, 1000.0, 1500.0, 1500.0, 1500.0, 1500.0, 2000.0, 2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 3000.0, 3000.0, 5000.0, 5000.0, 5000.0, 5000.0, 7500.0, 7500.0, 7500.0, 75000.0, 10000.0, 10000.0, 10000.0, 10000.0]
z = [100.0, 95.0, 87.5, 77.5, 60.0, 57.0, 52.5, 46.5, 40.0, 38.0, 35.0, 31.0, 30.0, 28.5, 26.25, 23.25, 23.0, 21.85, 20.125, 17.825, 17.0, 16.15, 14.875, 13.175, 13.0, 12.35, 11.375, 10.075, 10.0, 9.5, 8.75, 7.75, 7.0, 6.65, 6.125, 5.425, 5.0, 4.75, 4.375, 3.875]
interpolator = interp.CloughTocher2DInterpolator(np.array([x,y]).T, z)
Now you can call this interpolator with 2 coordinates to give you the corresponding interpolated data point:
>>> interpolator(x[10], y[10]) == z[10]
True
>>> interpolator(2, 300)
array(77.81343)
Note that you'll have to stay inside the convex hull of the input points, otherwise you'll get nan (or whatever is passed as the fill_value keyword to the interpolator):
>>> interpolator(2, 30)
array(nan)
Extrapolation is usually meaningless anyway, and your input points are scattered in a bit erratic way:
So even if extrapolation was possible I wouldn't believe it.
Just to demonstrate how the resulting interpolator is constrained to the convex hull of the input points, here's a surface plot of your data on a gridded mesh we create just for plotting:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# go linearly in the x grid
xline = np.linspace(min(x), max(x), 30)
# go logarithmically in the y grid (considering y distribution)
yline = np.logspace(np.log10(min(y)), np.log10(max(y)), 30)
# construct 2d grid from these
xgrid,ygrid = np.meshgrid(xline, yline)
# interpolate z data; same shape as xgrid and ygrid
z_interp = interpolator(xgrid, ygrid)
# create 3d Axes and plot surface and base points
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(xgrid, ygrid, z_interp, cmap='viridis',
vmin=min(z), vmax=max(z))
ax.plot(x, y, z, 'ro')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()
Here's the output from two angles (it's better to rotate around interactively; such stills don't do the 3d representation justice):
There are two main features to note:
The surface nicely fits the red points, which is expected from interpolation. Fortunately the input points are nice and smooth so everything goes well with interpolation. (The fact that the red points are usually hidden by the surface is only due to how pyplot's renderer mishandles the relative position of complex 3d objects)
The surface is cut (due to nan values) along the convex hull of the input points, so even though our gridded arrays define a rectangular grid we only get a cut of the surface where interpolation makes sense.

Python concatenating elements of one list that are between elements of another list

I have two lists: a and b. I want to concatenate all of the elements of the b that are between elements of a. All of the elements of a are in b, but b also has some extra elements that are extraneous. I would like to take the first instance of every element of a in b and concatenate it with the extraneous elements that follow it in b until we find another element of a in b. The following example should make it more clear.
a = [[11.0, 1.0], [11.0, 2.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [12.0, 11.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]
b = [[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0], [11.0, 2.0], [11.0, 3.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 5.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0], [3.0, 1.0], [13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [11.0, 3.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 5.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]
c = [[[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0]], [[11.0, 2.0]], [[11.0, 3.0], [11.0, 3.0]], [[11.0, 4.0]], [[11.0, 5.0]], [[12.0, 1.0]], [[12.0, 2.0]], [[12.0, 3.0]], [[12.0, 4.0]], [[12.0, 5.0]], [[12.0, 6.0]], [[12.0, 7.0], [12.0, 5.0]], [[12.0, 8.0]], [[12.0, 9.0]], [[12.0, 10.0], [13.0, 5.0]], [[12.0, 11.0], [12.0, 8.0], [3.0, 1.0]], [[13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0]], [[13.0, 2.0]], [[11.0, 3.0], [13.0, 3.0]], [[13.0, 4.0]], [[13.0, 5.0], [13.0, 5.0], [13.0, 5.0]], [[13.0, 6.0]], [[13.0, 7.0], [13.0, 7.0]], [[13.0, 8.0]], [[13.0, 9.0]], [[13.0, 10.0]]]
What I have thought of is something like this:
slice_list = []
for i, elem in enumerate(a):
if i < len(key_list)-1:
b_first_index = b.index(a[i])
b_second_index = b.index(a[i+1])
slice_list.append([b_first_index, b_second_index])
c = [[b[slice_list[i][0]:b[slice_list[i][1]]]] for i in range(len(slice_list))]
This however will not catch the last item in the list (which I am not quite sure how to fit into my list comprehension anyways) and it seems quite ugly. My question is, is there a neater way of doing this (perhaps in itertools)?
Let's simplify the visual a bit:
key_list = ['a', 'c', 'f']
wrong_list = ['a', 'b', 'c', 'd', 'e', 'f']
wrong_list_fixed = [['a', 'b'], ['c', 'd', 'e'], ['f']]
This will be semantically identical to what you have, but I think it is easier to see without all the extra nested brackets.
You could use itertools.groupby, if you could only come up with a clever key. Luckily, the mapping of key_list to wrong_list givs you exactly what you want:
class key:
def __init__(self, key_list):
self.last = -1
self.key_list = key_list
def __call__(self, item):
try:
self.last = self.key_list.index(item, self.last + 1)
except ValueError:
pass
return self.last
wrong_list_fixed = [list(g) for k, g in itertools.groupby(wrong_list, key(key_list))]
The key maps elements of wrong_list to key_list using index. For missing indices, it just returns the last one successfully found, ensuring that groups are not split until a new index is found. By starting the search from the next available index, you can ensure that duplicate entries in key_list get handled correctly.
[IDEOne Link]
I think your example wrong_list_fixed is incorrect.
[[12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0],
# There should be a new list here -^
Here's a solution that walks the lists. It can be optimized further:
from contextlib import suppress
fixed = []
current = []
key_list_iter = iter(key_list)
next_key = next(key_list_iter)
for wrong in wrong_list:
if wrong == next_key:
if current:
fixed.append(current)
current = []
next_key = None
with suppress(StopIteration):
next_key = next(key_list_iter)
current.append(wrong)
if current:
fixed.append(current)
Here are the correct lists (modified to be easier to visually parse):
key_list = ['_a0', '_b0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_m0', '_n0', '_o0', '_p0', '_q0', '_r0', '_s0', '_t0', '_u0', '_v0', '_w0', '_x0', '_y0', '_z0', '_A0', '_B0', '_C0']
wrong_list = ['_a0', '_a0', 'D0', '_b0', '_c0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_j0', '_m0', '_n0', '_o0', '_x0', '_p0', '_m0', 'E0', '_t0', 'F0', '_q0', '_r0', '_s0', '_t0', '_u0', '_c0', '_v0', '_w0', '_x0', '_x0', '_x0', '_y0', '_z0', '_z0', '_A0', '_B0', '_C0']
wrong_list_fixed = [['_a0', '_a0', 'D0'], ['_b0'], ['_c0', '_c0'], ['_d0'], ['_e0'], ['_f0'], ['_g0'], ['_h0'], ['_i0'], ['_j0'], ['_k0'], ['_l0', '_j0'], ['_m0'], ['_n0'], ['_o0', '_x0'], ['_p0', '_m0', 'E0', '_t0', 'F0'], ['_q0'], ['_r0'], ['_s0'], ['_t0'], ['_u0', '_c0'], ['_v0'], ['_w0'], ['_x0', '_x0', '_x0'], ['_y0'], ['_z0', '_z0'], ['_A0'], ['_B0'], ['_C0']]
I get slightly different result from yours, but give it a try. If this is not what you want, I will delete my answer.
idx = sorted(set([b.index(ai) for ai in a] + [len(b)]))
c = [b[i:j] for i, j in zip(idx[:-1], idx[1:])]

Categories

Resources