Removing a tab every nth column from pandas df

Removing a tab every nth column from pandas df - python

I am trying to access a column of a data frame I created out of two lists and do some filtering. However, there seems to be an additional space for every 12th element in my dataframe. How do I deal with this?
import pandas as pd
df = pd.DataFrame(
{'S': s,
'K': k})
I created the dataframe with the code as above. Also, for some weird reason, it was stored in scientific notation with type float. I used df.round(4) before I could actually figure out what the problem was.
KeyError Traceback (most recent call last)
<ipython-input-14-2e289598b460> in <module>()
----> 1 df[df['S']]
~\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1956 if isinstance(key, (Series, np.ndarray, Index, list)):
1957 # either boolean or fancy integer index
-> 1958 return self._getitem_array(key)
1959 elif isinstance(key, DataFrame):
1960 return self._getitem_frame(key)
~\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
2000 return self.take(indexer, axis=0, convert=False)
2001 else:
-> 2002 indexer = self.loc._convert_to_indexer(key, axis=1)
2003 return self.take(indexer, axis=1, convert=True)
2004
~\Anaconda\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
1229 mask = check == -1
1230 if mask.any():
-> 1231 raise KeyError('%s not in index' % objarr[mask])
1232
1233 return _values_from_object(indexer)
KeyError: '[-0.65 -0.6 -0.6 -0.6 -0.55 -0.55 -0.55 -0.55 -0.55 -0.55 -0.5 -0.5\n -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.45 -0.45 -0.45 -0.45 -0.45 -0.45\n -0.45 -0.45 -0.45 -0.45 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4\n -0.4 -0.4 -0.4 -0.4 -0.35 -0.35 -0.35 -0.35 -0.35 -0.35 -0.35 -0.35\n -0.35 -0.35 -0.35 -0.35 -0.35 -0.35 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3\n -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.25 -0.25\n -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25\n -0.25 -0.25 -0.25 -0.25 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2\n -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.15\n -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15\n -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.1 -0.1 -0.1 -0.1 -0.1\n -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1\n -0.1 -0.1 -0.1 -0.1 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05\n -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05 -0.05\n -0.05 -0.05 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05\n 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05\n 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1\n 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1\n 0.1 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15\n 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15\n 0.15 0.15 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2\n 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2\n 0.2 0.2 0.2 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25\n 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25\n 0.25 0.25 0.25 0.25 0.25 0.3 0.3 0.3 0.3 0.3 0.3 0.3\n 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3\n 0.3 0.3 0.3 0.3 0.3 0.3 0.35 0.35 0.35 0.35 0.35 0.35\n 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35\n 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.4 0.4 0.4 0.4 0.4\n 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4\n 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.45 0.45 0.45 0.45\n 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45\n 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.5 0.5 0.5\n 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5\n 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.55 0.55 0.55\n 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55\n 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.6 0.6 0.6 0.6\n 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6\n 0.6 0.6 0.6 0.6 0.6 0.6 0.65 0.65 0.65 0.65 0.65 0.65\n 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65\n 0.65 0.65 0.65 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7\n 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.75\n 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75\n 0.75 0.75 0.75 0.75 0.75 0.8 0.8 0.8 0.8 0.8 0.8 0.8\n 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.85 0.85\n 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85\n 0.85 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9\n 0.9 0.9 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95\n 0.95 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.05 1.05\n 1.05 1.05 1.05 1.05 1.1 1.1 1.1 1.1 1.15 1.3 1.3 1.3\n 1.3 1.3 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35 1.35\n 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4\n 1.45 1.45 1.45 1.45 1.45 1.45 1.45 1.45 1.45 1.45 1.45 1.45\n 1.45 1.45 1.45 1.45 1.45] not in index'

You can try to remove every \n in S like so:
df["S"] = df["S"].apply(lambda x: float(str(x).replace("\n", "")))
But by the way: df[df["S"]] will look for the values of S in the index. With your construction of the dataframe, there will probably not be any value of S in the index.

Related

Modified kmeans algorithm returns the wrong answer

I am trying to create a kmeans algorithm that is based on the Earth Movers Distance instead of the Euclidean distance. However, when I run it, it just returns the same value for all data points.
The input is an dxn matrix containing all of my n probability distributions.
Here is an example of running the algorithm. The clusters for my data of length 169 should be much more distributed. I've also tried running it for more iterations to no avail.
distribution = {}
num_bins = 10
for i in data:
distribution[i] = np.histogram(data[i], bins = num_bins)[0] / len(data[i])
Z = np.zeros((len(data), num_bins))
for i in range(len(Z)):
Z[i] = distribution[list(distribution)[i]]
Z = Z.T
ans = k_means_algorithm(Z, 8, proportionally_random_k)
res = points_to_clusters(Z, ans)
print(res)
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 3. 0. 0. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 4. 1. 0. 4. 1. 1. 1. 4. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 4. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 3. 1. 1. 1. 1. 0. 0.
1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1.]]
Here is my code:
def k_random(X, k):
random_idx = np.random.permutation(X.shape[1])
centroids = X[:, random_idx[:k]]
return centroids
def points_to_clusters(X, centroids):
assignment = np.zeros(len(X[0]))
for i in range(len(X[0])):
min_dist = 100000000
for j in range(len(centroids[0])):
cur_dist = scipy.stats.wasserstein_distance(X[:, i], centroids[:, j])
#cur_dist = np.linalg.norm(X[:, I] - centroids[:, j])
#EMD vs Norm
if cur_dist < min_dist:
assignment[i] = j
min_dist = cur_dist
assignment = np.array([assignment])
return assignment
def compute_centroids(X, k, cluster_assignments, old_centroids):
dimensions = X.shape[0]
centroids = np.zeros((dimensions, k))
for j in range(k):
new_centroid = np.mean(X[:, cluster_assignments.squeeze()==j], axis=1)
if np.isnan(new_centroid).any():
centroids[:,j] = old_centroids[:,j]
else:
centroids[:,j] = np.mean(X[:, cluster_assignments.squeeze()==j], axis=1)
return centroids
def k_means_algorithm(X, k, random_init, max_iters = 1000, return_obj=False):
centroids = random_init(X,k)
cluster_assignments = points_to_clusters(X, centroids)
new_centroids = compute_centroids(X, k, cluster_assignments, centroids)
counter = 0
while not np.array_equal(centroids,new_centroids) and counter < max_iters:
centroids = new_centroids
cluster_assignments = points_to_clusters(X, centroids)
new_centroids = compute_centroids(X, k, cluster_assignments, centroids)
counter += 1
return centroids
def proportionally_random_k(X, k):
centroids = [X[:, np.random.randint(X.shape[1])]]
for other_centroids in range(k - 1):
distances = []
for i in range(X.shape[1]):
point = X[:, i]
d = float("inf")
for j in range(len(centroids)):
temp_dist = np.linalg.norm(point - centroids[j])
d = min(d, temp_dist)
distances.append(d)
distances = np.array(distances)
next_centroid = X[:, np.argmax(distances)]
centroids.append(next_centroid)
distances = []
return np.asarray(centroids).T
Here is my answer when running the Norm vs EMD.
k_means_algorithm(Z, 8, k_random, max_iters=1000)
Note that the only thing that changes between the two codes is the line that is highlighted in the function points_to_clusters, which specifies which points get mapped to what clusters.
Norm: About what I expected, well distributed clusters.
[[6. 6. 6. 6. 6. 6. 6. 6. 0. 0. 0. 0. 4. 4. 2. 2. 2. 2. 2. 2. 2. 2. 7. 7.
6. 6. 6. 6. 6. 6. 0. 0. 0. 0. 4. 4. 2. 2. 2. 2. 2. 2. 2. 2. 7. 7. 6. 6.
6. 6. 0. 6. 0. 0. 0. 4. 2. 2. 2. 2. 2. 2. 2. 2. 7. 2. 6. 6. 0. 0. 0. 0.
4. 4. 2. 2. 5. 4. 2. 2. 2. 2. 7. 7. 0. 0. 0. 4. 4. 4. 4. 4. 2. 4. 2. 2.
5. 5. 7. 7. 4. 4. 4. 4. 5. 5. 5. 5. 5. 5. 5. 5. 7. 7. 4. 4. 5. 4. 5. 5.
5. 5. 5. 5. 7. 3. 5. 5. 5. 5. 3. 3. 3. 3. 3. 3. 5. 5. 3. 3. 3. 5. 3. 3.
3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 7. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
3.]]
EMD: Gave the same cluster to all distributions. Clearly not right. It should be much more distributed.
[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2.]]
Here is a sample of the data that I used when running the above cluster function.
[[0.41 0.46 0.35 0.45 0.39 0.42 0.38 0.38 0.27 0.29 0.24 0.3 0.18 0.19
0.07 0.12 0.1 0.09 0.06 0.09 0.07 0.05 0.03 0.05 0.35 0.4 0.36 0.42
0.36 0.43 0.31 0.31 0.22 0.29 0.16 0.18 0.06 0.11 0.06 0.05 0.1 0.06
0.06 0.06 0.06 0.02 0.39 0.35 0.34 0.38 0.31 0.36 0.23 0.23 0.24 0.2
0.12 0.07 0.06 0.06 0.06 0.05 0.03 0.03 0.05 0.03 0.38 0.39 0.29 0.32
0.26 0.24 0.17 0.17 0.07 0.1 0.03 0.1 0.05 0.06 0.05 0.03 0.02 0.03
0.29 0.37 0.24 0.21 0.16 0.15 0.11 0.1 0.13 0.14 0.03 0.06 0.03 0.13
0.03 0.03 0.14 0.18 0.17 0.11 0.08 0.09 0.07 0.04 0.05 0.03 0.04 0.02
0.03 0.04 0.16 0.15 0.1 0.09 0.03 0.1 0.03 0.06 0.06 0.06 0.03 0.01
0.09 0.07 0.06 0.08 0.03 0.04 0.02 0.01 0.02 0.03 0.06 0.04 0.04 0.04
0.04 0.04 0.03 0.04 0.05 0.03 0.04 0.03 0.03 0.04 0.04 0.01 0.01 0.05
0.02 0.01 0.03 0.02 0.01 0.01 0.01 0.03 0.01 0.02 0.01 0.01 0. 0.02
0.01]
[0.04 0.03 0.04 0.05 0.06 0.05 0.05 0.04 0.15 0.16 0.18 0.2 0.22 0.22
0.21 0.22 0.14 0.16 0.19 0.17 0.1 0.09 0.12 0.14 0.04 0.04 0.03 0.04
0.04 0.06 0.13 0.14 0.2 0.23 0.24 0.23 0.18 0.19 0.13 0.18 0.14 0.17
0.1 0.1 0.13 0.13 0.02 0.05 0.04 0.05 0.12 0.11 0.14 0.21 0.19 0.25
0.18 0.19 0.1 0.21 0.19 0.19 0.1 0.12 0.13 0.18 0.03 0.07 0.13 0.11
0.12 0.2 0.25 0.23 0.21 0.23 0.13 0.21 0.18 0.19 0.1 0.09 0.15 0.12
0.11 0.11 0.21 0.19 0.13 0.28 0.21 0.19 0.15 0.24 0.23 0.17 0.1 0.14
0.1 0.14 0.24 0.26 0.22 0.23 0.18 0.2 0.12 0.19 0.12 0.14 0.11 0.08
0.11 0.14 0.18 0.22 0.2 0.24 0.13 0.14 0.11 0.14 0.11 0.1 0.16 0.04
0.21 0.21 0.07 0.09 0.06 0.08 0.04 0.09 0.06 0.03 0.1 0.11 0.08 0.07
0.08 0.1 0.06 0.04 0.07 0.11 0.05 0.07 0.08 0.06 0.11 0.04 0.07 0.08
0.04 0.05 0.03 0.02 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.02 0.02
0. ]
[0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.03 0.03 0.01 0.01 0.07 0.09
0.12 0.15 0.14 0.15 0.09 0.17 0.21 0.2 0.09 0.09 0.03 0.01 0.01 0.02
0.02 0.04 0.03 0.01 0.02 0.03 0.06 0.06 0.18 0.15 0.13 0.15 0.13 0.17
0.18 0.23 0.07 0.09 0.01 0.02 0.02 0.02 0.02 0.01 0.03 0.03 0.05 0.04
0.14 0.15 0.15 0.13 0.18 0.18 0.21 0.18 0.1 0.1 0.01 0.02 0.01 0.01
0.02 0.02 0.04 0.06 0.12 0.15 0.18 0.08 0.16 0.14 0.22 0.23 0.1 0.08
0. 0.01 0.01 0.02 0.06 0.06 0.1 0.13 0.11 0.08 0.16 0.15 0.19 0.08
0.11 0.1 0.02 0.02 0.05 0.07 0.13 0.17 0.12 0.19 0.16 0.17 0.22 0.22
0.13 0.09 0.05 0.06 0.15 0.13 0.15 0.16 0.17 0.13 0.17 0.25 0.08 0.2
0.12 0.12 0.18 0.16 0.14 0.14 0.11 0.14 0.12 0.16 0.18 0.19 0.14 0.12
0.12 0.14 0.09 0.13 0.14 0.14 0.14 0.13 0.13 0.1 0.16 0.1 0.1 0.11
0.13 0.18 0.1 0.08 0.05 0.07 0.03 0.06 0.06 0.02 0.01 0.03 0.01 0.02
0.02]
[0.07 0.11 0.08 0.07 0.07 0.06 0.06 0.1 0.06 0.05 0.07 0.05 0.06 0.05
0.05 0.05 0.11 0.11 0.09 0.1 0.12 0.14 0.3 0.26 0.05 0.06 0.05 0.05
0.05 0.04 0.03 0.05 0.03 0.02 0.04 0.06 0.08 0.04 0.12 0.11 0.1 0.1
0.14 0.12 0.21 0.25 0.04 0.05 0.07 0.04 0.03 0.04 0.05 0.05 0.06 0.05
0.08 0.07 0.1 0.1 0.06 0.06 0.08 0.09 0.19 0.13 0.03 0.04 0.03 0.02
0.04 0.03 0.05 0.04 0.05 0.05 0.1 0.08 0.05 0.07 0.06 0.11 0.23 0.23
0.04 0.03 0.02 0.02 0.03 0.01 0.03 0.05 0.09 0.04 0.1 0.07 0.11 0.11
0.21 0.21 0.03 0.02 0.03 0.02 0.05 0.04 0.09 0.1 0.06 0.1 0.07 0.16
0.17 0.16 0.03 0.02 0.03 0.03 0.1 0.08 0.14 0.1 0.12 0.08 0.17 0.13
0.02 0.04 0.07 0.08 0.15 0.16 0.16 0.12 0.2 0.17 0.06 0.07 0.13 0.13
0.12 0.1 0.17 0.16 0.15 0.16 0.07 0.15 0.19 0.17 0.08 0.16 0.15 0.18
0.09 0.15 0.27 0.17 0.1 0.09 0.07 0.06 0.06 0.11 0.12 0.08 0.02 0.03
0.02]
[0.13 0.12 0.13 0.11 0.14 0.1 0.09 0.13 0.08 0.13 0.12 0.09 0.13 0.08
0.1 0.09 0.09 0.12 0.13 0.09 0.09 0.09 0.01 0.04 0.11 0.14 0.11 0.14
0.09 0.11 0.11 0.12 0.07 0.06 0.06 0.09 0.13 0.08 0.1 0.1 0.09 0.09
0.09 0.11 0.05 0.08 0.1 0.1 0.11 0.08 0.06 0.11 0.09 0.06 0.08 0.06
0.1 0.06 0.1 0.07 0.08 0.05 0.12 0.11 0.07 0.1 0.08 0.04 0.06 0.07
0.08 0.06 0.05 0.09 0.07 0.07 0.05 0.08 0.08 0.08 0.12 0.13 0.06 0.07
0.03 0.03 0.04 0.05 0.05 0.05 0.05 0.05 0.07 0.04 0.06 0.08 0.05 0.05
0.06 0.07 0. 0.02 0.05 0.03 0.06 0.03 0.03 0.03 0.05 0.05 0.06 0.04
0.07 0.06 0.04 0.03 0.04 0.04 0.04 0.03 0.06 0.08 0.05 0.05 0.03 0.09
0.02 0.02 0.06 0.01 0.04 0.07 0.06 0.1 0.06 0.08 0.03 0.01 0.04 0.06
0.1 0.11 0.06 0.07 0.03 0.03 0.11 0.14 0.07 0.08 0.09 0.13 0.08 0.06
0.09 0.1 0.21 0.24 0.31 0.25 0.18 0.18 0.22 0.22 0.15 0.13 0.07 0.08
0.01]
[0.13 0.09 0.09 0.09 0.11 0.14 0.14 0.11 0.15 0.12 0.09 0.12 0.09 0.11
0.07 0.07 0.11 0.09 0.09 0.08 0.09 0.1 0.09 0.08 0.15 0.11 0.15 0.13
0.15 0.13 0.11 0.13 0.15 0.1 0.1 0.1 0.1 0.11 0.08 0.08 0.06 0.08
0.1 0.06 0.09 0.07 0.13 0.17 0.13 0.17 0.16 0.14 0.13 0.13 0.08 0.13
0.11 0.1 0.1 0.09 0.07 0.12 0.07 0.09 0.09 0.11 0.16 0.16 0.14 0.15
0.11 0.15 0.1 0.11 0.1 0.11 0.09 0.11 0.07 0.09 0.09 0.09 0.05 0.09
0.15 0.12 0.11 0.12 0.16 0.12 0.1 0.1 0.12 0.13 0.09 0.13 0.07 0.08
0.09 0.07 0.12 0.13 0.1 0.11 0.08 0.06 0.08 0.07 0.09 0.08 0.08 0.1
0.07 0.12 0.07 0.04 0.05 0.05 0.07 0.05 0.05 0.05 0.09 0.07 0.08 0.06
0.02 0.03 0.04 0.02 0.05 0.03 0.05 0.07 0.05 0.07 0.03 0.02 0.02 0.02
0.03 0.05 0.04 0.05 0.02 0.02 0.03 0.02 0.05 0.05 0.02 0.03 0.04 0.04
0.04 0.04 0.12 0.15 0.16 0.21 0.31 0.3 0.26 0.2 0.21 0.24 0.11 0.17
0.05]
[0.02 0.04 0.02 0.06 0.03 0.07 0.06 0.06 0.05 0.05 0.07 0.06 0.08 0.09
0.07 0.06 0.05 0.05 0.07 0.08 0.04 0.06 0.03 0.02 0.07 0.05 0.05 0.06
0.06 0.05 0.05 0.1 0.09 0.09 0.07 0.08 0.08 0.06 0.06 0.1 0.06 0.04
0.02 0.05 0.02 0.04 0.05 0.09 0.08 0.06 0.07 0.1 0.11 0.11 0.08 0.09
0.06 0.12 0.09 0.09 0.06 0.07 0.06 0.07 0.05 0.04 0.09 0.08 0.12 0.11
0.13 0.12 0.09 0.1 0.12 0.09 0.11 0.1 0.1 0.09 0.05 0.04 0.06 0.05
0.14 0.13 0.13 0.13 0.12 0.13 0.13 0.14 0.09 0.11 0.09 0.09 0.07 0.1
0.04 0.07 0.16 0.14 0.17 0.15 0.09 0.15 0.14 0.12 0.13 0.1 0.1 0.09
0.09 0.05 0.17 0.16 0.12 0.13 0.12 0.13 0.09 0.11 0.07 0.1 0.08 0.07
0.17 0.15 0.13 0.16 0.09 0.1 0.08 0.07 0.11 0.09 0.13 0.12 0.08 0.1
0.07 0.1 0.05 0.08 0.06 0.08 0.07 0.02 0.04 0.07 0.04 0.02 0.04 0.06
0.05 0.07 0.03 0.09 0.1 0.12 0.08 0.11 0.11 0.16 0.18 0.16 0.24 0.22
0.06]
[0.01 0.02 0.03 0.02 0.05 0.06 0.03 0.05 0.03 0.03 0.05 0.07 0.04 0.05
0.08 0.1 0.09 0.1 0.07 0.12 0.09 0.11 0.1 0.1 0.03 0.05 0.03 0.04
0.04 0.04 0.04 0.04 0.03 0.06 0.07 0.06 0.05 0.1 0.1 0.12 0.1 0.14
0.09 0.1 0.11 0.1 0.04 0.03 0.03 0.06 0.03 0.03 0.06 0.05 0.06 0.07
0.04 0.08 0.11 0.1 0.1 0.11 0.11 0.12 0.11 0.15 0.03 0.09 0.06 0.07
0.06 0.03 0.07 0.07 0.08 0.08 0.12 0.1 0.1 0.09 0.1 0.11 0.12 0.14
0.06 0.06 0.05 0.07 0.07 0.05 0.07 0.06 0.05 0.1 0.07 0.11 0.11 0.16
0.11 0.12 0.07 0.06 0.05 0.08 0.13 0.12 0.1 0.1 0.08 0.13 0.15 0.11
0.09 0.13 0.1 0.14 0.08 0.11 0.11 0.11 0.1 0.11 0.11 0.13 0.12 0.12
0.1 0.16 0.13 0.19 0.12 0.14 0.15 0.13 0.1 0.1 0.18 0.18 0.12 0.16
0.12 0.14 0.16 0.13 0.14 0.17 0.13 0.14 0.11 0.1 0.11 0.14 0.08 0.08
0.08 0.05 0.01 0.02 0.04 0.06 0.09 0.04 0.09 0.07 0.09 0.13 0.24 0.2
0.34]
[0.04 0.01 0.05 0.03 0.04 0.03 0.03 0.04 0.04 0.03 0.05 0.02 0.04 0.04
0.08 0.03 0.04 0.06 0.06 0.04 0.08 0.1 0.1 0.11 0.04 0.04 0.05 0.05
0.02 0.04 0.04 0.02 0.06 0.06 0.06 0.05 0.04 0.06 0.06 0.05 0.07 0.08
0.11 0.09 0.13 0.11 0.04 0.06 0.03 0.03 0.08 0.05 0.04 0.05 0.03 0.03
0.05 0.06 0.07 0.08 0.08 0.08 0.11 0.11 0.08 0.08 0.04 0.03 0.04 0.04
0.04 0.04 0.05 0.06 0.05 0.04 0.08 0.06 0.1 0.09 0.1 0.06 0.12 0.11
0.05 0.03 0.05 0.08 0.06 0.07 0.05 0.06 0.07 0.06 0.06 0.08 0.1 0.08
0.14 0.1 0.04 0.05 0.04 0.08 0.06 0.06 0.08 0.08 0.1 0.11 0.07 0.07
0.1 0.14 0.06 0.07 0.06 0.06 0.08 0.08 0.1 0.12 0.1 0.08 0.15 0.18
0.08 0.07 0.07 0.1 0.11 0.12 0.14 0.13 0.12 0.17 0.08 0.13 0.15 0.13
0.15 0.12 0.16 0.2 0.18 0.11 0.17 0.18 0.15 0.21 0.23 0.25 0.26 0.23
0.27 0.25 0.03 0.02 0.04 0.03 0.04 0.05 0.05 0.05 0.07 0.06 0.1 0.09
0.27]
[0.12 0.1 0.18 0.1 0.1 0.06 0.14 0.08 0.13 0.11 0.11 0.08 0.1 0.07
0.17 0.1 0.14 0.07 0.14 0.05 0.11 0.06 0.13 0.1 0.13 0.1 0.17 0.07
0.16 0.07 0.15 0.08 0.13 0.06 0.12 0.09 0.11 0.09 0.16 0.07 0.15 0.08
0.12 0.08 0.13 0.12 0.18 0.1 0.16 0.11 0.12 0.05 0.12 0.09 0.14 0.08
0.12 0.09 0.12 0.07 0.12 0.08 0.12 0.09 0.12 0.09 0.14 0.08 0.12 0.11
0.14 0.12 0.12 0.08 0.12 0.09 0.09 0.07 0.11 0.1 0.11 0.12 0.08 0.08
0.14 0.13 0.13 0.1 0.15 0.09 0.15 0.12 0.13 0.07 0.11 0.05 0.17 0.07
0.11 0.09 0.18 0.12 0.13 0.11 0.15 0.08 0.17 0.09 0.14 0.1 0.11 0.11
0.14 0.08 0.15 0.11 0.15 0.11 0.17 0.12 0.15 0.1 0.12 0.09 0.1 0.09
0.16 0.12 0.19 0.11 0.19 0.11 0.19 0.14 0.16 0.1 0.15 0.13 0.2 0.16
0.17 0.11 0.16 0.11 0.15 0.15 0.18 0.12 0.16 0.12 0.12 0.12 0.16 0.12
0.19 0.12 0.17 0.18 0.18 0.15 0.17 0.15 0.13 0.12 0.14 0.15 0.19 0.14
0.22]]

There is nothing wrong with the code. But: The standard/best known k-means algorithm is Lloyd's algorithm (this also seems to be the version presented in the question). It assigns points to clusters using the Euclidean distance (kind of per definition). This distance measure cannot be simply replaced by another.
See for example the compute_centroids() function. Here new centroids are calculated using the mean values of the cluster points (thus k-means). By doing so it is implicitely assumed that this new centroid is a better representative of the cluster then the previous one as it minimises the overall distance between the points of the cluster to the centroid. But this needn't be the case if the distance function is changed.
There are other variations using for example the median which could be combined with the Manhattan distance.
See this question for a far more detailed discussion.
Quick addition to the algorithm to print out the cluster assignment distribution during each iteration
# imports ...
np.set_printoptions(precision=1)
# ...
def k_means_algorithm(X, k, random_init, max_iters = 1000, return_obj=False):
# ...
while not np.array_equal(centroids,new_centroids) and counter < max_iters:
# ...
hist, bin_edges = np.histogram(cluster_assignments.flatten(), bins=k)
hist = 100 * hist / hist.sum()
print(hist) # percentage of values assigned to each cluster
return centroids, cluster_assignments
# ...
Caution: this only gives a rough impression how values are assigned to clusters and might level out for the full scale data set. Still, it works nicely for the sample data provided with the question.
While the distribution more or less steadily converges to the final solution in case cur_dist = np.linalg.norm(X[:, i] - centroids[:, j]) is used,
it 'jumps' around for cur_dist = wasserstein_distance(X[:, i], centroids[:, j]) for said reason that the centroids do not represent the centers of the clusters for the EMD.

Combine 2 lists with diff shapes while having values to link them like a data frame Python

I have list number 1:
['Limitation', 'Parameter', 'input', 'Feature', 'Dataset', 'Output', 'EvaluationMetric', 'Algorithm', 'Task', 'HyperParameter', 'Layer', 'Model', 'Operator', 'Function', 'OptimizationAlgorithm', 'ActivationFunction', 'LeakyReluFunction', 'LossFunction']
list number2:
['Input', 'Dataset', 'Algorithm', 'Operator', 'Task', 'HyperParameter', 'Output']
And I have these values that describe the similarity between these words
and how could I make a data frame. that contains rows as the first list strings, columns as the sec list, and the values in the cells
0.4
0.75
0.75
0.65
0.65
0.050000000000000044
0.25
0.25
0.5
0.6
0.75
0.7
0.75
1.0
0.75
0.4
0.6
0.75
0.5
0.75
0.7
0.65
0.6
0.09999999999999998
0.25
0.30000000000000004
0.44999999999999996
0.55
0.6
0.6
0.55
0.6
0.6
0.35
1.0
0.55
0.4
0.6
0.6
0.65
0.6
0.4
0.25
0.25
0.44999999999999996
0.65
0.7
0.65
0.75
0.65
0.7
0.4
0.65
0.65
0.6
0.7
0.6
1.0
0.65
0.25
0.25
0.30000000000000004
0.5
0.55
0.6
0.75
0.7
0.75
0.7
0.25
0.55
1.0
0.35
0.8
0.75
0.65
0.6
0.0
0.15000000000000002
0.19999999999999996
0.44999999999999996
0.35
0.75
0.4
0.44999999999999996
0.5
0.35
0.30000000000000004
0.4
0.35
1.0
0.44999999999999996
0.35
0.6
0.30000000000000004
0.050000000000000044
0.15000000000000002
0.25
0.30000000000000004
0.55
0.6
0.85
0.7
0.75
1.0
0.35
0.6
0.7
0.35
0.7
0.7
0.7
0.7
0.09999999999999998
0.25
0.25
0.5

Your data have 117 values, and you have a 18x7 matrix (rows*columns) that requires 126 values.. filling this with NaN you could
import numpy as np
import pandas as pd
your_data = '0.4 0.75 0.75 0.65 0.65 0.050000000000000044 0.25 0.25 0.5 0.6 0.75 0.7 0.75 1.0 0.75 0.4 0.6 0.75 0.5 0.75 0.7 0.65 0.6 0.09999999999999998 0.25 0.30000000000000004 0.44999999999999996 0.55 0.6 0.6 0.55 0.6 0.6 0.35 1.0 0.55 0.4 0.6 0.6 0.65 0.6 0.4 0.25 0.25 0.44999999999999996 0.65 0.7 0.65 0.75 0.65 0.7 0.4 0.65 0.65 0.6 0.7 0.6 1.0 0.65 0.25 0.25 0.30000000000000004 0.5 0.55 0.6 0.75 0.7 0.75 0.7 0.25 0.55 1.0 0.35 0.8 0.75 0.65 0.6 0.0 0.15000000000000002 0.19999999999999996 0.44999999999999996 0.35 0.75 0.4 0.44999999999999996 0.5 0.35 0.30000000000000004 0.4 0.35 1.0 0.44999999999999996 0.35 0.6 0.30000000000000004 0.050000000000000044 0.15000000000000002 0.25 0.30000000000000004 0.55 0.6 0.85 0.7 0.75 1.0 0.35 0.6 0.7 0.35 0.7 0.7 0.7 0.7 0.09999999999999998 0.25 0.25 0.5'
values = your_data.split(' ')
for i in range(9):
values.append(np.nan)
data_matrix = np.split(np.array(values), (18))
list1 = ['Limitation', 'Parameter', 'input', 'Feature', 'Dataset', 'Output', 'EvaluationMetric', 'Algorithm', 'Task', 'HyperParameter', 'Layer', 'Model', 'Operator', 'Function', 'OptimizationAlgorithm', 'ActivationFunction', 'LeakyReluFunction', 'LossFunction']
list2 =['Input', 'Dataset', 'Algorithm', 'Operator', 'Task', 'HyperParameter', 'Output']
df = pd.DataFrame(index=list1, columns=list2, data=data_matrix)
print(df)

Python Pandas Shift Dataframe Column Down Into Rows (reset index on column?)

How would you drop / reset the column axis to shift the data down causing the column headers to be something like [0, 1, 2, 3, 4, 5] then set column headers to df[5] values? I reset the index of the rows axis all the time but never had the need to do it to columns.
df = pd.DataFrame({'very_low': ['High', 'Low', 'Middle', 'Low'], '0.2': [0.10000000000000001, 0.050000000000000003, 0.14999999999999999, 0.080000000000000002], '0.1': [0.080000000000000002, 0.059999999999999998, 0.10000000000000001, 0.080000000000000002], '0.4': [0.90000000000000002, 0.33000000000000002, 0.29999999999999999, 0.23999999999999999], '0': [0.080000000000000002, 0.059999999999999998, 0.10000000000000001, 0.080000000000000002], '0.3': [0.23999999999999999, 0.25, 0.65000000000000002, 0.97999999999999998]})
0 0.1 0.2 0.3 0.4 very_low
0 0.08 0.08 0.10 0.24 0.90 High
1 0.06 0.06 0.05 0.25 0.33 Low
2 0.10 0.10 0.15 0.65 0.30 Middle
3 0.08 0.08 0.08 0.98 0.24 Low

If I understood you correctly, something like this?
df2 = pd.concat([pd.DataFrame(df.columns).T, pd.DataFrame(df.values)],
ignore_index=True).iloc[:, :-1]
df2.columns = [df.columns[-1]] + df.iloc[:, -1].tolist()
>>> df2
very_low High Low Middle Low
0 0 0.1 0.2 0.3 0.4
1 0.08 0.08 0.1 0.24 0.9
2 0.06 0.06 0.05 0.25 0.33
3 0.1 0.1 0.15 0.65 0.3
4 0.08 0.08 0.08 0.98 0.24

I think this is what you want:
tdf = df.T
tdf.columns = tdf.iloc[5]
tdf.drop(tdf.tail(1).index,inplace=True)
>>> tdf
very_low High Low Middle Low
0 0.08 0.06 0.1 0.08
0.1 0.08 0.06 0.1 0.08
0.2 0.1 0.05 0.15 0.08
0.3 0.24 0.25 0.65 0.98
0.4 0.9 0.33 0.3 0.24

Transform a 3 columns (x, y, result) Python Pandas DataFrame to a DataFrame of result values with x (unique) as row and y (unique) as column

I would like to transform a Python Pandas DataFrame df like:
x y result
id
1 -0.8 -1 0.64
2 -0.8 0 -0.36
3 -0.4 -1 0.16
4 -0.4 0 -0.84
5 0.0 -1 0.00
6 0.0 0 -1.00
7 0.4 -1 0.16
8 0.4 0 -0.84
9 0.8 -1 0.64
10 0.8 0 -0.36
to a DataFrame like this:
-1 0
-0.8 0.64 -0.36
-0.4 0.16 -0.84
0.0 0 -1.00
0.4 0.16 -0.84
0.8 0.64 -0.36
I know how to get unique x values:
df["x"].unique()
and unique y values with:
df["y"].unique()
but I don't know how to "distribute" result column values inside DataFrame.
I would prefer a vectorized solution in order to avoid for loops.

That is a pivot operation, you can either use .pivot_table:
>>> df.pivot_table(values='result', index='x', columns='y')
y -1 0
x
-0.8 0.64 -0.36
-0.4 0.16 -0.84
0.0 0.00 -1.00
0.4 0.16 -0.84
0.8 0.64 -0.36
or .pivot:
>>> df.pivot(index='x', columns='y')['result']
y -1 0
x
-0.8 0.64 -0.36
-0.4 0.16 -0.84
0.0 0.00 -1.00
0.4 0.16 -0.84
0.8 0.64 -0.36
or .groupby followed by .unstack:
>>> df.groupby(['x', 'y'])['result'].aggregate('first').unstack()
y -1 0
x
-0.8 0.64 -0.36
-0.4 0.16 -0.84
0.0 0.00 -1.00
0.4 0.16 -0.84
0.8 0.64 -0.36

plotting multiple (x,y) co-ordinates in a single curve with gnuplot

Hi I want to plot multiple (x,y) coordinates in a single graph. Say I have a data file which has the contents like the following:
x y
0.0 0.5
0.12 0.1
0.16 0.4
0.2 0.35
0.31 0.8
0.34 0.6
0.38 1.0
0.46 0.2
0.51 0.7
0.7 0.9
could I have a some more data in this file like,
x y x1 y1
0.0 0.5 0.04 0.7
0.12 0.1 0.08 0.74
0.16 0.4 0.12 0.85
0.2 0.35 0.16 0.9
0.31 0.8 0.2 0.53
0.34 0.6 0.24 0.31
0.38 1.0 0.28 0.87
0.46 0.2 0.32 0.20
0.51 0.7 0.36 0.45
0.7 0.9 0.4 0.64
and plot the graph on gnuplot where (x,y) and (x1,y1) would all be in a single curve? Thank you.

gnuplot can only plot column format data as far as I know. That said, you will have to plot it in after transpose your data as follows:
x 0.000000 y 0.500000 x 0.120000 y 0.100000 ...
x1 0.040000 y1 0.700000 x1 0.080000 y1 0.740000 ...
and plot data us 1:2, data us 3:4, data us 5:6.
To transpose the data, you can either change your program to write it in this way, or use following awk script:
awk '{for (i=1;i<=NF;i++) arr[NR,i]=$i;} END{for (i=1;i<=NF;i=i+2) {for (j=1;j<=NR;j++) {printf "%f %f ",arr[j,i],arr[j,i+1]} print ""}}' datafile

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing a tab every nth column from pandas df - python

You can try to remove every \n in S like so: df["S"] = df["S"].apply(lambda x: float(str(x).replace("\n", ""))) But by the way: df[df["S"]] will look for the values of S in the index. With your construction of the dataframe, there will probably not be any value of S in the index.

Related

Modified kmeans algorithm returns the wrong answer

Combine 2 lists with diff shapes while having values to link them like a data frame Python

Python Pandas Shift Dataframe Column Down Into Rows (reset index on column?)

Transform a 3 columns (x, y, result) Python Pandas DataFrame to a DataFrame of result values with x (unique) as row and y (unique) as column

plotting multiple (x,y) co-ordinates in a single curve with gnuplot

Categories

Resources