Translate array into x and y direction - Python - python

We have the following two-dimensional array with x and y coordinates:
x = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]])
We flatten it: x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) )
and our goal is to apply translations into x direction, y direction.
We are dealing with a 4x4 array (lattice), and the first transformation is 1 shift into x direction :
so from '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]' we get '[1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8, 13, 14, 15, 12]'.
The next transformation is two shifts in x:
from '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]' we get '[2, 3, 0, 1, 6, 7, 4, 5, 10, 11, 8, 9, 14, 15, 12, 13]'.
We want to get this (flattened) array:
y = np.array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8, 13, 14, 15, 12],
[2, 3, 0, 1, 6, 7, 4, 5, 10, 11, 8, 9, 14, 15, 12, 13],
[3, 0, 1, 2, 7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14],
[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3],
[5, 6, 7, 4, 9, 10, 11, 8, 13, 14, 15, 12, 1, 2, 3, 0],
[6, 7, 4, 5, 10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1],
[7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2],
[8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7],
[9, 10, 11, 8, 13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4],
[10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5],
[11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2, 7, 4, 5, 6],
[12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8],
[14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5, 10, 11, 8, 9],
[15, 12, 13, 14, 3, 0, 1, 2, 7, 4, 5, 6, 11, 8, 9, 10]])
I tried using:
y = np.roll(np.roll(x, -1), -1)

Can concatenate two vstack operations. First, roll in axis=1 and then, roll in axis=0.
np.vstack([np.roll(np.roll(arr, -i, axis=0), -x, axis=1).flatten() \
for x in range(arr.shape[0])] \
for i in range(arr.shape[1]))
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[ 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8, 13, 14, 15, 12],
[ 2, 3, 0, 1, 6, 7, 4, 5, 10, 11, 8, 9, 14, 15, 12, 13],
[ 3, 0, 1, 2, 7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14],
[ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3],
[ 5, 6, 7, 4, 9, 10, 11, 8, 13, 14, 15, 12, 1, 2, 3, 0],
[ 6, 7, 4, 5, 10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1],
[ 7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2],
[ 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7],
[ 9, 10, 11, 8, 13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4],
[10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5],
[11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2, 7, 4, 5, 6],
[12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8],
[14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5, 10, 11, 8, 9],
[15, 12, 13, 14, 3, 0, 1, 2, 7, 4, 5, 6, 11, 8, 9, 10]])

Related

np.argsort(np.zeros(17)) gives strange order

As the title suggests, the np.argsort(np.zeros(n)) can gives right result if n<=16, otherwise, it will give a strange result, e.g.,
np.argsort(np.zeros(17))
# array([ 0, 14, 13, 12, 11, 10, 9, 15, 8, 6, 5, 4, 3, 2, 1, 7, 16], dtype=int64)
np.argsort(np.zeros(18))
# array([ 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 16, 17], dtype=int64)
Can anyone explain? Thanks!

Plotting different clusters markers for every class in scatter plot

I have a scatter plot where i am plotting 14 clusters, but each 2 clusters belong to the same class, they are all using the same markers. Every 50 rows is a cluster and every 100 rows is two clusters of the same class. What i want to do is change the markers for every 2 clusters or 100 rows.
Link for the Data Frame
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13]
X_lda = pd.read_pickle('lda_values')
X_lda = np.asarray(X_lda)
markers = ['x', 'o', '1', '.', '2', '>', 'D']
color=['b','r']
X_lda_colors= [ color[i] for i in list(np.array(y)%2) ]
X_lda_markers = [markers[i] for i in list(np.array(y)%2)]
plt.xlabel('1-eigenvector')
plt.ylabel('2-eigenvector')
plt.scatter(
X_lda[:,0],
X_lda[:,1],
marker = X_lda_markers,
c=X_lda_colors,
cmap='rainbow',
alpha=0.7,
)
This is the minimal reproducible example i could get from a large code with large data.
This is the actual plot:
This is what i am trying to achieve.
The error i get:
ValueError: Unrecognized marker style ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x'...]'
Please try this:
import numpy as np
from matplotlib import pyplot as plt
X_lda=np.array([[1,2],[1,1],[3,3],[4,4],[2,4],[3,5],[3,4],[3,2]]) # suppose you want to plot X
y=[0,1,1,1,2,3,4,4] # the cluster of each sample in X_lda
color=['b','r']
markers = ['x', 'o', '1', '.', '2', '>', 'D'] # marker
X_lda_colors= [ color[i] for i in list(np.array(y)%2) ]
X_lda_markers= [ markers[i] for i in list(np.array(y)%2) ]
plt.xlabel('1-eigenvector')
plt.ylabel('2-eigenvector')
for i in range(X_lda.shape[0]):
plt.scatter( X_lda[i,0], X_lda[i,1], c=X_lda_colors[i],
marker=X_lda_markers[i], cmap='rainbow', alpha=0.7, edgecolors='w')
plt.show()

What should be the terminating condition for this implementation of the 15-puzzle problem?

I am trying to implement a solution for outputting the sequence of moves for a 15-puzzle problem in Python. This is part of an optional assignment for a MOOC. The problem statement is given at this link.
I have a version of the program (given below) which performs valid transitions.
I am first identifying the neighbors of the empty cell (represented by 0) and putting them in a list. Then, I am randomly choosing one of the neighbors from the list to perform swaps with the empty cell. All the swaps are accumulated in a different list to record the sequence of moves to solve the puzzle. This is then outputted at the end of the program.
However, the random selection of numbers to make the swap with the empty cell is just going on forever. To avoid "infinite" (very long run) of loops, I have limited the number of swaps to 30 for now.
from random import randint
def find_idx_of_empty_cell(p):
for i in range(len(p)):
if p[i] == 0:
return i
def pick_random_neighbour_idx(neighbours_idx_list):
rand_i = randint(0, len(neighbours_idx_list)-1)
return neighbours_idx_list[rand_i]
def perform__neighbour_transposition(p, tar_idx, src_idx):
temp = p[tar_idx]
p[tar_idx] = p[src_idx]
p[src_idx] = temp
def solve_15_puzzle(p):
standard_perm = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0]
neighbours_idx_list = []
moves_sequence = []
empty_cell_idx = find_idx_of_empty_cell(p)
previous_empty_cell_idx = empty_cell_idx
while (not(p == standard_perm) and len(moves_sequence) < 30):
if not (empty_cell_idx in [0,4,8,12]):
neighbours_idx_list.append(empty_cell_idx - 1)
if not (empty_cell_idx in [3,7,11,15]):
neighbours_idx_list.append(empty_cell_idx + 1)
if not (empty_cell_idx in [0,1,2,3]):
neighbours_idx_list.append(empty_cell_idx - 4)
if not (empty_cell_idx in [12,13,14,15]):
neighbours_idx_list.append(empty_cell_idx + 4)
if previous_empty_cell_idx in neighbours_idx_list:
neighbours_idx_list.remove(previous_empty_cell_idx)
chosen_neighbour_idx = pick_random_neighbour_idx(neighbours_idx_list)
moves_sequence.append(p[chosen_neighbour_idx])
perform__neighbour_transposition(p, empty_cell_idx, chosen_neighbour_idx)
previous_empty_cell_idx = empty_cell_idx
empty_cell_idx = chosen_neighbour_idx
neighbours_idx_list = []
if (p == standard_perm):
print("Solution: ", moves_sequence)
For the below invocation of the method, the expected output is [15, 14, 10, 13, 9, 10, 14, 15].
solve_15_puzzle([1, 2, 3, 4, 5, 6, 7, 8, 13, 9, 11, 12, 10, 14, 15, 0])
The 15-tiles problem is harder as it may seem at a first sight.
Computing the best (shortest) solution is a difficult problem and it has been proved than finding the optimal solution as N increases is NP-hard.
Finding a (non-optimal) solution is much easier. A very simple algorithm that can be made to work for example is:
Define a "distance" of the current position as the sum of the manhattan
distances of every tile from the position you want it to be
Start from the given position and make some random moves
If the distance after the moves improves or stays the same then keep the changes, otherwise undo them and return to the starting point.
This kind of algorithm could be described as a multi-step stochastic hill-climbing approach and is able to solve the 15 puzzle (just make sure to allow enough random moves to be able to escape a local minimum).
Python is probably not the best language to attack this problem, but if you use PyPy implementation you can get solutions in reasonable time.
My implementation finds a solution for a puzzle that has been mixed up with 1000 random moves in seconds, for example:
(1, 5, 43, [9, [4, 10, 14, 11, 15, 3, 8, 1, 13, None, 9, 7, 12, 2, 5, 6]])
(4, 17, 41, [9, [4, 10, 14, 11, 15, 3, 8, 1, 12, None, 6, 2, 5, 13, 9, 7]])
(7, 19, 39, [11, [4, 10, 14, 11, 15, 3, 1, 2, 12, 6, 8, None, 5, 13, 9, 7]])
(9, 54, 36, [5, [4, 14, 3, 11, 15, None, 10, 2, 12, 6, 1, 8, 5, 13, 9, 7]])
(11, 60, 34, [10, [4, 14, 3, 11, 15, 10, 1, 2, 12, 6, None, 8, 5, 13, 9, 7]])
(12, 93, 33, [14, [4, 14, 11, 2, 15, 10, 3, 8, 12, 6, 1, 7, 5, 13, None, 9]])
(38, 123, 31, [11, [4, 14, 11, 2, 6, 10, 3, 8, 15, 12, 1, None, 5, 13, 9, 7]])
(40, 126, 30, [13, [15, 6, 4, 2, 12, 10, 11, 3, 5, 14, 1, 8, 13, None, 9, 7]])
(44, 172, 28, [10, [15, 4, 2, 3, 12, 6, 11, 8, 5, 10, None, 14, 13, 9, 1, 7]])
(48, 199, 23, [11, [15, 6, 4, 3, 5, 12, 2, 8, 13, 10, 11, None, 9, 1, 7, 14]])
(61, 232, 22, [0, [None, 15, 4, 3, 5, 6, 2, 8, 1, 12, 10, 14, 13, 9, 11, 7]])
(80, 276, 20, [10, [5, 15, 4, 3, 1, 6, 2, 8, 13, 10, None, 7, 9, 12, 14, 11]])
(105, 291, 19, [4, [9, 1, 2, 4, None, 6, 8, 7, 5, 15, 3, 11, 13, 12, 14, 10]])
(112, 313, 17, [9, [1, 6, 2, 4, 9, 8, 3, 7, 5, None, 14, 11, 13, 15, 12, 10]])
(113, 328, 16, [15, [1, 6, 2, 4, 9, 8, 3, 7, 5, 15, 11, 10, 13, 12, 14, None]])
(136, 359, 15, [4, [1, 6, 2, 4, None, 8, 3, 7, 9, 5, 11, 10, 13, 15, 12, 14]])
(141, 374, 12, [15, [1, 2, 3, 4, 8, 6, 7, 10, 9, 5, 12, 11, 13, 15, 14, None]])
(1311, 385, 11, [14, [1, 2, 3, 4, 8, 5, 7, 10, 9, 6, 11, 12, 13, 15, None, 14]])
(1329, 400, 10, [13, [1, 2, 3, 4, 6, 8, 7, 10, 9, 5, 11, 12, 13, None, 15, 14]])
(1602, 431, 9, [4, [1, 2, 3, 4, None, 6, 8, 7, 9, 5, 11, 10, 13, 15, 14, 12]])
(1707, 446, 8, [5, [1, 2, 3, 4, 6, None, 7, 8, 9, 5, 15, 12, 13, 10, 14, 11]])
(1711, 475, 7, [12, [1, 2, 3, 4, 6, 5, 7, 8, 9, 10, 15, 12, None, 13, 14, 11]])
(1747, 502, 6, [8, [1, 2, 3, 4, 6, 5, 7, 8, None, 9, 10, 12, 13, 14, 15, 11]])
(1824, 519, 5, [14, [1, 2, 3, 4, 9, 6, 7, 8, 5, 10, 15, 12, 13, 14, None, 11]])
(1871, 540, 4, [10, [1, 2, 3, 4, 9, 6, 7, 8, 5, 10, None, 12, 13, 14, 15, 11]])
(28203, 555, 3, [9, [1, 2, 3, 4, 5, 6, 7, 8, 9, None, 10, 12, 13, 14, 11, 15]])
(28399, 560, 2, [10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None, 12, 13, 14, 11, 15]])
(28425, 581, 1, [11, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, None, 13, 14, 15, 12]])
(28483, 582, 0, [15, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None]])
The last line means that after 24,483 experiments it found the target position after 582 moves. Note that 582 is for sure very far from optimal as it's known that no position in the classic version of the 15 puzzle requires more than 80 moves.
The number after the number of moves is the "manhattan distance", for example the fourth-last row is the position:
where the sum of manhattan distances from the solution is 3.

Using key values from one list to group corresponding items in other lists [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 3 lists:
idd = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
row = [15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7]
col = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2]
what I'm trying to do is to create a dictionary where the keys are the grouped items of idd and the values are a list of list of row and col.
The final output should be something like:
dd = {1:[[15,15,15,15.....][1,2,3,4,....]], 2:[[13,13,13,13,....][6,7,8,9,...]]....}
Thanks!
Give this a shot:
from collections import defaultdict
grouped = defaultdict(lambda: [[], []])
for i, r, c in zip(idd, row, col):
grouped[i][0].append(r)
grouped[i][1].append(c)

SKlearn Random Forest error on input

I am trying to run the fit for my random forest, but I am getting the following error:
forest.fit(train[features], y)
returns
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
396 % (array.ndim, estimator_name))
397 if force_all_finite:
--> 398 _assert_all_finite(array)
399
400 shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
52 and not np.isfinite(X).all()):
53 raise ValueError("Input contains NaN, infinity"
---> 54 " or a value too large for %r." % X.dtype)
55
56
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I have coerced my dataframe from float64 to float32 for my features and made sure that there's no nulls so not sure what's throwing this error. Let me know if it would be helpful to put in more of my code.
UPDATE
It's originally a pandas dataframe, which I dropped all NaNs. The original dataframe is survey results with respondent information, where I dropped all questions except for my dv. I double checked this by running rforest_df.isnull().sum() which returned 0. This is the full code I used for the modeling.
rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)
Update
This is what y data looks like
array([ 0, 1, 2, 3, 4, 3, 3, 5, 6, 7, 8, 7, 9, 6, 10, 6, 11,
7, 11, 3, 7, 9, 6, 5, 9, 11, 12, 13, 6, 11, 3, 3, 6, 14,
15, 0, 9, 9, 2, 0, 11, 3, 9, 4, 9, 7, 3, 4, 9, 12, 9,
7, 6, 13, 6, 0, 0, 16, 6, 11, 4, 10, 11, 11, 17, 3, 6, 16,
3, 4, 18, 19, 7, 11, 5, 11, 5, 4, 0, 6, 17, 7, 2, 3, 5,
11, 8, 9, 18, 6, 9, 8, 5, 16, 20, 0, 4, 8, 13, 16, 3, 20,
0, 5, 4, 2, 11, 0, 3, 0, 6, 6, 6, 9, 4, 6, 5, 11, 0,
13, 6, 2, 11, 7, 5, 6, 18, 12, 21, 17, 3, 6, 0, 13, 21, 7,
3, 2, 18, 22, 7, 3, 2, 6, 7, 8, 4, 0, 7, 12, 3, 7, 3,
2, 11, 19, 11, 6, 2, 9, 3, 7, 9, 9, 5, 6, 8, 0, 18, 11,
3, 12, 2, 6, 4, 7, 7, 11, 3, 6, 6, 0, 6, 12, 15, 3, 9,
3, 3, 0, 5, 9, 7, 9, 11, 7, 3, 20, 0, 7, 6, 6, 23, 15,
19, 0, 3, 6, 16, 13, 5, 6, 6, 3, 6, 11, 9, 0, 6, 23, 16,
4, 0, 6, 17, 11, 17, 11, 4, 3, 13, 3, 17, 16, 11, 7, 4, 24,
5, 2, 7, 7, 8, 3, 3, 11, 8, 7, 23, 7, 7, 11, 7, 11, 6,
15, 3, 25, 7, 4, 5, 3, 17, 20, 3, 26, 7, 9, 6, 6, 17, 20,
1, 0, 11, 9, 16, 20, 7, 7, 26, 3, 6, 20, 7, 2, 11, 7, 27,
9, 4, 26, 28, 8, 6, 9, 19, 7, 29, 3, 2, 26, 30, 6, 31, 6,
18, 3, 0, 18, 4, 7, 32, 0, 2, 8, 0, 5, 9, 4, 16, 6, 23,
0, 7, 0, 7, 9, 6, 8, 3, 7, 9, 3, 3, 12, 11, 8, 19, 20,
7, 3, 5, 11, 3, 11, 8, 4, 4, 6, 9, 4, 1, 3, 0, 9, 9,
6, 7, 8, 33, 8, 7, 9, 34, 11, 11, 6, 9, 9, 17, 8, 19, 0,
7, 4, 17, 6, 7, 0, 4, 12, 7, 6, 4, 16, 12, 9, 6, 6, 6,
6, 26, 13, 9, 7, 2, 7, 3, 11, 3, 6, 7, 19, 4, 8, 9, 13,
11, 15, 11, 4, 18, 7, 7, 7, 0, 5, 4, 6, 0, 3, 7, 4, 25,
18, 6, 19, 7, 9, 4, 20, 6, 3, 7, 4, 35, 15, 11, 2, 12, 0,
7, 32, 6, 18, 9, 9, 6, 2, 3, 19, 36, 32, 0, 7, 0, 9, 37,
3, 5, 6, 5, 34, 2, 6, 0, 7, 0, 7, 3, 7, 4, 18, 18, 7,
3, 7, 16, 9, 19, 13, 4, 16, 19, 3, 19, 38, 9, 4, 9, 8, 0,
17, 0, 2, 3, 5, 6, 5, 11, 11, 2, 9, 5, 33, 9, 5, 6, 20,
13, 3, 39, 13, 7, 0, 9, 0, 4, 6, 7, 16, 7, 0, 21, 5, 3,
18, 5, 20, 2, 2, 14, 6, 17, 11, 11, 16, 16, 9, 8, 11, 3, 23,
0, 11, 0, 6, 0, 0, 3, 16, 6, 7, 5, 9, 7, 13, 0, 20, 0,
25, 6, 16, 8, 4, 4, 2, 8, 7, 5, 40, 3, 8, 5, 12, 8, 9,
6, 6, 6, 6, 3, 7, 26, 4, 0, 13, 4, 3, 13, 12, 7, 7, 6,
7, 19, 15, 0, 33, 4, 5, 5, 20, 3, 11, 5, 4, 7, 9, 7, 11,
36, 9, 0, 6, 6, 11, 6, 4, 2, 5, 18, 8, 5, 5, 2, 25, 4,
41, 7, 7, 5, 7, 3, 36, 11, 6, 9, 0, 9, 0, 16, 42, 11, 11,
18, 9, 5, 36, 2, 9, 6, 3, 43, 9, 17, 13, 5, 9, 3, 4, 6,
44, 37, 0, 45, 2, 18, 8, 46, 2, 12, 9, 9, 3, 16, 6, 12, 9,
0, 11, 11, 0, 25, 8, 17, 4, 4, 3, 11, 3, 11, 6, 6, 9, 7,
23, 0, 2, 0, 3, 3, 4, 4, 9, 5, 11, 16, 7, 3, 18, 11, 7,
6, 6, 6, 5, 9, 6, 3, 9, 7, 17, 11, 4, 9, 2, 3, 0, 26,
9, 0, 20, 8, 9, 6, 11, 6, 6, 7, 26, 6, 6, 4, 19, 5, 41,
19, 18, 29, 6, 5, 13, 6, 11, 7, 7, 6, 8, 5, 0, 3, 13, 17,
6, 20, 11, 6, 9, 6, 2, 7, 11, 9, 20, 12, 7, 6, 8, 7, 4,
6, 2, 0, 7, 9, 26, 9, 16, 7, 4, 45, 7, 0, 23, 8, 4, 19,
4, 26, 11, 4, 4, 5, 7, 3, 0, 29, 12, 3, 4, 11, 4, 12, 8,
7, 5, 0, 47, 12, 0, 25, 6, 16, 20, 5, 8, 4, 4, 11, 12, 0,
6, 3, 11, 4, 3, 48, 3, 6, 7, 4, 7, 0, 3, 7, 3, 18, 6,
2, 9, 9, 11, 3, 9, 6, 18, 16, 6, 34, 2, 7, 4, 3, 45, 5,
0, 7, 2, 17, 17, 9, 18, 5, 6, 5, 15, 5, 7, 6, 9, 0, 7,
12, 17])
I would first recommend that you check the datatype for each column within your train[features] df in the following way:
print train[features].dtypes
If you see there are columns that are non-numeric, you can inspect these columns to ensure there are not any unexpected values (e.g. strings, NaNs, etc.) that would cause problems. If you do not mind dropping out non-numeric columns, you can simply select all the numeric columns with the following:
numeric_cols = X.select_dtypes(include=['float64','float32']).columns
You can also add in columns with int dtypes if you would like.
If you are encountering values that are too large or too small for the model to handle, it is a sign that scaling the data is a good idea. In sklearn, this can be accomplished as follows:
scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])
Finally you should consider imputing missing values with sklearn's Imputer as well as filling NaNs with something like the following:
train[features].fillna(0, inplace=True)
It happens when you have null-strings (like '') in the dataset. Also try to print something like
pd.value_counts()
or even
sorted(list(set(...)))
or get min or max values for each column in the dataset in the loop.
Above example with MinMaxScaler may work, but scaled features do not work well in the RF.

Categories

Resources