one hot encode with pandas get_dummies missing values - python

I have a dataset in the form of a DataFrame and each row has a label ranging from 1-5. I am doing a one hot encode using pd.get_dummies(). If my dataset has all 5 labels there is not problem. However not all sets contain all 5 numbers so the encode just skips the missing value and creates a problem for new datasets coming in. Can I set a range so that the one hot encode knows there should be 5 labels? Or would I have to append 1,2,3,4,5 to the end of the array before I perform the encode and then delete the last 5 entries?
Correct encode: values 1-5 are encoded
arr = np.array([1,2,5,3,1,5,1,4])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 1 0]]
Missing value encode: this dataset is missing label 4.
arr = np.array([1,2,5,3,1,5,1,])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0]
[0 1 0 0]
[0 0 0 1]
[0 0 1 0]
[1 0 0 0]
[0 0 0 1]
[1 0 0 0]]

Set up the CategoricalDtype before encoding to ensure all categories are represented when getting dummies:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 5, 3, 1, 5, 1])
df = pd.DataFrame(arr, columns=['test'])
# Setup Categorical Dtype
df['test'] = df['test'].astype(pd.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
hotarr = np.array(pd.get_dummies(df['test']))
print(hotarr)
Alternatively can reindex after get_dummies with fill_value=0 to add the missing columns:
hotarr = np.array(pd.get_dummies(df['test'])
.reindex(columns=[1, 2, 3, 4, 5], fill_value=0))
Both produce hotarr with 5 columns even though input does not contain 4:
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]]

Related

LabelBinarizer gives all values zeros

I'm encoding my labels with label binarizer like this:
from sklearn.preprocessing import LabelBinarizer
# Transform labels to one-hot
lb = LabelBinarizer()
Y = lb.fit_transform(df.classification)
But when I print Y I get all zeros like:
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
I don't know if all the values in all rows are zeros or not. Unfortunately, I can't see the complete row and couldn't find a way to do so. Are these values right or not?
Any help would be appreciated.

Preprocess Accuracy metric

I have a model which predicts 5 classes. I want to change Accuracy metric as in example below :
def accuracy(y_pred,y_true):
#our pred tensor
y_pred = [ [0,0,0,0,1], [0,1,0,0,0], [0,0,0,1,0], [1,0,0,0,0], [0,0,1,0,0]]
# make some manipulations with tensor y_pred
# actons description :
for array in y_pred :
if array[3] == 1 :
array[3] = 0
array[0] = 1
if array[4] == 1 :
array[4] = 0
array[1] = 1
else :
continue
#this nice work with arrays but howe can i implement it with tensors ?
#after manipulations result->
y_pred = [ [0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0],[0,0,1,0,0] ]
#the same ations i want to do with y_true
# and after it i want to run this preprocess tensors the same way as simple tf.keras.metrics.Accuracy metric
I think tf.where can help to filter tensor, but unfortunately can't do this correctly.
How to make this preprocessing accuracy metric with Tensors ?
If you want to shift the ones to left by 3 indices, you can do this:
import numpy as np
y_pred = [ [0,0,0,0,1], [0,1,0,0,0], [0,0,0,1,0], [1,0,0,0,0], [0,0,1,0,0]]
y_pred = np.array(y_pred)
print(y_pred)
shift = 3
one_pos = np.where(y_pred==1)[1] # indices where the y_pred is 1
# updating the new positions with 1
y_pred[range(y_pred.shape[1]),one_pos - shift] = np.ones((y_pred.shape[1],))
# making the old positions zero
y_pred[range(y_pred.shape[1]),one_pos] = np.zeros((y_pred.shape[1],))
print(y_pred)
[[0 0 0 0 1]
[0 1 0 0 0]
[0 0 0 1 0]
[1 0 0 0 0]
[0 0 1 0 0]]
[[0 1 0 0 0]
[0 0 0 1 0]
[1 0 0 0 0]
[0 0 1 0 0]
[0 0 0 0 1]]
Update:
If you only want to shift for index 3 and 4.
import numpy as np
y_pred = [ [0,0,0,0,1], [0,1,0,0,0], [0,0,0,1,0], [1,0,0,0,0], [0,0,1,0,0]]
y_pred = np.array(y_pred)
print(y_pred)
shift = 3
one_pos = np.where(y_pred==1)[1]# indices where the y_pred is 1
print(one_pos)
y_pred[range(y_pred.shape[1]),one_pos - shift] = [1 if (i == 3 or i == 4) else 0 for i in one_pos]
y_pred[range(y_pred.shape[1]),one_pos] = [0 if (i == 3 or i == 4) else 1 for i in one_pos]
print(y_pred)
[[0 0 0 0 1]
[0 1 0 0 0]
[0 0 0 1 0]
[1 0 0 0 0]
[0 0 1 0 0]]
[4 1 3 0 2]
[[0 1 0 0 0]
[0 1 0 0 0]
[1 0 0 0 0]
[1 0 0 0 0]
[0 0 1 0 0]]

How to find longest consecutive ocurrence of non-zero elements in 2D numpy array

I am simulating protein folding on a 2D grid where every angle is either ±90° or 0°, and have the following problem:
I have an n-by-n numpy array filled with zeros, except for certain places where the value is any integer from 1 to n. Every integer appears just once. Integer k is always a nearest neighbour to k-1 and k + 1, except for the endpoints. The array is saved as an object in the class Grid which I have created for doing energy calculations and folding the protein. Example array, with n=5:
>>> from Grid import Grid
>>> a = Grid(5)
>>> a.show()
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
My goal is to find the longest consecutive line of non-zero elements withouth any bends. In the above case, the result should be 5.
My idea so far are something like this:
def getDiameter(self):
indexes = np.zeros((self.n, 2))
for i in range(1, self.n + 1):
indexes[i - 1] = np.argwhere(self.array == i)[0]
for i in range(self.n):
j = 1
currentDiameter = 1
while indexes[0][i] == indexes[0][i + j] and i + j <= self.n:
currentDiameter += 1
j += 1
while indexes[i][0] == indexes[i + j][0] and i + j <= self.n:
currentDiameter += 1
j += 1
if currentDiameter > diameter:
diameter = currentDiameter
return diameter
This has two problems: (1) it doesn't work, and (2) it is horribly inefficient if I get it to work. I am wondering if anybody has a better way of doing this. If anything is unclear, please let me know.
Edit:
Less trivial example
[[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 10 0 0 0]
[ 0 0 0 0 0 0 9 0 0 0]
[ 0 0 0 0 0 0 8 0 0 0]
[ 0 0 0 4 5 6 7 0 0 0]
[ 0 0 0 3 0 0 0 0 0 0]
[ 0 0 0 2 1 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]]
The correct answer here is 4 (both the longest column and the longest row have four non-zero elements).
What I understood from your question is you need to find the length of longest occurance of consecutive elements in numpy array (row by row).
So for this below one, the output should be 5:
[[1 2 3 4 0]
[0 0 0 0 0]
[10 11 12 13 14]
[0 1 2 3 0]
[1 0 0 0 0]]
Because [10 11 12 13 14] are consecutive elements and they have the longest length comparing to any consecutive elements in any other row.
If this is what you are expecting, consider this:
import numpy as np
from itertools import groupby
a = np.array([[1, 2, 3, 4, 0],
[0, 0, 0, 0, 0],
[10, 11, 12, 13, 14],
[0, 1, 2, 3, 0],
[1, 0, 0, 0, 0]])
a = a.astype(float)
a[a == 0] = np.nan
b = np.diff(a) # Calculate the n-th discrete difference. Consecutive numbers will have a difference of 1.
counter = []
for line in b: # for each row.
if 1 in line: # consecutive elements differ by 1.
counter.append(max(sum(1 for _ in g) for k, g in groupby(line) if k == 1) + 1) # find the longest length of consecutive 1's for each row.
print(max(counter)) # find the max of list holding the longest length of consecutive 1's for each row.
# 5
For your particular example:
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
# 5
Start by finding the longest consecutive occurrence in a list:
def find_longest(l):
counter = 0
counters =[]
for i in l:
if i == 0:
counters.append(counter)
counter = 0
else:
counter += 1
counters.append(counter)
return max(counters)
now you can apply this function to each row and each column of the array, and find the maximum:
longest_occurrences = [find_longest(row) for row in a] + [find_longest(col) for col in a.T]
longest_occurrence = max(longest_occurrences)

Take non-zero elements in a macro-list

I have a problem with the instruction np.nonzero() in python. I want to take all the indices of a given list that are non zero. So, consider that I have the following code:
import numpy as np
from scipy.special import binom
M=4
N=3
def generate(N,nb):
states = np.zeros((int(binom(nb+N-1, nb)), N), dtype=int)
states[0, 0]=nb
ni = 0 # init
for i in xrange(1, states.shape[0]):
states[i,:N-1] = states[i-1, :N-1]
states[i,ni] -= 1
states[i,ni+1] += 1+states[i-1, N-1]
if ni >= N-2:
if np.any(states[i, :N-1]):
ni = np.nonzero(states[i, :N-1])[0][-1]
else:
ni += 1
return states
base = generate(M,N)
The result of base is given by:
base = [[3 0 0 0]
[2 1 0 0]
[2 0 1 0]
[2 0 0 1]
[1 2 0 0]
[1 1 1 0]
[1 1 0 1]
[1 0 2 0]
[1 0 1 1]
[1 0 0 2]
[0 3 0 0]
[0 2 1 0]
[0 2 0 1]
[0 1 2 0]
[0 1 1 1]
[0 1 0 2]
[0 0 3 0]
[0 0 2 1]
[0 0 1 2]
[0 0 0 3]]
The point is that for a given index j,k I want to take all the items in base that has non-zero components in the sites j,k, for example:
Taking j=0,k=1 I have to obtain:
result = [1 4 5 6]
which corresponds to the elements 1,4,5,6 of base that satisfies this condition. On the other hand, I have used the command:
np.nonzero((base[:, j]) & (base[:, k]))[0]
but it doesn't work correctly, any idea why?
First of all, the syntax for list index base[:, j] is wrong, use : [:][j] instead
also:
np.nonzero((base[:, j]) & (base[:, k]))[0]
won't work ,because the & sign is not applicable here..
you could use numpy like this:
b = np.array(base);
j=0;k=1;
np.nonzero(b.T[j]* b.T[k])[0]
which will give:
array([1, 4, 5, 6])

outputting large matrix in python from a dictionary

I have a python dictionary formatted in the following way:
data[author1][author2] = 1
This dictionary contains an entry for every possible author pair (all pairs of 8500 authors), and I need to output a matrix that looks like this for all author pairs:
"auth1" "auth2" "auth3" "auth4" ...
"auth1" 0 1 0 3
"auth2" 1 0 2 0
"auth3" 0 2 0 1
"auth4" 3 0 1 0
...
I have tried the following method:
x = numpy.array([[data[author1][author2] for author2 in sorted(data[author1])] for author1 in sorted(data)])
print x
outf.write(x)
However, printing this leaves me with this:
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]
and the output file is just a blank text file. I am trying to format the output in a way to read into Gephi (https://gephi.org/users/supported-graph-formats/csv-format/)
You almost got it right, your list comprehension is inverted. This will give you the expected result:
d = dict(auth1=dict(auth1=0, auth2=1, auth3=0, auth4=3),
auth2=dict(auth1=1, auth2=0, auth3=2, auth4=0),
auth3=dict(auth1=0, auth2=2, auth3=0, auth4=1),
auth4=dict(auth1=3, auth2=0, auth3=1, auth4=0))
np.array([[d[i][j] for i in sorted(d.keys())] for j in sorted(d[k].keys())])
#array([[0, 1, 0, 3],
# [1, 0, 2, 0],
# [0, 2, 0, 1],
# [3, 0, 1, 0]])
You could use pandas. Using #Saullo Castro input:
import pandas as pd
df = pd.DataFrame.from_dict(d)
Result:
>>> df
auth1 auth2 auth3 auth4
auth1 0 1 0 3
auth2 1 0 2 0
auth3 0 2 0 1
auth4 3 0 1 0
And if you want to save you can just do df.to_csv(file_name)

Categories

Resources