I am attempting to take an array of strings and transform it into an array of separate words (with the same number of columns). But the two loops are giving me very different results, and this means I can't access any of the values in the array, really.
array1 = [
["yes is a good thing","no is a bad thing"],
["maybe is a good","certainly is a bad"]
]
w2, h2 = 2,15;
array2 = [[0 for x in range(w2)] for y in range(h2)]
for column in range(len(array1[0])):
for row in range(len(array1)):
array2[1:][column] += str(array1[row][column]).split()
for line in array2: #LOOP 1
print(line)
for column in range(len(array2[0])): #LOOP 2
for row in range(len(array2)):
print(array2[row][column])
The results:
Loop 1 (This is what I'd like to be represented in the second loop)
[0, 0]
[0, 0, 'yes', 'is', 'a', 'good', 'thing', 'maybe', 'is', 'a', 'good']
[0, 0, 'no', 'is', 'a', 'bad', 'thing', 'certainly', 'is', 'a', 'bad']
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
Loop 2:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Basically I want an array with two columns, and then the relevant separate words going down each column. Expected output:
yes no
is is
a a
good bad
thing thing
maybe certainly
is is
a a
good bad
You can produce your output directly from the columns:
array1 = [
["yes is a good thing", "no is a bad thing"],
["maybe is a good", "certainly is a bad"]
]
words = [[word for line in col for word in line.split()] for col in zip(*array1)]
transposed = list(zip(*words))
zip(*iterable) transposes a matrix, moving columns to rows and vice versa.
Demo:
>>> array1 = [
... ["yes is a good thing", "no is a bad thing"],
... ["maybe is a good", "certainly is a bad"]
... ]
>>> words = [[word for line in col for word in line.split()] for col in zip(*array1)]
>>> transposed = list(zip(*words))
>>> for row in transposed:
... print('{:8} {:8}'.format(*row))
...
yes no
is is
a a
good bad
thing thing
maybe certainly
is is
a a
good bad
Related
I want to insert the value in the NumPy array as follows,
If Nth row is the same as (N-1)th row insert 1 for Nth row and (N-1)th row and rest 0
If Nth row is different from (N_1)th row then change column and repeat condition
Here is the example
d = {'col1': [2,2, 3,3,3, 4,4, 5,5,5,],
'col2': [3,3, 4,4,4, 1,1, 0,0,0]}
df = pd.DataFrame(data=d)
np.zeros((10,4))
###########################################################
OUTPUT MATRIX
1 0 0 0 First two rows are the same so 1,1 in a first column
1 0 0 0
0 1 0 0 Three-rows are same 1,1,1
0 1 0 0
0 1 0 0
0 0 1 0 Again two rows are the same 1,1
0 0 1 0
0 0 0 1 Again three rows are same 1,1,1
0 0 0 1
0 0 0 1
IIUC, you can achieve this simply with numpy indexing:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
# craft the numpy array
a = np.zeros((len(group), group.max()+1), dtype=int)
a[np.arange(len(df)), group] = 1
print(a)
Alternative with numpy.identity:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
shape = df.groupby(group).size()
# craft the numpy array
a = np.repeat(np.identity(len(shape), dtype=int), shape, axis=0)
print(a)
output:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
intermediates:
group
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
dtype: int64
shape
0 2
1 3
2 2
3 3
dtype: int64
other option
for fun, likely no so efficient on large inputs:
a = pd.get_dummies(df.agg(tuple, axis=1)).to_numpy()
Note that this second option uses groups of identical values, not successive identical values. For identical values with the first (numpy) approach, you would need to use group = df.groupby(list(df)).ngroup() and the numpy indexing option (this wouldn't work with repeating the identity).
Let's suppose I have the following DataFrame:
Column
0 A - B - C
1 A - B
2 A - C
3 A
4 B
5 C
I want to encode the "Column" but I have multiple classes in the same cell. Using pandas I can do the following to get the proper encoding output:
df['Column'].str.get_dummies(sep=' - ')
A B C
0 1 1 1
1 1 1 0
2 1 0 1
3 1 0 0
4 0 1 0
5 0 0 1
How can I do the same transformation using Sklearn?
Another approach is to use the MultiLabelBinarizer class as it supports an iterable as input.
df['Column'] = df['Column'].str.split(' - ')
enc = MultiLabelBinarizer()
enc.fit_transform(df['Column'])
Looks like you can do this with CountVectorizer specifying how you want to identify boundaries.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'Column': ['A - B - C', 'A - B', 'A - C', 'A', 'B', 'C']})
vectorizer = CountVectorizer(analyzer=lambda x: x.split(' - '))
X = vectorizer.fit_transform(df['Column'])
X.toarray()
array([[1, 1, 1],
[1, 1, 0],
[1, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
vectorizer.get_feature_names()
['A', 'B', 'C']
I have a 2d array r. What I want to do is to take the product of each row (excluding the zero elements in that row). For example if I have:
r = [[1 2 0 3 4],
[0 2 5 0 1],
[1 2 3 4 0]]
Then what I want is to have another 2d array result such that:
result = [[24],
[10],
[24]]
How can I achieve this using numpy.prod?
I think I figured it out:
np.prod(r, axis = 1, where = r > 0, keepdims = True)
Output:
array([[24],
[10],
[24]])
I want to creat a column s['C'] using apply() with a Pandas DataFrame.
My dataset is similiar to this:
[In]:
s=pd.DataFrame({'A':['hello', 'good', 'my', 'pandas','wrong'],
'B':[['all', 'say', 'hello'],
['good', 'for', 'you'],
['so','hard'],
['pandas'],
[]]})
[Out]:
A B
0 hello [all, say, hello]
1 good [good, for, you]
2 my [so, hard]
3 pandas [pandas]
4 wrong []
I need to creat a s['C'] column where the value of each row is a list with ones and zeros dependending if the word of column A is in the list of column B and the position of the element in the list of column B. My output should be like this:
[Out]:
A B C
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [] [0]
I've been trying with a funciĆ³n and apply, but I still have not realized where is the error.
[In]:
def func(valueA,listB):
new_list=[]
for i in listB:
if listB[i] == valueA:
new_list.append(1)
else:
new_list.append(0)
return new_list
s['C']=s.apply( lambda x: func(x.loc[:,'A'], x.loc[:,'B']))
The error is: Too many indexers
And I also tried with:
[In]:
list=[]
listC=[]
for i in s['A']:
for j in s['B'][i]:
if s['A'][i] == s['B'][i][j]:
list.append(1)
else:
list.append(0)
listC.append(list)
s['C']=listC
The error is: KeyError: 'hello'
Any suggests?
If you are working with pandas 0.25+, explode is an option:
(s.explode('B')
.assign(C=lambda x: x['A'].eq(x['B']).astype(int))
.groupby(level=0).agg({'A':'first','B':list,'C':list})
)
Output:
A B C
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [nan] [0]
Option 2: Based on your logic, you can do a list comprehension. This should work with any version of pandas:
s['C'] = [[x==a for x in b] if b else [0] for a,b in zip(s['A'],s['B'])]
Output:
A B C
0 hello [all, say, hello] [False, False, True]
1 good [good, for, you] [True, False, False]
2 my [so, hard] [False, False]
3 pandas [pandas] [True]
4 wrong [] [0]
With apply would be
s['c'] = s.apply(lambda x: [int(x.A == i) for i in x.B], axis=1)
s
A B c
0 hello [all, say, hello] [0, 0, 1]
1 good [good, for, you] [1, 0, 0]
2 my [so, hard] [0, 0]
3 pandas [pandas] [1]
4 wrong [] []
I could get your function to work with some minor changes:
def func(valueA, listB):
new_list = []
for i in range(len(listB)): #I changed your in listB with len(listB)
if listB[i] == valueA:
new_list.append(1)
else:
new_list.append(0)
return new_list
and adding the parameter axis = 1 to the apply function
s['C'] = s.apply(lambda x: func(x.A, x.B), axis=1)
Another approach that requires numpy for easy indexing:
import numpy as np
def create_vector(word, vector):
out = np.zeros(len(vector))
indices = [i for i, x in enumerate(vector) if x == word]
out[indices] = 1
return out.astype(int)
s['C'] = s.apply(lambda x: create_vector(x.A, x.B), axis=1)
# Output
# A B C
# 0 hello [all, say, hello] [0, 0, 1]
# 1 good [good, for, you] [1, 0, 0]
# 2 my [so, hard] [0, 0]
# 3 pandas [pandas] [1]
# 4 wrong [] []
Here my implementation of one-got encoding :
%reset -f
import numpy as np
import pandas as pd
sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'
sentences.append(s1)
sentences.append(s2)
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
all_words = []
for f in unf :
for f2 in f :
all_words.append(f2)
return all_words
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
return flattened
all_words = get_all_words(sentences)
print(get_one_hot(sentences , s1 , all_words))
print(get_one_hot(sentences , s2 , all_words))
this returns :
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
As can see a sparse vector is returns for small sentences. It appears the encoding is occurring at character level instead of word level ? How to correctly on-hot encode below words ?
I think the encodings should be ? :
s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0
Encoding at character level
This is because of the loop:
for f in unf :
for f2 in f :
all_words.append(f2)
that f2 is looping over characters of string f. Indeed you can rewrite the whole function to be:
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
return list(set([word for sen in unf for word in sen]))
correct one-hot encoding
This loop
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
is actually making a very long vector. Let's look at the output of one_hot_encoded_df = pd.get_dummies(list(set(all_words))):
1 2 is sentence this
0 0 1 0 0 0
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
the loop above is picking the corresponding columns from this dataframe and append to the output flattened. My suggestion will be simply leverage on the pandas feature to allow you to subset a few columns, than sum up, and clip to either 0 or 1, to get the one-hot encoded vector:
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values
The output will be:
[0 1 1 1 1]
[1 1 0 1 1]
For your two sentenses respectively. This is how to interpret these: From the row indices of one_hot_encoded_df dataframe, we know that we use 0 for 2, 1 for this, 2 for 1, etc. So the output [0 1 1 1 1] means all items in the bag of words except 2, which you can confirm with the input 'this is sentence 1'