One hot encoding sentences

One hot encoding sentences - python

Here my implementation of one-got encoding :
%reset -f
import numpy as np
import pandas as pd
sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'
sentences.append(s1)
sentences.append(s2)
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
all_words = []
for f in unf :
for f2 in f :
all_words.append(f2)
return all_words
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
return flattened
all_words = get_all_words(sentences)
print(get_one_hot(sentences , s1 , all_words))
print(get_one_hot(sentences , s2 , all_words))
this returns :
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
As can see a sparse vector is returns for small sentences. It appears the encoding is occurring at character level instead of word level ? How to correctly on-hot encode below words ?
I think the encodings should be ? :
s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0

Encoding at character level
This is because of the loop:
for f in unf :
for f2 in f :
all_words.append(f2)
that f2 is looping over characters of string f. Indeed you can rewrite the whole function to be:
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
return list(set([word for sen in unf for word in sen]))
correct one-hot encoding
This loop
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
is actually making a very long vector. Let's look at the output of one_hot_encoded_df = pd.get_dummies(list(set(all_words))):
1 2 is sentence this
0 0 1 0 0 0
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
the loop above is picking the corresponding columns from this dataframe and append to the output flattened. My suggestion will be simply leverage on the pandas feature to allow you to subset a few columns, than sum up, and clip to either 0 or 1, to get the one-hot encoded vector:
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values
The output will be:
[0 1 1 1 1]
[1 1 0 1 1]
For your two sentenses respectively. This is how to interpret these: From the row indices of one_hot_encoded_df dataframe, we know that we use 0 for 2, 1 for this, 2 for 1, etc. So the output [0 1 1 1 1] means all items in the bag of words except 2, which you can confirm with the input 'this is sentence 1'

Related

Get first number each block of duplicates numbers in a list of 0 and 1

I have a list that looks like this:
a = [0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0...]
How do I get the index of the first 1 in each block of zero - one so the resulting index is:
[8 23 ..] and so on
I've been using this code:
def find_one (a):
for i in range(len(a)):
if (a[i] > 0):
return i
print(find_one(a))
but it gives me only the first occurrence of 1. How can implement it to iterate trough the entire list?
Thank you!!

You can do it using zip and al list comprehension:
a = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
r = [i for n,(i,v) in zip([1]+a,enumerate(a)) if v > n]
print(r) # [8,23]

Since you tagged pandas, can use groupby. If s = pd.Series(a) then
>>> x = s.groupby(s.diff().ne(0).cumsum()).head(1).astype(bool)
>>> x[x].index
Int64Index([8, 23], dtype='int64')

Without pandas:
b = a[1:]
[(num+1) for num,i in enumerate(zip(a,b)) if i == (0,1)]

# `state` is (prev_char, cur_char)
# where `prev_char` is the previous character seen
# and `cur_char` is the current character
#
#
# (0, 1) .... previous was "0"
# current is "1"
# RECORD THE INDEX.
# STRING OF ONES JUST BEGAN
#
# (0, 0) .... previous was "0"
# current is "0"
# do **NOT** reccord the index
#
# (1, 1) .... previous was "1"
# current is "1"
# we are in a string of ones, but
# not the begining of it.
# do **NOT** reccord the index.
#
# (1, 0).... previous was "1"
# current is "0"
# string of ones, just ended
# not the start of a string of ones.
# do **NOT** reccord the index.
state_to_print_decision = dict()
state_to_print_decision[(0, 1)] = True
def find_one (a, state_to_print_decision):
#
# pretend we just saw a bunch of zeros
# initilize state to (0, 0)
state = (0, 0)
for i in range(len(a)):
#
# a[i] is current character
#
# state[0] is the left element of state
#
# state[1] is the right elemet of state
#
# state[1] was current character,
# is now previous character
#
state = (state[1], a[i])
it_is_time_to_print = state_to_print_decision.get(state, False)
if(it_is_time_to_print):
indicies.append()
return indicies
a = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
print(find_one(a, state_to_print_decision))

Converting piano roll to MIDI in music21?

I am using music21 for handling MIDI and mXML files and converting them to a piano roll I am using in my project.
My piano roll is made up of sequence of 88-dimensional vectors where each element in a vector represents one pitch. One vector is one time step that can be 16th, 8th, 4th, and so on. Elements can obtain three values {0, 1, 2}. 0 means note is off. 1 means note is on. 2 means also that note is on but it always follows 1 - that is how I distinguish multiple key presses of same note. E.g., let time step be 8th and these two pitches be C and E:
[0 0 0 ... 1 0 0 0 1 ... 0]
[0 0 0 ... 1 0 0 0 1 ... 0]
[0 0 0 ... 2 0 0 0 2 ... 0]
[0 0 0 ... 2 0 0 0 2 ... 0]
[0 0 0 ... 1 0 0 0 0 ... 0]
[0 0 0 ... 1 0 0 0 0 ... 0]
We see that C and E are simultaneously played for quarter note, then again for quarter note, and we end with a C that lasts quarter note.
Right now, I am creating Stream() for every note and fill it as notes come. That gives me 88 streams and when I convert that to MIDI, and open that MIDI with MuseScore, that leaves me with a mess that is not readable.
My question is, is there some nicer way to transform this kind of piano roll to MIDI? Some algorithm, or idea which I could use would be appreciated.

In my opinion music21 is a very good library but too high-level for
this job. There is no such thing as streams, quarter notes or chords
in MIDI -- only messages. Try the
Mido library instead. Here
is sample code:
from mido import Message, MidiFile, MidiTrack
def stop_note(note, time):
return Message('note_off', note = note,
velocity = 0, time = time)
def start_note(note, time):
return Message('note_on', note = note,
velocity = 127, time = time)
def roll_to_track(roll):
delta = 0
# State of the notes in the roll.
notes = [False] * len(roll[0])
# MIDI note for first column.
midi_base = 60
for row in roll:
for i, col in enumerate(row):
note = midi_base + i
if col == 1:
if notes[i]:
# First stop the ringing note
yield stop_note(note, delta)
delta = 0
yield start_note(note, delta)
delta = 0
notes[i] = True
elif col == 0:
if notes[i]:
# Stop the ringing note
yield stop_note(note, delta)
delta = 0
notes[i] = False
# ms per row
delta += 500
roll = [[0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 2, 0, 0, 0, 2, 0],
[0, 1, 0, 2, 0, 0, 0, 2, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0]]
midi = MidiFile(type = 1)
midi.tracks.append(MidiTrack(roll_to_track(roll)))
midi.save('test.mid')

Numpy: how to convert observations to probabilities?

I have a feature matrix and a corresponding targets, which are ones or zeroes:
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
As you can see, each feature may correspond to both ones and zeros. I need to convert my raw observation matrix to probability matrix, where each feature will correspond to the probability of seeing one as a target:
[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0
I have constructed a quite straight-forward solution:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
from collections import Counter
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
zeros = Counter()
ones = Counter()
# collect row-wise number of one and zero targets
for i, row in enumerate(features[:]):
if targets[i] == 0:
zeros[tuple(row)] += 1
else:
ones[tuple(row)] += 1
# iterate over unique features and compute probabilities
for k in idx:
unique_row = features[k]
zero_count = zeros[tuple(unique_row)]
one_count = ones[tuple(unique_row)]
proba = float(one_count) / float(zero_count + one_count)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)
which:
extracts unique features;
counts number of zero and one observations targets for each unique feature;
computes probability and constructs the result.
Could it be solved in a prettier way using some advanced numpy magic?
Update. Previous code was pretty inefficient O(n^2). Converted it to more performance-friendly. Old code:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
# calculate ZERO class occurences and ONE class occurences
for k in idx:
unique_row = features[k]
zeros = 0
ones = 0
for i, row in enumerate(features[:]):
if np.array_equal(row, unique_row):
if targets[i] == 0:
zeros += 1
else:
ones += 1
proba = float(ones) / float(zeros + ones)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)

It's easy using Pandas:
df = pd.DataFrame(features)
df['targets'] = targets
Now you have:
0 1 2 targets
0 1 1 0 1
1 1 1 0 0
2 0 1 0 1
3 0 1 0 1
4 0 1 0 0
5 0 0 1 0
Now, the fancy part:
df.groupby([0,1,2]).targets.mean()
Gives you:
0 1 2
0 0 1 0.000000
1 0 0.666667
1 1 0 0.500000
Name: targets, dtype: float64
Pandas doesn't print the 0 at the leftmost part of the 0.666 row, but if you inspect the value there, it is indeed 0.

np.sum(np.reshape([targets[f] if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)/np.sum(np.reshape([1 if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)
Here you go, numpy magic! Although unnecceserily so, this could probably be cleaned up using some boring variables ;)
(And this is probably far from optimal)

Match letter frequency within a word against 26 letters in R (or python)

Currently, I have a string "abdicator". I would like find out frequency of letters from this word compared against all English alphabets (i.e., 26 letters), with an output in the form as follows.
Output:
a b c d e f g h i ... o ... r s t ... x y z
2 1 1 0 0 0 0 0 1..0..1..0..1 0 1 ... 0 ...
This output can be a numeric vector (with names being the 26 letters). My initial attempt was to first use strsplit function to split the string into individual letters (using R):
strsplit("abdicator","") #split at every character
#[[1]]
#[1] "a" "b" "c" "d" "e"`
However, I am a little stuck as to what to do for the next step. Can someone enlighten me on this please? Many thanks.

In R:
table(c(letters, strsplit("abdicator", "")[[1]]))-1
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
And extending that a bit to handle the possibility of multiple words and/or capital letters:
words <- c("abdicator", "Syzygy")
letterCount <- function(X) table(c(letters, strsplit(tolower(X), "")[[1]]))-1
t(sapply(words, letterCount))
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# abdicator 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
# syzygy 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 1

In Python:
>>> from collections import Counter
>>> s = "abdicator"
>>> Counter(s)
Counter({'a': 2, 'c': 1, 'b': 1, 'd': 1, 'i': 1, 'o': 1, 'r': 1, 't': 1})
>>> map(Counter(s).__getitem__, map(chr, range(ord('a'), ord('z')+1)))
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Or:
>>> import string
>>> map(Counter(s).__getitem__, string.lowercase)
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]

Python:
import collections
import string
counts = collections.Counter('abdicator')
chars = string.ascii_lowercase
print(*chars, sep=' ')
print(*[counts[char] for char in chars], sep=' ')

In Python 2:
import string, collections
ctr = collections.Counter('abdicator')
for l in string.ascii_lowercase:
print l,
print
for l in string.ascii_lowercase:
print ctr[l],
print
In Python 3, only the syntax of print changes.
This produces exactly the output you requested. The core idea is that a collections.Counter, indexed with a missing key, humbly returns 0 with the obvious semantics "this key has been seen 0 times" fully aligned with the semantics it uses for keys that are present (where it returns their count, i.e, the number of times they have been seen).

Pandas DataFrame column concatenation

I have a pandas Dataframe y with 1 million rows and 5 columns.
np.shape(y)
(1037889, 5)
The column values are all 0 or 1. Looks something like this:
y.head()
a, b, c, d, e
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I want a Dataframe with 1 million rows and 1 column.
np.shape(y)
(1037889, )
where the column is just the 5 columns concatenated together.
New column
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I keep trying different things like merge, concat, dstack, etc...
but can't seem to figure this out.

If you want new column to have all data concatenated to string, it's good case for apply() function:
>>> df = pd.DataFrame({'a':[0,1,0,0], 'b':[0,0,1,0], 'c':[1,0,1,0], 'd':[0,1,1,0], 'c':[0,1,1,0]})
>>> df
a b c d
0 0 0 0 0
1 1 0 1 1
2 0 1 1 1
3 0 0 0 0
>>> df2 = df.apply(lambda row: ','.join(map(str, row)), axis=1)
>>> df2
0 0,0,0,0
1 1,0,1,1
2 0,1,1,1
3 0,0,0,0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

One hot encoding sentences - python

Related

Get first number each block of duplicates numbers in a list of 0 and 1

Converting piano roll to MIDI in music21?

Numpy: how to convert observations to probabilities?

Match letter frequency within a word against 26 letters in R (or python)

Pandas DataFrame column concatenation

Categories

Resources