How would one interpolate categorical (non-float, or in more broad sense, non-numerical) data in python?
Test data
Here is an example dataset with string-valued y-values.
x = [1.4, 2.8, 3.1, 4.4, 5.2]
y = ['A', 'B', 'A', 'A', 'B']
Expected outputs
# with kind= 'nearest'
x_new = [1, 2, 3, 4, 5]
y_new = ['A', 'A', 'A', 'A', 'B']
# with kind= 'previous', fill_value = None
x_new = [1, 2, 3, 4, 5]
y_new = [None, 'A', 'B', 'A', 'A']
I was expecting that interp1d could do the job with kind='nearest' or kind='previous', but unfortunately that is not the case.
You can still use interp1d if you replace your target points with indicies. I.e. construct list of all unique values - in your case it will be ['A', 'B'], transition y to be indicies instead of strings (indicies converted to float - you will be ok as long as number of unique elements can be stored as float without losing precision).
After interpolating you'll just need to get back elements given result of interpolation. As long as you use 'previous' or 'nearest' you'll always get floating point value which is one of your original indicies.
UPD.
Even simpler version would be to use y_int = [float(i) for i in range(len(y))], as input for interp1d, then after you got your interpolation result just use it as index of y.
Example: kind='nearest'
from scipy.interpolate import interp1d
import numpy as np
x = [1.4, 2.8, 3.1, 4.4, 5.2]
y = ['A', 'B', 'A', 'A', 'B']
f = interp1d(x, range(len(y)), kind='nearest', fill_value=(0, len(y)-1), bounds_error=False)
y_idx = f(x_new)
y_new = [y[int(i)] for i in y_idx ]
# ['A', 'A', 'A', 'A', 'B']
Example: kind='previous'
from scipy.interpolate import interp1d
import numpy as np
x = [1.4, 2.8, 3.1, 4.4, 5.2]
y = ['A', 'B', 'A', 'A', 'B']
f = interp1d(x, range(len(y)), kind='previous', fill_value=-1, bounds_error=False)
y_idx = f(x_new)
y_new = [y[int(i)] if i != -1 else None for i in y_idx]
# [None, 'A', 'B', 'A', 'A']
Related
I'm trying to build a pandas DataFrame of chromatic frequencies between A1 (55Hz) and A8 (7040Hz). Essentially, I want it to look like this...
df = pd.DataFrame(columns=['A', 'A#', 'B', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#'])
df.loc[0] = (55, 58.27, 61.74, 32.7, 34.65, 36.71, 38.89, 41.2, 43.65, 49, 51.91)
But without having to manually assign all the frequencies to their respective notes and with an octave per row (octave 1 to 8).
Based on the site http://pages.mtu.edu/~suits/notefreqs.html, the space between each note (or a 'half-step') given a single note is...
def hz_stepper(fixed_note, steps):
a = 2 ** (1/12)
return fixed_note * a ** steps
Using that function 'hz_stepper', I can chromatically increase or decrease a given note n times by assigning 1 or -1 to steps variable.
My question is, how do I create a DataFrame where all the rows look like how I did it manually, but using a list comprehension to form the rows?
just iterate over the pitches and reshape the result afterwards:
import numpy as np
import pandas as pd
base = 55.
n_octave = 8
columns = ['A', 'A#', 'B', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#']
factors = 2**(np.arange(12 * n_octave) / 12.)
pd.DataFrame(data=base * factors.reshape((n_octave, 12)), columns=columns)
Explanation
factors are the desired frequencies as 1d numpy array, but they are not in the tabular form required for the DataFrame. reshape creates a view of the array content, that has shape (n_octave, 12) such that rows are contiguous. E.g.
>>> np.arange(6).reshape((2, 3))
array([[0, 1, 2],
[3, 4, 5]])
This is just the format needed for the DataFrame.
from your begining :
df = pd.DataFrame(columns=['A', 'A#', 'B', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#'])
df.loc[0] = 55*2**(np.arange(12)/12)
for i in range(8): df.loc[i+1]=2*df.loc[i]
I have the following array:
a=[['A', 'B'],
['B', 'B'],
['B', 'C'],
['C', 'B'],
['B', 'A'],
['A', 'D'],
['D', 'D'],
['D', 'A'],
['A', 'B'],
['B', 'A'],
['A', 'D']]
I wish to make a transition probability matrix of this, such that I get:
[[P_AA,P_AB,P_AC,P_AD],
[P_BA,P_BB,P_BC,P_BD],
[P_CA,P_CB,P_CC,P_CD],
[P_DA,P_DB,P_DC,P_DD]]
(Above is for illustration), where P_AA counts how many ["A","A"] are in the array a and so on divided by P_AA+P_AB+P_AC+P_AD . I have started by using the counter
from collections import Counter
Counter(tuple(x) for x in l)
which counts the elements of array correctly as:
Counter({('A', 'B'): 2,
('B', 'B'): 1,
('B', 'C'): 1,
('C', 'B'): 1,
('B', 'A'): 2,
('A', 'D'): 2,
('D', 'D'): 1,
('D', 'A'): 1})
So the matrix shall be,
[[0,2/5,0,2/5],[2/4,1/4,1/4,0],[0,1,0,0],[1/2,0,0,1/2]]
A pandas-based solution:
import pandas as pd
from collections import Counter
# Create a raw transition matrix
matrix = pd.Series(Counter(map(tuple, a))).unstack().fillna(0)
# Normalize the rows
matrix.divide(matrix.sum(axis=1),axis=0)
# A B C D
#A 0.0 0.50 0.00 0.5
#B 0.5 0.25 0.25 0.0
#C 0.0 1.00 0.00 0.0
#D 0.5 0.00 0.00 0.5
If the number of elements is small, simply looping over all elements should be no problem:
import numpy as np
a = [['A', 'B'], ['B', 'B'], ['B', 'C'], ['C', 'B'], ['B', 'A'],
['A', 'D'], ['D', 'D'], ['D', 'A'] ['A', 'B'], ['B', 'A'], ['A', 'D']]
a = np.asarray(a)
elems = np.unique(a)
dim = len(elems)
P = np.zeros((dim, dim))
for j, x_in in enumerate(elems):
for k, x_out in enumerate(elems):
P[j,k] = (a == [x_in, x_out]).all(axis=1).sum()
if P[j,:].sum() > 0:
P[j,:] /= P[j,:].sum()
Output:
array([[0. , 0.5 , 0. , 0.5 ],
[0.5 , 0.25, 0.25, 0. ],
[0. , 1. , 0. , 0. ],
[0.5 , 0. , 0. , 0.5 ]])
But you could also use the counter with a pre-allocated transition matrix, map the elements to indices, assign the counts as values, and normalize (last two steps just like I did).
from collections import Counter
a = [['A', 'B'],
['B', 'B'],
['B', 'C'],
['C', 'B'],
['B', 'A'],
['A', 'D'],
['D', 'D'],
['D', 'A'],
['A', 'B'],
['B', 'A'],
['A', 'D']]
counts = Counter(map(tuple, a))
letters = 'ABCD'
p = []
for letter in letters:
d = sum(v for k, v in counts.items() if k[0] == letter)
p.append([counts.get((letter, x), 0) / d for x in letters])
print(p)
Output:
[[0.0, 0.5, 0.0, 0.5],
[0.5, 0.25, 0.25, 0.0],
[0.0, 1.0, 0.0, 0.0],
[0.5, 0.0, 0.0, 0.5]]
This is a problem that fits itertools and Counter perfectly. Take a look at the following1:
l = [['A', 'B'],
['B', 'B'],
['B', 'C'],
['C', 'B'],
['B', 'A'],
['A', 'D'],
['D', 'D'],
['D', 'A'],
['A', 'B'],
['B', 'A'],
['A', 'D']]
from collections import Counter
from itertools import product, groupby
unique_elements = set(x for y in l for x in y) # -> {'B', 'C', 'A', 'D'}
appearances = Counter(tuple(x) for x in l)
# generating all possible combinations to get the probabilities
all_combinations = sorted(list(product(unique_elements, unique_elements)))
# calculating and arranging the probabilities
table = []
for i, g in groupby(all_combinations, key=lambda x: x[0]):
g = list(g)
local_sum = sum(appearances.get(y, 0) for y in g)
table.append([appearances.get(x, 0) / local_sum for x in g])
# [[0.0, 0.5, 0.0, 0.5], [0.5, 0.25, 0.25, 0.0], [0.0, 1.0, 0.0, 0.0], [0.5, 0.0, 0.0, 0.5]]
1 I am assuming you have a mistake on the formulation of your question: "...where P_AA counts how many ["A","A"] are in the array a and so on divided by P_AA + P_AB + P_AC + P_AD...". You mean to divide with something else, right?
my question is how to get the indices of an array of strings that would sort another array.
I have this two arrays of strings:
A = np.array([ 'a', 'b', 'c', 'd' ])
B = np.array([ 'd', 'b', 'a', 'c' ])
I would like to get the indices that would sort the second one in order to match the first.
I have tried the np.argsort function giving the second array (transformed in a list) as order, but it doesn't seem to work.
Any help would be much apreciated.
Thanks and best regards,
Bradipo
edit:
def sortedIndxs(arr):
???
such that
sortedIndxs([ 'd', 'b', 'a', 'c' ]) = [2,1,3,0]
A vectorised approach is possible via numpy.searchsorted together with numpy.argsort:
import numpy as np
A = np.array(['a', 'b', 'c', 'd'])
B = np.array(['d', 'b', 'a', 'c'])
xsorted = np.argsort(B)
res = xsorted[np.searchsorted(B[xsorted], A)]
print(res)
[2 1 3 0]
A code that obtains a conversion rule from an arbitrary permutation to an arbitrary permutation.
creating indexTable: O (n)
examining indexTable: O (n)
Total: O (n)
A = [ 'a', 'b', 'c', 'd' ]
B = [ 'd', 'b', 'a', 'c' ]
indexTable = {k: v for v, k in enumerate(B)}
// {'d': 0, 'b': 1, 'a': 2, 'c': 3}
result = [indexTable[k] for k in A]
// [2, 1, 3, 0]
I am trying to find the durations of all the notes with the same name in a music piece (from .xml), in python. I have 3 lists:
scales = ['A', 'B', 'C'] #scales names
notesAll = ['B', 'A', 'C', 'A', 'C', 'C'] #note names of the piece
durationsAll = [1, 1.5, 1.5, 1, 1, 2] #duration of each note from notesAll list
I want to sum all the durations from durationsAll list for all the notes with the same name. For example for all the 'A's from notesAll list, which equals to scales[0], I want something like: durationsAll[1] + durationsAll[3]= 1.5 + 1 = 2.5. I need a better solution than my attempt:
for sc in scales:
for ntPosition, nt in enumerate(notesAll):
dtOfEach = 0
for dtPosition, dt in enumerate(durationsAll):
if sc == nt:
dtPosotion = ntPosition #I guess here is the problem
dtOfEach = dtOfEach + dt
The result I want would be: dtOfEach: 2.5 1 4.5
Should the first sum be 2.5? If so something, like the code at the end of this message should work. I think you basically had it with using a for loop and then enumerate but I wasn't sure what the second enumerate was for.
Also you have a line that reads:
dtPosotion = ntPosition #I guess here is the problem
Is dtPosotion a typo? Or are you trying to set dtPosition = ntPosition. If so I don't think that's possible because dtPosition and ntPosition are set by your enumerate loops.
scales = ['A', 'B', 'C'] #scales names
notesAll = ['B', 'A', 'C', 'A', 'C', 'C'] #notes names of the piece
durationsAll = [1, 1.5, 1.5, 1, 1, 2] #duration of each note from notesAll list
sums = [0,0,0]
for s in scales:
print "Scale letter:" + str(s)
for i,j in enumerate(notesAll):
if j == s:
sums[scales.index(s)] +=durationsAll[i]
print sums
You could use indices and proceed like this:
def get_total_duration(note):
return sum(durationsAll[idx] for idx, nt in enumerate(notesAll) if nt == note)
scales = ['A', 'B', 'C'] #scales names
notesAll = ['B', 'A', 'C', 'A', 'C', 'C'] #note names of the piece
durationsAll = [1, 1.5, 1.5, 1, 1, 2]
get_total_duration('A')
output:
2.5
You could do an indirect sort and then group using groupby
from itertools import groupby
scales = ['A', 'B', 'C'] #scales names
notesAll = ['B', 'A', 'C', 'A', 'C', 'C'] #note names of the piece
durationsAll = [1, 1.5, 1.5, 1, 1, 2] #duration of each note from notesAll list
# do indirect sort
idx = sorted(range(len(notesAll)), key=notesAll.__getitem__)
# group and sum
result = {k: sum(map(durationsAll.__getitem__, grp))
for k, grp in groupby(idx, notesAll.__getitem__)}
# {'A': 2.5, 'B': 1, 'C': 4.5}
defaultdict is perfect here (assuming note and note's duration are in corresponding lists of course)
from collections import defaultdict
duration_total = defaultdict(int)
for note in list(zip(notesAll, durationsAll)):
duration_total[note[0]] += note[1]
print(duration_total)
{'B': 1, 'A': 2.5, 'C': 4.5})
I think A:3.5 in your question was a typo ?
EDIT
Updated code with Chris' suggestion:
for note, duration in zip(notesAll, durationsAll):
duration_total[note] += duration
It can be done with a list comprehension, exploiting the idea that booleans behave like integers
In [3]: S = ['A', 'B', 'C'] #scales names
...: N = ['B', 'A', 'C', 'A', 'C', 'C'] #note names of the piece
...: D = [1, 1.5, 1.5, 1, 1, 2] #duration of each note from notesAll list
In [4]: [sum(match*d for match, d in zip((n==s for n in N), D)) for s in S]
Out[4]: [2.5, 1.0, 4.5]
If you need a dictionary
In [5]: {s:sum(match*d for match, d in zip((n==s for n in N), D)) for s in S}
Out[5]: {'A': 2.5, 'B': 1.0, 'C': 4.5}
or, avoiding one loop and w/o importing collections
In [6]: r = {}
In [7]: for n, d in zip(N, D): r[n] = r.get(n, 0) + d
In [8]: r
Out[8]: {'A': 2.5, 'B': 1, 'C': 4.5}
where we access the data in the dictionary not by indexing but using the dictionary .get(key, default) method that makes it possible to initialize correctly our values.
I have a list of strings
x = ['A', 'B', nan, 'D']
and want to remove the nan.
I tried:
x = x[~numpy.isnan(x)]
But that only works if it contains numbers. How do we solve this for strings in Python 3+?
If you have a numpy array you can simply check the item is not the string nan, but if you have a list you can check the identity with is and np.nan since it's a singleton object.
In [25]: x = np.array(['A', 'B', np.nan, 'D'])
In [26]: x
Out[26]:
array(['A', 'B', 'nan', 'D'],
dtype='<U3')
In [27]: x[x != 'nan']
Out[27]:
array(['A', 'B', 'D'],
dtype='<U3')
In [28]: x = ['A', 'B', np.nan, 'D']
In [30]: [i for i in x if i is not np.nan]
Out[30]: ['A', 'B', 'D']
Or as a functional approach in case you have a python list:
In [34]: from operator import is_not
In [35]: from functools import partial
In [37]: f = partial(is_not, np.nan)
In [38]: x = ['A', 'B', np.nan, 'D']
In [39]: list(filter(f, x))
Out[39]: ['A', 'B', 'D']
You can use math.isnan and a good-old list comprehension.
Something like this would do the trick:
import math
x = [y for y in x if not math.isnan(y)]
You may want to avoid np.nan with strings, use None instead; but if you do have nan you could do this:
import numpy as np
[i for i in x if i is not np.nan]
# ['A', 'B', 'D']
You could also try this:
[s for s in x if str(s) != 'nan']
Or, convert everything to str at the beginning:
[s for s in map(str, x) if s != 'nan']
Both approaches yield ['A', 'B', 'D'].