Related
I have a list of sentences and I am looking to extract contents between two items.
If the start or end item does not exist, I want it to return a row with padding only.
I already have the sentences tokenized and padded with 0 to a fixed length.
I figured a way to do this using for loops, but it is extremely slow, so would like to
know what is the best way to solve this, probably by using tensor operations.
import torch
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
])
# expected output
[
[7,8,0,0,0,0,0,0,0,0],
[7,2,8,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
# or
[
[0,0,7,8,0,0,0,0,0,0],
[0,0,0,0,7,2,8,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
The current solution that I have, which uses a for loop. It does not produce a symmetric array like I want in the expected output.
def _get_part_from_tokens(
self,
data: torch.Tensor,
s_id: int,
e_id: int,
) -> list[str]:
input_ids = []
for row in data:
try:
s_index = (row == s_id).nonzero(as_tuple=True)[0][0]
e_index = (row == e_id).nonzero(as_tuple=True)[0][0]
except IndexError:
input_ids.append(torch.tensor([]))
continue
if s_index is None or e_index is None or s_index > e_index:
input_ids.append(torch.tensor([]))
continue
ind = torch.arange(s_index + 1, e_index)
input_ids.append(row.index_select(0, ind))
return input_ids
A possible loop-free approach is this:
import torch
# using the provided sample data
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
[3,7,5,8,2,0,0,0,0,0], # does not contain start or end value
])
First, check which rows contain only a start_value or an end_value and fill these rows with 0.
# fill 'invalid' rows with 0
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
Then set the values up to (and including) the start_value and after (and including) the end_value to 0 in each row. This step targets mainly the 'valid' rows. Nevertheless, all other rows will (again) be overwritten with zeros.
# set values in the start and end of 'valid rows' to 0
row_length = data.shape[1]
start_idx = starts.long().argmax(axis=1)
start_mask = (start_idx[:,None] - torch.arange(row_length))>=0
data[start_mask] = 0
end_idx = row_length - ends.long().argmax(axis=1)
end_mask = (end_idx[:,None] + torch.arange(row_length))>=row_length
data[end_mask] = 0
Note: This works also, if a row contains neither a start_value nor an end_value (I added such a row to the sample data). Still, there are many more edge cases that one could think of (e.g. multiple start and end values in one row, start value after end value, ...). Not sure if they are of relevance for the specific problem.
Comparison of execution time
Using timeit and randomly generated data to compare the execution time of the different approaches suggests, that the approach without loops is considerably faster than the approach from the question. If the data is converted to numpy first and converted back to Pytorch afterwards some further (very minor) time savings are possible.
Each dot (execution time) in the plot is the minimum value of 3 trials each with 100 repetitions.
this is my attempt at improving #rosa b. algorithm.
Could you try this:
def function1(
data: torch.Tensor,
start_value: int,
end_value: int,
):
# fill 'invalid' rows with 0
row_length = data.shape[1]
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
valid_ind = torch.where(torch.logical_not(invalid))
# set values in the start and end of 'valid rows' to 0
arange_arr = torch.arange(row_length)
start_idx = starts.long()[valid_ind].argmax(axis=1)
start_mask = (start_idx[:, None] - arange_arr) >= 0
end_idx = row_length - ends.long()[valid_ind].argmax(axis=1)
end_mask = (end_idx[:, None] + arange_arr) >= row_length
mask = torch.logical_or(start_mask, end_mask)
tmp = data[valid_ind]
tmp.masked_fill_(mask, 0)
data[valid_ind] = tmp
return data
The main idea is I think the list of valid indexes is small. Therefore, we could skip many computations. I make some other minor updates so it should be slightly faster.
(Sorry I don't have enough reputation to make a comment).
I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You
Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.
I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!
IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])
You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]
The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0
I am writing a program to discretize a set of attributes via entropy discretization. The goal is to parse the dataset
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
Into
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
The specific problem that I am facing with my program is determining the number of classes in my dataset. This takes place at numberOfClasses = s['Class'].value_counts(). I would like to use a pandas method to return the number of distinct classes. In this example there are only two. However I get back
Number of classes: 2 5
1 4
From the print statement.
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].value_counts()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
main()
Ideally, I'd like to print Number of classes: 2. This way I can loop over the classes and calculate the frequencies for the attribute A from the dataset. I've reviewed the pandas documentation, but I got stuck at value_counts(). Any help would be greatly appreciated.
Maybe try:
number_of_classes = len(s['Class'].unique())
which will return the number of unique classes in the column Class.
Or even shorter:
s['Class'].nunique()
I'm writing a method for calculating the covariance of 2 to 8 time-series variables. I'm intending for the variables to be contained in list objects when they are passed to this method. The method should return 1 number, not a covariance matrix.
The method works fine the first time it's called. Anytime it's called after that, it returns a 0. An example is attached at the bottom, below my code. Any advice/feeback regarding the variable scope issues here would be greatly appreciated. Thanks!
p = [3,4,4,654]
o = [4,67,4,1]
class Toolkit():
def CovarianceScalar(self, column1, column2 = [], column3 = [], column4 = [],column5 = [],column6 = [],column7 = [],column8 = []):
"""Assumes all columns have length equal to Len(column1)"""
#If only the first column is passed, this will act as a variance function
import numpy as npObject
#This is a binary-style number that is assigned a value of 1 if one of the input vectors/lists has zero length. This way, the CovarianceResult variable can be computed, and the relevant
# terms can have a 1 added to them if they would otherwise go to 0, preventing the CovarianceResult value from incorrectly going to 0.
binUnityFlag2 = 1 if (len(column2) == 0) else 0
binUnityFlag3 = 1 if (len(column3) == 0) else 0
binUnityFlag4 = 1 if (len(column4) == 0) else 0
binUnityFlag5 = 1 if (len(column5) == 0) else 0
binUnityFlag6 = 1 if (len(column6) == 0) else 0
binUnityFlag7 = 1 if (len(column7) == 0) else 0
binUnityFlag8 = 1 if (len(column8) == 0) else 0
# Some initial housekeeping: ensure that all input column lengths match that of the first column. (Will later advise the user if they do not.)
lngExpectedColumnLength = len(column1)
inputList = [column2, column3, column4, column5, column6, column7, column8]
inputListNames = ["column2","column3","column4","column5","column6","column7","column8"]
for i in range(0,len(inputList)):
while len(inputList[i]) < lngExpectedColumnLength: #Empty inputs now become vectors of 1's.
inputList[i].append(1)
#Now start calculating the covariance of the inputs:
avgColumn1 = sum(column1)/len(column1) #<-- Each column's average
avgColumn2 = sum(column2)/len(column2)
avgColumn3 = sum(column3)/len(column3)
avgColumn4 = sum(column4)/len(column4)
avgColumn5 = sum(column5)/len(column5)
avgColumn6 = sum(column6)/len(column6)
avgColumn7 = sum(column7)/len(column7)
avgColumn8 = sum(column8)/len(column8)
avgList = [avgColumn1,avgColumn2,avgColumn3,avgColumn4,avgColumn5, avgColumn6, avgColumn7,avgColumn8]
#start building the scalar-valued result:
CovarianceResult = float(0)
for i in range(0,lngExpectedColumnLength):
CovarianceResult +=((column1[i] - avgColumn1) * ((column2[i] - avgColumn2) + binUnityFlag2) * ((column3[i] - avgColumn3) + binUnityFlag3 ) * ((column4[i] - avgColumn4) + binUnityFlag4 ) *((column5[i] - avgColumn5) + binUnityFlag5) * ((column6[i] - avgColumn6) + binUnityFlag6 ) * ((column7[i] - avgColumn7) + binUnityFlag7)* ((column8[i] - avgColumn8) + binUnityFlag8))
#Finally, divide the sum of the multiplied deviations by the sample size:
CovarianceResult = float(CovarianceResult)/float(lngExpectedColumnLength) #Coerce both terms to a float-type to prevent return of array-type objects.
return CovarianceResult
Example:
myInst = Toolkit() #Create a class instance.
First execution of the function:
myInst.CovarianceScalar(o,p)
#Returns -2921.25, the covariance of the numbers in lists o and p.
Second time around:
myInst.CovarianceScalar(o,p)
#Returns: 0.0
I belive that the problem you are facing is due to mutable default arguments. Basicily, when you first execute myInst.CovarianceScalar(o,p) all columns other than first two are []. During this execution, you change the arguments. Thus when you execute the same function as before, myInst.CovarianceScalar(o,p), the other columns in the arguments are not [] anymore. They take values of whatever value they have as a result of the first execution.