I have a list of sentences and I am looking to extract contents between two items.
If the start or end item does not exist, I want it to return a row with padding only.
I already have the sentences tokenized and padded with 0 to a fixed length.
I figured a way to do this using for loops, but it is extremely slow, so would like to
know what is the best way to solve this, probably by using tensor operations.
import torch
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
])
# expected output
[
[7,8,0,0,0,0,0,0,0,0],
[7,2,8,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
# or
[
[0,0,7,8,0,0,0,0,0,0],
[0,0,0,0,7,2,8,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
]
The current solution that I have, which uses a for loop. It does not produce a symmetric array like I want in the expected output.
def _get_part_from_tokens(
self,
data: torch.Tensor,
s_id: int,
e_id: int,
) -> list[str]:
input_ids = []
for row in data:
try:
s_index = (row == s_id).nonzero(as_tuple=True)[0][0]
e_index = (row == e_id).nonzero(as_tuple=True)[0][0]
except IndexError:
input_ids.append(torch.tensor([]))
continue
if s_index is None or e_index is None or s_index > e_index:
input_ids.append(torch.tensor([]))
continue
ind = torch.arange(s_index + 1, e_index)
input_ids.append(row.index_select(0, ind))
return input_ids
A possible loop-free approach is this:
import torch
# using the provided sample data
start_value, end_value = 4,9
data = torch.tensor([
[3,4,7,8,9,2,0,0,0,0],
[1,5,3,4,7,2,8,9,10,0],
[3,4,7,8,10,0,0,0,0,0], # does not contain end value
[3,7,5,9,2,0,0,0,0,0], # does not contain start value
[3,7,5,8,2,0,0,0,0,0], # does not contain start or end value
])
First, check which rows contain only a start_value or an end_value and fill these rows with 0.
# fill 'invalid' rows with 0
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
Then set the values up to (and including) the start_value and after (and including) the end_value to 0 in each row. This step targets mainly the 'valid' rows. Nevertheless, all other rows will (again) be overwritten with zeros.
# set values in the start and end of 'valid rows' to 0
row_length = data.shape[1]
start_idx = starts.long().argmax(axis=1)
start_mask = (start_idx[:,None] - torch.arange(row_length))>=0
data[start_mask] = 0
end_idx = row_length - ends.long().argmax(axis=1)
end_mask = (end_idx[:,None] + torch.arange(row_length))>=row_length
data[end_mask] = 0
Note: This works also, if a row contains neither a start_value nor an end_value (I added such a row to the sample data). Still, there are many more edge cases that one could think of (e.g. multiple start and end values in one row, start value after end value, ...). Not sure if they are of relevance for the specific problem.
Comparison of execution time
Using timeit and randomly generated data to compare the execution time of the different approaches suggests, that the approach without loops is considerably faster than the approach from the question. If the data is converted to numpy first and converted back to Pytorch afterwards some further (very minor) time savings are possible.
Each dot (execution time) in the plot is the minimum value of 3 trials each with 100 repetitions.
this is my attempt at improving #rosa b. algorithm.
Could you try this:
def function1(
data: torch.Tensor,
start_value: int,
end_value: int,
):
# fill 'invalid' rows with 0
row_length = data.shape[1]
starts = (data == start_value)
ends = (data == end_value)
invalid = ((starts.sum(axis=1) - ends.sum(axis=1)) != 0)
data[invalid] = 0
valid_ind = torch.where(torch.logical_not(invalid))
# set values in the start and end of 'valid rows' to 0
arange_arr = torch.arange(row_length)
start_idx = starts.long()[valid_ind].argmax(axis=1)
start_mask = (start_idx[:, None] - arange_arr) >= 0
end_idx = row_length - ends.long()[valid_ind].argmax(axis=1)
end_mask = (end_idx[:, None] + arange_arr) >= row_length
mask = torch.logical_or(start_mask, end_mask)
tmp = data[valid_ind]
tmp.masked_fill_(mask, 0)
data[valid_ind] = tmp
return data
The main idea is I think the list of valid indexes is small. Therefore, we could skip many computations. I make some other minor updates so it should be slightly faster.
(Sorry I don't have enough reputation to make a comment).
In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.
I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want
Good mooring to all,
The objective is to be able to create a series of new columns by inserting x and y into the df[f'sma_{x}Vs_sma{y}'] function.
The problem that I’m having is that I’m only getting the last tuple value into the function and therefore into the data frame as you can see on the last image.
On the second part of the code, 3 examples on how the tuples values must be plug into the function. IN the examples I will be using the first 2 tuples (10,11), (10,12) and the last tuple (48,49)
Code:
a = list(combinations(range(10, 15),2))
print(a)
for index, tuple in enumerate(a):
x = tuple[0]
y = tuple[1]
print(x, y)
df[f'sma_{x}_Vs_sma_{y}'] = np.where(ta.sma(df['close'], lenght = x) > ta.sma(df['close'], lenght = y),1,-1)
Code Examples:
Tuple (10,11)
df[f'sma_{10}_Vs_sma_{11}'] = np.where(ta.sma(df['close'], lenght = 10) > ta.sma(df['close'], lenght = 11),1,-1)
Tuple (10,12)
df[f'sma_{10}_Vs_sma_{12}'] = np.where(ta.sma(df['close'], lenght = 10) > ta.sma(df['close'], lenght = 12),1,-1)
Tuple (13,14)
df[f'sma_{13}_Vs_sma_{14}'] = np.where(ta.sma(df['close'], lenght = 13) > ta.sma(df['close'], lenght = 14),1,-1)
Error code
On the next lines the code that solve the issue. Although looking backwards seams very easy, it took me some time to get to the answer.
Thanks to the people that comment on the issue
a = list(combinations(range(5, 51),2))
print(a)
for x, y in a :
df[f'hma_{x}_Vs_hma_{y}'] = np.where(ta.hma(df['close'], lenght = x) > ta.hma(df['close'], lenght = y),1,-1)
The code :
import pandas as pd
import numpy as np
import csv
data = pd.read_csv("/content/NYC_temperature.csv", header=None,names = ['temperatures'])
np.cumsum(data['temperatures'])
printcounter = 0
list_30 = [15.22]#first temperature , i could have also added it by doing : list_30.append(i)[0] since it's every 30 values but doesn't append the first one :)
list_2 = [] #this is for the values of the subtraction (for the second iteration)
for i in data['temperatures']:
if (printcounter == 30):
list_30.append(i)
printcounter = 0
printcounter += 1
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
list_2.append(substraction)
print(max(list_2))
Hey guys ! i'm really having trouble with the black part.
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
I'm trying to iterate over the elements and sub stracting element x with the next element (x+1) but the following error pops out TypeError: 'float' object is not iterable. I have also tried to iterate using x instead of list_30[x] but then when I use next(x) I have another error.
for x in list_30: will iterate on list_30, and affect to x, the value of the item in the list, not the index in the list.
for your case you would prefer to loop on your list with indexes:
index = 0
while index < len(list_30):
substract = list_30[index] - list_30[index + 1]
edit: you will still have a problem when you will reach the last element of list_30 as there will be no element of list_30[laste_index + 1],
so you should probably stop before the end with while index < len(list_30) -1:
in case you want the index and the value, you can do:
for i, v in enumerate(list_30):
substract = v - list_30[i + 1]
but the first one look cleaner i my opinion
if you`re trying to find ifference btw two adjacent elements of an array (like differentiate it), you shoul probably use zip function
inp = [1, 2, 3, 4, 5]
delta = []
for x0,x1 in zip(inp, inp[1:]):
delta.append(x1-x0)
print(delta)
note that list of deltas will be one shorter than the input