Updating Dataframe during Traversal - python

I'm working with dataframes, and need to delete a few rows as I iterate through them.
A brief overview: I read a row (N), compare it with the next 20 rows (till N+20), and delete a few rows between N and N+20 based on the comparison. I then go back to N+1, and compare that row with the next 20 rows, until N+1+20. I do not want to compare N+1 with the rows I've previously deleted.
However, as I delete the rows, the deletion is not reflected in the dataframe as I am traversing its original copy, and the change hasn't been reflected.
Any solutions for this?
df = pd.read_csv(r"C:\snip\test.csv")
index_to_delete = []
for index, row in df.iterrows():
snip
for i in range(20):
if (index + i + 1) < len(df.index):
if condition:
index_to_delete.append(index + i + 1) #storing indices of rows to delete between N and N+20
df.loc[index, ['snip1', 'snip2']] = [snip, snip] #updating values in row N
df = df.drop(index_to_delete)
index_to_delete.clear()

pandas.DataFrame.iterrows():
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
there are a many tricks to solve ploblem:
1: you can itrate over len of df instead of itrate on df.
for inx in range(len(df)):
try:
row = df.loc[inx]
except:
continue
2: store checked indexes and skip them
df = pd.read_csv(r"C:\snip\test.csv")
all_index_to_delete = []
index_to_delete = []
for index, row in df.iterrows():
if index in all_index_to_delete:
continue
snip
for i in range(20):
if (index + i + 1) < len(df.index):
if condition:
index_to_delete.append(index + i + 1) #storing indices of rows to delete between N and N+20
all_index_to_delete.append(index + i + 1) #storing indices of rows to delete between N and N+20
df.loc[index, ['snip1', 'snip2']] = [snip, snip] #updating values in row N
df = df.drop(index_to_delete)
index_to_delete.clear()

Related

Fastest way to count event occurences in a Pandas dataframe?

I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str, Time int, and Values float), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv').
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv').

How to split a series by the longest repetition of a number in python?

df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18

Speeding up rolling calculations within a dataframe

I have a dataframe where the index is a datetime, and it is sorted. Basically I want to creating a column rolling_time1, rolling_time2,... etc where the value is the number of count after the row that is within timex. I created the following but it is very slow. Any ways to make this faster?
def sum_window_wd(row, wd_file, wd, df, num):
if row.start_index > num:
return row['rolling_' + str(wd)]
count = 0
for i in range(row.start_index + 1, len(df)):
if GetWinddownLeft(wd_file, df.iloc[i].name, row.name) < wd:
count = count + 1
else:
break
return count
for rolling in rollings:
df['rolling_' + str(rolling)] = 0
for rolling in rollings:
df['rolling_' + str(rolling)] = df.apply(sum_window_wd, axis=1, args = (winddown, rolling, df, len))

Improve efficiency of python for loop counting items against IDs in nested list

I'm trying to improve the efficiency of a script that takes a nested list representing a data table, with a column of IDs (each of which might have many entries). The script counts the number of IDs that have more than 100 entries, and more than 200 entries.
Is there a way I can not have to cycle through the list each time with the list comprehension maybe?
list_of_IDs = [row[4] for row in massive_nested_list] ### get list of ID numbers
list_of_IDs = set(list_of_IDs) ### remove duplicates
list_of_IDs = list(list_of_IDs)
counter200 = 0
counter100 = 0
for my_ID in list_of_IDs:
temp = [row for row in massive_nested_list if row[4] == my_ID]
if len(temp) > 200:
counter200 += 1
if len(temp) > 100:
counter100 += 1
Use a collections.Counter() instance to count your ids. There is no need to collect all possible ids first. You can then collate counts from there:
from collections import Counter
counts = Counter(row[4] for row in massive_nested_list)
counter100 = counter200 = 0
for id, count in counts.most_common():
if count >= 200:
counter200 += 1
elif count >= 100:
counter100 += 1
else:
break
Given K unique IDs in N nested lists, your code would take O(KN) loops to count everything; worst case (K == N) that means your solution takes quadratic time (for every additional row you need to do N times more work). The above code reduces this no one loop over N items, then another loop over K items, making it a O(N) (linear) algorithm.
The simplest method would be to go:
temp100 = [row for row in massive_nested_list if row[4] == my_ID and row >= 100 and row < 200]
temp200 = [row for row in massive_nested_list if row[4] == my_ID and row >= 200]
then you could go:
len(temp200)
OR
counter200 = len(temp200)

How do I count these comparisons in selection sort?

Here is my code
count = 0
def selectionSort(data):
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return data
data = selectionSort([3,4,5,2,6])
print(count, data)
Your code as-is should not run. You should get local variable 'count' referenced before assignment.
To fix this, add the following to the top of selectionSort(data):
global count
A better way is to scrap the global variable and return count alongside the sorted data:
def selectionSort(data):
count = 0
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return count, data
count, data = selectionSort([3,4,5,2,6])
print(count, data)
Last but not least, you are counting something other than comparisons. I leave fixing that as an exercise for the reader.

Categories

Resources