I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str, Time int, and Values float), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv').
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv').
Related
df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18
I have a data table consisting 100000 records with 50 columns, It has a start time and end time value and a equipment key for which records are available. When this nodes are down, their records are stored. so start time is when the node goes down, and end time is when the node is up after getting down. If there are multiple records where we have same equipment key, and start time and end time values which are inside of previous record's start time and end time, then we call it that this new record has overlapping time and we need to ignore them. To find out these overlapping records, I have written a function and apply it on a dataframe, but it's taking a long time. I am not that efficient in optimization, that's why seeking any suggestion regarding this.
sitecode_info = []
def check_overlapping_sitecode(it):
sitecode = it['equipmentkey']
fo = it['firstoccurrence']
ct = it['cleartimestamp']
if len(sitecode_info) == 0:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
for list_item in sitecode_info:
for item in list_item.keys():
if item == 'sc':
if list_item[item] == sitecode:
# print("matched")
if fo >= list_item['fo'] and ct <= list_item['ct'] or \
fo >= list_item['fo'] and fo <= list_item['ct'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['fo'] and ct <= list_item['ct']:
return 1
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
I am calling this as following.
temp_df['false_alarms'] = temp_df.apply(check_overlapping_sitecode, axis=1)
I think you were just iterating over that list of dictionaries a touch too much.
**EDIT:**Added appending fo's and ct's even if it returns 1 in the method for enhanced accuracy.
'''
setting an empty dictionary.
this will look like: {sc1: [[fo, ct], [fo, ct]],
sc2:[[fo, ct], [fo, ct]]}
the keys are just the site_code,
this way we don't have to iterate over all of the fo's and ct's, just the ones related to that site code.
'''
sitecode_info = {}
# i set up a dataframe with 200000 rows x 50 columns
def check_overlapping_sitecode(site_code, fo, ct):
try:
#try to grab the existing site_code information from sitecode_info dict.
#if it fails then go ahead and make it while also returning 0 for that site_code
my_list = sitecode_info[site_code]
#if it works, go through that site's list.
for fo_old, ct_old in my_list:
#if the first occurence is >= old_first occurenc and <= cleartimestamp
if fo >= fo_old and fo <= ct_old:
sitecode_info[site_code].append([fo, ct])
return 1
#same but for cleartimestamp instead
elif ct <= ct_old and ct >= fo_old:
sitecode_info[site_code].append([fo, ct])
return 1
else:
#if it doesnt overlap at all go ahead and set the key to a list in list
sitecode_info[site_code].append([fo, ct])
return 0
except:
#set the key to a list in list if it fails
sitecode_info[site_code] = [[fo, ct]]
return 0
t = time.time()
"""Here's the real meat and potatoes.
using a lambda function to call method "check_overlapping_sitecode".
lambda: x where x is row
return the output of check_overlapping_sitecode
"""
temp_df['false_alarms'] = temp_df.apply(lambda x: check_overlapping_sitecode(x['equipmentkey'], x['firstoccurrence'], x['cleartimestamp']), axis=1)
print(time.time()-t)
#this code runs nearly 6 seconds for me.
#then you can do whatever you want with your DF.
I'm trying to improve the efficiency of a script that takes a nested list representing a data table, with a column of IDs (each of which might have many entries). The script counts the number of IDs that have more than 100 entries, and more than 200 entries.
Is there a way I can not have to cycle through the list each time with the list comprehension maybe?
list_of_IDs = [row[4] for row in massive_nested_list] ### get list of ID numbers
list_of_IDs = set(list_of_IDs) ### remove duplicates
list_of_IDs = list(list_of_IDs)
counter200 = 0
counter100 = 0
for my_ID in list_of_IDs:
temp = [row for row in massive_nested_list if row[4] == my_ID]
if len(temp) > 200:
counter200 += 1
if len(temp) > 100:
counter100 += 1
Use a collections.Counter() instance to count your ids. There is no need to collect all possible ids first. You can then collate counts from there:
from collections import Counter
counts = Counter(row[4] for row in massive_nested_list)
counter100 = counter200 = 0
for id, count in counts.most_common():
if count >= 200:
counter200 += 1
elif count >= 100:
counter100 += 1
else:
break
Given K unique IDs in N nested lists, your code would take O(KN) loops to count everything; worst case (K == N) that means your solution takes quadratic time (for every additional row you need to do N times more work). The above code reduces this no one loop over N items, then another loop over K items, making it a O(N) (linear) algorithm.
The simplest method would be to go:
temp100 = [row for row in massive_nested_list if row[4] == my_ID and row >= 100 and row < 200]
temp200 = [row for row in massive_nested_list if row[4] == my_ID and row >= 200]
then you could go:
len(temp200)
OR
counter200 = len(temp200)
I have a collection of 101 documents, I need to iterate over them taking 10 collections at a time and store a value of a particular field(of 10 documents) in a list.
I tried this:
values = db.find({},{"field":1})
urls = []
count = 0
for value in values:
if(count < 10):
urls.append(value["field"])
count = count + 1
print count
else:
print urls
urls = []
urls.append(value["field"])
count = 1
It doesn't fetch the last value because it doesn't reach if condition. Any elegant way to do this and rectify ths situation?
You reset count to 0 everytime the loop restarted. Move the declaration outside the loop:
count = 0
for value in values:
If urls is already filled, this will be your only problem.
As far as I can tell, you've some data that you want to organize into batches of size 10. If so, perhaps this will help:
N = 10
values = list(db.find({},{"field":1}))
url_batches = [
[v['field'] for v in values[i:i+N]]
for i in xrange(0, len(values), N)
]
Here is my code
count = 0
def selectionSort(data):
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return data
data = selectionSort([3,4,5,2,6])
print(count, data)
Your code as-is should not run. You should get local variable 'count' referenced before assignment.
To fix this, add the following to the top of selectionSort(data):
global count
A better way is to scrap the global variable and return count alongside the sorted data:
def selectionSort(data):
count = 0
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return count, data
count, data = selectionSort([3,4,5,2,6])
print(count, data)
Last but not least, you are counting something other than comparisons. I leave fixing that as an exercise for the reader.