I have a data frame that basically consists of three columns: group, timestamp, value.
I created the following for loop that will iterate through the data frame and run tests to see if the values are acceptable or not. For example, if not enough time has passed between the timestamps to account for the value, then it is tagged as potentially bad data.
The only caveat here is that values should not always be compared to the previous value, but rather the last 'good' value within the group. Thus the reason I went with the loop.
I'm wondering if there is a better way to do this without the loop, or are there inefficiencies in the loop that would help speed it up?
dfy = pd.DataFrame(index=dfx.index,columns = ['gvalue','quality'])
for row in df.itertuples():
thisgroup = row[1]
thistimestamp = row[2]
thisvalue = row[3]
qualitytag = ''
qualitytest = True
if prevgroup == thisgroup:
ts_gap = thistimestamp - goodtimestamp
hour_gap = (thisvalue - goodvalue) * 3600
if hour_gap < 0:
qualitytag = 'H'
qualitytest = False
elif hour_gap > ts_gap:
qualitytag = 'A'
qualitytest = False
elif hour_gap >= 86400
qualitytag = 'U'
qualitytest = False
#if tests pass, update good values
if qualitytest:
goodvalue = thisvalue
goodtimestamp = thistimestamp
#save good values to y dataframe
dfy.iat[row[0],0] = goodvalue
dfy.iat[row[0],1] = qualitytag
prevgroup = thisgroup
df = dfx.join(dfy)
Related
I habve a big dataframe in pandas and want to fill one column based on the values from another column. This column contains of sequences of '0' and '1', and I want to caluclate the ratio of these. So this is my working code, but its really slow so do you have a good idea how to speed this up?
t1 = time.time();
phase = df.loc[0]['Phase']
sequence_0 = 0
sequence_1 = 0
sequence = 0
ratio = 0
for val in df.itertuples():
if val[10] == phase:
sequence += 1
else:
if phase == 0:
sequence_0 = sequence
else:
sequence_1 = sequence
if sequence_0 > 0:
ratio = ( sequence_0 / (sequence_1 + sequence_0) ) * 100
sequence = 0
phase = x
df.at[i,'Ratio'] = ratio
print("Elapsed: %.2f seconds" % (time.time() - t1))
So this takes ~10s for a length of the dataframe of ~850k rows.
Thanks and best regards
Christoph
Vectorize the calculation. Something like:
df[df['col10'] == phase].mean()
Should yield the expected result using the appropriate column names.
I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str, Time int, and Values float), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv').
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv').
I have a list of data frames that I'm opening in a for loop. For each data frame I want to query a portion of it and find the average.
This is what I have so far:
k = 0
for i in open('list.txt', 'r'):
k = k+1
i_name = i.strip()
df = pd.read_csv(i_name, sep='\t')
#Create queries
A = df.query('location == 1' and '1000 >= start <= 120000000')
B = df.query('location == 10' and '2000000 >= start <= 60000000')
print A
print B
#Find average
avgA = (sum(A['height'])/len(A['height']))
print avgA
avgB = (sum(B['height'])/len(B['height']))
print avgB
The problem is I'm not getting the average values I'm expecting (when doing it manually by excel). Printing the query results in the entire data frame being printed so I'm not sure if there's a problem with how I'm querying the data.
Am I correctly assigning the values A and B to the queries? Is there another way to do this that doesn't involve saving every data frame as a csv? I have many queries to create and don't want to save each intermediate query for hundreds of samples as I'm only interested in the average.
This does not do what you expect:
A = df.query('location == 1' and '1000 >= start <= 120000000')
B = df.query('location == 10' and '2000000 >= start <= 60000000')
You are doing the Python "and" of two strings. Since the first string has a True value, the result of that expression is "1000 >= start <= 120000000".
You want the "and" to be inside the query:
A = df.query('location == 1 and 1000 >= start <= 120000000')
B = df.query('location == 10 and 2000000 >= start <= 60000000')
Secondly, you have the inequality operators backwards. The first one is only going to get values less than or equal to 1000. What you really want is:
A = df.query('location == 1 and 1000 <= start <= 120000000')
B = df.query('location == 10 and 2000000 <= start <= 60000000')
I have a data table consisting 100000 records with 50 columns, It has a start time and end time value and a equipment key for which records are available. When this nodes are down, their records are stored. so start time is when the node goes down, and end time is when the node is up after getting down. If there are multiple records where we have same equipment key, and start time and end time values which are inside of previous record's start time and end time, then we call it that this new record has overlapping time and we need to ignore them. To find out these overlapping records, I have written a function and apply it on a dataframe, but it's taking a long time. I am not that efficient in optimization, that's why seeking any suggestion regarding this.
sitecode_info = []
def check_overlapping_sitecode(it):
sitecode = it['equipmentkey']
fo = it['firstoccurrence']
ct = it['cleartimestamp']
if len(sitecode_info) == 0:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
for list_item in sitecode_info:
for item in list_item.keys():
if item == 'sc':
if list_item[item] == sitecode:
# print("matched")
if fo >= list_item['fo'] and ct <= list_item['ct'] or \
fo >= list_item['fo'] and fo <= list_item['ct'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['fo'] and ct <= list_item['ct']:
return 1
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
I am calling this as following.
temp_df['false_alarms'] = temp_df.apply(check_overlapping_sitecode, axis=1)
I think you were just iterating over that list of dictionaries a touch too much.
**EDIT:**Added appending fo's and ct's even if it returns 1 in the method for enhanced accuracy.
'''
setting an empty dictionary.
this will look like: {sc1: [[fo, ct], [fo, ct]],
sc2:[[fo, ct], [fo, ct]]}
the keys are just the site_code,
this way we don't have to iterate over all of the fo's and ct's, just the ones related to that site code.
'''
sitecode_info = {}
# i set up a dataframe with 200000 rows x 50 columns
def check_overlapping_sitecode(site_code, fo, ct):
try:
#try to grab the existing site_code information from sitecode_info dict.
#if it fails then go ahead and make it while also returning 0 for that site_code
my_list = sitecode_info[site_code]
#if it works, go through that site's list.
for fo_old, ct_old in my_list:
#if the first occurence is >= old_first occurenc and <= cleartimestamp
if fo >= fo_old and fo <= ct_old:
sitecode_info[site_code].append([fo, ct])
return 1
#same but for cleartimestamp instead
elif ct <= ct_old and ct >= fo_old:
sitecode_info[site_code].append([fo, ct])
return 1
else:
#if it doesnt overlap at all go ahead and set the key to a list in list
sitecode_info[site_code].append([fo, ct])
return 0
except:
#set the key to a list in list if it fails
sitecode_info[site_code] = [[fo, ct]]
return 0
t = time.time()
"""Here's the real meat and potatoes.
using a lambda function to call method "check_overlapping_sitecode".
lambda: x where x is row
return the output of check_overlapping_sitecode
"""
temp_df['false_alarms'] = temp_df.apply(lambda x: check_overlapping_sitecode(x['equipmentkey'], x['firstoccurrence'], x['cleartimestamp']), axis=1)
print(time.time()-t)
#this code runs nearly 6 seconds for me.
#then you can do whatever you want with your DF.
I have coded the following for loop. The main idea is that in each occurrence of 'D' in the column 'A_D', it looks for all the possible cases where some specific conditions should happen. When all the conditions are verified, a value is added to a list.
a = []
for i in df.index:
if df['A_D'][i] == 'D':
if df['TROUND_ID'][i] == ' ':
vb = df[(df['O_D'] == df['O_D'][i])
& (df['A_D'] == 'A' )
& (df['Terminal'] == df['Terminal'][i])
& (df['Operator'] == df['Operator'][i])]
number = df['number_ac'][i]
try: ## if all the conditions above are verified a value is added to a list
x = df.START[i] - pd.Timedelta(int(number), unit='m')
value = vb.loc[(vb.START-x).abs().idxmin()].FlightID
except: ## if are not verified, several strings are added to the list
value = 'No_link_found'
else:
value = 'Has_link'
else:
value = 'IsArrival'
a.append(value)
My main problem is that df has millions of rows, therefore this for loop is way too time consuming. Is there any vectorized solution where I do not need to use a for loop?
An initial set of improvements: use apply rather than a loop; create a second dataframe at the start of the rows where df["A_D"] == "A"; and vectorise the value x.
arr = df[df["A_D"] == "A"]
# if the next line is slow, apply it only to those rows where x is needed
df["x"] = df.START - pd.Timedelta(int(df["number_ac"]), unit='m')
def link_func(row):
if row["A_D"] != "D":
return "IsArrival"
if row["TROUND_ID"] != " ":
return "Has_link"
vb = arr[arr["O_D"] == row["O_D"]
& arr["Terminal"] == row["Terminal"]
& arr["Operator"] == row["Operator"]]
try:
return vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
return "No_link_found"
df["a"] = df.apply(link_func, axis=1)
Using apply is apparently more efficient but does not automatically vectorise the calculation. But finding a value in arr based on each row of df is inherently time consuming, however efficiently it is implemented. Consider whether the two parts of the original dataframe (where df["A_D"] == "A" and df["A_D"] == "D", respectively) can be reshaped into a wide format somehow.
EDIT: You might be able to speed up the querying of arr by storing query strings in df, like this:
df["query_string"] = ('O_D == "' + df["O_D"]
+ '" & Terminal == "' + df["Terminal"]
+ '" & Operator == "' + df["Operator"] + '"')
def link_func(row):
vb = arr.query(row["query_string"])
try:
row["a"] = vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
row["a"] = "No_link_found"
df.query('(A_D == "D") & (TROUND_ID == " ")').apply(link_func, axis=1)