How to merge continuous lines of a csv file - python
I have a csv file that carries outputs of some processes over video frames. In the file, each line is either fire or none. Each line has startTime and endTime. Now I need to cluster and print only one instance out of continuous fires with their start and end time. The point is that a few none in the middle can also be tolerated if their time is within 1 second. So to be clear, the whole point is to cluster detections of closer frames together...somehow smooth out the results. Instead of multiple 31-32, 32-33, ..., have a single line with 31-35 seconds.
How to do that?
For instance, the whole following continuous items are considered a single one since the none gaps is within 1s. So we would have something like 1,file1,name1,30.6,32.2,fire,0.83 with that score being the mean of all fire lines.
frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
...
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344
This is my attempts so far:
with open(filename) as fin:
lastWasFire=False
for line in fin:
if "fire" in line:
if lastWasFire==False and line !="" and line.split(",")[5] != lastline.split(",")[5]:
fout.write(line)
else:
lastWasFire=False
lastline=line
I assume you don't want to use external libraries for data processing like numpy or pandas. The following code should be quite similar to your attempt:
threshold = 1.0
# We will chain a "none" object at the end which triggers the threshold to make sure no "fire" objects are left unprinted
from itertools import chain
trigger = (",,,0,{},,none,".format(threshold + 1),)
# Keys for columns of input data
keys = (
"frame_num",
"uniqueId",
"title",
"startTime",
"endTime",
"startTime_fmt",
"object",
"score",
)
# Store last "fire" or "none" objects
last = {
"fire": [],
"none": [],
}
with open(filename) as f:
# Skip first line of input file
next(f)
for line in chain(f, trigger):
line = dict(zip(keys, line.split(",")))
last[line["object"]].append(line)
# Check threshold for "none" objects if there are previous unprinted "fire" objects
if line["object"] == "none" and last["fire"]:
if float(last["none"][-1]["endTime"]) - float(last["none"][0]["startTime"]) > threshold:
print("{},{},{},{},{},{},{},{}".format(
last["fire"][0]["frame_num"],
last["fire"][0]["uniqueId"],
last["fire"][0]["title"],
last["fire"][0]["startTime"],
last["fire"][-1]["endTime"],
last["fire"][0]["startTime_fmt"],
last["fire"][0]["object"],
sum([float(x["score"]) for x in last["fire"]]) / len(last["fire"]),
))
last["fire"] = []
# Previous "none" objects don't matter anymore as soon as a "fire" object is being encountered
if line["object"] == "fire":
last["none"] = []
The input file is being processed line by line and "fire" objects are being accumulated in last["fire"]. They will be merged and printed if either
the "none" objects in last["none"] reach the threshold defined in threshold
or when the end of the input file is reached due to the manually chained trigger object, which is a "none" object of length threshold + 1, therefore triggering the threshold and subsequent merge and print.
You could replace print with a call to write into an output file, of course.
This is close to what you are looking for and may be an acceptable alternative.
If your sample rate is quite stable (looks to be about 0.12s or 50 Hz) then you can find the equivalent number of samples you can tolerate to be 'none'. Let's say that's 8.
This code will read in the data and fill the 'none' values with up to 8 of the last valid value.
import numpy as np
import pandas as pd
def groups_of_true_values(x):
"""Returns array of integers where each True value in x
is replaced by the count of the group of consecutive
True values that it was found in.
"""
return (np.diff(np.concatenate(([0], np.array(x, dtype=int)))) == 1).cumsum()*x
df = pd.read_csv('test.csv', index_col=0)
# Forward-fill the 'none' values to a limit
df['filled'] = df['object'].replace('none', None).fillna(method='ffill', limit=8)
# Find the groups of consecutive fire values
df['group'] = groups_of_true_values(df['filled'] == 'fire')
# Produce sum of scores by group
group_scores = df[['group', 'score']].groupby('group').sum()
print(group_scores)
# Find firing start and stop times
df['start'] = ((df['filled'] == 'fire') & (df['filled'].shift(1) == 'none'))
df['stop'] = ((df['filled'] == 'none') & (df['filled'].shift(1) == 'fire'))
start_times = df.loc[df['start'], 'startTime'].to_list()
stop_times = df.loc[df['stop'], 'startTime'].to_list()
print(start_times, stop_times)
Output:
score
group
1 10.347362
[] []
Hopefully, the output would be more interesting if there were longer sequences of no firing...
My approach, using pandas and groupby:
Combine continuous lines of the same object (fire or none) into a spell
Drop none-fire spells with duration less than 1 second
Combine continuous series of spells of the same object (fire or none) into a superspell, and calculate the corresponding score
I assume the data is sorted by time (otherwise we need to add a sort after reading the data). The trick to combining continuous lines of the same object into spells/superspells is: first, identify where the new spell/superspell starts (i.e. when the object type changes), and second, assign a unique id to each spell (= the number of new spell before it)
import pandas as pd
# preparing the test data
data = '''frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344'''
with open("a.txt", 'w') as f:
print(data, file=f)
df1 = pd.read_csv("a.txt")
# mark new spell (the start of a series of continuous lines of the same object)
# new spell if the current object is different from the previous object
df1['newspell'] = df1.object != df1.object.shift(1)
# give each spell a unique spell number (equal to the total number of new spell before it)
df1['spellnum'] = df1.newspell.cumsum()
# group lines from the same spell together
spells = df1.groupby(by=["uniqueId", "title", "spellnum", "object"]).agg(
first_frame = ('frame_num', 'min'),
last_frame = ('frame_num', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('score', 'sum'),
cnt = ('score', 'count')).reset_index()
# remove none-fire spells with duration less than 1
spells = spells[(spells.object == 'fire') | (spells.endTime > spells.startTime + 1)]
# Now group conitnous fire spells into superspells
# mark new superspell
spells['newsuperspell'] = spells.object != spells.object.shift(1)
# give each superspell a unique number
spells['superspellnum'] = spells.newsuperspell.cumsum()
superspells = spells.groupby(by=["uniqueId", "title", "superspellnum", "object"]).agg(
first_frame = ('first_frame', 'min'),
last_frame = ('last_frame', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('totalScore', 'sum'),
cnt = ('cnt', 'sum')).reset_index()
superspells['score'] = superspells.totalScore/superspells.cnt
superspells.drop(columns=['totalScore', 'cnt'], inplace=True)
print(superspells.to_csv(index=False))
# output
#uniqueId,title,superspellnum,object,first_frame,last_frame,startTime,endTime,score
#file1,name1,1,fire,10,23,30.6,32.2,0.8304779999999999
Related
Calculation of the removal percentage for chemical parameters (faster code)
I have to calculate the removal pecentages of chemical/biological parameters (e.g. after an oxidation process) in a waster water treatment plant. My code code works so far and does exactly what it should do, but it is really slow. On my laptop the calculation for the original dataset took about 10 sec and on my PC 4 sec for a 15x80 Data Frame. That is too long, especially if I have to deal with more rows. What the code does: The formula for the single removal is defined as: 1 - n(i)/n(i-1) and for the total removal: 1 - n(i)/n(0) Every measuring point has its own ID. The code searches for the ID's and performs the calculation and saves it in the data frame. Here is an example (I cant post the original data): import pandas as pd import numpy as np data = {"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"], "Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35]} df["S_removal"]= np.nan df["T_removal"]= np.nan Data Frame before calculation this is my function for the calculation: def removal_TEST(Rem1, Measure, Rem2): lst = [i.split("_")[1] for i in df["ID"]] #takes relevant ID information y = np.unique(lst) #stores unique ID values to loop over them for ID in y: id_list = [] for i in range(0, len(df["ID"])): if ID in df["ID"][i]: id_list.append(i) else: # this stores only the relevant id in a new list id_list.append(np.nan) indexlist = pd.Series(id_list) first_index = indexlist.first_valid_index() #gets the first and last index of the id list last_index = indexlist.last_valid_index() col_indizes = [] for i in range(first_index, last_index+1): col_indizes.append(i) for i in col_indizes: if i == 0: continue # for i=0 there is no 0-1 element, so i=0 should be skipped else: Rem1[i]= 1-(Measure[i]/Measure[i-1]) Rem1[first_index]= np.nan #first entry of an ID must be NaN value for i in range(first_index, last_index+1): col_indizes.append(i) for i in range(len(Rem2)): for i in col_indizes: Rem2[i]= 1-(Measure[i]/Measure[first_index]) Rem2[first_index]= np.nan this is the result: Final Data Frame I am new to Python and to stackoverflow (so sorry if my code and question are not so good to read). Are there any good libraries to speed up my code, or do you have some suggestions? Thank you :)
Your use of Pandas seems to be getting in the way of solving the problem. The only relevant state seems to be when the group changes and the first and previous measurement values for each row. I'd be tempted to solve this just using Python primitives, but you could solve this in other ways if you had lots of data (i.e. millions of rows). import pandas as pd df = pd.DataFrame({ "ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"], "Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35], "S_removal": float('nan'), "T_removal": float('nan'), }) # somewhere keep track of the last group identifier last = None # iterate over rows for idx, ID, meas in zip(df.index, df['ID'], df['Measurement']): # what's the current group name _, grp = ID.split('_', 1) # see if we're in a new group if grp != last: last = grp # track the group's measurement grp_meas = meas else: # calculate things df.loc[idx, 'S_removal'] = 1 - meas / last_meas df.loc[idx, 'T_removal'] = 1 - meas / grp_meas # keep track of the last measurement last_meas = meas I've commented the code in the hopes it makes sense. This takes ~2 seconds for 1000 copies of your example data, so 11000 rows. Given that OP has said this needs to be done for a wide dataset, here's another version that reduces runtime to ~30ms for 11000 rows and 2 columns: import numpy as np import pandas as pd data = { "ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"], "M1": [100, 80, 60, 120,90,70,50,25, 85,65,35], "M2": [100, 80, 60, 120,90,70,50,25, 85,65,35], } # reset_index() because code below assumes they are unique df = pd.concat([pd.DataFrame(data)]*1000).reset_index() # column names measurement_col_names = ['M1', 'M2'] single_output_names = ['S1', 'S2'] total_output_names = ['T1', 'T2'] # somewhere keep track of the last group identifier last = None # somewhere to store intermediate state vals_idx = [] meas_vals = [] last_vals = [] grp_vals = [] # iterate over rows for idx, ID, meas in zip(df.index, df['ID'], df.loc[:,measurement_col_names].values): # what's the current group name _, grp = ID.split('_', 1) # we're in a new group if grp != last: last = grp # track the group's measurement grp_meas = meas else: # track values and which rows they apply to vals_idx.append(idx) meas_vals.append(meas) last_vals.append(last_meas) grp_vals.append(grp_meas) # keep track of the last measurement last_meas = meas # convert to numpy array so it vectorises nicely meas_vals = np.array(meas_vals) # perform calculation using fast numpy operations df.loc[vals_idx, single_output_names] = 1 - (meas_vals / last_vals) df.loc[vals_idx, total_output_names] = 1 - (meas_vals / grp_vals)
Looking for a more elegant and sophisticated solution when multiple if and for-loop are used
I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible. Here the code I have written. It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string... ##### Reading CSV file values and looking for variants IDs ###### # Find Variant ID (rs000000) in CSV # \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all') # Now, we save the results found in a dict key=index and value=variand ID if rs.empty == False: ind = rs.index.to_list() vals = list(rs.stack().values) row2rs = dict(zip(ind, vals)) print(row2rs) # We need to remove the row where rs has been found. # Because if in the same row more than one ID variant found (i.e rs# and NM_#) # this code is going to get same variant more than one. for index, rs in row2rs.items(): # Rows where substring 'rs' has been found need to be delete to avoid repetition # This will be done in df_draft df_draft = df_draft.drop(index) ## Same thing with other ID variants # Here with Variant ID (NM_0000000) in CSV NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all') if NM.empty == False: ind = NM.index.to_list() vals = list(NM.stack().values) row2NM = dict(zip(ind, vals)) print(row2NM) for index, NM in row2NM.items(): df_draft = df_draft.drop(index) # Here with Variant ID (NP_0000000) in CSV NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all') if NP.empty == False: ind = NP.index.to_list() vals = list(NP.stack().values) row2NP = dict(zip(ind, vals)) print(row2NP) for index, NP in row2NP.items(): df_draft = df_draft.drop(index) # Here with ClinVar field (RCV#) in CSV RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all') if RCV.empty == False: ind = RCV.index.to_list() vals = list(RCV.stack().values) row2RCV = dict(zip(ind, vals)) print(row2RCV) for index, NP in row2NP.items(): df_draft = df_draft.drop(index) I was wondering for a more elegant solution of writing this simple but long code. I have been thinking of sa
How to optimize this pandas iterable
I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules: def disambiguate(arg): arg['length'] = (arg.end - arg.begin).abs() df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy() data = [] out = pd.DataFrame() for row in df.itertuples(): test = df[df['note_id']==row.note_id].copy() # get overlapping intervals: # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither') span_range = pd.Interval(row.begin, row.end) fx = test[iix.overlaps(span_range)].copy() maxLength = fx['length'].max() minLength = fx['length'].min() maxScore = abs(float(fx['score'].max())) minScore = abs(float(fx['score'].min())) # filter out overlapping rows via hierarchy if maxScore > minScore: fx = fx[fx['score'] == maxScore] elif maxLength > minLength: fx = fx[fx['length'] == minScore] data.append(fx) out = pd.concat(data, axis=0) # randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742 out.reset_index(inplace=True) out = out.reindex(np.random.permutation(out.index)) return out.drop_duplicates(subset=['begin', 'end', 'note_id']) This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there). Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form. Here is a sample of my data: begin,end,note_id,score 0,9,0365,1 10,14,0365,1 25,37,0365,0.7 28,37,0365,1 38,42,0365,1 53,69,0365,0.7857142857142857 56,60,0365,1 56,69,0365,1 64,69,0365,1 83,86,0365,1 91,98,0365,0.8333333333333334 101,108,0365,1 101,127,0365,1 112,119,0365,1 112,127,0365,0.8571428571428571 120,127,0365,1 163,167,0365,1 196,203,0365,1 208,216,0365,1 208,223,0365,1 208,231,0365,1 208,240,0365,0.6896551724137931 217,223,0365,1 217,231,0365,1 224,231,0365,1 246,274,0365,0.7692307692307693 252,274,0365,1 263,274,0365,0.8888888888888888 296,316,0365,0.7222222222222222 301,307,0365,1 301,316,0365,1 301,330,0365,0.7307692307692307 301,336,0365,0.78125 308,316,0365,1 308,323,0365,1 308,330,0365,1 308,336,0365,1 317,323,0365,1 317,336,0365,1 324,330,0365,1 324,336,0365,1 361,418,0365,0.7368421052631579 370,404,0365,0.7111111111111111 370,418,0365,0.875 383,418,0365,0.8285714285714286 396,404,0365,1 396,418,0365,0.8095238095238095 405,418,0365,0.8333333333333334 432,453,0365,0.7647058823529411 438,453,0365,1 438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe. It should been: cases = set(df['note_id'].tolist()) for case in cases: test = df[df['note_id']==case].copy() for row in df.itertuples(): # get overlapping intervals: # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither') span_range = pd.Interval(row.begin, row.end) fx = test[iix.overlaps(span_range)].copy() maxLength = fx['length'].max() minLength = fx['length'].min() maxScore = abs(float(fx['score'].max())) minScore = abs(float(fx['score'].min())) if maxScore > minScore: fx = fx[fx['score'] == maxScore] elif maxLength > minLength: fx = fx[fx['length'] == maxLength] data.append(fx) out = pd.concat(data, axis=0) For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!
counting entries yields a wrong dataframe
So I'm trying to automate the process of getting the number of entries a person has by using pandas. Here's my code: st = pd.read_csv('list.csv', na_values=['-']) auto = pd.read_csv('data.csv', na_values=['-']) comp = st.Component.unique() eventname = st.EventName.unique() def get_summary(ID): for com in comp: for event in eventname: arr = [] for ids in ID: x = len(st.loc[(st.User == str(ids)) & (st.Component == str(com)) & (st.EventName == str(event))]) arr.append(x) auto.loc[:, event] = pd.Series(arr, index=auto.index) The output I get looks like this: I ran some manual loops to see the entries for the first four columns. And I counted them manually too in the csv file. But when I put a print function inside the loop, I can see that it does count the entries correctly, but at some point it gets overwritten with the zero values. What am I missing/doing wrong here?
How to append data to a dataframe whithout overwriting?
I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips? import datetime import pandas as pd import numpy as np total ={} entryTable = pd.read_csv("Entry_Table.csv") newEntries = int(input("How many new entries?\n")) for i in range(newEntries): ID = input ("ID?\n") VQ = int (input ("VQ?\n")) timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") entryTable.loc[i] = [timeStamp, ID, VQ] entryTable.to_csv("Inventory_Table.csv") total[i] = 1 pos = sum(total.values()) print(pos) inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either n = len(entryTable) or n = entryTable.shape[0]