find duplicate data from CSV and merge then in a specific schema

find duplicate data from CSV and merge then in a specific schema - python

Context :
For video editing purpose, I receive CSVs with shot name, start and end timecode.
the interesting CSV datas are formatted this way :
the source in / out needs to be converted to a framerange (like 1-24)
so basically a shot is extracted as sequence('shot','framerange') like : `SQ010,('010','150-1000').
A sequence can (or not) have multiple shots and multiple time the same shot with a similar or different source in & out. an edit (a CSV) can have multiple sequences
the problem :
1/ I'm not able to merge duplicate shots with overlapping framerange in the expected way, that is :
[ ('SQ010', ('010', '0-81'),('010', '10-250') ), ('SQ020', (...) ) ]
and if the cut of the same shot doesn't overlap, the result must be :
[ ('SQ010', ('050', '0-65,70-81'),('070', '10-250') ), ('SQ020', (...) ) ]
2/ when I have duplicate shots, I can't run all the scripts we have behind, as they expect unique shot. so i can't (for now) change the expected result formatting
3/ from some reason, some 'None' can appear in the list because of duplicates shots with my code. I don't really understand how to avoid them.
What I can do for now :
I'm able to extract sequences, shots, timecode and convert it, but not to merge duplicate shots timecode.
import csv
import re
import sys
import os
with open(writeClean, 'r') as clearSL:
reader = csv.DictReader(clearSL)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# extract from CSV
seqLine = []
shotLine= []
for i in data['Name']:
if i.startswith('SQ') :
seq,shot=i.split('_')
seqLine.append(seq)
shotLine.append(shot)
else:
pass
fInLine=[]
durationTCtoFrame=[]
#read framerate from the first usable shot TODO: do it per shot for precision
framerate = int( float( list( set(data['Source FPS']) )[0] ) )
#convert Timecode to frame
for TC in data['Source In']:
inTCtoFrame = int(round(tcToFrames(TC,framerate)))
fInLine.append(inTCtoFrame)
for TC in data['Source Out']:
duration = int(round(tcToFrames(TC,framerate)))
durationTCtoFrame.append(duration)
fRange = ["{}-{}".format(*i) for i in zip(fInLine, fOutLine)] #merge first frame and last frame in string format usable by GCC
shotList = list(map(lambda x,y:(x,y),shotLine,fRange)) # merge shotname with framerange
#TEST - print(shotList)
seqInfo = map(lambda x,y:(x,y),seqLine,shotList) # merge sequence and shot
#TEST - print(seqInfo)
seqClean = list(set(seqInfo)) # remove duplicate with set()
seqClean.sort() # order all
#TEST - print(seqClean)
seqList = [(i,) + tuple(i[1] for i in e) for i, e in groupby(seqClean, lambda x: x[0])] #group shots per sequences
#TEST - print(seqList)
this give me this type of result :
[('SQ010', ('050', '0-81'), ('050', '0-65'),(...)]
if it's help, i can provide a full CSV
thank you

Related

DataFrame returns Value Error after adding auto index

This script needs to query the DC server for events. Since this is done live, each time the server is queried, it returns query results of varying lengths. The log file is long and messy, as most logs are. I need to filter only the event names and their codes and then create a DataFrame. Additionally, I need to add a third column that counts the number of times each event took place. I've done most of it but can't figure out how to fix the error I'm getting.
After doing all the filtering from Elasticsearch, I get two lists - action and code - which I have emulated here.
action_list = ['logged-out', 'logged-out', 'logged-out', 'Directory Service Access', 'Directory Service Access', 'Directory Service Access', 'logged-out', 'logged-out', 'Directory Service Access', 'created-process', 'created-process']
code_list = ['4634', '4634', '4634', '4662', '4662', '4662', '4634', '4634', '4662','4688']
I then created a list that contains only the codes that need to be filtered out.
event_code_list = ['4662', '4688']
My script is as follows:
import pandas as pd
from collections import Counter
#Create a dict that combines action and code
lists2dict = {}
lists2dict = dict(zip(action_list,code_list))
# print(lists2dict)
#Filter only wanted eventss
filtered_events = {k: v for k, v in lists2dict.items() if v in event_code_list}
# print(filtered_events)
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index to DataFrame
df = pd.DataFrame(filtered_events,index=index)#Create DataFrame from filtered events
#Create Auto Index
count = Counter(df)
action_count = dict(Counter(count))
action_count_values = action_count.values()
# print(action_count_values)
#Convert Columns to Rows and Add Index
new_df = df.melt(var_name="Event",value_name="Code")
new_df['Count'] = action_count_values
print(new_df)
Up until this point, everything works as it should. The problem is what comes next. If there are no events, the script outputs an empty DataFrame. This works fine. However, if there are events, then we should see the events, the codes, and the number of times each event occurred. The problem is that it always outputs 1. How can I fix this? I'm sure it's something ridiculous that I'm missing.
#If no alerts, create empty DataFrame
if new_df.empty:
empty_df = pd.DataFrame(columns=['Event','Code','Count'])
empty_df['Event'] = ['-']
empty_df['Code'] = ['-']
empty_df['Count'] = ['-']
empty_df.to_html()
html = empty_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
else: #else, output alerts + codes + count
new_df.to_html()
html = new_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
Any help is appreciated.

It is because you are collecting the result as dictionary - the repeated records are ignored. You lost the record count here: lists2dict = dict(zip(action_list,code_list)).
You can do all these operations very easily on dataframe. Just construct a pandas dataframe from given lists, then filter by code, groupby, and aggregate as count:
df = pd.DataFrame({"Event": action_list, "Code": code_list})
df = df[df.Code.isin(event_code_list)] \
.groupby(["Event", "Code"]) \
.agg(Count = ("Code", len)) \
.reset_index()
print(df)
Output:
Event Code Count
0 Directory Service Access 4662 4
1 created-process 4688 2

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa

Drop rows in dataframe if length of the name columns <=1

Please point out where i am doing wrong or a duplicate of this question
I have 11 columns in my table, i am loading data from Ceph(AWS) bucket to Postgres and while doing that i have to filter the data with the below conditions before inserting data into Postgres
Drop the entire row if there is any empty/ Null values in any column
First name and last name should have more than a single letter. Ex : first name = A or last name = P, any record either first name or last name or both , entire record/row should be dropped
Zip code should be 5 digit or greater . Max 7 digit
First name and last name records should not have [Jr, Sr, I, II, etc] in it. or drop the entire record
i have managed to execute the first step (new to pandas) but i was blocked at the next step and i believe that it might also help me solve step3 if i find a solution for step2. While doing a quick research in google, I found that i might be complicating the process by using chunks and might have to use 'concat' to apply it for all chunks or may be i am wrong but i am dealing with huge amount of data and using chunks would help me load the data faster into Postgres.
I am going to paste my code here and mention what i tried, what was the output and what would be the expected output
what i tried:
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd',
'indiv_entpr_id',
'TOKEN_1',
'TOKEN_2',
'TOKEN_3',
'TOKEN_4',
'TOKEN_KEY'
]
def push_to_pg_weekly(key):
vants = []
print(key)
key = _download_s3(key)
how_many_files_pushed.append(True)
s=sp.Popen(["wc", "-l", key], stdout=sp.PIPE)
a, b = s.communicate()
total_rows = int(a.split()[0])
rows = 0
data = pd.read_csv(key, sep="|", header=None, chunksize=100000)
for chunk in data:
rows += len(chunk)
print("Processed rows: ", (float(rows)/total_rows)*100)
chunk = chunk.dropna(axis=0) #step-1 Drop the rows where at least one element is missing.
index_names = chunk[(len(chunk[0]) <= 1) | (len(chunk[1]) <= 1)].index #step2
chunk.drop(index_names, axis=0)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(user = os.environ.get("DATABASE_USER", “USERNAME”),
password = os.environ.get("DATABASE_PASS", “PASSWORD“),
host = os.environ.get("DATABASE_HOST", "cvlpsql.pgsql.com"),
port = 5432,
dbname = os.environ.get("DATABASE_NAME", "cvlpsql_db"),
options = "-c search_path=DATAVANT_O")
with connection.cursor() as cursor:
cursor.copy_from(open('/tmp/sample.csv'), "COVID1", sep='|')
connection.commit()
def push_to_pg():
paginator = CLIENT.get_paginator('list_objects')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if "Contents" in page:
for obj in page["Contents"]:
if obj['Key'].startswith('test/covid-2020-11-10-175213') and (obj['Key'].endswith('.txt') or obj['Key'].endswith('.csv')):
push_to_pg_weekly(obj['Key'])
os.remove(obj['Key'])
return
Data:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
||1969-01-01|M|926.0|135112782|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
||1978-01-01|M|70.0|170737333|Q8NDJz563UrquOUUz0vD6Es05vIaAD/AfVOef4Mhj24=|k5Q02GVd0nJ6xMs1vHVM24MxV6tZ46HJNKoePcDsyoM=|C9cvHz5n+sDycUecioiWZW8USE6D2dli5gRzo4nOyvY=|z4eNSVNDAjiPU2Sw3VY+Ni1djO5fptl5FGQvfnBodr4=|cigna_TOKEN_ENCRYPTION_KEY
||1996-01-01|M|840.0|91951973|Y4kmxp0qdZVCW5pJgQmvWCfc4URg9oFnv2DWGglfQKM=|RJfyDYJjwuZ1ZDjP+5PA5S2fLS6llFD51Lg+uJ84Tus=|+PXzrKt7O79FehSnL3Q8EjGmnyZVDUfdM4zzHk1ghOY=|gjyVKjunky2Aui3dxzmeLt0U6+vT39/uILMbEiT0co8=|cigna_TOKEN_ENCRYPTION_KEY
||1960-01-01|M|180.0|64496569|80e1CgNJeO8oYQHlSn8zWYL4vVrHSPe9AnK2T2PrdII=|bJl7veT+4MlU4j2mhFpFyins0xeCFWeaA30JUzWsfqo=|0GuhUfbS4xCnCj2ms43wqmGFG5lCnfiIQdyti9moneM=|lq84jO9yhz8f9/DUM0ACVc/Rp+sKDvHznVjNnLOaRo4=|cigna_TOKEN_ENCRYPTION_KEY
||1963-01-01|M|310.0|122732991|zEvHkd5AVT7hZFR3/13dR9KzN5WSulewY0pjTFEov2Y=|eGqNbLoeCN1GJyvgaa01w+z26OtmplcrAY2vxwOZ4Y4=|6q9DPLPK5PPAItZA/x253DvdAWA/r6zIi0dtIqPIu2g=|lOl11DhznPphGQOFz6YFJ8i28HID1T6Sg7B/Y7W1M3o=|cigna_TOKEN_ENCRYPTION_KEY
||2001-01-01|F|650.0|43653178|vv/+KLdhHqUm13bWhpzBexwxgosXSIzgrxZIUwB7PDo=|78cJu1biJAlMddJT1yIzQAH1KCkyDoXiL1+Lo1I2jkw=|9/BM/hvqHYXgfmWehPP2JGGuB6lKmfu7uUsmCtpPyz8=|o/yP8bMzFl6KJ1cX+uFll1SrleCC+8BXmqBzyuGdtwM=|cigna_TOKEN_ENCRYPTION_KEY
output - data inserted into postgresDB:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Any answers/comments will be very much appriciated, thank you

Fastest way to do operations like this on pandas is through numpy.where.
eg for String length:
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1), True, False)]
Note: you can add postal code condition in same way. by default in your data postal codes will read in as floats, so cast them to string first, and then set length limit:
## string length & postal code conditions together
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1) &
(data['cust_postl_cd'].astype('str').str.len()>4) &
(data['cust_postl_cd'].astype('str').str.len()<8)
, True, False)]
EDIT:
Since you working in chunks, change the data to chunk and put this inside your loop. Also, since you don't import headers (headers=0, change column names to their index values. And convert all values to strings before comparison, since otherwise NaN columns will be treated as floats eg:
chunk = chunk[np.where((chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[5].astype('str').str.len()>4) &
(chunk[5].astype('str').str.len()<8), True, False)]

Create a new column in the dataframe with a value for the length:
df['name_length'] = df.name.str.len()
Index using the new column:
df = df[df.name_length > 1]

How to merge continuous lines of a csv file

I have a csv file that carries outputs of some processes over video frames. In the file, each line is either fire or none. Each line has startTime and endTime. Now I need to cluster and print only one instance out of continuous fires with their start and end time. The point is that a few none in the middle can also be tolerated if their time is within 1 second. So to be clear, the whole point is to cluster detections of closer frames together...somehow smooth out the results. Instead of multiple 31-32, 32-33, ..., have a single line with 31-35 seconds.
How to do that?
For instance, the whole following continuous items are considered a single one since the none gaps is within 1s. So we would have something like 1,file1,name1,30.6,32.2,fire,0.83 with that score being the mean of all fire lines.
frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
...
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344
This is my attempts so far:
with open(filename) as fin:
lastWasFire=False
for line in fin:
if "fire" in line:
if lastWasFire==False and line !="" and line.split(",")[5] != lastline.split(",")[5]:
fout.write(line)
else:
lastWasFire=False
lastline=line

I assume you don't want to use external libraries for data processing like numpy or pandas. The following code should be quite similar to your attempt:
threshold = 1.0
# We will chain a "none" object at the end which triggers the threshold to make sure no "fire" objects are left unprinted
from itertools import chain
trigger = (",,,0,{},,none,".format(threshold + 1),)
# Keys for columns of input data
keys = (
"frame_num",
"uniqueId",
"title",
"startTime",
"endTime",
"startTime_fmt",
"object",
"score",
)
# Store last "fire" or "none" objects
last = {
"fire": [],
"none": [],
}
with open(filename) as f:
# Skip first line of input file
next(f)
for line in chain(f, trigger):
line = dict(zip(keys, line.split(",")))
last[line["object"]].append(line)
# Check threshold for "none" objects if there are previous unprinted "fire" objects
if line["object"] == "none" and last["fire"]:
if float(last["none"][-1]["endTime"]) - float(last["none"][0]["startTime"]) > threshold:
print("{},{},{},{},{},{},{},{}".format(
last["fire"][0]["frame_num"],
last["fire"][0]["uniqueId"],
last["fire"][0]["title"],
last["fire"][0]["startTime"],
last["fire"][-1]["endTime"],
last["fire"][0]["startTime_fmt"],
last["fire"][0]["object"],
sum([float(x["score"]) for x in last["fire"]]) / len(last["fire"]),
))
last["fire"] = []
# Previous "none" objects don't matter anymore as soon as a "fire" object is being encountered
if line["object"] == "fire":
last["none"] = []
The input file is being processed line by line and "fire" objects are being accumulated in last["fire"]. They will be merged and printed if either
the "none" objects in last["none"] reach the threshold defined in threshold
or when the end of the input file is reached due to the manually chained trigger object, which is a "none" object of length threshold + 1, therefore triggering the threshold and subsequent merge and print.
You could replace print with a call to write into an output file, of course.

This is close to what you are looking for and may be an acceptable alternative.
If your sample rate is quite stable (looks to be about 0.12s or 50 Hz) then you can find the equivalent number of samples you can tolerate to be 'none'. Let's say that's 8.
This code will read in the data and fill the 'none' values with up to 8 of the last valid value.
import numpy as np
import pandas as pd
def groups_of_true_values(x):
"""Returns array of integers where each True value in x
is replaced by the count of the group of consecutive
True values that it was found in.
"""
return (np.diff(np.concatenate(([0], np.array(x, dtype=int)))) == 1).cumsum()*x
df = pd.read_csv('test.csv', index_col=0)
# Forward-fill the 'none' values to a limit
df['filled'] = df['object'].replace('none', None).fillna(method='ffill', limit=8)
# Find the groups of consecutive fire values
df['group'] = groups_of_true_values(df['filled'] == 'fire')
# Produce sum of scores by group
group_scores = df[['group', 'score']].groupby('group').sum()
print(group_scores)
# Find firing start and stop times
df['start'] = ((df['filled'] == 'fire') & (df['filled'].shift(1) == 'none'))
df['stop'] = ((df['filled'] == 'none') & (df['filled'].shift(1) == 'fire'))
start_times = df.loc[df['start'], 'startTime'].to_list()
stop_times = df.loc[df['stop'], 'startTime'].to_list()
print(start_times, stop_times)
Output:
score
group
1 10.347362
[] []
Hopefully, the output would be more interesting if there were longer sequences of no firing...

My approach, using pandas and groupby:
Combine continuous lines of the same object (fire or none) into a spell
Drop none-fire spells with duration less than 1 second
Combine continuous series of spells of the same object (fire or none) into a superspell, and calculate the corresponding score
I assume the data is sorted by time (otherwise we need to add a sort after reading the data). The trick to combining continuous lines of the same object into spells/superspells is: first, identify where the new spell/superspell starts (i.e. when the object type changes), and second, assign a unique id to each spell (= the number of new spell before it)
import pandas as pd
# preparing the test data
data = '''frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344'''
with open("a.txt", 'w') as f:
print(data, file=f)
df1 = pd.read_csv("a.txt")
# mark new spell (the start of a series of continuous lines of the same object)
# new spell if the current object is different from the previous object
df1['newspell'] = df1.object != df1.object.shift(1)
# give each spell a unique spell number (equal to the total number of new spell before it)
df1['spellnum'] = df1.newspell.cumsum()
# group lines from the same spell together
spells = df1.groupby(by=["uniqueId", "title", "spellnum", "object"]).agg(
first_frame = ('frame_num', 'min'),
last_frame = ('frame_num', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('score', 'sum'),
cnt = ('score', 'count')).reset_index()
# remove none-fire spells with duration less than 1
spells = spells[(spells.object == 'fire') | (spells.endTime > spells.startTime + 1)]
# Now group conitnous fire spells into superspells
# mark new superspell
spells['newsuperspell'] = spells.object != spells.object.shift(1)
# give each superspell a unique number
spells['superspellnum'] = spells.newsuperspell.cumsum()
superspells = spells.groupby(by=["uniqueId", "title", "superspellnum", "object"]).agg(
first_frame = ('first_frame', 'min'),
last_frame = ('last_frame', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('totalScore', 'sum'),
cnt = ('cnt', 'sum')).reset_index()
superspells['score'] = superspells.totalScore/superspells.cnt
superspells.drop(columns=['totalScore', 'cnt'], inplace=True)
print(superspells.to_csv(index=False))
# output
#uniqueId,title,superspellnum,object,first_frame,last_frame,startTime,endTime,score
#file1,name1,1,fire,10,23,30.6,32.2,0.8304779999999999

group a bunch of files by day

I do have a bunch of files containing atmospheric measurements in one directory. Fileformat is NetCDF. Each file has a timestamp (variable 'basetime'). I can read all files and plot individual measurement events (temperature vs. altitude).
What I need to do next is "group the files by day" and plot all measurements taken at one single day together in one plot. Unfortunately I have no clue how to do that.
One idea is to use the variable 'measurement_day' as it is defined in the code below.
For each day I normally do have four different files containing temp. and altitude.
Ideally the data of those four different files should be grouped (e.g. for plotting)
I hope my question is clear. Can anyone please help me.
EDIT: I try to use a dictionary now but I have trouble to determine whether one entry already exists for one measurement day. Please see edited code below
from netCDF4 import Dataset
data ={} # was edited
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# check if dict entries for day already exist, if not create empty dict
# and lists inside
if len(data[measurement_day]) == 0:
data[measurement_day] = {}
else: pass
if len(data[measurement_day]['temp']) == 0:
data[measurement_day]['temp'] = []
data[measurement_day]['altitude'] = []
else: pass
I get the following error message:
Traceback (most recent call last):... if len(data[measurement_day]) == 0:
KeyError: '2009/05/28'

Can anyone please help me.
I will try. Though I'm not totally clear on what you already have.
I can read all files and plot individual measurement events
(temperature vs. altitude). What I need to do next is "group the files
by day" and plot all measurements taken at one single day together in
one plot.
From this, I am assuming that you know how to plot the information given a list of Datasets. To get that list of Datasets, try something like this.
from netCDF4 import Dataset
# a dictionary of lists that hold all the datasets from a given day
grouped_datasets = {}
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# if we haven't encountered any datasets from this day yet...
if measurement_day not in grouped_datasets:
# add that day to our dict
grouped_datasets[measurement_day] = []
# now append our dataset to the correct day (list)
grouped_datasets[measurement_day].append(f)
Now you have a dictionary keyed on measurement_day. I'm not sure how you are graphing your data, so this is as far as I can get you. Hope it helps, good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.