I have a list of elements like this:
new_element = {'start':start, 'end':end, 'category':cat, 'value': val}
Now, I want to append it to a list only if there's no other element that already contains this new element (checking by start, end and category).
Also, if this element contains an element that is already in the list, I want to add it and delete the old one.
To sum up, I don't want nested elements and I only want to keep the larger one.
What I have so far (id is category):
for ir in irs[:]:
#is it nested into another?
if ir['category'] == ir_new['category'] and ir['start'] <= ir_new['start'] and ir['end'] >= ir_new['end']:
nested = True
#another is nested in this one
if ir['category'] == ir_new['category'] and ir['start'] >= ir_new['start'] and ir['end'] <= ir_new['end']:
irs.remove(ir)
if not nested:
#append in a list
irs.append(ir_new)
found += 1
This works, I think it's O(n*n). Maybe there's another way to do it more efficient by using dicts or pandas.
Some thoughts:
Should I do it before appending or append all and then check?
UPDATE 1:
There is an implementation of interval tree in this lib, the only issue is that it is not possible to delete intervales once added.
http://bx-python.readthedocs.io/en/latest/lib/bx.intervals.intersection.html#bx.intervals.intersection.IntervalTree
UPDATE 2:
https://github.com/chaimleib/intervaltree is interesting, the thing is that I cannot recover while discading partial overlaps. So I only need full overlaps / nest
rough draft:
define a class for your "ir" items, with __lt__ using 'start'
have a master dict with category as the key
store a sorted list of items with that category (bisect)
when you find the insert position based on start time, you can start comparing
end times until you find an "ir" item which should not be deleted.
By using pandas library and some coding I was able to get a decent solution
Initialize
...
df = pd.DataFrame(columns=['start','end','seq','record','len','ir_1','ir_2'])
Add
...
with l_lock:
new_element = [ir_start, ir_end,ir_seq, record.id, ir_len, seq_q, seq_q_prime]
df.loc[len(df)] = new_element
Delete dups
...
for idx, row in df.iterrows():
res = df[(df.index != idx) & (df.start >= row.start) & (df.end <= row.end)]
df.drop(res.index,inplace=True)
As suggested in some comments, interval tree was also a possible solution but I could not get it work
Related
I was practicing my grip on dictionaries and then I came across this problem.
I created a dictionary with a few keys and some keys have multiple values which were assigned using lists.
eg:
mypouch = {
'writing_stationery': ['Pens', 'Pencils'],
'gadgets': ['calculator', 'Watch'],
'Documents': ['ID Card', 'Hall Ticket'],
'Misc': ['Eraser', 'Sharpener', 'Sticky Notes']
}
I want to delete a specific item 'Eraser'.
Usually I can just use pop or del function but the element is in a list.
I want the output for misc as 'Misc':['Sharpener', 'Sticky Notes']
What are the possible solutions for this kind of a problem?
You can do:
mypouch['Misc'].remove('Eraser')
Or, use a set rather than a list:
for k in mypouch:
mypouch[k] = set(mypouch[k])
then, it is easy and O(1) to remove an element in a set, using the same code as the list.
Here is a 4 ways to do it the best is with remove but maybe you want to know different approaches
You can use filter
mypouch['Misc'] = list(filter(lambda x: x!='Eraser', mypouch['Misc']))
or short for with if
mypouch['Misc'] = [x for x in mypouch['Misc'] if x != 'Eraser']
or build-in function list remove
mypouch['Misc'].remove('Eraser')
or you can use pop and index combo
mypouch['Misc'].pop(mypouch['Misc'].index('Eraser'))
I have lists of items:
['MRS_103_005_010_BG_001_v001',
'MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v001',
'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v001',
'MRS_103_005_020_BG_001_v002',
'MRS_103_005_020_BG_001_v003']
I need to identify the latest version of each item and store it to a new list. Having trouble with my logic.
Based on how this has been built I believe I need to first compare the indices to each other. If I find a match I then check to see which number is greater.
I figured I first needed to do a check to see if the folder names matched between the current index and the next index. I did this by making two variables, 0 and 1, to represent the index so I could do a staggered incremental comparison of the list on itself. If the two indices matched I then needed to check the vXXX number on the end. whichever one was the highest would be appended to the new list.
I suspect that the problem lies in one copy of the list getting to an empty index before the other one does but I'm unsure of how to compensate for that.
Again, I am not a programmer by trade. Any help would be appreciated! Thank you.
# Preparing variables for filtering the folders
versions = foundVerList
verAmountTotal = len(foundVerList)
verIndex = 0
verNextIndex = 1
highestVerCount = 1
filteredVersions = []
# Filtering, this will find the latest version of each folder and store to a list
while verIndex < verAmountTotal:
try:
nextVer = (versions[verIndex])
nextVerCompare = (versions[verNextIndex])
except IndexError:
verNextIndex -= 1
if nextVer[0:24] == nextVerCompare[0:24]:
if nextVer[-3:] < nextVerCompare [-3:]:
filteredVersions.append(nextVerCompare)
else:
filteredVersions.append(nextVer)
verIndex += 1
verNextIndex += 1
My expected output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v003']
The actual output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v002', 'MRS_103_005_020_BG_001_v003']
During the with loop I am using os.list on each folder referenced via verIndex. I believe the problem is that a list is being generated for every folder that is searched but I want all the searches to be combined in a single list which will THEN go through the groupby and sorted actions.
Seems like a case for itertools.groupby:
from itertools import groupby
grouped = groupby(data, key=lambda version: version.rsplit('_', 1)[0])
result = [sorted(group, reverse=True)[0] for key, group in grouped]
print(result)
Output:
['MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v003']
This groups the entries by everything before the last underscore, which I understand to be the "item code".
Then, it sorts each group in reverse order. The elements of each group differ only by the version, so the entry with the highest version number will be first.
Lastly, it extracts the first entry from each group, and puts it back into a result list.
Try this:
text = """MRS_103_005_010_BG_001_v001
MRS_103_005_010_BG_001_v002
MRS_103_005_010_FG_001_v001
MRS_103_005_010_FG_001_v002
MRS_103_005_010_FG_001_v003
MRS_103_005_020_BG_001_v001
MRS_103_005_020_BG_001_v002
MRS_103_005_020_BG_001_v003
"""
result = {}
versions = text.splitlines()
for item in versions:
v = item.split('_')
num = int(v.pop()[1:])
name = item[:-3]
if result.get(name, 0) < num:
result[name] = num
filteredVersions = [k + str(v) for k, v in result.items()]
print(filteredVersions)
output:
['MRS_103_005_010_BG_001_v2', 'MRS_103_005_010_FG_001_v3', 'MRS_103_005_020_BG_001_v3']
I have a list of ids and am trying to do some processing, using eq function on a dataframe object, on all but a particular element of the list. Can you please suggest me how it can be done?
ids = list(set(df['user_id']))
for k in ids:
#processing = df.user_id.eq(ids-{k}????)
One thing to watch out for this is that you don't want to be destructively modifying the ids list as you are looping through it to remove the current element. Thus, one way we can do this is to loop through the indexes and, for each index i, create a new spliced together list that contains all the elements in ids other than at index i. I would do it as such:
for i in range(len(ids)):
elemsExcludingi = ids[:i] + ids[i + 1:]
# use this list to do things
Good luck!
You can set an if statement in the for-loop, where target_id is the id you do not which to process
ids = list(set(df['user_id']))
for k in ids:
if k != target_id:
#processing code goes here
Use the keyword continue.
While in a loop whenever the keyword continue is called, it makes the loop to iterate over it leaving the code below unprocessed.
ids = list(set(df['user_id']))
for k in ids:
if k == 'the_element_you_dont_want':
continue #Skips the code below when its called
#Other code code block
#processing = df.user_id.eq(ids-{k}????)
I have this array of data
data = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
I remove the duplicate data with list(set(data)), which gives me
data = [20001202.05, 20001202.50, 20001215.75, 20021215.75]
But I would like to remove the duplicate data, based on the numbers before the "period"; for instance, if there is 20001202.05 and 20001202.50, I want to keep one of them in my array.
As you don't care about the order of the items you keep, you could do:
>>> {int(d):d for d in data}.values()
[20001202.5, 20021215.75, 20001215.75]
If you would like to keep the lowest item, I can't think of a one-liner.
Here is a basic example for anybody who would like to add a condition on the key or value to keep.
seen = set()
result = []
for item in sorted(data):
key = int(item) # or whatever condition
if key not in seen:
result.append(item)
seen.add(key)
Generically, with python 3.7+, because dictionaries maintain order, you can do this, even when order matters:
data = {d:None for d in data}.keys()
However for OP's original problem, OP wants to de-dup based on the integer value, not the raw number, so see the top voted answer. But generically, this will work to remove true duplicates.
data1 = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
for i in data1:
if i not in ls:
ls.append(i)
print ls
Using: Python 2.4
Currently, I have a nested for loop that iterates over 2 lists and makes a match based on two elements that exists on both lists. Once a match has been found, it the element from the r120Final list and puts in a new list called "r120Delta":
for r120item in r120Final:
for spectraItem in spectraFinal:
if(str(spectraItem[0]) == r120item[2].strip()) and (str(spectraItem[25]) == r120item[10]):
r120Delta.append(r120item)
break
The problem is that this is SO SLOW and the lists aren't that deep. The R120 is about 64,000 lines and the Spectra is about 150,000 lines.
The r120Final list is a nested array and it looks like so:
r120Final[0] = [['xxx','xxx','12345','xxx','xxx','xxx','xxx','xxx','xxx','xxx','234567']]
...
r120Final[n] = [['xxx','xxx','99999','xxx','xxx','xxx','xxx','xxx','xxx','xxx','678901']]
The spectraFinal list is essentially the same, a nested array and it looks like so:
spectraFinal[0] = [['12345','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','234567']]
...
spectraFinal[0] = [['99999','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','678901']]
Finally, the reason for the "r120Delta" is so that way I can then do a list differential between r120Final and r120Delta and retrieve r120 data elements that were NOT matched. This is the function I defined for this very task, and again, slow:
def listDiff( diffList, completeList ):
returnList = []
for completeItem in completeList:
if not completeItem in diffList:
returnList.append(completeItem)
return returnList
Basically, I'm knowledgeable in Python but by no means an expert. I'm looking for some experts to show me how to speed this up. Any help is appreciated!
spectra_set = set((str(spectraItem[0]), str(spectraItem[25])) for spectraItem in spectraFinal)
returnList = []
for r120item in r120Final:
if (r120item[2].strip(), r120item[10]) not in spectra_set:
returnList.append(r120item)
This will add all items that didn't match to the returnList.
You can do it in one line (if you really want) as
returnList = [r120item for r120item in r120Final
if (r120item[2].strip(), r120item[10]) not in
set((str(spectraItem[0]), str(spectraItem[25]))
for spectraItem in spectraFinal)]
If you need the full spectraItem:
spectra_dict = dict(((str(spectraItem[0]), str(spectraItem[25])), spectraItem) for spectraItem in spectraFinal)
returnList = []
for r120item in r120Final:
key = (r120item[2].strip(), r120item[10])
if key not in spectra_dict:
returnList.append(r120item)
else:
return_item = some_function_of(r120item, spectra_dict[key])
returnList.append(return_item)