Comparing the elements of a list with themselves - python

I have lists of items:
['MRS_103_005_010_BG_001_v001',
'MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v001',
'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v001',
'MRS_103_005_020_BG_001_v002',
'MRS_103_005_020_BG_001_v003']
I need to identify the latest version of each item and store it to a new list. Having trouble with my logic.
Based on how this has been built I believe I need to first compare the indices to each other. If I find a match I then check to see which number is greater.
I figured I first needed to do a check to see if the folder names matched between the current index and the next index. I did this by making two variables, 0 and 1, to represent the index so I could do a staggered incremental comparison of the list on itself. If the two indices matched I then needed to check the vXXX number on the end. whichever one was the highest would be appended to the new list.
I suspect that the problem lies in one copy of the list getting to an empty index before the other one does but I'm unsure of how to compensate for that.
Again, I am not a programmer by trade. Any help would be appreciated! Thank you.
# Preparing variables for filtering the folders
versions = foundVerList
verAmountTotal = len(foundVerList)
verIndex = 0
verNextIndex = 1
highestVerCount = 1
filteredVersions = []
# Filtering, this will find the latest version of each folder and store to a list
while verIndex < verAmountTotal:
try:
nextVer = (versions[verIndex])
nextVerCompare = (versions[verNextIndex])
except IndexError:
verNextIndex -= 1
if nextVer[0:24] == nextVerCompare[0:24]:
if nextVer[-3:] < nextVerCompare [-3:]:
filteredVersions.append(nextVerCompare)
else:
filteredVersions.append(nextVer)
verIndex += 1
verNextIndex += 1
My expected output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v003']
The actual output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v002', 'MRS_103_005_020_BG_001_v003']
During the with loop I am using os.list on each folder referenced via verIndex. I believe the problem is that a list is being generated for every folder that is searched but I want all the searches to be combined in a single list which will THEN go through the groupby and sorted actions.

Seems like a case for itertools.groupby:
from itertools import groupby
grouped = groupby(data, key=lambda version: version.rsplit('_', 1)[0])
result = [sorted(group, reverse=True)[0] for key, group in grouped]
print(result)
Output:
['MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v003']
This groups the entries by everything before the last underscore, which I understand to be the "item code".
Then, it sorts each group in reverse order. The elements of each group differ only by the version, so the entry with the highest version number will be first.
Lastly, it extracts the first entry from each group, and puts it back into a result list.

Try this:
text = """MRS_103_005_010_BG_001_v001
MRS_103_005_010_BG_001_v002
MRS_103_005_010_FG_001_v001
MRS_103_005_010_FG_001_v002
MRS_103_005_010_FG_001_v003
MRS_103_005_020_BG_001_v001
MRS_103_005_020_BG_001_v002
MRS_103_005_020_BG_001_v003
"""
result = {}
versions = text.splitlines()
for item in versions:
v = item.split('_')
num = int(v.pop()[1:])
name = item[:-3]
if result.get(name, 0) < num:
result[name] = num
filteredVersions = [k + str(v) for k, v in result.items()]
print(filteredVersions)
output:
['MRS_103_005_010_BG_001_v2', 'MRS_103_005_010_FG_001_v3', 'MRS_103_005_020_BG_001_v3']

Related

How to decode a list and remove items from two lists when there is a match in both of them based on an index?

I have two lists which contain the following type of information.
List #1:
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3", ...same format elements]
List #2:
Reply_List = ["1/1/0", "1/3/1", "1/2/0", ...same format elements]
From the "Reply" list, I want to be able to compare the second item in the "#/#/#", in this case it will be 1,3,2, and so on with all the items in the Reply list and check if there is a match with the second item in "Request list". If there is a match, then I want to be able to return a new list which would contain the information of the third index in the request string appended with the third index of the matching string in the reply.
The result would be like the following.
Result = ["1.34.0", "1.3.5.0", "1.2.3.1"]
Note that the 0 was appended to the 1.34, the 1 was appended to the 1.3.4 and the 0 was appended to the 1.2.3 from the corresponding indexes in the "Reply" list as the second index existed in the "Reply" list. The 'Reply" list could have the item anywhere placed in the list.
The code which does the problem stated above is shown below.
def get_list_of_error_codes(self, Reply_List , Request_List ):
decoded_Reply_List = Reply_List .decode("utf-8") # I am not sure if this is
the right way to decode all the elements in the list?
Result = [
f"{i.split('/')[-1]}.{j.split('/')[-1]}"
for i in Request_List
for j in decoded_Reply_List
if (i.split("/")[1] == j.split("/")[1])
]
return Result
res = get_list_of_error_codes(Reply_List , Request_List)
print (res) # ["1.34.0", "1.3.5.0", "1.2.3.1"]
Issues I am facing right now:
I am NOT sure if I decode the Reply_List correctly and in the proper manner. Can someone help me also verify this?
I am not sure on how to also remove the corresponding items for the Reply_List and Request_List when I find a match based on the condition if (i.split("/")[1] == j.split("/")[1]).
You can use list comprehension to decode the list:
decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
In this case, if you wanted to also remove items from the list while you create the new list, I would say list comprehension isn't the right move. Just go with the nested for loops:
def get_list_of_error_codes(self, Reply_List, Request_List):
decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
Result = []
for i in list(Request_List):
for j in decoded_Reply_List:
if (i.split("/")[1] == j.split("/")[1]):
Result.append(f"{i.split('/')[-1]}.{j.split('/')[-1]}")
Reply_List.remove(j)
break
else:
continue
Request_List.remove(i)
return Result
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3"]
Reply_List = [b"1/1/0", b"1/3/1", b"1/2/0"]
print(get_list_of_error_codes("Foo", Reply_List, Request_List))
# Output: ['1.34.0', '1.3.5.0', '1.2.3.1']
Some things to note:
I added a break so that we don't keep looking for matches if we find one. It will only match the first pair, then move on.
In for i in list(Request_List), I added the list() cast to effectively make a copy of the list. This allows us to remove entries from Request_List without disrupting the loop. I didn't do this for for j in decoded_Reply_List because it's already a copy of Reply_List. (I assumed you wanted to remove the entries from Reply_List)
The last is the else: continue. We don't want to reach Request_List.remove(i) if we didn't find a match. If break is called, else will not be called, which means we will reach Request_List.remove(i). But if the loop completes without finding a match, the loop will then enter else and we will skip the removal step by calling continue
EDIT:
Actually, Reply_List.remove(j) breaks, since we've decoded j in this method, thus decoded j is not the same object as it is in Reply_List. Here's some revised code which will solve this issue:
def get_list_of_error_codes(Reply_List, Request_List):
# decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
Result = []
for i in list(Request_List):
for j in list(Reply_List):
dj = j.decode(encoding='utf-8')
if (i.split("/")[1] == dj.split("/")[1]):
Result.append(f"{i.split('/')[-1]}.{dj.split('/')[-1]}")
Reply_List.remove(j)
break
else:
continue
Request_List.remove(i)
return Result
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3"]
Reply_List = [b"1/1/0", b"1/3/1", b"1/2/0"]
print("Result: ", get_list_of_error_codes(Reply_List, Request_List))
print("Reply_List: ", Reply_List)
print("Request_List: ", Request_List)
# Output:
# Result: ['1.34.0', '1.3.5.0', '1.2.3.1']
# Reply_List: []
# Request_List: []
What I've done is that instead of creating a separate decoded list, I just decode the entries as they're looped through, and then remove the un-decoded entry from Reply_List. This should be a little more efficient too, since we're not looping through Reply_List twice now.

Python Aggregation without PANDAS

I have a sorted and nested list. Each element in the list has 3 sub-elements; 'Drugname', Doctor_id, Amount. For a given drugname (which repeats) the doctor ids are different and so are the amounts. See sample list below..
I need an output where, for each drugname, I need to count the total UNIQUE doctor ids and the sum of the dollar amount for that drug. For ex, for the list snippet below..
[
['CIPROFLOXACIN HCL', 1801093968, 61.49],
['CIPROFLOXACIN HCL', 1588763981, 445.23],
['HYDROCODONE-ACETAMINOPHEN', 1801093968, 251.52],
['HYDROCODONE-ACETAMINOPHEN', 1588763981, 263.16],
['HYDROXYZINE HCL', 1952310666, 945.5],
['IBUPROFEN', 1801093968, 67.06],
['INVEGA SUSTENNA', 1952310666, 75345.68]
]
The desired output is as below.
[
['CIPROFLOXACIN HCL', 2, 516.72],
['HYDROCODONE-ACETAMINOPHEN', 2, 514.68]
['HYDROXYZINE HCL', 1, 945.5]
['IBUPROFEN', 1, 67.06]
['INVEGA SUSTENNA', 1, 75345.68]
]
In a database world this is the easiest thing with a simple GROUP BY on drugname. In Python, I am not allowed to use PANDAS, NumPy etc. Just the basic building blocks of Python. I tried the below code but I am unable to reset the count variable to count doctor ids and amounts. This commented code is one of several attempts. Not sure if I need to use a nested for loop or a for loop-while loop combo.
All help is appreciated!
aggr_list = []
temp_drug_name = ''
doc_count = 0
amount = 0
for list_element in sorted_new_list:
temp_drug_name = list_element[0]
if temp_drug_name == list_element[0]:
amount += float(amount)
doc_count += 1
aggr_list.append([temp_drug_name, doc_count, amount])
print(aggr_list)
Since the list is already sorted you can simply iterate through the list (named l in the example below) and keep track of the name of the last iteration, and if the name of the current iteration is different from the last, insert a new entry to the output. Use a set to keep track of the doctor IDs already seen for the current drug, and only increment the the second item of the last entry of the output by 1 if the doctor ID is not seen. And increment the third item of the last entry of the output by the amount of the current iteration:
output = []
last = None
for name, id, amount in l:
if name != last:
output.append([name, 0, 0])
last = name
ids = set()
if id not in ids:
output[-1][1] += 1
ids.add(id)
output[-1][2] += amount
output becomes:
[['CIPROFLOXACIN HCL', 2, 506.72],
['HYDROCODONE-ACETAMINOPHEN', 2, 514.6800000000001],
['HYDROXYZINE HCL', 1, 945.5],
['IBUPROFEN', 1, 67.06],
['INVEGA SUSTENNA', 1, 75345.68]]
Note that decimal floating points are approximated in the binary system that the computer uses (please read Is floating point math broken?), so some minor errors are inevitable as seen in the sum of the second entry above.
Here is a solution with a focus on readability, it doesn't consider that the entries in your original list are sorted by drug name.
It does one pass on all the entries of your data , then a pass on the number of unique drugs.
To do only a single pass on all the entries of your sorted data, see #blhsing solution
from collections import defaultdict, namedtuple
Entry = namedtuple('Entry',['doctors', 'prices'])
processed_data = defaultdict(lambda: Entry(doctors=set(), prices=[]))
for entry in data:
drug_name, doctor_id, price = entry
processed_data[drug_name].doctors.add(doctor_id)
processed_data[drug_name].prices.append(price)
stat_list = [[drug_name, len(entry.doctors), sum(entry.prices)] for drug_name, entry in processed_data.items()]
Without Pandas or defaultdict:
d={}
for row in l:
if row[0] in d:
d[row[0]].append(row[1])
d[row[0]].append(row[2])
else:
d[row[0]]=[row[1]]
d[row[0]].append(row[2])
return [[key, len(set(val[0::2])), sum(val[1::2])] for key, val in d.items()]
Reusable solution, meant for those who arrive here trough Google:
def group_by(rows, key):
m = {}
for row in rows:
k = key(row)
try:
m[k].append(row)
except KeyError:
m[k] = [row]
return m.values()
grouped_by_drug = group_by(data, key=lambda row: row[0])
result = [
(
drug_rows[0][0],
len(drug_rows),
sum(row[2] for row in drug_rows)
)
for drug_rows in grouped_by_drug
]
You can also use defaultdict in this implementation, which for my use case is slightly faster.

Algorithm: add vector to list while avoiding nested

I have a list of elements like this:
new_element = {'start':start, 'end':end, 'category':cat, 'value': val}
Now, I want to append it to a list only if there's no other element that already contains this new element (checking by start, end and category).
Also, if this element contains an element that is already in the list, I want to add it and delete the old one.
To sum up, I don't want nested elements and I only want to keep the larger one.
What I have so far (id is category):
for ir in irs[:]:
#is it nested into another?
if ir['category'] == ir_new['category'] and ir['start'] <= ir_new['start'] and ir['end'] >= ir_new['end']:
nested = True
#another is nested in this one
if ir['category'] == ir_new['category'] and ir['start'] >= ir_new['start'] and ir['end'] <= ir_new['end']:
irs.remove(ir)
if not nested:
#append in a list
irs.append(ir_new)
found += 1
This works, I think it's O(n*n). Maybe there's another way to do it more efficient by using dicts or pandas.
Some thoughts:
Should I do it before appending or append all and then check?
UPDATE 1:
There is an implementation of interval tree in this lib, the only issue is that it is not possible to delete intervales once added.
http://bx-python.readthedocs.io/en/latest/lib/bx.intervals.intersection.html#bx.intervals.intersection.IntervalTree
UPDATE 2:
https://github.com/chaimleib/intervaltree is interesting, the thing is that I cannot recover while discading partial overlaps. So I only need full overlaps / nest
rough draft:
define a class for your "ir" items, with __lt__ using 'start'
have a master dict with category as the key
store a sorted list of items with that category (bisect)
when you find the insert position based on start time, you can start comparing
end times until you find an "ir" item which should not be deleted.
By using pandas library and some coding I was able to get a decent solution
Initialize
...
df = pd.DataFrame(columns=['start','end','seq','record','len','ir_1','ir_2'])
Add
...
with l_lock:
new_element = [ir_start, ir_end,ir_seq, record.id, ir_len, seq_q, seq_q_prime]
df.loc[len(df)] = new_element
Delete dups
...
for idx, row in df.iterrows():
res = df[(df.index != idx) & (df.start >= row.start) & (df.end <= row.end)]
df.drop(res.index,inplace=True)
As suggested in some comments, interval tree was also a possible solution but I could not get it work

Constantly getting IndexError and am unsure why in Python

I am new to python and really programming in general and am learning python through a website called rosalind.info, which is a website that aims to teach through problem solving.
Here is the problem, wherein you're asked to calculate the percentage of guanine and thymine to the string of DNA given to for each ID, then return the ID of the sample with the greatest percentage.
I'm working on the sample problem on the page and am experiencing some difficulty. I know my code is probably really inefficient and cumbersome but I take it that's to be expected for those who are new to programming.
Anyway, here is my code.
gc = open("rosalind_gcsamp.txt","r")
biz = gc.readlines()
i = 0
gcc = 0
d = {}
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
del biz[i+2]
What I'm trying to accomplish here is, given input such as this:
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
Break what's given into a list based on the lines and concatenate the two lines of DNA like so:
['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'TCCCACTAATAATTCTGAGG\n']
And delete the entry two indices after the ID, which is >Rosalind. What I do with it later I still need to figure out.
However, I keep getting an index error and can't, for the life of me, figure out why. I'm sure it's a trivial reason, I just need some help.
I've even attempted the following to limited success:
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
elif biz[i].startswith("A" or "C" or "G" or "T") and biz[i+1].startswith(">"):
del biz[i]
which still gives me an index error but at least gives me the biz value I want.
Thanks in advance.
It is very easy do with itertools.groupby using lines that start with > as the keys and as the delimiters:
from itertools import groupby
with open("rosalind_gcsamp.txt","r") as gc:
# group elements using lines that start with ">" as the delimiter
groups = groupby(gc, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
# if k is False we a non match to our not x.startswith(">")
# so use the value v as the key and call next on the grouper object
# to get the next value
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
print(d)
{'>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'}
If you need order use a collections.OrderedDict in place of d.
You are looping over the length of biz. So in your last iteration biz[i+1] and biz[i+2] don't exist. There is no item after the last.

How do you get back tuple or 2 lists with key and value matching order of reg pattern group names?

I'm trying to create repaired path using 2 dicts created using groupdict() from re.compile
The idea is the swap out values from the wrong path with equally named values of the correct dict.
However, due to the fact they are not in the captured group order, I can't rebuild the resulting string as a correct path as the values are not in order that is required for path.
I hope that makes sense, I've only been using python for a couple of months, so I may be missing the obvious.
# for k, v in pat_list.iteritems():
# pat = re.compile(v)
# m = pat.match(Path)
# if m:
# mgd = m.groups(0)
# pp (mgd)
this gives correct value order, and groupdict() creates the right k,v pair, but in wrong order.
You could perhaps use something a bit like that:
pat = re.compile(r"(?P<FULL>(?P<to_ext>(?:(?P<path_file_type>(?P<path_episode>(?P<path_client>[A-Z]:[\\/](?P<client_name>[a-zA-z0-1]*))[\\/](?P<episode_format>[a-zA-z0-9]*))[\\/](?P<root_folder>[a-zA-Z0-9]*)[\\/])(?P<file_type>[a-zA-Z0-9]*)[\\/](?P<path_folder>[a-zA-Z0-9]*[_,\-]\d*[_-]?\d*)[\\/](?P<base_name>(?P<episode>[a-zA-Z0-9]*)(?P<scene_split>[_,\-])(?P<scene>\d*)(?P<shot_split>[_-])(?P<shot>\d*)(?P<version_split>[_,\-a-zA-Z]*)(?P<version>[0-9]*))))[\.](?P<ext>[a-zA-Z0-9]*))")
s = r"T:\Grimm\Grimm_EPS321\Comps\Fusion\G321_08_010\G321_08_010_v02.comp"
mat = pat.match(s)
result = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
result.append([name, cap])
That will give you a list of lists, the smaller lists having the capture group as first item, and the capture group as second item.
Or if you want 2 lists, you can make something like:
names = []
captures = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
names.append(name)
captures.append(cap)
Getting key from value in a dict obtained from this answer

Categories

Resources