Removing duplicates based on a partial match Python - python

Say I have a list
var1 = ["VenueA/2003", "VenueA/2006", "VenueA/2009","VenueB/2009"]
What I want to do is remove all duplicate elements in the list based on the VenueX and keep the first occurrence
In the above example, there are three similar VenueA which are VenueA/2003, VenueA/2006 and VenueA/2009. As VenueA/2003 is the first occurrence, I want to keep that and remove the rest of VenueA
The result that I want is
var1 = ["VenueA/2003","VenueB/2009"]
How do I go about doing that?

You can build a map keyed by the prefixes, and with as value, one of the strings that has this prefix. The Map constructor can be used for this.
As the Map constructor will retain the last occurrence of the string that has a given prefix, you should reverse the input and the output to get the first match instead:
const arr = ["VenueA/2003", "VenueA/2006", "VenueA/2009","VenueB/2009"];
const map = new Map(arr.map(s => [s.split("/")[0], s]).reverse());
const result = [...map.values()].reverse();
console.log(result);

k=[]
for x in var1:
if x.startswith('VenueA') and len(k)==0:
k.append(x)
if x.startswith('VenueB') and len(k)==1:
k.append(x)
#output
['VenueA/2003', 'VenueB/2009']
Similarly, if you have more Venues you can increase the value of len and append them to k

Related

Deleting every item in a list corresponding to match in other list

I am showing two cases of many such cases to explain the problem. Each case
has two lists. First list i.e. nu contains elements ID , which has to be matched with first element of every tuple in second list i.e. nu_ew.
If the match is found, I want to delete every occurrence of tuple with same ID from the second list i.e. nu_ew.
Issue is that I am successfully able to delete all desired element (tuple) from Case-2, but last occurrence of element (tuple) i.e. ('Na23', 0.0078838) corresponding to last ID in nu remains undeleted in Case-1.
I am looking for any way out to get desired result. Any suggestion is greatly appreciated.
Case-1:
nu=['F19', 'U234', 'U235', 'U238', 'Cl35', 'Cl37', 'Na23']
nu_ew = [('Mg24', 0.070385), ('Mg25', 0.0092824),
('Mg26', 0.0106276), ('F19', 0.42348),
('U234', 1.083506277), ('U235', 0.0014516),
('U238', 0.202605), ('Cl35', 0.0454252),
('Cl37', 0.0153641), ('Na23', 0.047303),
('F19', 0.0521210), ('U234', 3.61168759),
('U235', 0.000483890), ('U238', 0.067535),
('F19', 0.0217170), ('Na23', 0.0078838),
('Cl35', 0.0181700), ('Cl37', 0.0061456)]
Case-2:
nu=['F19', 'U234', 'U235', 'U238']
nu_ew = [('Mg24', 0.068893), ('Mg25', 0.009085),
('Mg26', 0.0104025), ('F19', 0.414511),
('U234', 1.060551431), ('U235', 0.0014209),
('U238', 0.198313), ('Cl35', 0.0444628),
('Cl37', 0.0150386), ('Na23', 0.046301),
('F19', 0.0510167), ('U234', 5.65627430),
('U235', 0.00075782), ('U238', 0.105767),
('F19', 0.034011)]
I tried doing:
for n in nu:
for ind, id_wf in enumerate(nu_ew):
if n == id_wf[0]:
del nu_ew[ind]`
print(nu_ew)`
I would just use list comprehension here:
something like
result = [t for t in nu_ew if t[0] not in nu]
For larger lists
nu_as_set = set(nu)
result = [t for t in nu_ew if t[0] not in nu_as_set]
Two points to address.
Don't delete/add an element to a list while iterating over it. Instead, make a new list containing the result. This can be done by a for loop, or by list comprehension.
You don't have to iterate over nu, just use in. If nu is going to be larger, consider using a set, because it takes less time with in operator.
So, either one of the following will work for your second code.
Use for loop.
result = []
for id_wf in nu_ew:
if id_wf[0] in nu:
result.append(id_wf)
nu_ew = result
print(nu_ew)
Use list comprehension.
nu_ew = [id_wf in nu_ew if id_wf[0] in nu]
Making a set out of nu is simple, just add
nu = set(nu)
or
nu_set = set(nu)
(If you want to keep the original list)
beforehand.

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

How to add to python dictionary without replacing

the current code I have is category1[name]=(number) however if the same name comes up the value in the dictionary is replaced by the new number how would I make it so instead of the value being replaced the original value is kept and the new value is also added, giving the key two values now, thanks.
You would have to make the dictionary point to lists instead of numbers, for example if you had two numbers for category cat1:
categories["cat1"] = [21, 78]
To make sure you add the new numbers to the list rather than replacing them, check it's in there first before adding it:
cat_val = # Some value
if cat_key in categories:
categories[cat_key].append(cat_val)
else:
# Initialise it to a list containing one item
categories[cat_key] = [cat_val]
To access the values, you simply use categories[cat_key] which would return [12] if there was one key with the value 12, and [12, 95] if there were two values for that key.
Note that if you don't want to store duplicate keys you can use a set rather than a list:
cat_val = # Some value
if cat_key in categories:
categories[cat_key].add(cat_val)
else:
# Initialise it to a set containing one item
categories[cat_key] = set(cat_val)
a key only has one value, you would need to make the value a tuple or list etc
If you know you are going to have multiple values for a key then i suggest you make the values capable of handling this when they are created
It's a little hard to understand your question.
I think you want this:
>>> d[key] = [4]
>>> d[key].append(5)
>>> d[key]
[4, 5]
Depending on what you expect, you could check if name - a key in your dictionary - already exists. If so, you might be able to change its current value to a list, containing both the previous and the new value.
I didn't test this, but maybe you want something like this:
mydict = {'key_1' : 'value_1', 'key_2' : 'value_2'}
another_key = 'key_2'
another_value = 'value_3'
if another_key in mydict.keys():
# another_key does already exist in mydict
mydict[another_key] = [mydict[another_key], another_value]
else:
# another_key doesn't exist in mydict
mydict[another_key] = another_value
Be careful when doing this more than one time! If it could happen that you want to store more than two values, you might want to add another check - to see if mydict[another_key] already is a list. If so, use .append() to add the third, fourth, ... value to it.
Otherwise you would get a collection of nested lists.
You can create a dictionary in which you map a key to a list of values, in which you would want to append a new value to the lists of values stored at each key.
d = dict([])
d["name"] = 1
x = d["name"]
d["name"] = [1] + x
I guess this is the easiest way:
category1 = {}
category1['firstKey'] = [7]
category1['firstKey'] += [9]
category1['firstKey']
should give you:
[7, 9]
So, just use lists of numbers instead of numbers.

eliminate '\n' in dictionary

I have a dictionary looks like this, the DNA is the keys and quality value is value:
{'TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG\n':
'eeeecdddddaaa`]eceeeddY\\cQ]V[F\\\\TZT_b^[^]Z_Z]ac_ccd^\\dcbc\\TaYcbTTZSb]Y]X_bZ\\a^^\\S[T\\aaacccBBBBBBBBBB\n',
'ACTTATATTATGTTGACACTCAAAAATTTCAGAATTTGGAGTATTTTGAATTTCAGATTTTCTGATTAGGGATGTACCTGTACTTTTTTTTTTTTTTTTT\n':
'dddddd\\cdddcdddcYdddd`d`dcd^dccdT`cddddddd^dddddddddd^ddadddadcd\\cda`Y`Y`b`````adcddd`ddd_dddadW`db_\n',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT\n':
'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB\n'}
I want to write a function so that if I query a DNA sequence, it returns a tuple of this DNA sequence and its corresponding quality value
I wrote the following function, but it gives me an error message that says list indices must be integers, not str
def query_sequence_id(self, dna_seq=''):
"""Overrides the query_sequence_id so that it optionally returns both the sequence and the quality values.
If DNA sequence does not exist in the class, return a string error message"""
list_dna = []
for t in self.__fastqdict.keys():
list_dna.append(t.rstrip('\n'))
self.dna_seq = dna_seq
if self.dna_seq in list_dna:
return (self.dna_seq,self.__fastqdict.values()[self.dna_seq + "\n"])
else:
return "This DNA sequence does not exist"
so I want something like if I print
query_sequence_id("TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG"),
I would get
('TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG',
'eeeecdddddaaa`]eceeeddY\\cQ]V[F\\\\TZT_b^[^]Z_Z]ac_ccd^\\dcbc\\TaYcbTTZSb]Y]X_bZ\\a^^\\S[T\\aaacccBBBBBBBBBB')
I want to get rid of "\n" for both keys and values, but my code failed. Can anyone help me fix my code?
The newline characters aren't your problem, though they are messy. You're trying to index the view returned by dict.values() based on the string. That's not only not what you want, but it also defeats the whole purpose of using the dictionary in the first place. Views are iterables, not mappings like dicts are. Just look up the value in the dictionary, the normal way:
return (self.dna_seq, self.__fastqdict[self.dna_seq + "\n"])
As for the newlines, why not just take them out when you build the dictionary in the first place?
To modify the dictionary you can just do the following:
myNewDict = {}
for var in myDict:
myNewDict[var.strip()] = myDict[var].strip()
You can remove those pesky newlines from your dictionary's keys and values like this (assuming your dictionary was stored in a variable nameddna):
dna = {k.rstrip(): v.rstrip() for k, v in dna.iteritems()}

use slice in for loop to build a list

I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1
You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.
Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1
If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]

Categories

Resources