Python to remove duplicates using only some, not all, columns - python

I have a tab-delimited input.txt file like this
A B C
A B D
E F G
E F T
E F K
These are tab-delimited.
I want to remove duplicates only when multiple rows have the same 1st and 2nd columns.
So, even though 1st and 2nd rows are different in 3rd column, they have the same 1st and 2nd columns, so I want to remove "A B D" that appears later.
So output.txt will be like this.
A B C
E F G
If I was to remove duplicates in usual way, I just make the lists into "set" function, and I am all set.
But now I am trying to remove duplicates using only "some" columns.
Using excel, it's just so easy.
Data -> Remove Duplicates -> Select columns
Using MatLab, it's easy, too.
import input.txt -> Use "unique" function with respect to 1st and 2nd columns -> Remove the rows numbered "1"
But using python, I couldn't find how to do this because all I knew about removing duplicate was using "set" in python.
===========================
This is what I experimented following undefined_is_not_a_function's answer.
I am not sure how to overwrite the result to output.txt, and how to alter the code to let me specify the columns to use for duplicate-removing (like 3 and 5).
import sys
input = sys.argv[1]
seen = set()
data = []
for line in input.splitlines():
key = tuple(line.split(None, 2)[0])
if key not in seen:
data.append(line)
seen.add(key)

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.
>>> from itertools import groupby
>>> s = '''A B C
A B D
E F G
E F T
E F K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
print next(g)
...
A B C
E F G
Simply replace s.splitlines() with file object if input is coming from a file.
Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.
>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A B C
A B D
E F G
E F T
E F K
A B F'''
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
... key = ig(line.split())
... if key not in seen:
... data.append(line)
... seen.add(key)
...
>>> data
['A B C', 'E F G']

if you have access to a Unix system, sort is a nice utility that is made for your problem.
sort -u -t$'\t' --key=1,2 filein.txt
I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.

from the below code, you can do it.
file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
li = each_line .split()
lst.append(li)
dic = {}
for l in lst:
if (l[0], l[1]) not in dic:
dic[(l[0], l[1])] = l[2]
print dic
sorry for variable names.

Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:
entries = set()
keys = set()
for row in rows:
key = (row[0], row[1]) # Only the first two columns
if key not in keys:
keys.add(key)
entries.add((row[0], row[1], row[2]))

please notice that I am not an expert but I still have ideas that may help you.
There is a csv module useful for csv files, you might go see there if you find something interesting.
First I would ask how are you storing those datas ? In a list ?
something like
[[A,B,C],
[A,B,D],
[E,F,G],...]
Could be suitable. (maybe not the best choice)
Second, is it possible to go through the whole list ?
You can simply store a line, compare it to all lines.
I would do this :
suposing list contains the letters.
copy = list
index_list = []
for i in range(0, len(list)-1):
for j in range(0, len(list)-1): #and exclude i of course
if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
index_list.append(j)
for i in index_list: #just loop over the index list and remove
list.pop(index_list[i])
this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations).
Edit : pop; not remove

Related

Constantly getting IndexError and am unsure why in Python

I am new to python and really programming in general and am learning python through a website called rosalind.info, which is a website that aims to teach through problem solving.
Here is the problem, wherein you're asked to calculate the percentage of guanine and thymine to the string of DNA given to for each ID, then return the ID of the sample with the greatest percentage.
I'm working on the sample problem on the page and am experiencing some difficulty. I know my code is probably really inefficient and cumbersome but I take it that's to be expected for those who are new to programming.
Anyway, here is my code.
gc = open("rosalind_gcsamp.txt","r")
biz = gc.readlines()
i = 0
gcc = 0
d = {}
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
del biz[i+2]
What I'm trying to accomplish here is, given input such as this:
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
Break what's given into a list based on the lines and concatenate the two lines of DNA like so:
['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'TCCCACTAATAATTCTGAGG\n']
And delete the entry two indices after the ID, which is >Rosalind. What I do with it later I still need to figure out.
However, I keep getting an index error and can't, for the life of me, figure out why. I'm sure it's a trivial reason, I just need some help.
I've even attempted the following to limited success:
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
elif biz[i].startswith("A" or "C" or "G" or "T") and biz[i+1].startswith(">"):
del biz[i]
which still gives me an index error but at least gives me the biz value I want.
Thanks in advance.
It is very easy do with itertools.groupby using lines that start with > as the keys and as the delimiters:
from itertools import groupby
with open("rosalind_gcsamp.txt","r") as gc:
# group elements using lines that start with ">" as the delimiter
groups = groupby(gc, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
# if k is False we a non match to our not x.startswith(">")
# so use the value v as the key and call next on the grouper object
# to get the next value
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
print(d)
{'>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'}
If you need order use a collections.OrderedDict in place of d.
You are looping over the length of biz. So in your last iteration biz[i+1] and biz[i+2] don't exist. There is no item after the last.

How do you get back tuple or 2 lists with key and value matching order of reg pattern group names?

I'm trying to create repaired path using 2 dicts created using groupdict() from re.compile
The idea is the swap out values from the wrong path with equally named values of the correct dict.
However, due to the fact they are not in the captured group order, I can't rebuild the resulting string as a correct path as the values are not in order that is required for path.
I hope that makes sense, I've only been using python for a couple of months, so I may be missing the obvious.
# for k, v in pat_list.iteritems():
# pat = re.compile(v)
# m = pat.match(Path)
# if m:
# mgd = m.groups(0)
# pp (mgd)
this gives correct value order, and groupdict() creates the right k,v pair, but in wrong order.
You could perhaps use something a bit like that:
pat = re.compile(r"(?P<FULL>(?P<to_ext>(?:(?P<path_file_type>(?P<path_episode>(?P<path_client>[A-Z]:[\\/](?P<client_name>[a-zA-z0-1]*))[\\/](?P<episode_format>[a-zA-z0-9]*))[\\/](?P<root_folder>[a-zA-Z0-9]*)[\\/])(?P<file_type>[a-zA-Z0-9]*)[\\/](?P<path_folder>[a-zA-Z0-9]*[_,\-]\d*[_-]?\d*)[\\/](?P<base_name>(?P<episode>[a-zA-Z0-9]*)(?P<scene_split>[_,\-])(?P<scene>\d*)(?P<shot_split>[_-])(?P<shot>\d*)(?P<version_split>[_,\-a-zA-Z]*)(?P<version>[0-9]*))))[\.](?P<ext>[a-zA-Z0-9]*))")
s = r"T:\Grimm\Grimm_EPS321\Comps\Fusion\G321_08_010\G321_08_010_v02.comp"
mat = pat.match(s)
result = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
result.append([name, cap])
That will give you a list of lists, the smaller lists having the capture group as first item, and the capture group as second item.
Or if you want 2 lists, you can make something like:
names = []
captures = []
for i in range(1, pat.groups):
name = list(pat.groupindex.keys())[list(pat.groupindex.values()).index(i)]
cap = res.group(i)
names.append(name)
captures.append(cap)
Getting key from value in a dict obtained from this answer

culling values in csv.DictReader

I'm working with a huge csv that I am parsing with csv.DictReader , what would be some most efficient way to trim the data in the resulting dictionary based on the key name .
Say, just keep the keys that contain "JAN" .
Thanks !
result = {key:val for key, val in row.items() if 'JAN' in key}
where row is a dictionary obtained from DictReader.
Okay, here's a dirt stupid example of using csv.DictReader with /etc/passwd
#!python
keepers = dict()
r = csv.DictReader(open('/etc/passwd', 'r'), delimiter=":", \
fieldnames=('login','pw', 'uid','gid','gecos','homedir', 'shell'))
for i in r:
if i['uid'] < 1:
continue
keepers[i['login']]=i
Now, trying to apply that to your question ... I'm only guessing that you were building a dictionary of dictionaries based on the phrase "from the resulting dictionary." It seems obvious that the read/object is going to return a dictionary for every input record. So there will be one resulting dictionary for every line of your file (assuming any of the common CSV "dialects").
Naturally I could have used if i['uid'] > 1 or if "Jan" in i['gecos'] and only added to my "keepers" if the condition holds true. I wrote it this way to emphasize how you can easily skip those values in which you're not interested, such that the rest of your for suite could do various interesting things with those records that are of interest.
However, this answer is so simple that I have to suspect that I'm not understanding the question. (I'm using ''/etc/passwd'' and a colon separated list simply because it's an extremely well known format and world-readable copies are readily available on Linux, Unix, and MacOS X systems).
You could do something like this:
>>> with open('file.csv') as f:
... culled = [{k: d[k] for k in d if "JAN" in k} for d in csv.DictReader(f)]
When I tried this on a simple CSV file with the following contents:
JAN11,FEB11,MAR11,APR11,MAY11,JUN11,JUL11,AUG11,SEP11,OCT11,NOV11,DEC11,JAN12,FEB12,MAR12,APR12
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
I got the following result:
>>> with open('file.csv') as f:
... culled = [{k: d[k] for k in d if "JAN" in k} for d in csv.DictReader(f)]
...
>>> culled
[{'JAN11': '1', 'JAN12': '13'}, {'JAN11': '17', 'JAN12': '29'}]

how to print dict values based on key containing delimiters

Actually i have a dict
x1={'b;0':'A1;B2;C3','b;1':'aa1;aa2;aa3','a;1': 'a1;a2;a3', 'a;0': 'A;B;C'}
Actually here my convention is 'a;0','b;0' will contain tags and 'a;1','b;1' will have corresponding values, based on this i have to group and print.
From this dict what output i want is
<a> #this is group name
<A>a1</A> # this are tags n values
<B>a2</B>
<C>a3</C>
</a>
<b>
<A1>aa1</A1>
<B2>aa2</B2>
<C1>aa3</C1>
</b>
This is the sample dict which i given like this many groups may come like c;0:.... d;0.....
I am using code like
a=[]
b=[]
c=[]
d=[]
e=[]
for k,v in x1.iteritems():
if k.split(";").count('0')==1: # i am using this bcoz a;0,b;0 contains tag so i am checking if they contain zero split it.
a=k.split(";") # this contains a=['a','0','b','0']
b=v.split(";") # this contains 'a;0','b;0' values
else:
c=v.split(";") # this contains 'a;1','b;1' values
for i in range(0,len(b)):
d=b[i]
e=c[i]
print "<%s>%s<%s>"%(c,e,c)
Actually this code is working only 50% when single group is their in
dict('a;1': 'a1;a2;a3', 'a;0': 'A;B;C') and when multiple groups r their in
dict ('b;0':'A1;B2;C3','b;1':'aa1;aa2;aa3','a;1': 'a1;a2;a3', 'a;0': 'A;B;C')
in both cases it prints
aa1
aa2
aa3
its printing only recent value not all values
Be aware: dictionaries have no order. So the iteritems() loop does not necessarily start with 'b;0'. Try for example
for k,v in x1.iteritems():
print k
to see. On my computer it gives
a;1
a;0
b;0
b;1
This gives a problem since your code assumes the keys to come in the order they appear in the definition of x1 [edit: or rather that they come in order]. You can e.g. iterate over sorted keys instead:
for k in sorted(x1.keys()):
v = x1[k]
print k, v
Then the problem with the order is solved. But I think you have more problems in your code.
Edit: Data structures:
it might be better to store your data in some way like
x1 = {'a': [('A','a1'),('B','a2'),('C','a3')], 'b': ... }
if you cannot change the format, this is how you could convert your data:
x1f = {}
for k in x1.iterkeys():
tag, id = k.split(';')
if int(id) == 0:
x1f[tag] = zip(x1[k].split(';'), x1[tag+';'+'1'].split(';'))
print x1f
From there it should be easier to convert to the desired output.
And depending if you want extend the complexity of the output in future,
you might want to consider using pyxml:
from xml.dom import minidom
doc = minidom.Document()
then you can use the createElement and appendChild methods.

Mapping data from excel with Python

I am reading data from an xls spreadsheet with xlrd. First, I gather the index for the column that contains the data that I need (may not always be in the same column in every instance):
amr_list, pssr_list, inservice_list = [], [], []
for i in range(sh.ncols):
for j in range(sh.nrows):
if 'amrprojectnumber' in sh.cell_value(j,i).lower():
amr_list.append(sh.cell_value(j,i))
if 'pssrnumber' in sh.cell_value(j,i).lower():
pssr_list.append(sh.cell_value(j,i))
if 'inservicedate' in sh.cell_value(j,i).lower():
inservice_list.append(sh.cell_value(j,i))
Now I have three lists, which I need to use for writing data to a new workbook. The values in a row are related. So the index of an item in one list corresponds to the same index of the items in the other lists.
The amr_list has repeating string values. For example:
['4006BA','4006BA','4007AC','4007AC','4007AC']
The pssr_list always shares the same value as the amr_list but with additional info:
['4006BA(1)','4006BA(2)','4007AC(1)','4007AC(2)','4007AC(3)']
Finally, the inservice_list may or may not contain a variable date (as read from excel):
[40780.0, '', 40749.0, 40764.0, '']
This is the result I want from the data:
amr = { '4006BA':[('4006BA(1)',40780.0),('4006BA(2)','')], '4007AC':[('4007AC(1)',40749.0),('4007AC(2)',40764.0),('4007AC(3)','')] }
But I am having a hard time figuring out how an easy way to get there. Thanks in advance.
Maybe this can help:
A = ['4006BA','4006BA','4007AC','4007AC','4007AC']
B = ['4006BA(1)','4006BA(2)','4007AC(1)','4007AC(2)','4007AC(3)']
C = [40780.0, '', 40749.0, 40764.0, '']
result = dict()
for item in xrange(len(A)):
key = A[item]
result.setdefault(key, [])
result[key].append( (B[item], C[item] ) )
print result
This will print you the data in the format you are looking for.
look into itertools.groupby and
zip(amr_list, pssr_list, inservice_list)
For your case:
dict((x,list(a[1:] for a in y)) for x,y in
itertools.groupby(zip(amr_list, pssr_list, inservice_list), lambda z: z[0]))
Note that this assumes your input is sorted by amr_list.
Another approach would be:
combined={}
for k, v in zip(amr_list, zip(pssr_list, inservice_list)):
combined.setdefault(k, []).append(v)
Which does not require your input to be sorted.

Categories

Resources