Finding duplicates in each row and column

Finding duplicates in each row and column - python

The function needs to be able to check a file for duplicates in each row and column.
Example of file with duplicates:
A B C
A A B
B C A
As you can see, there is a duplicate in row 2 with 2 A's but also in Column 1 with two A's.
code:
def duplication_char(dc):
with open (dc,"r") as duplicatechars:
linecheck = duplicatechar.readlines()
linecheck = [line.split() for line in linecheck]
for row in linecheck:
if len(set(row)) != len(row):
print ("duplicates", " ".join(row))
for column in zip(*checkLine):
if len(set(column)) != len(column):
print ("duplicates"," ".join(column))

Well, here is how I would do it.
First, read your files and create a 2d numpy array with the content:
import numpy
with open('test.txt', 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
Then, check if each row has duplicates using sets (a set has no duplicates, so if the length of the set is different than the length of the array, the array has duplicates):
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
Then, check if each column has duplicates using sets, by transposing your numpy array:
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
If you wrap all of this in a function:
def check_for_duplicates(filename):
import numpy
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
As suggested by Apero, you can also do this without numpy using zip (https://docs.python.org/3/library/functions.html#zip):
def check_for_duplicates(filename):
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
for row in lines:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in zip(*lines):
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
In your example, this code prints:
# Duplicates in row: ['A' 'A' 'B']
# Duplicates in column: ['A' 'A' 'B']

You can have a List of Lists and use zip to transpose it.
Given your example, try:
from collections import Counter
with open(fn) as fin:
data=[line.split() for line in fin]
rowdups={}
coldups={}
for d, m in ((rowdups, data), (coldups, zip(*data))):
for i, sl in enumerate(m):
count=Counter(sl)
for c in count.most_common():
if c[1]>1:
d.setdefault(i, []).append(c)
>>> rowdups
{1: [('A', 2)]}
>>> coldups
{0: [('A', 2)]}

Related

Index a list python

I have a text file with tuples in it that I would like to convert to a list with indices as follows:
2, 60;
3, 67;
4, 67;
5, 60;
6, 60;
7, 67;
8, 67;
Needs to become:
60, 2 5 6
67, 3 4 7 8
And so on with many numbers...
I've made it as far as reading in the file and getting rid of the punctuation and casting it as ints, but I'm not quite sure how to iterate through and add multiple items at a given index of a list. Any help would be much appreciated!
Here is my code so far:
with open('cues.txt') as f:
lines = f.readlines()
arr = []
for i in lines:
i = i.replace(', ', ' ')
i = i.replace(';', '')
i = i.replace('\n', '')
arr.append(i)
array = []
for line in arr: # read rest of lines
array.append([int(x) for x in line.split()])
arr = []
#make array of first values 40 to 80
for i in range(40, 81):
arr.append(i)
print arr
for j in range(0, len(array)):
for i in array:
if (i[0] == arr[j]):
arr[i[0]].extend(i[1])

Do you need it in a list you can simply collect them into a dict:
i = {}
with open('cues.txt') as f:
for (x, y) in (l.strip(';').split(', ') for l in f):
i.setdefault(y, []).append(x)
for k, v in i.iteritems():
print "{0}, {1}".format(k, " ".join(v))

You could use defaultdict function from collections module.
from collections import defaultdict
with open('file') as f:
l = []
for line in f:
l.append(tuple(line.replace(';','').strip().split(', ')))
m = defaultdict(list)
for i in l:
m[i[1]].append(i[0])
for j in m:
print j+", "+' '.join(m[j])

You can use a dict to store the index:
results = {}
with open("cues.txt") as f:
for line in f:
value, index = line.strip()[:-1].split(", ")
if index not in results:
results[index] = [value]
else:
results[index].append(value)
for index in results:
print("{0}, {1}".format(index, " ".join(results[index]))

1) This code is wrong at many level. See inline comment
arr = []
for i in lines:
i = i.replace(', ', ' ')
i = i.replace(';', '')
i = i.replace('\n', '') # Wrong identation. You will only get the last line in arr
arr.append(i)
You can simply do
arr = []
for i in lines:
i = i.strip().replace(';', '').split(", ")
arr.append(i)
It will remove newline character, remove ; and nicely split a line into a tuple of (index, value)
2) This code can be simplified to one line
arr = [] # It should not be named `arr` because it destroyed the arr created in stage 1
for i in range(40, 81):
arr.append(i)
print arr
becomes:
result = range(40, 81)
But it is not an ideal data structure for your problem. You should use dictionary instead. In the other word, you can lose this bit of code altogether
3) Finally you are ready to iterate arr and build the result
result = defaultdict(list)
for a in arr:
result[a[1]].append(a[0])

You should use dict to save text data, the following code:
d = {}
with open('cues.txt') as f:
lines = f.readlines()
for line in lines:
line = line.split(',')
key = line[1].strip()[0:-1]
if d.has_key(key):
d[key].append(line[0])
else:
d[key] = [line[0]]
for key, value in d.iteritems():
print "{0}, {1}".format(key, " ".join(value))

How can I customize map() for a list of strings in Python?

How do I tell map() to selectively convert only some of the strings (not all the strings) within a list to integer values?
Input file (tab-delimited):
abc1 34 56
abc1 78 90
My attempt:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = map(int, row)
print X
Error message: ValueError: invalid literal for int() with base 10: 'abc1'
When I read in the file with the csv module, it is a list of strings:
['abc1', '34', '56']
['abc1', '78', '90']
map() obviously does not like 'abc1'even though it is a string just like '34' is a string.
I thoroughly examined Convert string to integer using map() but it did not help me deal with the first column of my input file.

def safeint(val):
try:
return int(val)
except ValueError:
return val
for row in start:
X = map(safeint, row)
print X
is one way to do it ... you can step it up even more
from functools import partial
myMapper = partial(map,safeint)
map(myMapper,start)

Map only the part of the list that interests you:
row[1:] = map(int, row[1:])
print row
Here, row[1:] is a slice of the list that starts at the second element (the one with index 1) up to the end of the list.

I like Roberto Bonvallet's answer, but if you want to do things immutably, as you're doing in your question, you can:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = [row[0]] + map(int, row[1:])
print X
… or…
numeric_cols = (1, 2)
X = [int(value) if col in numeric_cols else value
for col, value in enumerate(row])
… or, probably most readably, wrap that up in a map_partial function, so you can do this:
X = map_partial(int, (1, 2), row)
You could implement it as:
def map_partial(func, indices, iterable):
return [func(value) if i in indices else value
for i, value in enumerate(iterable)]
If you want to be able to access all of the rows after you're done, you can't just print each one, you have to store it in some kind of structure. What structure you want depends on how you want to refer to these rows later.
For example, maybe you just want a list of rows:
rows = []
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.append(map_partial(int, (1, 2), row))
print('The second column of the first row is {}'.format(rows[0][1]))
Or maybe you want to be able to look them up by the string ID in the first column, rather than by index. Since those IDs aren't unique, each ID will map to a list of rows:
rows = {}
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.setdefault(row[0], []).append(map_partial(int, (1, 2), row))
print('The second column of the first abc1 row is {}'.format(rows['abc1'][0][1]))

Adding column in CSV python and enumerating it

my CSV looks like
John,Bomb,Dawn
3,4,5
3,4,5
3,4,5
I want to add ID column in front like so:
ID,John,Bomb,Dawn
1,3,4,5
2,3,4,5
3,3,4,5
using enumerate function, but I don't know how. Here's my code so far:
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.append('ID')
all.append(row)
count = 0
for row in reader:
count += 1
while count:
all.append(row)
row.append(enumerate(reader, 1))
break
writer.writerows(all)
And the output comes all wrong:
John,Bomb,Dawn,ID
3,4,5,<enumerate object at 0x7fb2a5728d70>
3,4,5,<enumerate object at 0x1764370>
3,4,5,<enumerate object at 0x17643c0>
So the ID comes in the end, when it should be in the start, and it doesn't even do the 1,2,3. Some weird error comes out.

I can suggest the code below to solve your question:
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.insert(0, 'ID')
all.append(row)
for k, row in enumerate(reader):
all.append([str(k+1)] + row)
writer.writerows(all)
More compact code can be:
all = [['ID'] + next(reader)] + [[str(k+1)] + row for k, row in enumerate(reader)]
UPDATE (some explanation):
Your have wrong enumerate function understanding. enumerate should be used in for loop and when you iterate over enumerate function result you get the sequence of the tuples where first item is ordered number of item from list and the second is item itself.
But enumerate function return is object (docs) so when you try to convert it to string it call __repr__ magic method and cast enumerate object to <enumerate object at ...>.
Another words, enumerate helps to avoid additional counters in loops such as your count += 1 variable.
Also you have a very strange code here:
while count:
all.append(row)
row.append(enumerate(reader, 1))
break
this part of code never can't be performed more than one time.

You should use insert() instead of append(). This will allow you to specify the index where you want to add the element.
Try this
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.insert(0, 'ID')
all.append(row)
count = 0
for row in reader:
count += 1
row.insert(0, count)
all.append(row)
writer.writerows(all)

You can do something like this:
import csv
with open('testi.csv') as inp, open('temp.csv', 'w') as out:
reader = csv.reader(inp)
writer = csv.writer(out, delimiter=',')
#No need to use `insert(), `append()` simply use `+` to concatenate two lists.
writer.writerow(['ID'] + next(reader))
#Iterate over enumerate object of reader and pass the starting index as 1.
writer.writerows([i] + row for i, row in enumerate(reader, 1))
enumerate() returns an enumerate object, that yield the index and item in a tuple one at a time, so you need to iterate over the enumerate object instead of writing it to the csv file.
>>> lst = ['a', 'b', 'c']
>>> e = enumerate(lst)
>>> e
<enumerate object at 0x1d48f50>
>>> for ind, item in e:
... print ind, item
...
0 a
1 b
2 c
Output:
>>> !cat temp.csv
ID,John,Bomb,Dawn
1,3,4,5
2,3,4,5
3,3,4,5

output items in a list using curly braces

I have a text file with 'n' lines. I want to extract first word, second word, third word, ... of each line into a list1, list2, list3,...
Suppose input txt file contains:
a1#a2#a3
b1#b2#b3#b4
c1#c2
After reading the file, Output should be:
List1: {a1,b1,c1}
List2: {a2,b2,c2}
List3: {a3,b3}
List4: {b4}
The code:
f = open('path','r')
for line in f:
List=line.split('#')
List1 = List[0]
print '{0},'.format(List1),
List2 = List[1]
print '{0},'.format(List2),
List3 = List[2]
print '{0},'.format(List3),
List4 = List[3]
print '{0},'.format(List4),
OUTPUT
a1,b1,c1,a2,b2,c2,a3,b3,b4

You really don't want to use separate lists here; just use a list of lists. Using the csv module here would make handling splitting a little easier:
import csv
columns = [[] for _ in range(4)] # 4 columns expected
with open('path', rb) as f:
reader = csv.reader(f, delimiter='#')
for row in reader:
for i, col in enumerate(row):
columns[i].append(col)
or, if the number of columns needs to grow dynamically:
import csv
columns = []
with open('path', rb) as f:
reader = csv.reader(f, delimiter='#')
for row in reader:
while len(row) > len(columns):
columns.append([])
for i, col in enumerate(row):
columns[i].append(col)
Or you can use itertools.izip_longest() to transpose the CSV rows:
import csv
from itertools import izip_longest
with open('path', rb) as f:
reader = csv.reader(f, delimiter='#')
columns = [filter(None, column) for column in izip_longest(*reader)]
In the end, you can then print your columns with:
for i, col in enumerate(columns, 1):
print 'List{}: {{{}}}'.format(i, ','.join(col))
Demo:
>>> import csv
>>> from itertools import izip_longest
>>> data = '''\
... a1#a2#a3
... b1#b2#b3#b4
... c1#c2
... '''.splitlines(True)
>>> reader = csv.reader(data, delimiter='#')
>>> columns = [filter(None, column) for column in izip_longest(*reader)]
>>> for i, col in enumerate(columns, 1):
... print 'List{}: {{{}}}'.format(i, ','.join(col))
...
List1: {a1,b1,c1}
List2: {a2,b2,c2}
List3: {a3,b3}
List4: {b4}

Python 2: IndexError: list index out of range

I have an "asin.txt" document:
in,Huawei1,DE
out,Huawei2,UK
out,Huawei3,none
in,Huawei4,FR
in,Huawei5,none
in,Huawei6,none
out,Huawei7,IT
I'm opening this file and make an OrderedDict:
from collections import OrderedDict
reader = csv.reader(open('asin.txt','r'),delimiter=',')
reader1 = csv.reader(open('asin.txt','r'),delimiter=',')
d = OrderedDict((row[0], row[1].strip()) for row in reader)
d1 = OrderedDict((row[1], row[2].strip()) for row in reader1)
Then I want to create variables (a,b,c,d) so if we take the first line of the asin.txt it should be like: a = in; b = Huawei1; c = Huawei1; d = DE. To do this I'm using a "for" loop:
from itertools import izip
for (a, b), (c, d) in izip(d.items(), d1.items()): # here
try:
.......
It worked before, but now, for some reason, it prints an error:
d = OrderedDict((row[0], row[1].strip()) for row in reader)
IndexError: list index out of range
How do I fix that?

Probably you have a row in your textfile which does not have at least two fields delimited by ",". E.g.:
in,Huawei1
Try to find the solution along these lines:
d = OrderedDict((row[0], row[1].strip()) for row in reader if len(row) >= 2)
or
l = []
for row in reader:
if len(row) >= 2:
l.append(row[0], row[1].strip())
d = OrderedDict(l)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding duplicates in each row and column - python

Related

Index a list python

How can I customize map() for a list of strings in Python?

Adding column in CSV python and enumerating it

output items in a list using curly braces

Python 2: IndexError: list index out of range

Categories

Resources