Cannot remove duplicates from a list using Python - python

I have a csv file which I want to edit so I read the file and copy the contents in a list. The list contains duplicates. So I do:
csv_in = list(set(csv_in))
But I get:
Unhashable list Error
with open(source_initial2, 'r', encoding='ISO-8859-1') as file_in, open(source_initial3, 'w', encoding='ISO-8859-1',newline='') as file_out:
csv_in = csv.reader(file_in, delimiter=',')
csv_out = csv.writer(file_out, delimiter=';')
csv_in = list(set(csv_in))
for row in csv_in:
for i in range(len(row)):
if "/" in row[i]:
row[i] = row[i].replace('/', '')
if "\"" in row[i]:
row[i] = row[i].replace('\"', '')
if "Yes" in row[i]:
row[i] = row[i].replace('Yes', '1')
if "No" in row[i]:
row[i] = row[i].replace('No', '0')
if myrowlen > 5:
break
print(row)
csv_out.writerow(row)
The list is something like
[['DCA.P/C.05820', '5707119001793', 'P/C STEELSERIES SUR... QcK MINI', '5,4', 'Yes'],['DCA.P/C.05820', '5707119001793', 'P/C STEELSERIES SUR... QcK MINI', '5,4', 'Yes'].....['DCA.P/C.05820', '5707119001793', 'P/C STEELSERIES SUR... QcK MINI', '5,4', 'Yes']]
Why I get this, how can I solve it?
thank you

The problem you have is that csv_in is a list of lists and list is not hashable datatype. In order to get around the issue you can do the following:
csv_in = list(set([tuple(row) for row in csv_in]))
or if you need it as a list of lists:
csv_in = [list(element) for element in set([tuple(row) for row in csv_in])]

csv.reader contains rows where each row read from the csv file is returned as a list of strings.
While set object requires its items to be an immutable data type (thereby hashable), list type is not one of those.
test_reader = [[0,1,2], [3,4,5]]
print(set(test_reader)) # throws TypeError: unhashable type: 'list'
# after casting to tuple type
test_reader = [(0,1,2), (3,4,5)]
print(set(test_reader)) # {(0, 1, 2), (3, 4, 5)}

Related

Python-'<' not supported between instances of 'int' and 'str'

Hi I am new to coding and i've just started. I have seen many examples of the same error but I am unsure how to apply that to my code. I am trying to sort a text file in order of score. This is my current code:
ScoresFile = open("Top Scores.txt","r")
newScoresRec = []
ScoresRec = []
for row in ScoresFile:
ScoresRec = row.split(",")
username = ScoresRec[0]
Bestscores = int(ScoresRec[1])
newScoresRec.append(username)
newScoresRec.append(Bestscores)
ScoresRec.append(newScoresRec)
newScoresRec = []
sortedTable = sorted(ScoresRec,key=lambda x:x[1])
for n in range (len(sortedTable)):
print(sortedTable[n][0],sortedTable[n][1])
ScoresFile.close()
The text file is just in the simple format of:
'username','score'-
example: BO15,78
Any help would be greatly appreciated thanks.
Try this and let me know if it works:
scores_rec = []
with open("Top Scores.txt", "r") as scores_file:
lines = scores_file.readlines()
for line in lines:
s_line = line.rstrip().split(",")
scores_rec.append([s_line[0], int(s_line[1])])
sorted_table = sorted(scores_rec, key=lambda x: x[1])
for item in sorted_table:
print(item[0], item[1])
Part of the issue is that you're appending back into the split list,
ScoresRec = row.split(",")
...
ScoresRec.append(newScoresRec)
and not building a list of lists; at the end of the loop, you are sorting the last line of the file with the two appended values that you added
Therefore, you are sorting a list of a few strings and a list, with the sort key being a list containing a string and int, thus the error.
Text file (file.txt):
"MORETHANSMALLER",2
"BIGGER",10
"SMALLER",1
"UNDERBIGGER",9
"MIDDLE",5
Code (csv_reader.py):
import csv
with open('file.txt', newline='') as csvfile:
rows = list(csv.reader(csvfile, delimiter=',', quotechar='"'))
rows.sort(key = lambda x: int(x[1]))
print('From 1 to 10:')
print(rows)
rows.reverse()
print('From 10 to 1:')
print(rows)
Result:
From 1 to 10:
[['SMALLER', '1'], ['MORETHANSMALLER', '2'], ['MIDDLE', '5'], ['UNDERBIGGER', '9'], ['BIGGER', '10']]
From 10 to 1:
[['BIGGER', '10'], ['UNDERBIGGER', '9'], ['MIDDLE', '5'], ['MORETHANSMALLER', '2'], ['SMALLER', '1']]
Don't parse CSV files manually. Use python CSV libraries. That will help you to avoid traps with quotes.

Making a list from list of lists

I have the below list of lists
[['Afghanistan,2.66171813,7.460143566,0.490880072,52.33952713,0.427010864,-0.106340349,0.261178523'], ['Albania,4.639548302,9.373718262,0.637698293,69.05165863,0.74961102,-0.035140377,0.457737535']]
I want to create a new list with only the country names.
So
[Afghanistan, Albania]
Currently using this code.
with open(fileName, "r") as f:
_= next(f)
row_lst = f.read().split()
countryLst = [[i] for i in row_lst]
Try this, using split(',') as your first element in list of list is string separated by comma.
>>> lst = [['Afghanistan,2.66171813,7.460143566,0.490880072,52.33952713,0.427010864,-0.106340349,0.261178523'], ['Albania,4.639548302,9.373718262,0.637698293,69.05165863,0.74961102,-0.035140377,0.457737535']]
Output:
>>> [el[0].split(',')[0] for el in lst]
['Afghanistan', 'Albania']
Explanation:
# el[0] gives the first element in you list which a string.
# .split(',') returns a list of elements after spliting by `,`
# [0] finally selecting your first element as required.
Edit-1:
Using regex,
pattern = r'([a-zA-Z]+)'
new_lst = []
for el in lst:
new_lst+=re.findall(pattern, el[0])
>>> new_lst # output
['Afghanistan', 'Albania']
Looks like a CSV file. Use the csv module
Ex:
import csv
with open(fileName, "r") as f:
reader = csv.reader(f)
next(reader) #Skip header
country = [row[0] for row in reader]

Finding duplicates in each row and column

The function needs to be able to check a file for duplicates in each row and column.
Example of file with duplicates:
A B C
A A B
B C A
As you can see, there is a duplicate in row 2 with 2 A's but also in Column 1 with two A's.
code:
def duplication_char(dc):
with open (dc,"r") as duplicatechars:
linecheck = duplicatechar.readlines()
linecheck = [line.split() for line in linecheck]
for row in linecheck:
if len(set(row)) != len(row):
print ("duplicates", " ".join(row))
for column in zip(*checkLine):
if len(set(column)) != len(column):
print ("duplicates"," ".join(column))
Well, here is how I would do it.
First, read your files and create a 2d numpy array with the content:
import numpy
with open('test.txt', 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
Then, check if each row has duplicates using sets (a set has no duplicates, so if the length of the set is different than the length of the array, the array has duplicates):
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
Then, check if each column has duplicates using sets, by transposing your numpy array:
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
If you wrap all of this in a function:
def check_for_duplicates(filename):
import numpy
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
As suggested by Apero, you can also do this without numpy using zip (https://docs.python.org/3/library/functions.html#zip):
def check_for_duplicates(filename):
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
for row in lines:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in zip(*lines):
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
In your example, this code prints:
# Duplicates in row: ['A' 'A' 'B']
# Duplicates in column: ['A' 'A' 'B']
You can have a List of Lists and use zip to transpose it.
Given your example, try:
from collections import Counter
with open(fn) as fin:
data=[line.split() for line in fin]
rowdups={}
coldups={}
for d, m in ((rowdups, data), (coldups, zip(*data))):
for i, sl in enumerate(m):
count=Counter(sl)
for c in count.most_common():
if c[1]>1:
d.setdefault(i, []).append(c)
>>> rowdups
{1: [('A', 2)]}
>>> coldups
{0: [('A', 2)]}

How can I customize map() for a list of strings in Python?

How do I tell map() to selectively convert only some of the strings (not all the strings) within a list to integer values?
Input file (tab-delimited):
abc1 34 56
abc1 78 90
My attempt:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = map(int, row)
print X
Error message: ValueError: invalid literal for int() with base 10: 'abc1'
When I read in the file with the csv module, it is a list of strings:
['abc1', '34', '56']
['abc1', '78', '90']
map() obviously does not like 'abc1'even though it is a string just like '34' is a string.
I thoroughly examined Convert string to integer using map() but it did not help me deal with the first column of my input file.
def safeint(val):
try:
return int(val)
except ValueError:
return val
for row in start:
X = map(safeint, row)
print X
is one way to do it ... you can step it up even more
from functools import partial
myMapper = partial(map,safeint)
map(myMapper,start)
Map only the part of the list that interests you:
row[1:] = map(int, row[1:])
print row
Here, row[1:] is a slice of the list that starts at the second element (the one with index 1) up to the end of the list.
I like Roberto Bonvallet's answer, but if you want to do things immutably, as you're doing in your question, you can:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = [row[0]] + map(int, row[1:])
print X
… or…
numeric_cols = (1, 2)
X = [int(value) if col in numeric_cols else value
for col, value in enumerate(row])
… or, probably most readably, wrap that up in a map_partial function, so you can do this:
X = map_partial(int, (1, 2), row)
You could implement it as:
def map_partial(func, indices, iterable):
return [func(value) if i in indices else value
for i, value in enumerate(iterable)]
If you want to be able to access all of the rows after you're done, you can't just print each one, you have to store it in some kind of structure. What structure you want depends on how you want to refer to these rows later.
For example, maybe you just want a list of rows:
rows = []
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.append(map_partial(int, (1, 2), row))
print('The second column of the first row is {}'.format(rows[0][1]))
Or maybe you want to be able to look them up by the string ID in the first column, rather than by index. Since those IDs aren't unique, each ID will map to a list of rows:
rows = {}
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.setdefault(row[0], []).append(map_partial(int, (1, 2), row))
print('The second column of the first abc1 row is {}'.format(rows['abc1'][0][1]))

Adding column in CSV python and enumerating it

my CSV looks like
John,Bomb,Dawn
3,4,5
3,4,5
3,4,5
I want to add ID column in front like so:
ID,John,Bomb,Dawn
1,3,4,5
2,3,4,5
3,3,4,5
using enumerate function, but I don't know how. Here's my code so far:
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.append('ID')
all.append(row)
count = 0
for row in reader:
count += 1
while count:
all.append(row)
row.append(enumerate(reader, 1))
break
writer.writerows(all)
And the output comes all wrong:
John,Bomb,Dawn,ID
3,4,5,<enumerate object at 0x7fb2a5728d70>
3,4,5,<enumerate object at 0x1764370>
3,4,5,<enumerate object at 0x17643c0>
So the ID comes in the end, when it should be in the start, and it doesn't even do the 1,2,3. Some weird error comes out.
I can suggest the code below to solve your question:
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.insert(0, 'ID')
all.append(row)
for k, row in enumerate(reader):
all.append([str(k+1)] + row)
writer.writerows(all)
More compact code can be:
all = [['ID'] + next(reader)] + [[str(k+1)] + row for k, row in enumerate(reader)]
UPDATE (some explanation):
Your have wrong enumerate function understanding. enumerate should be used in for loop and when you iterate over enumerate function result you get the sequence of the tuples where first item is ordered number of item from list and the second is item itself.
But enumerate function return is object (docs) so when you try to convert it to string it call __repr__ magic method and cast enumerate object to <enumerate object at ...>.
Another words, enumerate helps to avoid additional counters in loops such as your count += 1 variable.
Also you have a very strange code here:
while count:
all.append(row)
row.append(enumerate(reader, 1))
break
this part of code never can't be performed more than one time.
You should use insert() instead of append(). This will allow you to specify the index where you want to add the element.
Try this
import csv
with open("testi.csv", 'rb') as input, open('temp.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
all = []
row = next(reader)
row.insert(0, 'ID')
all.append(row)
count = 0
for row in reader:
count += 1
row.insert(0, count)
all.append(row)
writer.writerows(all)
You can do something like this:
import csv
with open('testi.csv') as inp, open('temp.csv', 'w') as out:
reader = csv.reader(inp)
writer = csv.writer(out, delimiter=',')
#No need to use `insert(), `append()` simply use `+` to concatenate two lists.
writer.writerow(['ID'] + next(reader))
#Iterate over enumerate object of reader and pass the starting index as 1.
writer.writerows([i] + row for i, row in enumerate(reader, 1))
enumerate() returns an enumerate object, that yield the index and item in a tuple one at a time, so you need to iterate over the enumerate object instead of writing it to the csv file.
>>> lst = ['a', 'b', 'c']
>>> e = enumerate(lst)
>>> e
<enumerate object at 0x1d48f50>
>>> for ind, item in e:
... print ind, item
...
0 a
1 b
2 c
Output:
>>> !cat temp.csv
ID,John,Bomb,Dawn
1,3,4,5
2,3,4,5
3,3,4,5

Categories

Resources