Python extract unique values from CSV - python

I am using the following python script to remove duplicates from a CSV file
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
I am trying to modify it so that instead of outputting the list without duplicates to final.csv it outputs the unique values that were found.
Kind of the opposite to what it does now. Anyone got an example?

Using a dict to keep track of how many times each line occurs, then you can process the dict and add only the unique items to the seen set, and write those to the final.csv:
from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
uniques[line] +=1
for k, v in uniques.iteritems():
if v = 1:
seen.add(k)
out_file.write(k)
Or:
from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
uniques[line] +=1
seen = set(k for k in uniques if uniques[k] == 1)
for itm in seen:
out_file.write(itm)
Or, using Counter:
from collections import Counter
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
lines = Counter(file.readlines())
seen = set(k for k in lines if lines[k] == 1)
for itm in seen:
out_file.write(itm)
This will output only the lines which appear once, depending on what you mean by "uniques", this may or may not be correct. If, instead, you want to output ALL lines but only one instance per line, using the last method:
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
lines = Counter(file.readlines())
for itm in lines:
out_file.write(itm)

You could collect the dublicates to another variable and use those to remove not unique values from the set.

Related

Get duplicated lines in file

Working on file with thousands of line
trying to find which line is duplicated exactly ( 2 times )
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
print c
it give me the result of all duplicated lines but i need to get the repeated line (2 times only)
You're printing all the strings and not just the repeated ones, to print only the ones which are repeated twice, you can print the strings which have a count of two.
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
for line, count in c.items():
if count==2:
print(line)
The Counter Object also provides information about how often a line occurs.
You can filter it using e.g. list comprehension.
This will print all lines, that occur exactly two times in the file
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v == 2])
If you want to have all repeated lines (lines duplicated two or more times)
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v > 1])
You could use Counter.most_common i.e.
from collections import Counter
with open('log.txt') as f:
c = Counter(f)
print(c.most_common(1))
This prints the Counter entry with the highest count.

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

how to read file line by line in python

I'm trying to read below text file in python, I'm struggling to get as key value in output but its not working as expected:
test.txt
productId1 ProdName1,ProdPrice1,ProdDescription1,ProdDate1
productId2 ProdName2,ProdPrice2,ProdDescription2,ProdDate2
productId3 ProdName3,ProdPrice3,ProdDescription3,ProdDate3
productId4 ProdName4,ProdPrice4,ProdDescription4,ProdDate4
myPython.py
import sys
with open('test.txt') as f
lines = list(line.split(' ',1) for line in f)
for k,v in lines.items();
print("Key : {0}, Value: {1}".format(k,v))
I'm trying to parse the text file and trying to print key and value separately. Looks like I'm doing something wrong here. Need some help to fix this?
Thanks!
You're needlessly storing a list.
Loop, split and print
with open('test.txt') as f:
for line in f:
k, v = line.rstrip().split(' ',1)
print("Key : {0}, Value: {1}".format(k,v))
This should work, with a list comprehension:
with open('test.txt') as f:
lines = [line.split(' ',1) for line in f]
for k, v in lines:
print("Key: {0}, Value: {1}".format(k, v))
You can make a dict right of the bat with a dict comp and than iterate the list to print as you wanted. What you had done was create a list, which does not have an items() method.
with open('notepad.txt') as f:
d = {line.split(' ')[0]:line.split(' ')[1] for line in f}
for k,v in d.items():
print("Key : {0}, Value: {1}".format(k,v))
lines is a list of lists, so the good way to finish the job is:
import sys
with open('test.txt') as f:
lines = list(line.split(' ',1) for line in f)
for k,v in lines:
print("Key : {0}, Value: {1}".format(k,v))
Perhaps I am reading too much into your description but I see one key, a space and a comma limited name of other fields. If I interpret that as their being data for those items that is comma limited then I would conclude you want a dictionary of dictionaries. That would lead to code like:
data_keys = 'ProdName', 'ProdPrice', 'ProdDescription', 'ProdDate'
with open('test.txt') as f:
for line in f:
id, values = l.strip().split() # automatically on white space
keyed_values = zip(data_keys, values.split(','))
print(dict([('key', id)] + keyed_values))
You can use the f.readlines() function that returns a list of lines in the file f. I changed the code to include f.lines in line 3.
import sys
with open('test.txt') as f:
lines = list(line.split(' ',1) for line in f.readlines())
for k,v in lines.items();
print("Key : {0}, Value: {1}".format(k,v))

Python- how to convert lines in a .txt file to dictionary elements?

Say I have a file "stuff.txt" that contains the following on separate lines:
q:5
r:2
s:7
I want to read each of these lines from the file, and convert them to dictionary elements, the letters being the keys and the numbers the values.
So I would like to get
y ={"q":5, "r":2, "s":7}
I've tried the following, but it just prints an empty dictionary "{}"
y = {}
infile = open("stuff.txt", "r")
z = infile.read()
for line in z:
key, value = line.strip().split(':')
y[key].append(value)
print(y)
infile.close()
try this:
d = {}
with open('text.txt') as f:
for line in f:
key, value = line.strip().split(':')
d[key] = int(value)
You are appending to d[key] as if it was a list. What you want is to just straight-up assign it like the above.
Also, using with to open the file is good practice, as it auto closes the file after the code in the 'with block' is executed.
There are some possible improvements to be made. The first is using context manager for file handling - that is with open(...) - in case of exception, this will handle all the needed tasks for you.
Second, you have a small mistake in your dictionary assignment: the values are assigned using = operator, such as dict[key] = value.
y = {}
with open("stuff.txt", "r") as infile:
for line in infile:
key, value = line.strip().split(':')
y[key] = (value)
print(y)
Python3:
with open('input.txt', 'r', encoding = "utf-8") as f:
for line in f.readlines():
s=[] #converting strings to list
for i in line.split(" "):
s.append(i)
d=dict(x.strip().split(":") for x in s) #dictionary comprehension: converting list to dictionary
e={a: int(x) for a, x in d.items()} #dictionary comprehension: converting the dictionary values from string format to integer format
print(e)

how to sort out repeated entry from a column in python

i have a file like:
q12j4
q12j4
fj45j
q12j4
fjmep
fj45j
now all i wanted to do is:
find if any entries are repeated,
if so then print the entry once and those are not repeated print 'em normally.
the output should be like:
q12j4
fj45j
fjmep
[repetition is omitted]
I was trying to do it with defaultdictfunction but I think it will not work for strings.
please help..
This should be roughly enough:
with open('file.txt', 'r') as f:
for line in set(f):
print line
def unique(seq):
seen = set()
for val in seq:
if val not in seen:
seen.add(val)
yield val
with open('file.txt') as f:
print ''.join(unique(f))
As you can see, I've chosen to write a separate generator for removing duplicates from an iterable. This generator, unique(), can be used in lots of other contexts too.
seen = set()
with open(filename, 'r') as f:
for line in f:
if line not in seen:
print line
seen.add(line)
You should use the itertools.groupby function, for an example of usage, look at the standard library or this related question: How do I use Python's itertools.groupby()?
Assume that toorder is your list with repeated entries:
import itertools
toorder = ["a", "a", "b", "a", "b", "c"]
for key, group in itertools.groupby(sorted(toorder)):
print key
Should output:
a
b
c

Categories

Resources