Get duplicated lines in file - python

Working on file with thousands of line
trying to find which line is duplicated exactly ( 2 times )
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
print c
it give me the result of all duplicated lines but i need to get the repeated line (2 times only)

You're printing all the strings and not just the repeated ones, to print only the ones which are repeated twice, you can print the strings which have a count of two.
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
for line, count in c.items():
if count==2:
print(line)

The Counter Object also provides information about how often a line occurs.
You can filter it using e.g. list comprehension.
This will print all lines, that occur exactly two times in the file
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v == 2])
If you want to have all repeated lines (lines duplicated two or more times)
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v > 1])

You could use Counter.most_common i.e.
from collections import Counter
with open('log.txt') as f:
c = Counter(f)
print(c.most_common(1))
This prints the Counter entry with the highest count.

Related

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Create dictionary from a list deleting all spaces

sorry for asking but I'm kind of new to these things. I'm doing a splitting words from the text and putting them to dict creating an index for each token:
import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
a=0
c=0
e=[]
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(list(filter(None, b)))
c = c + 1
e = e + b
d = dict(zip(e, range(len(e))))
But in the end I receive a dict with spaces in it like that:
{'': 633,
'a': 617,
'according': 385,
'adjacent': 237,
'allow': 429,
'allows': 459}
How can I remove "" from the final result in dict? Also how can I change the indexing after that to not use "" in index counting? (with "" the index count is 633, without-248)
Big thanks!
How about this?
b = list(filter(None, re.split('[^a-z]', line.lower())))
As an alternative:
b = re.findall('[a-z]+', line.lower())
Either way, you can then also remove that filter from the next line:
a += len(b)
EDIT
As an aside, I think what you end up with here is a dictionary mapping words to the last position in which they appear in the text. I'm not sure if that's what you intended to do. E.g.
>>> dict(zip(['hello', 'world', 'hello', 'again'], range(4)))
{'world': 1, 'hello': 2, 'again': 3}
If you instead want to keep track of all the positions a word occurs, perhaps try this code instead:
from collections import defaultdict
import re
indexes = defaultdict(list)
with open('test.txt', 'r') as f:
for index, word in enumerate(re.findall(r'[a-z]+', f.read().lower())):
indexes[word].append(index)
indexes then maps each word to a list of indexes at which the word appears.
EDIT 2
Based on the comment discussion below, I think you want something more like this:
from collections import defaultdict
import re
word_positions = {}
with open('test.txt', 'r') as f:
index = 0
for word in re.findall(r'[a-z]+', f.read().lower()):
if word not in word_positions:
word_positions[word] = index
index += 1
print(word_positions)
# Output:
# {'hello': 0, 'goodbye': 2, 'world': 1}
Your regex looks not a good one. Consider to use:
line = re.sub('[^a-z]*$', '', line.strip())
b = re.split('[^a-z]+', line.lower())
Replace:
d = dict(zip(e, range(len(e))))
With:
d = {word:n for n, word in enumerate(e) if word}
Alternatively, to avoid the empty entries in the first place, replace:
b=re.split('[^a-z]', line.lower())
With:
b=re.split('[^a-z]+', re.sub('(^[^a-z]+|[^a-z]+$)', '', line.lower()))

Python: Read text file into dict and ignore comments

I am trying to put the following text file into a dictionary but I would like any section starting with '#' or empty lines ignored.
My text file looks something like this:
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
My desired output would be:
myVariables = {'Apples': 1, 'Oranges': 3, 'Bananas': 5}
My Python code reads as follows:
filename = "myFile.txt"
myVariables = {}
with open(filename) as f:
for line in f:
if line.startswith('#') or not line:
next(f)
key, val = line.split()
myVariables[key] = val
print "key: " + str(key) + " and value: " + str(val)
The error I get:
Traceback (most recent call last):
File "C:/Python27/test_1.py", line 11, in <module>
key, val = line.split()
ValueError: need more than 1 value to unpack
I understand the error but I do not understand what is wrong with the code.
Thank you in advance!
Given your text:
text = """
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
"""
We can do this in 2 ways. Using regex, or using Python generators. I would choose the latter (described below) as regex is not particularly fast(er) in such cases.
To open the file:
with open('file_name.xyz', 'r') as file:
# everything else below. Just substitute `for line in lines` with
# `for line in file.readline()`
Now to create a similar, we split the lines, and create a list:
lines = text.split('\n') # as if read from a file using `open`.
Here is how we do all you want in a couple of lines:
# Discard all comments and empty values.
comment_less = filter(None, (line.split('#')[0].strip() for line in lines))
# Separate items and totals.
separated = {item.split()[0]: int(item.split()[1]) for item in comment_less}
Lets test:
>>> print(separated)
{'Apples': 1, 'Oranges': 3, 'Bananas': 5}
Hope this helps.
This doesn't exactly reproduce your error, but there's a problem with your code:
>>> x = "Apples\t1\t# This is a comment"
>>> x.split()
['Apples', '1', '#', 'This', 'is', 'a', 'comment']
>>> key, val = x.split()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
Instead try:
key = line.split()[0]
val = line.split()[1]
Edit: and I think your "need more than 1 value to unpack" is coming from the blank lines. Also, I'm not familiar with using next() like this. I guess I would do something like:
if line.startswith('#') or line == "\n":
pass
else:
key = line.split()[0]
val = line.split()[1]
To strip comments, you could use str.partition() which works whether the comment sign is present or not in the line:
for line in file:
line, _, comment = line.partition('#')
if line.strip(): # non-blank line
key, value = line.split()
line.split() may raise an exception in this code too—it happens if there is a non-blank line that does not contain exactly two whitespace-separated words—it is application depended what you want to do in this case (ignore such lines, print warning, etc).
You need to ignore empty lines and lines starting with # splitting the remaining lines after either splitting on # or using rfind as below to slice the string, an empty line will have a new line so you need and line.strip() to check for one, you cannot just split on whitespace and unpack as you have more than two elements after splitting including what is in the comment:
with open("in.txt") as f:
d = dict(line[:line.rfind("#")].split() for line in f
if not line.startswith("#") and line.strip())
print(d)
Output:
{'Apples': '1', 'Oranges': '3', 'Bananas': '5'}
Another option is to split twice and slice:
with open("in.txt") as f:
d = dict(line.split(None,2)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
Or splitting twice and unpacking using an explicit loop:
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
k, v, _ = line.split(None, 2)
d[k] = v
You can also use itertools.groupby to group the lines you want.
from itertools import groupby
with open("in.txt") as f:
grouped = groupby(f, lambda x: not x.startswith("#") and x.strip())
d = dict(next(v).split(None, 2)[:2] for k, v in grouped if k)
print(d)
To handle where we have multiple words in single quotes we can use shlex to split:
import shlex
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
data = shlex.split(line)
d[data[0]] = data[1]
print(d)
So changing the Banana line to:
Bananas 'north-side disabled' # I want to ignore this comment too!
We get:
{'Apples': '1', 'Oranges': '3', 'Bananas': 'north-side disabled'}
And the same will work for the slicing:
with open("in.txt") as f:
d = dict(shlex.split(line)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
If the format of the file is correctly defined you can try a solution with regular expressions.
Here's just an idea:
import re
fruits = {}
with open('fruits_list.txt', mode='r') as f:
for line in f:
match = re.match("([a-zA-Z0-9]+)[\s]+([0-9]+).*", line)
if match:
fruit_name, fruit_amount = match.groups()
fruits[fruit_name] = fruit_amount
print fruits
UPDATED:
I changed the way of reading lines taking care of large files. Now I read line by line and not all in one. This improves the memory usage.

Python extract unique values from CSV

I am using the following python script to remove duplicates from a CSV file
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
I am trying to modify it so that instead of outputting the list without duplicates to final.csv it outputs the unique values that were found.
Kind of the opposite to what it does now. Anyone got an example?
Using a dict to keep track of how many times each line occurs, then you can process the dict and add only the unique items to the seen set, and write those to the final.csv:
from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
uniques[line] +=1
for k, v in uniques.iteritems():
if v = 1:
seen.add(k)
out_file.write(k)
Or:
from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
uniques[line] +=1
seen = set(k for k in uniques if uniques[k] == 1)
for itm in seen:
out_file.write(itm)
Or, using Counter:
from collections import Counter
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
lines = Counter(file.readlines())
seen = set(k for k in lines if lines[k] == 1)
for itm in seen:
out_file.write(itm)
This will output only the lines which appear once, depending on what you mean by "uniques", this may or may not be correct. If, instead, you want to output ALL lines but only one instance per line, using the last method:
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
lines = Counter(file.readlines())
for itm in lines:
out_file.write(itm)
You could collect the dublicates to another variable and use those to remove not unique values from the set.

prints line number in both txtfile and list?

i have this code which prints the line number in infile but also the linenumber in words what do i do to only print the line number of the txt file next to the words???
d = {}
counter = 0
wrongwords = []
for line in infile:
infile = line.split()
wrongwords.extend(infile)
counter += 1
for word in infile:
if word not in d:
d[word] = [counter]
if word in d:
d[word].append(counter)
for stuff in wrongwords:
print(stuff, d[stuff])
the output is :
hello [1, 2, 7, 9] # this is printing the linenumber of the txt file
hello [1] # this is printing the linenumber of the list words
hello [1]
what i want is:
hello [1, 2, 7, 9]
Four things:
You can keep track of the line number by doing this instead of handling a
counter on your own:
for line_no, word in enumerate(infile):
As sateesh pointed out above, you probably need an else in your
conditions:
if word not in d:
d[word] = [counter]
else:
d[word].append(counter)
Also note that the above code snippet is exactly what defaultdicts are
for:
from collections import defaultdict
d = defaultdict(list)
Then in your main loop, you can get rid of the if..else part:
d[word].append(counter)
Why are you doing wrongwords.extend(infile)?
Also, I don't really understand how you are supposed to decide what "wrong words" are. I assume that you have a set named wrongwords that contains the wrong words, which makes your final code something like this:
from collections import defaultdict
d = defaultdict(list)
wrongwords = set(["hello", "foo", "bar", "baz"])
for counter, line in enumerate(infile):
infile = line.split()
for word in infile:
if word in wrongwords:
d[word].append(counter)

Categories

Resources