Removing duplicates from text file using python

Removing duplicates from text file using python - python

I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well

You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger

Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])

Related

How to remove lines that start with the same letters (sequence) in a txt file?

#!/usr/bin/env python
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)
This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.
Example (issue):
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT
What I need as result after one is taken out of file:
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)

I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.
FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5
lines = set()
output_lines = [] # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + '\n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines
with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file

Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
prefixCache = set()
data = []
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?

This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.

I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))

readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

How do I remove an entire line with a specific word, from a text file?

I made a text file called contacts.txt which contains:
pot 2000
derek 45
snow 55
I want to get user input (a name) on which contact to remove, and delete the entire line containing that name. So far, this is what I've done:
# ... previous code
if int(number) == 5:
print "\n"
newdict = {}
with open('contacts.txt','r') as f:
for line in f:
if line != "\n":
splitline = line.split( )
newdict[(splitline[0])] = ",".join(splitline[1:])
print newdict
removethis = raw_input("Contact to be removed: ")
if removethis in newdict:
with open('contacts.txt','r') as f:
new = f.read()
new = new.replace(removethis, '')
with open('contacts.txt','w') as f:
f.write(new)
When I enter "pot", I come back to the text file and only "pot" is removed, the "2000" stays there. I tried
new = new.replace(removethis + '\n', '') as other forums suggested, but it didn't work.
Notes:
Everything I've read on other forums requires me to make a new file, but I don't want that; I only want one file.
I already tried 'r+' the first time I opened the file and inserted a for loop which only picks the lines that do not contain the input word, but it doesn't work either.

I saw you said this is not a duplicate, but isn't this discussion equivalent to your question?
Deleting a specific line in a file (python)
Based on the discussion in the link, I created a .txt file from your input (with the usernames you supplied) and ran the following code:
filename = 'smp.txt'
f = open(filename, "r")
lines = f.readlines()
f.close()
f = open(filename, "w")
for line in lines:
if line!="\n":
f.write(line)
f.close()
What this does is to remove the spaces between the lines.
It seems to me as if this is what you want.

How about this:
Read in all the lines from the file into a list
Write out each line
Skip the line that you want removed
Something like this:
filename = 'contacts.txt'
with open(filename, 'r') as fin:
lines = fin.readlines()
with open(filename, 'w') as fout:
for line in lines:
if removethis not in line:
fout.write(line)
If you want to be more precise about the line you remove, you could use if not line.startswith(removethis+' '), or you could put together a regular expression of some kind.

Locate a specific line in a file based on user input then delete a specific number of lines

I'm trying to delete specific lines in a text file the way I need to go about it is by prompting the user to input a string (a phrase that should exist in the file) the file is then searched and if the string is there the data on that line and the number line number are both stored.
After the phrase has been found it and the five following lines are printed out. Now I have to figure out how to delete those six lines without changing any other text in the file which is my issue lol.
Any Ideas as to how I can delete those six lines?
This was my latest attempt to delete the lines
file = open('C:\\test\\example.txt', 'a')
locate = "example string"
for i, line in enumerate(file):
if locate in line:
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
i = i + 1
line[i] = line.strip()
i = i+1
line[i] = line.strip()
break

Usually I would not think it's desirable to overwrite the source file - what if the user does something by mistake? If your project allows, I would write the changes out to a new file.
with open('source.txt', 'r') as ifile:
with open('output.txt', 'w') as ofile:
locate = "example string"
skip_next = 0
for line in ifile:
if locate in line:
skip_next = 6
print(line.rstrip('\n'))
elif skip_next > 0:
print(line.rstrip('\n'))
skip_next -= 1
else:
ofile.write(line)
This is also robust to finding the phrase multiple times - it will just start counting lines to remove again.

You can find the occurrences, copy the list items between the occurrences to a new list and then save the new list into the file.
_newData = []
_linesToSkip = 3
with open('data.txt', 'r') as _file:
data = _file.read().splitlines()
occurrences = [i for i, x in enumerate(data) if "example string" in x]
_lastOcurrence = 0
for ocurrence in occurrences:
_newData.extend(data[_lastOcurrence : ocurrence])
_lastOcurrence = ocurrence + _linesToSkip
_newData.extend(data[_lastOcurrence:])
# Save new data into the file

There are a couple of points that you clearly misunderstand here:
.strip() removes whitespace or given characters:
>>> print(str.strip.__doc__)
S.strip([chars]) -> str
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
incrementing i doesn't actually do anything:
>>> for i, _ in enumerate('ignore me'):
... print(i)
... i += 10
...
0
1
2
3
4
5
6
7
8
You're assigning to the ith element of the line, which should raise an exception (that you neglected to tell us about)
>>> line = 'some text'
>>> line[i] = line.strip()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
Ultimately...
You have to write to a file if you want to change its contents. Writing to a file that you're reading from is tricky business. Writing to an alternative file, or just storing the file in memory if it's small enough is a much healthier approach.
search_string = 'example'
lines = []
with open('/tmp/fnord.txt', 'r+') as f: #`r+` so we can read *and* write to the file
for line in f:
line = line.strip()
if search_string in line:
print(line)
for _ in range(5):
print(next(f).strip())
else:
lines.append(line)
f.seek(0) # back to the beginning!
f.truncate() # goodbye, original lines
for line in lines:
print(line, file=f) # python2 requires `from __future__ import print_function`
There is a fatal flaw in this approach, though - if the sought after line is any closer than the 6th line from the end, it's going to have problems. I'll leave that as an exercise for the reader.

You are appending to your file by using open with 'a'. Also, you are not closing your file (bad habit). str.strip() does not delete the line, it removes whitespace by default. Also, this would usually be done in a loop.
This to get started:
locate = "example string"
n=0
with open('example.txt', 'r+') as f:
for i,line in enumerate(f):
if locate in line:
n = 6
if n:
print( line, end='' )
n-=1
print( "done" )
Edit:
Read-modify-write solution:
locate = "example string"
filename='example.txt'
removelines=5
with open(filename) as f:
lines = f.readlines()
with open(filename, 'w') as f:
n=0
for line in lines:
if locate in line:
n = removelines+1
if n:
n-=1
else:
f.write(line)

Sorting/Deleting File Lines - Python

I was wanting to get rid of lines in a file that were less than 6 characters, and delete the whole line that had a string less than 6 characters. I tried running this code, but it ended up deleting the whole text file. How would I go about this?
Code:
import linecache
i = 1
while i < 5:
line = linecache.getline('file.txt', i)
if len(line) < 6:
str.replace(line, line, '')
i += 1
Thanks in advance!

You'll want to use the open method instead of the linecache:
def deleteShortLines():
text = 'file.txt'
f = open(text)
output = []
for line in f:
if len(line) >= 6:
output.append(line)
f.close()
f = open(text, 'w')
f.writelines(output)
f.close()

Done with iterators instead of lists to support very long files:
with open('file.txt', 'r') as input_file:
# iterating over a file object yields its lines one at a time
# keep only lines with at least 6 characters
filtered_lines = (line for line in input_file if len(line) >= 6)
# write the kept lines to a new file
with open('output_file.txt', 'w') as output_file:
output_file.writelines(filtered_lines)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicates from text file using python - python

You could use dict.fromkeys to remove duplicates and preserve order efficiently: with open(filename, "r") as f: lines = dict.fromkeys(f.readlines()) with open(filename, "w") as f: f.writelines(lines) Idea from Raymond Hettinger

Related

How to remove lines that start with the same letters (sequence) in a txt file?

Changing the contents of a text file and making a new file with same format

How do I remove an entire line with a specific word, from a text file?

Locate a specific line in a file based on user input then delete a specific number of lines

Sorting/Deleting File Lines - Python

Categories

Resources