Remove both duplicates (original and duplicate) from text file using python - python

I try to remove both duplicates like:
STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT41
STANGHOLMEN_TA02_GT81
STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT81
Result
STANGHOLMEN_TA02_GT41
I tried this script
lines_seen = set()
with open(example.txt, "w") as output_file:
for each_line in open(example2.txt, "r"):
if each_line not in lines_seen:
output_file.write(each_line)
lines_seen.add(each_line)
But unfortunately, it doesn't work as I want, it misses lines and doesn't remove lines. The original file has spaces every now and then between the lines

You need to do 2 passes for it to work correctly. Because with 1 pass you won't know if the current line will be repeated later or not. You should try something like this:
# count each line occurances
lines_count = {}
for each_line in open('example2.txt', "r"):
lines_count[each_line] = lines_count.get(each_line, 0) + 1
# write only the lines that are not repeated
with open('example.txt', "w") as output_file:
for each_line, count in lines_count.items():
if count == 1:
output_file.write(each_line)

Related

Removing duplicates from text file using python

I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well
You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger
Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])

How to remove lines that start with the same letters (sequence) in a txt file?

#!/usr/bin/env python
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)
This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.
Example (issue):
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT
What I need as result after one is taken out of file:
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)
I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.
FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5
lines = set()
output_lines = [] # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + '\n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines
with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file
Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
prefixCache = set()
data = []
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?
This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.
I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))
readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

how to count empty lines in python file

I would like to print the total empty lines using python. I have been trying to print using:
f = open('file.txt','r')
for line in f:
if (line.split()) == 0:
but not able to get proper output
I have been trying to print it.. it does print the value as 0.. not sure what wrong with code..
print "\nblank lines are",(sum(line.isspace() for line in fname))
it printing as:
blank lines are 0
There are 7 lines in the file.
There are 46 characters in the file.
There are 8 words in the file.
Since the empty string is a falsy value, you may use .strip():
for line in f:
if not line.strip():
....
The above ignores lines with only whitespaces.
If you want completely empty lines you may want to use this instead:
if line in ['\r\n', '\n']:
...
Please use a context manager (with statement) to open files:
with open('file.txt') as f:
print(sum(line.isspace() for line in f))
line.isspace() returns True (== 1) if line doesn't have any non-whitespace characters, and False (== 0) otherwise. Therefore, sum(line.isspace() for line in f) returns the number of lines that are considered empty.
line.split() always returns a list. Both
if line.split() == []:
and
if not line.split():
would work.
FILE_NAME = 'file.txt'
empty_line_count = 0
with open(FILE_NAME,'r') as fh:
for line in fh:
# The split method will split the word into list. if the line is
# empty the split will return an empty list. ' == [] ' this will
# check the list is empty or not.
if line.split() == []:
empty_line_count += 1
print('Empty Line Count : ' , empty_line_count)

Count number of lines in a txt file with Python excluding blank lines

I wish to count the number of lines in a .txt file which looks something like this:
apple
orange
pear
hippo
donkey
Where there are blank lines used to separate blocks. The result I'm looking for, based on the above sample, is five (lines).
How can I achieve this?
As a bonus, it would be nice to know how many blocks/paragraphs there are. So, based on the above example, that would be two blocks.
non_blank_count = 0
with open('data.txt') as infp:
for line in infp:
if line.strip():
non_blank_count += 1
print 'number of non-blank lines found %d' % non_blank_count
UPDATE: Re-read the question, OP wants to count non-blank lines .. (sigh .. thanks #RanRag).
(I need a break from the computer ...)
A short way to count the number of non-blank lines could be:
with open('data.txt', 'r') as f:
lines = f.readlines()
num_lines = len([l for l in lines if l.strip(' \n') != ''])
I am surprised to see that there isn't a clean pythonic answer yet (as of Jan 1, 2019). Many of the other answers create unnecessary lists, count in a non-pythonic way, loop over the lines of the file in a non-pythonic way, do not close the file properly, do unnecessary things, assume that the end of line character can only be '\n', or have other smaller issues.
Here is my suggested solution:
with open('myfile.txt') as f:
line_count = sum(1 for line in f if line.strip())
The question does not define what blank line is. My definition of blank line: line is a blank line if and only if line.strip() returns the empty string. This may or may not be your definition of blank line.
sum([1 for i in open("file_name","r").readlines() if i.strip()])
Considering the blank lines will only contain the new line character, it would be pretty faster to avoid calling str.strip which creates a new string but instead to check if the line contains only spaces using str.isspace and then skip it:
with open('data.txt') as f:
non_blank_lines = sum(not line.isspace() for line in f)
Demo:
from io import StringIO
s = '''apple
orange
pear
hippo
donkey'''
non_blank_lines = sum(not line.isspace() for line in StringIO(s)))
# 5
You can further use str.isspace with itertools.groupby to count the number of contiguous lines/blocks in the file:
from itertools import groupby
no_paragraphs = sum(k for k, _ in groupby(StringIO(s), lambda x: not x.isspace()))
print(no_paragraphs)
# 2
Not blank lines Counter:
lines_counter = 0
with open ('test_file.txt') as f:
for line in f:
if line != '\n':
lines_counter += 1
Blocks Counter:
para_counter = 0
prev = '\n'
with open ('test_file.txt') as f:
for line in f:
if line != '\n' and prev == '\n':
para_counter += 1
prev = line
This bit of Python code should solve your problem:
with open('data.txt', 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
This is how I would've done it:
f = open("file.txt")
l = [x for x in f.readlines() if x != "\n"]
print len(l)
readlines() will make a list of all the lines in the file and then you can just take those lines that have at least something in them.
Looks pretty straightforward to me!
Pretty straight one! I believe
f = open('path','r')
count = 0
for lines in f:
if lines.strip():
count +=1
print count
My one liner would be
print(sum(1 for line in open(path_to_file,'r') if line.strip()))

Categories

Resources