Generate output file Python - python

As it can be seen in the code. I created two output files one for output after splitting
and second output as actual out after removing duplicate lines
How can i make only one output file. Sorry if i sound too stupid, I'm a beginner
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
outfile = open("out.txt", "w")
actualout = open("output.txt", "w")
for line in open(txt, "r"):
line = line.split("?", 1)[0]
outfile.write(line+"\n")
outfile.close()
for line in open("out.txt", "r"):
if line not in lines_seen: # not a duplicate
actualout.write(line)
lines_seen.add(line)
actualout.close()

You can add the lines from the input file directly into the set. Since sets cannot have duplicates, you don't even need to check for those. Try this:
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
actualout = open("output.txt", "w")
for line in open(txt, "r"):
line = line.split("?", 1)[0]
lines_seen.add(line + "\n")
for line in lines_seen:
actualout.write(line)
actualout.close()

In the first step you iterate through every line in the file, split the line on your decriminator and store it into a list. After that you iterate through the list and write it into your output file.
import sys
txt = sys.argv[1]
lines_seen = set() # holds lines already seen
actualout = open("output.txt", "w")
data = [line.split("?", 1[0] for line in open("path/to/file/here", "r")]
for line in data:
if line not in lines_seen: # not a duplicate
actualout.write(line)
lines_seen.add(line)
actualout.close()

Related

Removing duplicates from text file using python

I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well
You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger
Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])

Adding a comma to end of first row of csv files within a directory using python

Ive got some code that lets me open all csv files in a directory and run through them removing the top 2 lines of each file, Ideally during this process I would like it to also add a single comma at the end of the new first line (what would have been originally line 3)
Another approach that's possible could be to remove the trailing comma's on all other rows that appear in each of the csvs.
Any thoughts or approaches would be gratefully received.
import glob
path='P:\pytest'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
for line in lines:
o.write(line+'\n')
o.close()
adding a counter in there can solve this:
import glob
path=r'C:/Users/dsqallihoussaini/Desktop/dev_projects/stack_over_flow'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r') as f:
lines = f.read().split("\n")
print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()
One possible problem with your code is that you are reading the whole file into memory, which might be fine. If you are reading larger files, then you want to process the file line by line.
The easiest way to do that is to use the fileinput module: https://docs.python.org/3/library/fileinput.html
Something like the following should work:
#!/usr/bin/env python3
import glob
import fileinput
# inplace makes a backup of the file, then any output to stdout is written
# to the current file.
# change the glob..below is just an example.
#
# Iterate through each file in the glob.iglob() results
with fileinput.input(files=glob.iglob('*.csv'), inplace=True) as f:
for line in f: # Iterate over each line of the current file.
if f.filelineno() > 2: # Skip the first two lines
# Note: 'line' has the newline in it.
# Insert the comma if line 3 of the file, otherwise output original line
print(line[:-1]+',') if f.filelineno() == 3 else print(line, end="")
Ive added some encoding as well as mine was throwing a error but encoding fixed that up nicely
import glob
path=r'C:/whateveryourfolderis'
for filename in glob.iglob(path+'/*.csv'):
with open(filename, 'r',encoding='utf-8') as f:
lines = f.read().split("\n")
#print(lines)
f.close()
if len(lines) >= 1:
lines = lines[2:]
o = open(filename, 'w',encoding='utf-8')
counter=0
for line in lines:
counter=counter+1
if counter==1:
o.write(line+',\n')
else:
o.write(line+'\n')
o.close()

How to remove lines that start with the same letters (sequence) in a txt file?

#!/usr/bin/env python
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)
This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.
Example (issue):
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT
What I need as result after one is taken out of file:
CCTGGATGGCTTATATAAGAT***GTTAT***
***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)
I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.
FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5
lines = set()
output_lines = [] # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + '\n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines
with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file
Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:
FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5
prefixCache = set()
data = []
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)

How do I remove an entire line with a specific word, from a text file?

I made a text file called contacts.txt which contains:
pot 2000
derek 45
snow 55
I want to get user input (a name) on which contact to remove, and delete the entire line containing that name. So far, this is what I've done:
# ... previous code
if int(number) == 5:
print "\n"
newdict = {}
with open('contacts.txt','r') as f:
for line in f:
if line != "\n":
splitline = line.split( )
newdict[(splitline[0])] = ",".join(splitline[1:])
print newdict
removethis = raw_input("Contact to be removed: ")
if removethis in newdict:
with open('contacts.txt','r') as f:
new = f.read()
new = new.replace(removethis, '')
with open('contacts.txt','w') as f:
f.write(new)
When I enter "pot", I come back to the text file and only "pot" is removed, the "2000" stays there. I tried
new = new.replace(removethis + '\n', '') as other forums suggested, but it didn't work.
Notes:
Everything I've read on other forums requires me to make a new file, but I don't want that; I only want one file.
I already tried 'r+' the first time I opened the file and inserted a for loop which only picks the lines that do not contain the input word, but it doesn't work either.
I saw you said this is not a duplicate, but isn't this discussion equivalent to your question?
Deleting a specific line in a file (python)
Based on the discussion in the link, I created a .txt file from your input (with the usernames you supplied) and ran the following code:
filename = 'smp.txt'
f = open(filename, "r")
lines = f.readlines()
f.close()
f = open(filename, "w")
for line in lines:
if line!="\n":
f.write(line)
f.close()
What this does is to remove the spaces between the lines.
It seems to me as if this is what you want.
How about this:
Read in all the lines from the file into a list
Write out each line
Skip the line that you want removed
Something like this:
filename = 'contacts.txt'
with open(filename, 'r') as fin:
lines = fin.readlines()
with open(filename, 'w') as fout:
for line in lines:
if removethis not in line:
fout.write(line)
If you want to be more precise about the line you remove, you could use if not line.startswith(removethis+' '), or you could put together a regular expression of some kind.

Take lines from two files, output into same line- python

I am attempting to open two files then take the first line in the first file, write it to an out file, then take the first line in the second file and append it to the same line in the output file, separated by a tab.
I've attempted to code this, and my outfile just ends up being the whole contents of the first file, followed by the entire contents of the second file. I included print statements just because I wanted to see something going on in the terminal while the script was running, that is why they are there. Any ideas?
import sys
InFileName = sys.argv[1]
InFile = open(InFileName, 'r')
InFileName2 = sys.argv[2]
InFile2 = open(InFileName2, 'r')
OutFileName = "combined_data.txt"
OutFile = open(OutFileName, 'a')
for line in InFile:
OutFile.write(str(line) + '\t')
print line
for line2 in InFile2:
OutFile.write(str(line2) + '\n')
print line
InFile.close()
InFile2.close()
OutFile.close()
You can use zip for this:
with open(file1) as f1,open(file2) as f2,open("combined_data.txt","w") as fout:
for t in zip(f1,f2):
fout.write('\t'.join(x.strip() for x in t)+'\n')
In the case where your two files don't have the same number of lines (or if they're REALLY BIG), you could use itertools.izip_longest(f1,f2,fillvalue='')
Perhaps this gives you a few ideas:
Adding entries from multiple files in python
o = open('output.txt', 'wb')
fh = open('input.txt', 'rb')
fh2 = open('input2.txt', 'rb')
for line in fh.readlines():
o.write(line.strip('\r\n') + '\t' + fh2.readline().strip('\r\n') + '\n')
## If you want to write remaining files from input2.txt:
# for line in fh2.readlines():
# o.write(line.rstrip('\r\n') + '\n')
fh.close()
fh2.close()
o.close()
This will give you:
line1_of_file_1 line1_of_file_2
line2_of_file_1 line2_of_file_2
line3_of_file_1 line3_of_file_2
line4_of_file_1 line4_of_file_2
Where the space in my output example is a [tab]
Note: no line ending is appended to the file for obvious reasons.
For this to work, the linendings would need to be proper in both file 1 and 2.
To check this:
print 'File 1:'
f = open('input.txt', 'rb')
print [r.read[:200]]
f.close()
print 'File 2:'
f = open('input2.txt', 'rb')
print [r.read[:200]]
f.close()
This should give you something like
File 1:
['This is\ta lot of\t text\r\nWith a few line\r\nendings\r\n']
File 2:
['Give\r\nMe\r\nSome\r\nLove\r\n']

Categories

Resources