How to delete specific strings from a file?

How to delete specific strings from a file? - python

I have a data file (unstructured, messy file) from which I have to scrub specific list of strings (delete strings).
Here is what I am doing but with no result:
infile = r"messy_data_file.txt"
outfile = r"cleaned_file.txt"
delete_list = ["firstname1 lastname1","firstname2 lastname2"....,"firstnamen lastnamen"]
fin=open(infile,"")
fout = open(outfile,"w+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
When I execute the file, I get the following error:
NameError: name 'word' is not defined

The readlines method returns a list of lines, not words, so your code would only work where one of your words is on a line by itself.
Since files are iterators over lines this can be done much easier:
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["word_1", "word_2", "word_n"]
with open(infile) as fin, open(outfile, "w+") as fout:
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)

To remove the string within the same file, I used this code
f = open('./test.txt','r')
a = ['word1','word2','word3']
lst = []
for line in f:
for word in a:
if word in line:
line = line.replace(word,'')
lst.append(line)
f.close()
f = open('./test.txt','w')
for line in lst:
f.write(line)
f.close()

Based on your comment "I am double clicking the .py file. It seems to invoke the python application which disappears after a couple of seconds. I dont get any error thought" I believe your issue is the script is not finding the input file. That is also why you are not getting any output. When you double click on it... I actually can't recall where the interpreter is going to look but I think it's where the python.exe is installed.
Use a fully qualified path like so.
# Depends on your OS
infile = r"C:\tmp\messy_data_file.txt"
outfile = r"C:\tmp\cleaned_file.txt"
infile = r"/etc/tmp/messy_data_file.txt"
outfile = r"/etc/tmp/cleaned_file.txt"
Also, for your sanity, run it from the command-line instead of double clicking. It'll be much easier to catch errors/output.

To the OP,
Ross Patterson's method above works perfectly for me, i.e.
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["word_1", "word_2", "word_n"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
Example:
I have a file named messy_data_file.txt that includes the following words (animals), not necessarily on the same line. Like this:
Goat
Elephant
Horse Donkey Giraffe
Lizard
Bird
Fish
When I modify the code to read (actually just adding the words to delete to the "delete_list" line):
infile = "messy_data_file.txt"
outfile = "cleaned_file.txt"
delete_list = ["Donkey", "Goat", "Fish"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
The resulting "cleaned_file.txt" looks like this:
Elephant
Horse Giraffe
Lizard
Bird
There is a blank line where "Goat" used to be (where, oddly, removing "Donkey" did not) but for my purposes, this works fine.
I also add input("Press Enter to exit...") the the very end of the code to keep the command line window from opening and slamming shut on me when I'm double-clicking the remove_text.py file to run it, but take note that you'll catch no errors this way.
To do that I run it from the command line (where C:\Just_Testing is the directory where all my files are, i.e. remove_text.py and messy_text.txt)
like this:
C:\Just_Testing\>py remove_text.py
or
C:\Just_Testing>python remove_text.py
works exactly the same.
Of course, like when writing HTML, I guess it never hurts to use a fully qualified path when running py or python from somewhere other than the directory you happen to be sitting in, such as:
C:\Windows\System32\>python C:\Users\Me\Desktop\remove_text.py
Of course in the code it would be:
infile = "C:\Users\Me\Desktop\messy_data_file.txt"
outfile = "C:\Users\Me\Desktop\cleaned_file.txt"
Be careful to use the same fully qualified path to place your newly created cleaned_file.txt in or it will be created wherever you may be and that could cause confusion when looking for it.
Personally, I have the PATH in my Environment Variables set to point to all my Python installs i.e. C:\Python3.5.3, C:\Python2.7.13, etc. so I can run py or python from anywhere.
Anyway, I hope making fine-tuning adjustments to this code from Mr. Patterson can get you exactly what you need. :)
.

Maybe you can add encoding='utf-8' in your fin and fout variables.
Here is the modified one you may want to use:
fin=open(infile,"", encoding='utf-8')
fout = open(outfile,"w+", encoding='utf-8')
This(adding utf-8) mostly occurs on the OS Windows. Also for reading, writing, and appending the file, this usually isn't a problem but for advanced things to do a file like replacing text in there, etc then you should do this.
Hope this helps you.

The code below just gets the old data and checks if the string doesnt contain the string you doesnt want then continues. (this also works if you want to remove empty lines)
str = []
with open("file.txt", "r+") as f:
for i in f.readlines():
str.append(i)
with open("file.txt", "w") as f:
for i in str:
if i != "The string you want to remove":
f.write(i)

Related

Appending characters to each line in a txt file with python

I wrote the following python code snippet to append a lower p character to each line of a txt file:
f = open('helloworld.txt','r')
for line in f:
line+='p'
print(f.read())
f.close()
However, when I execute this python program, it returns nothing but an empty blank:
zhiwei#zhiwei-Lenovo-Rescuer-15ISK:~/Documents/1001/ass5$ python3 helloworld.py
Can anyone tell me what's wrong with my codes?

Currently, you are only reading each line and not writing to the file. reopen the file in write mode and write your full string to it, like so:
newf=""
with open('helloworld.txt','r') as f:
for line in f:
newf+=line.strip()+"p\n"
f.close()
with open('helloworld.txt','w') as f:
f.write(newf)
f.close()

well, type help(f) in shell, you can get "Character and line based layer over a BufferedIOBase object, buffer."
it's meaning：if you reading first buffer,you can get content, but again. it's empty。
so like this:
with open(oldfile, 'r') as f1, open(newfile, 'w') as f2:
newline = ''
for line in f1:
newline+=line.strip()+"p\n"
f2.write(newline)

open(filePath, openMode) takes two arguments, the first one is the path to your file, the second one is the mode it will be opened it. When you use 'r' as second argument, you are actually telling Python to open it as an only reading file.
If you want to write on it, you need to open it in writing mode, using 'w' as second argument. You can find more about how to read/write files in Python in its official documentation.
If you want to read and write at the same time, you have to open the file in both reading and writing modes. You can do this simply by using 'r+' mode.

It seems that your for loop has already read the file to the end, so f.read() return empty string.
If you just need to print the lines in the file, you could move the print into for loop just like print(line). And it is better to move the f.read() before for loop:
f = open("filename", "r")
lines = f.readlines()
for line in lines:
line += "p"
print(line)
f.close()
If you need to modify the file, you need to create another file obj and open it in mode of "w", and use f.write(line) to write the modified lines into the new file.
Besides, it is more better to use with clause in python instead of open(), it is more pythonic.
with open("filename", "r") as f:
lines = f.readlines()
for line in lines:
line += "p"
print(line)
When using with clause, you have no need to close file, this is more simple.

Using variable as part of name of new file in python

I'm fairly new to python and I'm having an issue with my python script (split_fasta.py). Here is an example of my issue:
list = ["1.fasta", "2.fasta", "3.fasta"]
for file in list:
contents = open(file, "r")
for line in contents:
if line[0] == ">":
new_file = open(file + "_chromosome.fasta", "w")
new_file.write(line)
I've left the bottom part of the program out because it's not needed. My issue is that when I run this program in the same direcoty as my fasta123 files, it works great:
python split_fasta.py *.fasta
But if I'm in a different directory and I want the program to output the new files (eg. 1.fasta_chromsome.fasta) to my current directory...it doesn't:
python /home/bin/split_fasta.py /home/data/*.fasta
This still creates the new files in the same directory as the fasta files. The issue here I'm sure is with this line:
new_file = open(file + "_chromosome.fasta", "w")
Because if I change it to this:
new_file = open("seq" + "_chromosome.fasta", "w")
It creates an output file in my current directory.
I hope this makes sense to some of you and that I can get some suggestions.

You are giving the full path of the old file, plus a new name. So basically, if file == /home/data/something.fasta, the output file will be file + "_chromosome.fasta" which is /home/data/something.fasta_chromosome.fasta
If you use os.path.basename on file, you will get the name of the file (i.e. in my example, something.fasta)
From #Adam Smith
You can use os.path.splitext to get rid of the .fasta
basename, _ = os.path.splitext(os.path.basename(file))
Getting back to the code example, I saw many things not recommended in Python. I'll go in details.
Avoid shadowing builtin names, such as list, str, int... It is not explicit and can lead to potential issues later.
When opening a file for reading or writing, you should use the with syntax. This is highly recommended since it takes care to close the file.
with open(filename, "r") as f:
data = f.read()
with open(new_filename, "w") as f:
f.write(data)
If you have an empty line in your file, line[0] == ... will result in a IndexError exception. Use line.startswith(...) instead.
Final code :
files = ["1.fasta", "2.fasta", "3.fasta"]
for file in files:
with open(file, "r") as input:
for line in input:
if line.startswith(">"):
new_name = os.path.splitext(os.path.basename(file)) + "_chromosome.fasta"
with open(new_name, "w") as output:
output.write(line)
Often, people come at me and say "that's hugly". Not really :). The levels of indentation makes clear what is which context.

remove first char from each line in a text file

im new to Python, to programming in general.
I want to remove first char from each line in a text file and write the changes back to the file. For example i have file with 36 lines, and the first char in each line contains a symbol or a number, and i want it to be removed.
I made a little code here, but it doesn't work as expected, it only duplicates whole liens. Any help would be appreciated in advance!
from sys import argv
run, filename = argv
f = open(filename, 'a+')
f.seek(0)
lines = f.readlines()
for line in lines:
f.write(line[1:])
f.close()

Your code already does remove the first character. I saved exactly your code as both dupy.py and dupy.txt, then ran python dupy.py dupy.txt, and the result is:
from sys import argv
run, filename = argv
f = open(filename, 'a+')
f.seek(0)
lines = f.readlines()
for line in lines:
f.write(line[1:])
f.close()
rom sys import argv
un, filename = argv
= open(filename, 'a+')
.seek(0)
ines = f.readlines()
or line in lines:
f.write(line[1:])
.close()
It's not copying entire lines; it's copying lines with their first character stripped.
But from the initial statement of your problem, it sounds like you want to overwrite the lines, not append new copies. To do that, don't use append mode. Read the file, then write it:
from sys import argv
run, filename = argv
f = open(filename)
lines = f.readlines()
f.close()
f = open(filename, 'w')
for line in lines:
f.write(line[1:])
f.close()
Or, alternatively, write a new file, then move it on top of the original when you're done:
import os
from sys import argv
run, filename = argv
fin = open(filename)
fout = open(filename + '.tmp', 'w')
lines = f.readlines()
for line in lines:
fout.write(line[1:])
fout.close()
fin.close()
os.rename(filename + '.tmp', filename)
(Note that this version will not work as-is on Windows, but it's simpler than the actual cross-platform version; if you need Windows, I can explain how to do this.)
You can make the code a lot simpler, more robust, and more efficient by using with statements, looping directly over the file instead of calling readlines, and using tempfile:
import tempfile
from sys import argv
run, filename = argv
with open(filename) as fin, tempfile.NamedTemporaryFile(delete=False) as fout:
for line in fin:
fout.write(line[1:])
os.rename(fout.name, filename)
On most platforms, this guarantees an "atomic write"—when your script finishes, or even if someone pulls the plug in the middle of it running, the file will end up either replaced by the new version, or untouched; there's no way it can end up half-way overwritten into unrecoverable garbage.
Again this version won't work on Windows. Without a whole lot of work, there is no way to implement this "write-temp-and-rename" algorithm on Windows. But you can come close with only a bit of extra work:
with open(filename) as fin, tempfile.NamedTemporaryFile(delete=False) as fout:
for line in fin:
fout.write(line[1:])
outname = fout.name
os.remove(filename)
os.rename(outname, filename)
This does prevent you from half-overwriting the file, but it leaves a hole where you may have deleted the original file, and left the new file in a temporary location that you'll have to search for. You can make this a little nicer by putting the file somewhere easier to find (see the NamedTemporaryFile docs to see how). Or renaming the original file to a temporary name, then writing to the original filename, then deleting the original file. Or various other possibilities. But to actually get the same behavior as on other platforms is very difficult.

You can either read all lines in memory then recreate file,
from sys import argv
run, filename = argv
with open(filename, 'r') as f:
data = [i[1:] for i in f
with open(filename, 'w') as f:
f.writelines(i+'\n' for i in data) # this is for linux. for win use \r\n
or You can create other file and move data from first file to second line by line. Then You can rename it If You'd like
from sys import argv
run, filename = argv
new_name = filename + '.tmp'
with open(filename, 'r') as f_in, open(new_name, 'w') as f_out:
for line in f_in:
f_out.write(line[1:])
os.rename(new_name, filename)

At its most basic, your problem is that you need to seek back to the beginning of the file after you read its complete contents into the array f. Since you are making the file shorter, you also need to use truncate to adjust the official length of the file after you're done. Furthermore, open mode a+ (a is for append) overrides seek and forces all writes to go to the end of the file. So your code should look something like this:
import sys
def main(argv):
filename = argv[1]
with open(filename, 'r+') as f:
lines = f.readlines()
f.seek(0)
for line in lines:
f.write(line[1:])
f.truncate()
if __name__ == '__main__': main(sys.argv)
It is better, when doing something like this, to write the changes to a new file and then rename it over the old file when you're done. This causes the update to happen "atomically" - a concurrent reader sees either the old file or the new one, not some mangled combination of the two. That looks like this:
import os
import sys
import tempfile
def main(argv):
filename = argv[1]
with open(filename, 'r') as inf:
with tempfile.NamedTemporaryFile(dir=".", delete=False) as outf:
tname = outf.name
for line in inf:
outf.write(line[1:])
os.rename(tname, filename)
if __name__ == '__main__': main(sys.argv)
(Note: Atomically replacing a file via rename does not work on Windows; you have to os.remove the old name first. This unfortunately does mean there is a brief window (no pun intended) where a concurrent reader will find that the file does not exist. As far as I know there is no way to avoid this.)

import re
with open(filename,'r+') as f:
modified = re.sub('^.','',f.read(),flags=re.MULTILINE)
f.seek(0,0)
f.write(modified)
In the regex pattern:
^ means 'start of string'
^ with flag re.MULTILINE means 'start of line'
^. means 'the only one character at the start of a line'
The start of a line is the start of the string or any position after a newline (a newline is \n)
So, we may fear that some newlines in sequences like \n\n\n\n\n\n\n could match with the regex pattern.
But the dot symbolizes any character EXCEPT a newline, then all the newlines don't match with this regex pattern.
During the reading of the file triggered by f.read(), the file's pointer goes until the end of the file.
f.seek(0,0) moves the file's pointer back to the beginning of the file
f.truncate() puts a new EOF = end of file at the point where the writing has stopped. It's necessary since the modified text is shorter than the original one.
Compare what it does with a code without this line

To be hones, i'm really not sure how good/bad is an idea of nesting with open(), but you can do something like this.
with open(filename_you_reading_lines_FROM, 'r') as f0:
with open(filename_you_appending_modified_lines_TO, 'a') as f1:
for line in f0:
f1.write(line[1:])

While there seemed to be some discussion of best practice and whether it would run on Windows or not, being new to Python, I was able to run the first example that worked and get it to run in my Win environment that has cygwin binaries in my environmental variables Path and remove the first 3 characters (which were line numbers from a sample file):
import os
from sys import argv
run, filename = argv
fin = open(filename)
fout = open(filename + '.tmp', 'w')
lines = fin.readlines()
for line in lines:
fout.write(line[3:])
fout.close()
fin.close()
I chose not to automatically overwrite since I wanted to be able to eyeball the output.
python c:\bin\remove1st3.py sampleCode.txt

Printing the output of a Line search

I'm new to programming pretty much in general and I am having difficulty trying to get this command to print it's output to the .txt document. My goal in the end is to be able to change the term "Sequence" out for a variable where I can integrate it into a custom easygui for multiple inputs and returns, but that's a story for later down the road. For the sake of testing and completion of the current project I will be just manually altering the term.
I've been successful in being able to get another program to send it's output to a .txt but this one is being difficult. I don't know if I have been over looking something simple, but I have been grounded for more time than I would like to have been on this.
When the it searches for the lines it prints the fields in the file I want, however when it goes to write it finds the last line of the file and then puts that in the .txt as the output. I know the issue but I haven't been able to wrap my head around how to fix it, mainly due to my lack of knowledge of the language I think.
I am using Sublime Text 2 on Windows
def main():
import os
filelist = list()
filed = open('out.txt', 'w')
searchfile = open("asdf.csv")
for lines in searchfile:
if "Sequence" in lines:
print lines
filelist.append(lines)
TheString = " ".join(filelist)
searchfile.close()
filed.write(TheString)
filed.close()
main()

It sounds like you want to the lines you are printing out collected in the variable "filelist", which will then be printed to the file at the .write() call. Only a difference of indentation (which is significant in Python) prevents this from happening:
def main():
import os
filelist = list()
filed = open('out.txt', 'w')
searchfile = open("asdf.csv")
for lines in searchfile:
if "Sequence" in lines:
print lines
filelist.append(lines)
TheString = " ".join(filelist)
searchfile.close()
filed.write(TheString)
filed.close()
main()
Having
filelist.append(lines)
at the same level of indentation as
print lines
tells Python that they are in the same block, and that the second statement also belongs to the "then" clause of the if statement.

Your problem is that you are not appending inside the loop, as a consequence you are only appending the last line, do like this:
for lines in searchfile:
if "Sequence" in lines:
print lines
filelist.append(lines)
BONUS: This is the "pythonic" way to do what you want:
def main():
with open('asdf.csv', 'r') as src, open('out.txt', 'w') as dest:
dest.writelines(line for line in src if 'sequence' in line)

def main():
seq = "Sequence"
record = file("out.txt", "w")
search = file("in.csv", "r")
output = list()
for line in search:
if seq in line: output.append(line)
search.close()
record.write(" ".join(output))
record.close()

How to open a file using the open with statement

I'm looking at how to do file input and output in Python. I've written the following code to read a list of names (one per line) from a file into another file while checking a name against the names in the file and appending text to the occurrences in the file. The code works. Could it be done better?
I'd wanted to use the with open(... statement for both input and output files but can't see how they could be in the same block meaning I'd need to store the names in a temporary location.
def filter(txt, oldfile, newfile):
'''\
Read a list of names from a file line by line into an output file.
If a line begins with a particular name, insert a string of text
after the name before appending the line to the output file.
'''
outfile = open(newfile, 'w')
with open(oldfile, 'r', encoding='utf-8') as infile:
for line in infile:
if line.startswith(txt):
line = line[0:len(txt)] + ' - Truly a great person!\n'
outfile.write(line)
outfile.close()
return # Do I gain anything by including this?
# input the name you want to check against
text = input('Please enter the name of a great person: ')
letsgo = filter(text,'Spanish', 'Spanish2')

Python allows putting multiple open() statements in a single with. You comma-separate them. Your code would then be:
def filter(txt, oldfile, newfile):
'''\
Read a list of names from a file line by line into an output file.
If a line begins with a particular name, insert a string of text
after the name before appending the line to the output file.
'''
with open(newfile, 'w') as outfile, open(oldfile, 'r', encoding='utf-8') as infile:
for line in infile:
if line.startswith(txt):
line = line[0:len(txt)] + ' - Truly a great person!\n'
outfile.write(line)
# input the name you want to check against
text = input('Please enter the name of a great person: ')
letsgo = filter(text,'Spanish', 'Spanish2')
And no, you don't gain anything by putting an explicit return at the end of your function. You can use return to exit early, but you had it at the end, and the function will exit without it. (Of course with functions that return a value, you use the return to specify the value to return.)
Using multiple open() items with with was not supported in Python 2.5 when the with statement was introduced, or in Python 2.6, but it is supported in Python 2.7 and Python 3.1 or newer.
http://docs.python.org/reference/compound_stmts.html#the-with-statement
http://docs.python.org/release/3.1/reference/compound_stmts.html#the-with-statement
If you are writing code that must run in Python 2.5, 2.6 or 3.0, nest the with statements as the other answers suggested or use contextlib.nested.

Use nested blocks like this,
with open(newfile, 'w') as outfile:
with open(oldfile, 'r', encoding='utf-8') as infile:
# your logic goes right here

You can nest your with blocks. Like this:
with open(newfile, 'w') as outfile:
with open(oldfile, 'r', encoding='utf-8') as infile:
for line in infile:
if line.startswith(txt):
line = line[0:len(txt)] + ' - Truly a great person!\n'
outfile.write(line)
This is better than your version because you guarantee that outfile will be closed even if your code encounters exceptions. Obviously you could do that with try/finally, but with is the right way to do this.
Or, as I have just learnt, you can have multiple context managers in a with statement as described by #steveha. That seems to me to be a better option than nesting.
And for your final minor question, the return serves no real purpose. I would remove it.

Sometimes, you might want to open a variable amount of files and treat each one the same, you can do this with contextlib
from contextlib import ExitStack
filenames = [file1.txt, file2.txt, file3.txt]
with open('outfile.txt', 'a') as outfile:
with ExitStack() as stack:
file_pointers = [stack.enter_context(open(file, 'r')) for file in filenames]
for fp in file_pointers:
outfile.write(fp.read())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to delete specific strings from a file? - python

To remove the string within the same file, I used this code f = open('./test.txt','r') a = ['word1','word2','word3'] lst = [] for line in f: for word in a: if word in line: line = line.replace(word,'') lst.append(line) f.close() f = open('./test.txt','w') for line in lst: f.write(line) f.close()

Related

Appending characters to each line in a txt file with python

Using variable as part of name of new file in python

remove first char from each line in a text file

Printing the output of a Line search

How to open a file using the open with statement

Categories

Resources