Read, manipulate and save text file - python

All,
I'm trying to read text files that are being downloaded every 20 min in to a certain folder, check it, manipulate it and move it to another location for further processing. Basically, I want to check each file that comes in, check if a string contains a "0.00" value and if so, delete that particular string. There are two strings per file.
I managed to manipulate a file with a given name, but now I need to do the same for files with variable names (there is a timestamp included in the title). One file will need to be processed at a time.
This is what I got so far:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
def remove_line(line, stop):
return any([word in line for word in stop])
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
with open(file, "w") as f:
for line in lines:
if not remove_line(line, stop):
f.write(line)
What works are the def-function and the two "with open..." codes. What am I doing wrong here?
Also, can I write a file to another directory using the open() function?
Thanks in advance!

Your code looks mostly fine. I don't think your list comprehension method does remove the string though. You can write to a different folder with Open(). This should do the trick for you:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
# put the new file in a different location
newfile = os path.join("New", "directory", file)
with open(newfile, "w") as f:
for line in lines:
if stop in line: #check if we need to modify lime
#modify line here
#this will remove stop from the line
line.replace(stop, "")
# Regardless of whether the line has changed, we need to write it out.
f.write(line)

Related

How to delete blank lines in a text file on a windows machine

I am trying to delete all blank lines in all YAML files in a folder. I have multiple lines with nothing but CRLF (using Notepad++), and I can't seem to eliminate these blank lines. I researched this before posting, as always, but I can't seem to get this working.
import glob
import re
path = 'C:\\Users\\ryans\\OneDrive\\Desktop\\output\\*.yaml'
for fname in glob.glob(path):
with open(fname, 'r') as f:
sfile = f.read()
for line in sfile.splitlines(True):
line = sfile.rstrip('\r\n')
f = open(fname,'w')
f.write(line)
f.close()
Here is a view in Notepad++
I want to delete the very first row shown here, as well as all other blank rows. Thanks.
If you use python, you can update the line using:
re.sub(r'[\s\r\n]','',line)
Close the reading file handler before writing.
If you use Notepad++, install the plugin called TextFX.
Replace all occurances of \r\n with blank.
Select all the text
Use the new menu TextFX -> TextFX Edit -> E:Delete Blank Lines
I hope this helps.
You cant write the file you are currently reading in. Also you are stripping things via file.splitlines() from each line - this way you'll remove all \r\n - not only those in empty lines. Store content in a new name and delete/rename the file afterwards:
Create demo file:
with open ("t.txt","w") as f:
f.write("""
asdfb
adsfoine
""")
Load / create new file from it:
with open("t.txt", 'r') as r, open("q.txt","w") as w:
for l in r:
if l.strip(): # only write line if when stripped it is not empty
w.write(l)
with open ("q.txt","r") as f:
print(f.read())
Output:
asdfb
adsfoine
( You need to strip() lines to see if they contain spaces and a newline. )
For rename/delete see f.e. How to rename a file using Python and Delete a file or folder
import os
os.remove("t.txt") # remove original
os.rename("q.txt","t.txt") # rename cleaned one
It's nice and easy...
file_path = "C:\\Users\\ryans\\OneDrive\\Desktop\\output\\*.yaml"
with open(file_path,"r+") as file:
lines = file.readlines()
file.seek(0)
for i in lines:
if i.rstrip():
file.write(i)
Where you open the file, read the lines, and if they're not blank write them back out again.

Using variable as part of name of new file in python

I'm fairly new to python and I'm having an issue with my python script (split_fasta.py). Here is an example of my issue:
list = ["1.fasta", "2.fasta", "3.fasta"]
for file in list:
contents = open(file, "r")
for line in contents:
if line[0] == ">":
new_file = open(file + "_chromosome.fasta", "w")
new_file.write(line)
I've left the bottom part of the program out because it's not needed. My issue is that when I run this program in the same direcoty as my fasta123 files, it works great:
python split_fasta.py *.fasta
But if I'm in a different directory and I want the program to output the new files (eg. 1.fasta_chromsome.fasta) to my current directory...it doesn't:
python /home/bin/split_fasta.py /home/data/*.fasta
This still creates the new files in the same directory as the fasta files. The issue here I'm sure is with this line:
new_file = open(file + "_chromosome.fasta", "w")
Because if I change it to this:
new_file = open("seq" + "_chromosome.fasta", "w")
It creates an output file in my current directory.
I hope this makes sense to some of you and that I can get some suggestions.
You are giving the full path of the old file, plus a new name. So basically, if file == /home/data/something.fasta, the output file will be file + "_chromosome.fasta" which is /home/data/something.fasta_chromosome.fasta
If you use os.path.basename on file, you will get the name of the file (i.e. in my example, something.fasta)
From #Adam Smith
You can use os.path.splitext to get rid of the .fasta
basename, _ = os.path.splitext(os.path.basename(file))
Getting back to the code example, I saw many things not recommended in Python. I'll go in details.
Avoid shadowing builtin names, such as list, str, int... It is not explicit and can lead to potential issues later.
When opening a file for reading or writing, you should use the with syntax. This is highly recommended since it takes care to close the file.
with open(filename, "r") as f:
data = f.read()
with open(new_filename, "w") as f:
f.write(data)
If you have an empty line in your file, line[0] == ... will result in a IndexError exception. Use line.startswith(...) instead.
Final code :
files = ["1.fasta", "2.fasta", "3.fasta"]
for file in files:
with open(file, "r") as input:
for line in input:
if line.startswith(">"):
new_name = os.path.splitext(os.path.basename(file)) + "_chromosome.fasta"
with open(new_name, "w") as output:
output.write(line)
Often, people come at me and say "that's hugly". Not really :). The levels of indentation makes clear what is which context.

Read line from file, process it, then remove it

I have a 22mb text file containing a list of numbers (1 number per line). I am trying to have python read the number, process the number and write the result in another file. All of this works but if I have to stop the program it starts all over from the beginning. I tried to use a mysql database at first but it was way too slow. I am getting about 4 times the number being processed this way. I would like to be able to delete the line after the number was processed.
with open('list.txt', 'r') as file:
for line in file:
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
#process number and write file
#(need code to delete current line here)
As you can see every time it is restarted it has to search the hard drive for the file name to make sure it gets to the place it left off. With 1.5 million numbers this can take a while. I found an example with truncate but it did not work.
Are there any commands similar to array_shift (PHP) for python that will work with text files.
I would use a marker file to keep the number of the last line processed instead of rewriting the input file:
start_from = 0
try:
with open('last_line.txt', 'r') as llf: start_from = int(llf.read())
except:
pass
with open('list.txt', 'r') as file:
for i, line in enumerate(file):
if i < start_from: continue
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
pass
with open('last_line.txt', 'w') as outfile: outfile.write(str(i))
This code first checks for the file last_line.txt and tries to read a number from it. The number is the number of line which was processed in during the previous attempt. Then it simply skips the required number of lines.
I use Redis for stuff like that. Install redis and then pyredis and you can have a persistent set in memory. Then you can do:
r = redis.StrictRedis('localhost')
with open('list.txt', 'r') as file:
for line in file:
if r.sismember('done', line):
continue
else:
#process number and write file
r.sadd('done', line)
if you don't want to install Redis you can also use the shelve module, making sure that you open it with the writeback=False option. I really recommend Redis though, it makes things like this so much easier.
Reading the data file should not be a bottleneck. The following code read a 36 MB, 697997 line text file in about 0,2 seconds on my machine:
import time
start = time.clock()
with open('procmail.log', 'r') as f:
lines = f.readlines()
end = time.clock()
print 'Readlines time:', end-start
Because it produced the following result:
Readlines time: 0.1953125
Note that this code produces a list of lines in one go.
To know where you've been, just write the number of lines you've processed to a file. Then if you want to try again, read all the lines and skip the ones you've already done:
import os
# Raad the data file
with open('list.txt', 'r') as f:
lines = f.readlines()
skip = 0
try:
# Did we try earlier? if so, skip what has already been processed
with open('lineno.txt', 'r') as lf:
skip = int(lf.read()) # this should only be one number.
del lines[:skip] # Remove already processed lines from the list.
except:
pass
with open('lineno.txt', 'w+') as lf:
for n, line in enumerate(lines):
# Do your processing here.
lf.seek(0) # go to beginning of lf
lf.write(str(n+skip)+'\n') # write the line number
lf.flush()
os.fsync() # flush and fsync make sure the lf file is written.

Open a file for input and output in Python

I have the following code which is intended to remove specific lines of a file. When I run it, it prints the two filenames that live in the directory, then deletes all information in them. What am I doing wrong? I'm using Python 3.2 under Windows.
import os
files = [file for file in os.listdir() if file.split(".")[-1] == "txt"]
for file in files:
print(file)
input = open(file,"r")
output = open(file,"w")
for line in input:
print(line)
# if line is good, write it to output
input.close()
output.close()
open(file, 'w') wipes the file. To prevent that, open it in r+ mode (read+write/don't wipe), then read it all at once, filter the lines, and write them back out again. Something like
with open(file, "r+") as f:
lines = f.readlines() # read entire file into memory
f.seek(0) # go back to the beginning of the file
f.writelines(filter(good, lines)) # dump the filtered lines back
f.truncate() # wipe the remains of the old file
I've assumed that good is a function telling whether a line should be kept.
If your file fits in memory, the easiest solution is to open the file for reading, read its contents to memory, close the file, open it for writing and write the filtered output back:
with open(file_name) as f:
lines = list(f)
# filter lines
with open(file_name, "w") as f: # This removes the file contents
f.writelines(lines)
Since you are not intermangling read and write operations, the advanced file modes like "r+" are unnecessary here, and only compicate things.
If the file does not fit into memory, the usual approach is to write the output to a new, temporary file, and move it back to the original file name after processing is finished.
One way is to use the fileinput stdlib module. Then you don't have to worry about open/closing and file modes etc...
import fileinput
from contextlib import closing
import os
fnames = [fname for fname in os.listdir() if fname.split(".")[-1] == "txt"] # use splitext
with closing(fileinput.input(fnames, inplace=True)) as fin:
for line in fin:
# some condition
if 'z' not in line: # your condition here
print line, # suppress new line but adjust for py3 - print(line, eol='') ?
When using inplace=True - the fileinput redirects stdout to be to the file currently opened. A backup of the file with a default '.bak' extension is created which may come in useful if needed.
jon#minerva:~$ cat testtext.txt
one
two
three
four
five
six
seven
eight
nine
ten
After running the above with a condition of not line.startswith('t'):
jon#minerva:~$ cat testtext.txt
one
four
five
six
seven
eight
nine
You're deleting everything when you open the file to write to it. You can't have an open read and write to a file at the same time. Use open(file,"r+") instead, and then save all the lines to another variable before writing anything.
You should not open the same file for reading and writing at the same time.
"w" means create a empty for writing. If the file already exists, its data will be deleted.
So you can use a different file name for writing.

Replace a word in a file

I am new to Python programming...
I have a .txt file....... It looks like..
0,Salary,14000
0,Bonus,5000
0,gift,6000
I want to to replace the first '0' value to '1' in each line. How can I do this? Any one can help me.... With sample code..
Thanks in advance.
Nimmyliji
I know that you're asking about Python, but forgive me for suggesting that perhaps a different tool is better for the job. :) It's a one-liner via sed:
sed 's/^0,/1,/' yourtextfile.txt > output.txt
This applies the regex /^0,/ (which matches any 0, that occurs at the beginning of a line) to each line and replaces the matched text with 1, instead. The output is directed into the file output.txt specified.
inFile = open("old.txt", "r")
outFile = open("new.txt", "w")
for line in inFile:
outFile.write(",".join(["1"] + (line.split(","))[1:]))
inFile.close()
outFile.close()
If you would like something more general, take a look to Python csv module. It contains utilities for processing comma-separated values (abbreviated as csv) in files. But it can work with arbitrary delimiter, not only comma. So as you sample is obviously a csv file, you can use it as follows:
import csv
reader = csv.reader(open("old.txt"))
writer = csv.writer(open("new.txt", "w"))
writer.writerows(["1"] + line[1:] for line in reader)
To overwrite original file with new one:
import os
os.remove("old.txt")
os.rename("new.txt", "old.txt")
I think that writing to new file and then renaming it is more fault-tolerant and less likely corrupt your data than direct overwriting of source file. Imagine, that your program raised an exception while source file was already read to memory and reopened for writing. So you would lose original data and your new data wouldn't be saved because of program crash. In my case, I only lose new data while preserving original.
o=open("output.txt","w")
for line in open("file"):
s=line.split(",")
s[0]="1"
o.write(','.join(s))
o.close()
Or you can use fileinput with in place edit
import fileinput
for line in fileinput.FileInput("file",inplace=1):
s=line.split(",")
s[0]="1"
print ','.join(s)
f = open(filepath,'r')
data = f.readlines()
f.close()
edited = []
for line in data:
edited.append( '1'+line[1:] )
f = open(filepath,'w')
f.writelines(edited)
f.flush()
f.close()
Or in Python 2.5+:
with open(filepath,'r') as f:
data = f.readlines()
with open(outfilepath, 'w') as f:
for line in data:
f.write( '1' + line[1:] )
This should do it. I wouldn't recommend it for a truly big file though ;-)
What is going on (ex 1):
1: Open the file in read mode
2,3: Read all the lines into a list (each line is a separate index) and close the file.
4,5,6: Iterate over the list constructing a new list where each line has the first character replaced by a 1. The line[1:] slices the string from index 1 onward. We concatenate the 1 with the truncated list.
7,8,9: Reopen the file in write mode, write the list to the file (overwrite), flush the buffer, and close the file handle.
In Ex. 2:
I use the with statement that lets the file handle closing itself, but do essentially the same thing.

Categories

Resources