This question already has answers here:
In python, how to check the end of standard input streams (sys.stdin) and do something special on that
(2 answers)
Closed 6 months ago.
How do I check for EOF in Python? I found a bug in my code where the last block of text after the separator isn't added to the return list. Or maybe there's a better way of expressing this function?
Here's my code:
def get_text_blocks(filename):
text_blocks = []
text_block = StringIO.StringIO()
with open(filename, 'r') as f:
for line in f:
text_block.write(line)
print line
if line.startswith('-- -'):
text_blocks.append(text_block.getvalue())
text_block.close()
text_block = StringIO.StringIO()
return text_blocks
You might find it easier to solve this using itertools.groupby.
def get_text_blocks(filename):
import itertools
with open(filename,'r') as f:
groups = itertools.groupby(f, lambda line:line.startswith('-- -'))
return [''.join(lines) for is_separator, lines in groups if not is_separator]
Another alternative is to use a regular expression to match the separators:
def get_text_blocks(filename):
import re
seperator = re.compile('^-- -.*', re.M)
with open(filename,'r') as f:
return re.split(seperator, f.read())
The end-of-file condition holds as soon as the for statement terminates -- that seems the simplest way to minorly fix this code (you can extract text_block.getvalue() at the end if you want to check it's not empty before appending it).
This is the standard problem with emitting buffers.
You don't detect EOF -- that's needless. You write the last buffer.
def get_text_blocks(filename):
text_blocks = []
text_block = StringIO.StringIO()
with open(filename, 'r') as f:
for line in f:
text_block.write(line)
print line
if line.startswith('-- -'):
text_blocks.append(text_block.getvalue())
text_block.close()
text_block = StringIO.StringIO()
### At this moment, you are at EOF
if len(text_block) > 0:
text_blocks.append( text_block.getvalue() )
### Now your final block (if any) is appended.
return text_blocks
Why do you need StringIO here?
def get_text_blocks(filename):
text_blocks = [""]
with open(filename, 'r') as f:
for line in f:
if line.startswith('-- -'):
text_blocks.append(line)
else: text_blocks[-1] += line
return text_blocks
EDIT: Fixed the function, other suggestions might be better, just wanted to write a function similar to the original one.
EDIT: Assumed the file starts with "-- -", by adding empty string to the list you can "fix" the IndexError or you could use this one:
def get_text_blocks(filename):
text_blocks = []
with open(filename, 'r') as f:
for line in f:
if line.startswith('-- -'):
text_blocks.append(line)
else:
if len(text_blocks) != 0:
text_blocks[-1] += line
return text_blocks
But both versions look a bit ugly to me, the reg-ex version is much more cleaner.
This is a fast way to see if you have an empty file:
if f.read(1) == '':
print "EOF"
f.close()
Related
I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo
This question already has answers here:
Search and replace a line in a file in Python
(13 answers)
Closed 3 years ago.
I have a jsonline file like below:
{"id":0,"country":"fr"}
{"id":1,"country":"en"}
{"id":2,"country":"fr"}
{"id":3,"country":"fr"}
I have a list of codes, i want to attribute a code to each user, by updating the file lines.
The result should be the following:
{"id":0,"country":"fr", code:1}
{"id":1,"country":"en", code:2}
{"id":2,"country":"fr", code:3}
{"id":3,"country":"fr", code:4}
This is how i do it now:
import ujson
fh, abs_path = mkstemp()
with open(fh, 'w') as tmp_file:
with open(shooting.segment_filename) as segment_filename:
for line in segment_filename:
enriched_line = ujson.loads(line)
code = compute_code()
if code:
enriched_line["code"] = code
tmp_file.write(ujson.dumps(enriched_line) + '\n')
My question is, is there a faster way to do this ? May be via a linux command launched via sarge for example ? or any pythonic way without having to read the read / write / replace the original file ?
Thank you !
For performance you can skip the json serialization / deserialization step completely and just replace the closing bracket with your code + a closing bracket.
So this should perform much better:
content = ""
with open("inp.txt", "r") as inp:
for line in inp:
content += line[:-1] + ", code:%s}\n" % compute_code()
with open("inp.txt", "w") as out:
out.write(content)
EDIT:
If you don't want to load the whole file into memory you can do something like this.
with open("inp.txt", "r") as inp, open("out.txt", "w") as out:
for line in inp:
out.write(line[:-1] + ", code:%s}\n" % compute_code())
I do not know if this will satisfy you but here is some "cleaner" code:
import json
with open(shooting.segment_filename, "r") as f:
data = [json.loads(line) for line in f.readlines()]
for json_line in data:
code = compute_code()
if code:
json_line["code"] = code
# Will overwrite source file, you might want to give a bogus file to test it first
with open(shooting.segment_filename, "w") as f:
f.write("\n".join([json.dumps(elem) for elem in data]))
I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.
start_line = b
num_lines = 377763316
while b < num_lines:
plasmid1 = linecache.getline("Result.txt", b-1)
plasmid1 = plasmid1.strip("\n")
plasmid1 = plasmid1.split("\t")
plasmid2 = linecache.getline("Result.txt", b)
plasmid2 = plasmid2.strip("\n")
plasmid2 = plasmid2.split("\t")
if not str(plasmid1[0]) == str(plasmid2[0]):
end_line = b
#do something
The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.
I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!
Thanks,
Philipp
I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.
After calling loadtxt() you will get ndarray back.
You can use itertools:
from itertools import takewhile
class EqualityChecker(object):
def __init__(self, id):
self.id = id
def __call__(self, current_line):
result = False
current_id = current_line.split('\t')[0]
if self.id == current_id:
result = True
return result
with open('hugefile.txt', 'r') as f:
for id in ids:
checker = EqualityChecker(id)
for line in takewhile(checker, f.xreadlines()):
do_stuff(line)
In outer loop id can actually be obtain from the first line with an id non-matching previous value.
You should open the file just once, and iterate over the lines.
with open('Result.txt', 'r') as f:
aline = f.next()
currentid = aline.split('\t', 1)[0]
for nextline in f:
nextid = nextline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
You get the idea, just use plain python.
Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.
If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:
with open('Result.txt', 'r') as f:
aline = f.readline()
currentid = aline.split('\t', 1)[0]
while aline != '':
aline = f.readline()
nextid = aline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
I have a file which contains following row:
//hva_SaastonJakaumanMuutos/printData/reallocationAssignment/changeUser/firstName>
I want to add "John" at the end of line.
I have written following code but for some reason it is not working,
def add_text_to_file(self, file, rowTitle, inputText):
f = open("check_files/"+file+".txt", "r")
fileList = list(f)
f.close()
j = 0
for row in fileList :
if fileList[j].find(rowTitle) > 0 :
fileList[j]=fileList[j].replace("\n","")+inputText+"\n"
break
j = j+1
f = open("check_files/"+file+".txt", "w")
f.writelines(fileList)
f.close()
Do you see where am I doing wrong?
str.find may return 0 if the text you are searching is found at the beginning. After all, it returns the index the match begins.
So your condition should be:
if fileList[j].find(rowTitle) >= 0 :
Edit:
The correction above would save the day but it's better if you things the right way, the pythonic way.
If you are looking for a substring in a text, you can use the foo in bar comparison. It will be True if foo can be found in bar and False otherwise.
You rarely need a counter in Python. enumerate built-in is your friend here.
You can combine the iteration and writing and eliminate an unnecessary step.
strip or rstrip is better than replace in your case.
For Python 2.6+, it is better to use with statement when dealing with files. It will deal with the closing of the file right way. For Python 2.5, you need from __future__ import with_statement
Refer to PEP8 for commonly preferred naming conventions.
Here is a cleaned up version:
def add_text_to_file(self, file, row_title, input_text):
with open("check_files/" + file + ".txt", "r") as infile:
file_list = infile.readlines()
with open("check_files/" + file + ".txt", "w") as outfile:
for row in file_list:
if row_title in row:
row = row.rstrip() + input_text + "\n"
outfile.write(row)
You are not giving much informations, so even thoug I wouldn't use the following code (because I'm sure there are better ways) it might help to clear your problem.
import os.path
def add_text_to_file(self, filename, row_title, input_text):
# filename should have the .txt extension in it
filepath = os.path.join("check_files", filename)
with open(filepath, "r") as f:
content = f.readlines()
for j in len(content):
if row_title in content[j]:
content[j] = content[j].strip() + input_text + "\n"
break
with open(filepath, "w") as f:
f.writelines(content)
is there any way to remove what found between two lines that contain two concrete strings?
I mean: I want to remove anything found between 'heaven' and 'hell' in a text file with this text:
I'm in heaven
foobar
I'm in hell
After executing the script/function I'm asking the text file will be empty.
Use a flag to indicate whether you're writing or not.
from __future__ import with_statement
writing = True
with open('myfile.txt') as f:
with open('output.txt') as out:
for line in f:
if writing:
if "heaven" in line:
writing = False
else:
out.write(line)
elif "hell" in line:
writing = True
os.remove('myfile.txt')
os.rename('output.txt', 'myfile.txt')
EDIT
As extraneon pointed in the comments, the requirement is to remove the lines between two concrete strings. That means that if the second (closing) string is never found, nothing should be removed. That can be achieved by keeping a buffer of lines. The buffer gets discarded if the closing string "I'm in hell" is found, but if the end of file is reached without finding it, the whole contents must be written to the file.
Example:
I'm in heaven
foo
bar
Should keep the whole contents since there's no closing tag and the question says between two lines.
Here's an example to do that, for completion:
from __future__ import with_statement
writing = True
with open('myfile.txt') as f:
with open('output.txt') as out:
for line in f:
if writing:
if "heaven" in line:
writing = False
buffer = [line]
else:
out.write(line)
elif "hell" in line:
writing = True
else:
buffer.append(line)
else:
if not writing:
#There wasn't a closing "I'm in hell", so write buffer contents
out.writelines(buffer)
os.remove('myfile.txt')
os.rename('output.txt', 'myfile.txt')
Looks like by "remove" you mean "rewrite the input file in-place" (or make it look like you're so doing;-), in which case fileinput.input helps:
import fileinput
writing = True
for line in fileinput.input(['thefile.txt'], inplace=True):
if writing:
if 'heaven' in line: writing = False
else: print line,
else:
if 'hell' in line: writing = True
You could do something like the following with regular expressions. There are probably more efficient ways to do it since I'm still learning a lot of python, but this should work.
import re
f = open('hh_remove.txt')
lines = f.readlines()
pattern1 = re.compile("heaven",re.I)
pattern2 = re.compile("hell",re.I)
mark1 = False
mark2 = False
for i, line in enumerate(lines):
if pattern1.search(line) != None:
mark1 = True
set1 = i
if pattern2.search(line) != None:
mark2 = True
set2 = i+1
if ((mark1 == True) and (mark2 == True)):
del lines[set1:set2]
mark1 = False
mark2 = False
f.close()
out = open('hh_remove.txt','w')
out.write("".join(lines))
out.close()
I apologize but this sounds like a homework problem. We have a policy on these: https://meta.stackexchange.com/questions/10811/homework-on-stackoverflow
However, what I can say is that the feature #nosklo wrote about is available in any Python 2.5.x (or newer), but you need to learn enough Python to enable it. :-)
My solution would involve using creating a new string with the undesired stuff stripped out using str.find() or str.index() (or some relative of those 2).
Best of luck!
see below. I dont know if it's ok but It seems is working ok.
import re,fileinput,os
for path, dirs, files in os.walk(path):
for filename in files:
fullpath = os.path.join(path, filename)
f = open(fullpath,'r')
data = f.read()
patter = re.compile('Im in heaven.*?Im in hell', re.I | re.S)
data = patter.sub("", data)
f.close()
f = open(fullpath, 'w')
f.write(data)
f.close()
Anyway when i execute it, it leaves a blank line. I mean, if have this function:
public function preFetchAll(Doctrine_Event $event){
//Im in heaven
$a = sfContext::getInstance()->getUser()->getAttribute("passw.formulario");
var_dump($a);
//Im in hell
foreach ($this->_listeners as $listener) {
$listener->preFetchAll($event);
}
}
and i execute my script, i get this:
public function preFetchAll(Doctrine_Event $event){
foreach ($this->_listeners as $listener) {
$listener->preFetchAll($event);
}
}
As you can see there is an empty line between "public..." and "foreach...".
Why?
Javi