Reading a file until a specific character in python - python

I am currently working on an application which requires reading all the input from a file until a certain character is encountered.
By using the code:
file=open("Questions.txt",'r')
c=file.readlines()
c=[x.strip() for x in c]
Every time strip encounters \n, it is removed from the input and treated as a string in list c.
This means every line is split into the part of a list c. But I want to make a list up to a point whenever a special character is encountered like this:
if the input file has the contents:
1.Hai
2.Bye\-1
3.Hello
4.OAPd\-1
then I want to get a list as
c=['1.Hai\n2.Bye','3.Hello\n4.OApd']
Please help me in doing this.

The easiest way would be to read the file in as a single string and then split it across your separator:
with open('myFileName') as myFile:
text = myFile.read()
result = text.split(separator) # use your \-1 (whatever that means) here
In case your file is very large, holding the complete contents in memory as a single string for using .split() is maybe not desirable (and then holding the complete contents in the list after the split is probably also not desirable). Then you could read it in chunks:
def each_chunk(stream, separator):
buffer = ''
while True: # until EOF
chunk = stream.read(CHUNK_SIZE) # I propose 4096 or so
if not chunk: # EOF?
yield buffer
break
buffer += chunk
while True: # until no separator is found
try:
part, buffer = buffer.split(separator, 1)
except ValueError:
break
else:
yield part
with open('myFileName') as myFile:
for chunk in each_chunk(myFile, separator='\\-1\n'):
print(chunk) # not holding in memory, but printing chunk by chunk

I used "*" instead of "-1", I'll let you make the appropriate changes.
s = '1.Hai\n2.Bye*3.Hello\n4.OAPd*'
temp = ''
results = []
for char in s:
if char is '*':
results.append(temp)
temp = []
else:
temp += char
if len(temp) > 0:
results.append(temp)

Related

Python: How to I read from stdin/file word by word?

As the title says, how do I read from stdin or from a file word by word, rather than line by line? I'm dealing with very large files, not guaranteed to have any newlines, so I'd rather not load all of a file into memory. So the standard solution of:
for line in sys.stdin:
for word in line:
foo(word)
won't work, since line may be too large. Even if it's not too large, it's still inefficient since I don't need the entire line at once. I essentially just need to look at a single word at a time, and then forget it and move on to the next one, until EOF.
EDIT: The suggested "duplicate" is not really a duplicate. It mentions reading line by line and THEN splitting it into words, something I explicitly said I wanted to avoid.
Here's a generator approach. I don't know when you plan to stop reading, so this is a forever loop.
def read_by_word(filename, chunk_size=16):
'''This generator function opens a file and reads it by word'''
buff = '' # Preserve word from previous
with open(filename) as fd:
while True:
chunk = fd.read(chunk_size)
if not chunk: # Empty means end of file
if buff: # Corner case -- file had no whitespace at end
# Unfortunately, big chunk sizes could make the
# final chunk have spaces in it
yield from buff.split()
break
chunk = buff + chunk # Add any previous reads
if chunk != chunk.rstrip():
yield chunk.rstrip() # This chunk ends with whitespace
buff = ''
else:
comp = chunk.split(None, 1) # At most 1 with whitespace
if len(comp) == 1:
buff += chunk
continue
else:
yield comp[0]
buff = comp[1]
for word in read_by_word('huge_file_with_few_newlines.txt'):
print(word)
Here's a straightforward answer, which I'll post if anyone else goes looking and doesn't feel like wading through toxic replies:
word = ''
with open('filename', 'r') as f:
while (c := f.read(1)):
if c.isspace():
if word:
print(word) # Here you can do whatever you want e.g. append to list
word = ''
else:
word += c
Edit: I will note that it would be faster to read larger byte-chunks at a time, and detecting words after the fact. Ben Y's answer has an (as of this edit) incomplete solution that might be of assistance. If performance (rather than memory, as was my issue) is a problem, that should probably be your approach. The code will be quite a bit longer, however.

How can I choose the line separator when reading a file?

I am trying to read a file which contains one single 2.9 GB long line separated by commas. This code would read the file line by line, with each print stopping at '\n':
with open('eggs.txt', 'rb') as file:
for line in file:
print(line)
How can I instead iterate over "lines" that stop at ', ' (or any other character/string)?
I don't think there is a built-in way to achieve this. You will have to use file.read(block_size) to read the file block by block, split each block at commas, and rejoin strings that go across block boundaries manually.
Note that you still might run out of memory if you don't encounter a comma for a long time. (The same problem applies to reading a file line by line, when encountering a very long line.)
Here's an example implementation:
def split_file(file, sep=",", block_size=16384):
last_fragment = ""
while True:
block = file.read(block_size)
if not block:
break
block_fragments = iter(block.split(sep))
last_fragment += next(block_fragments)
for fragment in block_fragments:
yield last_fragment
last_fragment = fragment
yield last_fragment
Using buffered reading from the file (Python 3):
buffer_size = 2**12
delimiter = ','
with open(filename, 'r') as f:
# remember the characters after the last delimiter in the previously processed chunk
remaining = ""
while True:
# read the next chunk of characters from the file
chunk = f.read(buffer_size)
# end the loop if the end of the file has been reached
if not chunk:
break
# add the remaining characters from the previous chunk,
# split according to the delimiter, and keep the remaining
# characters after the last delimiter separately
*lines, remaining = (remaining + chunk).split(delimiter)
# print the parts up to each delimiter one by one
for line in lines:
print(line, end=delimiter)
# print the characters after the last delimiter in the file
if remaining:
print(remaining, end='')
Note that the way this is currently written, it will just print the original file's contents exactly as they were. This is easily changed though, e.g. by changing the end=delimiter parameter passed to the print() function in the loop.
Read the file a character at a time, and assemble the comma-separated lines:
def commaBreak(filename):
word = ""
with open(filename) as f:
while True:
char = f.read(1)
if not char:
print("End of file")
yield word
break
elif char == ',':
yield word
word = ""
else:
word += char
You may choose to do something like this with a larger number of charachters, Eg 1000, read at a time.
with open('eggs.txt', 'rb') as file:
for line in file:
str_line = str(line)
words = str_line.split(', ')
for word in words:
print(word)
It yields each character from file at once, what means that there is no memory overloading.
def lazy_read():
try:
with open('eggs.txt', 'rb') as file:
item = file.read(1)
while item:
if ',' == item:
raise StopIteration
yield item
item = file.read(1)
except StopIteration:
pass
print(''.join(lazy_read()))

How to read a big binary file and split its content by some marker

In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?
There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.
A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.
I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...
Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.

character reading in a variety way

I have a text file in this format:
abc? cdfde" nhj.cde' dfwe-df$sde.....
How can i ignore all the special characters, blanks, numbers, end of the lines, etc and write only the characters in another file?For example, the above file becomes
abccdfdenhjcdedfwedfsde.....
And from this output file,
Should able to read single character by character till the end of file.
Should be able to read two characters at a time, like ab,bc,cc,cd,df,... from above file
Should be able to read three characters at a time, like abc,bcc,ccd,cdf,... from the above file
First of all, how can i read only characters and write to external file?
I can read single character by character by using f.read(1) till end of file.How can i apply this to read 2,3 chars at a time, that too skipping only one character(that is, if i have abcd, i should read ab,bc,cd but not ab,cd(this, i think can be done by f.read(2))). Thanks. I am doing this for cryptanalysis work to analyze ciphertexts by frequency.
If you need to peek ahead (read a few extra characters at a time), you need a buffered file object. The following class does just that:
import io
class AlphaPeekReader(io.BufferedReader):
def readalpha(self, count):
"Read one character, and peek ahead (count - 1) *extra* characters"
val = [self.read1(1)]
# Find first alpha character
while not val[0].isalpha():
if val == ['']:
return '' # EOF
val = [self.read1(1)]
require = count - len(val)
peek = self.peek(require * 3) # Account for a lot of garbage
if peek == '': # EOF
return val[0]
for c in peek:
if c.isalpha():
require -= 1
val.append(c)
if not require:
break
# There is a chance here that there were not 'require' alpha chars in peek
# Return anyway.
return ''.join(val)
This attempts to find extra characters beyond the one character you are reading, but doesn't make a guarantee it'll be able to satisfy your requirements. It could read fewer if we are at the end of the file or if there is a lot of non-alphabetic text in the next block.
Usage:
with AlphaPeekReader(io.open(filename, 'rb')) as alphafile:
alphafile.readalpha(3)
Demo, using a file with your example input:
>>> f = io.open('/tmp/test.txt', 'rb')
>>> alphafile = AlphaPeekReader(f)
>>> alphafile.readalpha(3)
'abc'
>>> alphafile.readalpha(3)
'bcc'
>>> alphafile.readalpha(3)
'ccd'
>>> alphafile.readalpha(10)
'cdfdenhjcd'
>>> alphafile.readalpha(10)
'dfdenhjcde'
To use the readalpha() calls in a loop, where you get each and every character separately plus the two next 2 bytes, use the iter() with a sentinel:
for alpha_with_extra in iter(lambda: alphafile.readalpha(3), ''):
# Do something with alpha_with_extra
To read a line from a file:
import fileinput
text_file = open("Output.txt", "w")
for line in fileinput.input("sample.txt"):
outstring = ''.join(ch for ch in line if ch.isalpha())
text_file.write("%s"%outstring)
text_file.close()

Is there a way to read a file in a loop in python using a separator other than newline

I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?
The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.
Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.

Categories

Resources