I am trying to copy and paste information from bytes X to Y from a huge data file to a new file. I got X and Y by using f.readline() and f.tell(). Is there a faster way to do this then the code below.
import os
a = 300 # Beginning Byte Location
b = 208000 # Ending Byte Location
def file_split(x,y):
g = open('C:/small_file.dat', 'wb')
with open('C:/huge_data_file.dat', 'rb') as f:
f.seek(x, os.SEEK_SET) # Sets file pointer to x
line = '-1'
while (line != '') # line = '' would indicate EOF
while (f.tell() < y):
g.write(f.read(1))
g.close()
file_split(a,b)
You could start with a larger block size than 1 byte? If it's never going to be megabytes worth of data, just go for g.write(f.read(b-a)) and you're done, no need for the loop. If it's going to be megabytes, you may want to do it block by block, making sure the last block is shorter to not exceed b.
Related
I am currently working on an application which requires reading all the input from a file until a certain character is encountered.
By using the code:
file=open("Questions.txt",'r')
c=file.readlines()
c=[x.strip() for x in c]
Every time strip encounters \n, it is removed from the input and treated as a string in list c.
This means every line is split into the part of a list c. But I want to make a list up to a point whenever a special character is encountered like this:
if the input file has the contents:
1.Hai
2.Bye\-1
3.Hello
4.OAPd\-1
then I want to get a list as
c=['1.Hai\n2.Bye','3.Hello\n4.OApd']
Please help me in doing this.
The easiest way would be to read the file in as a single string and then split it across your separator:
with open('myFileName') as myFile:
text = myFile.read()
result = text.split(separator) # use your \-1 (whatever that means) here
In case your file is very large, holding the complete contents in memory as a single string for using .split() is maybe not desirable (and then holding the complete contents in the list after the split is probably also not desirable). Then you could read it in chunks:
def each_chunk(stream, separator):
buffer = ''
while True: # until EOF
chunk = stream.read(CHUNK_SIZE) # I propose 4096 or so
if not chunk: # EOF?
yield buffer
break
buffer += chunk
while True: # until no separator is found
try:
part, buffer = buffer.split(separator, 1)
except ValueError:
break
else:
yield part
with open('myFileName') as myFile:
for chunk in each_chunk(myFile, separator='\\-1\n'):
print(chunk) # not holding in memory, but printing chunk by chunk
I used "*" instead of "-1", I'll let you make the appropriate changes.
s = '1.Hai\n2.Bye*3.Hello\n4.OAPd*'
temp = ''
results = []
for char in s:
if char is '*':
results.append(temp)
temp = []
else:
temp += char
if len(temp) > 0:
results.append(temp)
I have a .raw file containing a 52 lines html header followed by the data themselves. The file is encoded in little-endian 24bits SIGNED and I want to convert the data to integers in an ASCII file. I use Python 3.
I tried to 'unpack' the entire file with the following code found in this post:
import sys
import chunk
import struct
f1 = open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb')
data = struct.unpack('<i', chunk + ('\0' if chunk[2] < 128 else '\xff'))
But I get this error message:
TypeError: 'module' object is not subscriptable
EDIT
It seems this is better:
data = struct.unpack('<i','\0'+ bytes)[0] >> 8
But I still get an error message:
TypeError: must be str, not type
Easy to fix I presume?
That's not a nice file to process in Python! Python is great for processing text files, because it reads them in big chunks in an internal buffer and then iterates on lines, but you cannot easily access binary data that comes after text read like that. Additionally, the struct module has no support for 24 bits values.
The only way I can imagine is to read the file one byte at a time, first skip 52 time an end of line, then read bytes 3 at a time, concatenate them in a 4 bytes byte string and unpack it.
Possible code could be:
eol = b'\n' # or whatever is the end of line in your file
nlines = 52 # number of lines to skip
with open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb') as f1:
for i in range(nlines): # process nlines lines
t = b'' # to store the content of each line
while True:
x = f1.read(1) # one byte at a time
if x == eol: # ok we have one full line
break
else:
t += x # else concatenate into current line
print(t) # to control the initial 52 lines
while True:
t = bytes((0,)) # struct only knows how to process 4 bytes int
for i in range(3): # so build one starting with a null byte
t += f1.read(1)
# print(t)
if(len(t) == 1): break # reached end of file
if(len(t) < 4): # reached end of file with uncomplete value
print("Remaining bytes at end of file", t)
break
# the trick is that the integer division by 256 skips the initial 0 byte and keeps the sign
i = struct.unpack('<i', t)[0]//256 # // for Python 3, only / for Python 2
print(i, hex(i)) # or any other more useful processing
Remark: above code assumes that your description of 52 lines (terminated by an end of line) is true, but the shown image let think that last line is not. In that case, you should first count 51 lines and then skip the content of the last line.
def skipline(fd, nlines, eol):
for i in range(nlines): # process nlines lines
t = b'' # to store the content of each line
while True:
x = fd.read(1) # one byte at a time
if x == eol: # ok we have one full line
break
else:
t += x # else concatenate into current line
# print(t) # to control the initial 52 lines
with open('/Users/anais/Documents/CR_lab/Lab_files/labtest.raw', mode = 'rb') as f1:
skiplines(f1, 51, b'\n') # skip 51 lines terminated with a \n
skiplines(f1, 1, b'>') # skip last line assuming it ends at the >
...
I want to read through a binary file.
Googling "python binary eof" led me here.
Now, the questions:
Why does the container (x in the SO answer) contain not a single (current) byte but a whole bunch of them? What am I doing wrong?
If it should be so and I am doing nothing wrong, HOW do read a single byte? I mean, is there any way to detect EOF while reading the file with read(1) method?
To quote the documentation:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
That means (for a regular file):
f.read(1) will return a byte object containing either 1 byte or 0 byte is EOF was reached
f.read(2) will return a byte object containing either 2 bytes, or 1 byte if EOF is reached after the first byte, or 0 byte if EOF in encountered immediately.
...
If you want to read your file one byte at a time, you will have to read(1) in a loop and test for "emptiness" of the result:
# From answer by #Daniel
with open(filename, 'rb') as f:
while True:
b = f.read(1)
if not b:
# eof
break
do_something(b)
If you want to read your file by "chunk" of say 50 bytes at a time, you will have to read(50) in a loop:
with open(filename, 'rb') as f:
while True:
b = f.read(50)
if not b:
# eof
break
do_something(b) # <- be prepared to handle a last chunk of length < 50
# if the file length *is not* a multiple of 50
In fact, you may even break one iteration sooner:
with open(filename, 'rb') as f:
while True:
b = f.read(50)
do_something(b) # <- be prepared to handle a last chunk of size 0
# if the file length *is* a multiple of 50
# (incl. 0 byte-length file!)
# and be prepared to handle a last chunk of length < 50
# if the file length *is not* a multiple of 50
if len(b) < 50:
break
Concerning the other part of your question:
Why does the container [..] contain [..] a whole bunch of them [bytes]?
Referring to that code:
for x in file:
i=i+1
print(x)
To quote again the doc:
A file object is its own iterator, [..]. When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing).
The the code above read a binary file line-by-line. That is stopping at each occurrence of the EOL char (\n). Usually, that leads to chunks of various length as most binary files contains occurrences of that char randomly distributed.
I wouldn't encourage you to read a binary file that way. Please prefer one a solution based on read(size).
"" will signify the end of the file
with open(filename, 'rb') as f:
for ch in iter(lambda: f.read(1),""): # keep calling f.read(1) until end of the data
print ch
Reading byte-by-byte:
with open(filename, 'rb') as f:
while True:
b = f.read(1)
if not b:
# eof
break
do_something(b)
Here is what I did. The call to read returns a falsy value when it encounters the end of the file, and this terminates the loop. Using while ch != "": copied the image but it gave me a hung loop.
from sys import argv
donor = argv[1]
recipient = argv[2]
# read from donor and write into recipient
# with statement ends, file gets closed
with open(donor, "rb") as fp_in:
with open(recipient, "wb") as fp_out:
ch = fp_in.read(1)
while ch:
fp_out.write(ch)
ch = fp_in.read(1)
In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?
There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.
A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.
I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...
Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.
I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file