Using python 2-7:
msg = self.infile.read(1500)
I am reading pieces of 1500Bytes from a file inside a possibly infinite while loop,
when the file is all done and I have read it all what happens? Would I be reading again from the start? (I don't want that)
Is there a simple way to count how many chuncks of 1500B strings (or less for the last one) I have read in total without saving them?
You can hold a counter:
ct = 0
while 1:
msg = self.infile.read(1500)
if len(msg) == 0:
break
else:
ct += 1
After the loop ends the ct variable will hold your number of chunks.
Update
While I'll leave the below notes I added here, other answers in here are better meeting your requirements.
Specifically adding a
if not msg:
break
Into the loop will take care of stopping the infinite loop when you reach the EOF.
As you haven't detailed why you are grabbing 1500 Byte chunks, I'm going to suggest an alternative per line solution. Assuming that self.infile is a file descriptor.
msg = [ line for line in self.infile ]
This will give you a list with each line from the file in a separate element with index '0' being the first and index 'n' (ie the last element in the list) being the last.
You can read chunk using iter like this
>>> with open('infile') as infile:
... for chunk in iter(lambda:infile.read(1500), ''):
... if len(chunk) == 1500: # skip if size of chunk is less than 1500
... print chunk # you can deal with `chunk` and count it here
Or you just want know count
>>> os.stat('infile').st_size/1500
Related
As the title says, how do I read from stdin or from a file word by word, rather than line by line? I'm dealing with very large files, not guaranteed to have any newlines, so I'd rather not load all of a file into memory. So the standard solution of:
for line in sys.stdin:
for word in line:
foo(word)
won't work, since line may be too large. Even if it's not too large, it's still inefficient since I don't need the entire line at once. I essentially just need to look at a single word at a time, and then forget it and move on to the next one, until EOF.
EDIT: The suggested "duplicate" is not really a duplicate. It mentions reading line by line and THEN splitting it into words, something I explicitly said I wanted to avoid.
Here's a generator approach. I don't know when you plan to stop reading, so this is a forever loop.
def read_by_word(filename, chunk_size=16):
'''This generator function opens a file and reads it by word'''
buff = '' # Preserve word from previous
with open(filename) as fd:
while True:
chunk = fd.read(chunk_size)
if not chunk: # Empty means end of file
if buff: # Corner case -- file had no whitespace at end
# Unfortunately, big chunk sizes could make the
# final chunk have spaces in it
yield from buff.split()
break
chunk = buff + chunk # Add any previous reads
if chunk != chunk.rstrip():
yield chunk.rstrip() # This chunk ends with whitespace
buff = ''
else:
comp = chunk.split(None, 1) # At most 1 with whitespace
if len(comp) == 1:
buff += chunk
continue
else:
yield comp[0]
buff = comp[1]
for word in read_by_word('huge_file_with_few_newlines.txt'):
print(word)
Here's a straightforward answer, which I'll post if anyone else goes looking and doesn't feel like wading through toxic replies:
word = ''
with open('filename', 'r') as f:
while (c := f.read(1)):
if c.isspace():
if word:
print(word) # Here you can do whatever you want e.g. append to list
word = ''
else:
word += c
Edit: I will note that it would be faster to read larger byte-chunks at a time, and detecting words after the fact. Ben Y's answer has an (as of this edit) incomplete solution that might be of assistance. If performance (rather than memory, as was my issue) is a problem, that should probably be your approach. The code will be quite a bit longer, however.
I have created a method which reads a file line by line and checks if they all contain the same number of delimiters (see below code). The trouble with the solution is that it works on a line per line basis. Given that some of the files I am dealing with are gigabytes in size, this will take a while to process, is there a better solution which will 1) validate whether all lines contain the same number of delimiters 2) not cause any out of memory issues. Thanks in advance.
def isValid(fileName):
with open(fileName,'rb') as infile:
for lineNumber,line in enumerate(infile,1):
count = line.count(',')
if lineNumber > 1 and prevCount != count:
# this line does not contain the same number of delimiters
return False
prevCount = count
return True
You can use all instead and a generator expression:
with open(file_name) as your_file:
start = your_file.readline().count(',') # initial count
print all(i.count(',') == start for i in your_file)
I propose a different approach (without code):
1. read the file as binary, and in chunks of, say, 64 KB
2. count the number of end-of-line tokens in the chunk
3. count the number of delimiters in the chunk but only up to the position of the last EOL token
4. if both number do not divide evenly, stop and return False
5. At EOF, return True
As you'd have to handle the 'overlap' between the last EOL token and the end of the chunk the logic is a bit more complicated than the 'brute-force' approach. But in dealing with GBs it might pay off.
I just noticed that - if you would want to stick with simple logic - the original code can be deflated a bit:
def isValid(fileName):
with open(fileName,'r') as infile:
count = infile.readline().count(',')
for line in infile:
if line.count(',') != count:
return False
return True
There is no need to keep the previous line's count as one single difference will decide it. So keep only the delim count of the first line.
Then, the file needs to be opened as a text file ('r'), not as a binary.
Lastly, by prefetching the very first line just before the loop we can discard the call to enumerate.
I have this method:
def get_chunksize(path):
"""
Breaks a file into chunks and yields the chunk sizes.
Number of chunks equals the number of available cores.
Ensures that each chunk ends at an EOL.
"""
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while 1:
start = f.tell()
f.seek(chunksize, 1) # Go to the next chunk
s = f.readline() # Ensure the chunk ends at the end of a line
yield start, f.tell()-start
if not s:
break
It is supposed to break a file into chunks and return the start of the chunk (in bytes) and the chunk size.
Crucially, the end of a chunk should correspond to the end of a line (which is why the f.readline() behaviour is there), but I am finding that my chunks are not seeking to an EOL at all.
The purpose of the method is to then read chunks which can be passed to a csv.reader instance (via StringIO) for further processing.
I've been unable to spot anything obviously wrong with the function...any ideas why it is not moving to the EOL?
I came up with this rather clunky alternative:
def line_chunker(path):
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while True:
part = f.readlines(chunksize)
yield csv.reader(StringIO("".join(part)))
if not part:
break
This will split the file into chunks with a csv reader for each chunk, but the last chunk is always empty (??) and having to join the list of strings back together is rather clunky.
if not s:
break
Instead of looking at s to see if you're at the end of the file, you should look if you've reached the end of the file by using:
if size == f.tell(): break
this should fix it. I wouldn't depend on a CSV file having a single record per line though. I've worked with several CSV files that have strings with new-lines:
first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...
Notice the 2nd record (bob) spans across 3 lines. csv.reader can handle this. If the idea is to do some cpu intensive work on a csv. I'd create an array of threads, each with a buffer of n records. have the csv.reader pass a record to each thread using round-robin, skipping a thread if its buffer is full.
Hope this helps - enjoy.
In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?
There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.
A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.
I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...
Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.
This question already has answers here:
How should I read a file line-by-line in Python?
(3 answers)
Closed 7 months ago.
The community reviewed whether to reopen this question 6 months ago and left it closed:
Original close reason(s) were not resolved
I want to read a large file (>5GB), line by line, without loading its entire contents into memory. I cannot use readlines() since it creates a very large list in memory.
Use a for loop on a file object to read it line-by-line. Use with open(...) to let a context manager ensure that the file is closed after reading:
with open("log.txt") as infile:
for line in infile:
print(line)
All you need to do is use the file object as an iterator.
for line in open("log.txt"):
do_something_with(line)
Even better is using context manager in recent Python versions.
with open("log.txt") as fileobject:
for line in fileobject:
do_something_with(line)
This will automatically close the file as well.
Please try this:
with open('filename','r',buffering=100000) as f:
for line in f:
print line
An old school approach:
fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()
You are better off using an iterator instead.
Relevant: fileinput — Iterate over lines from multiple input streams.
From the docs:
import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
process(line)
This will avoid copying the whole file into memory at once.
Here's what you do if you dont have newlines in the file:
with open('large_text.txt') as f:
while True:
c = f.read(1024)
if not c:
break
print(c,end='')
I couldn't believe that it could be as easy as #john-la-rooy's answer made it seem. So, I recreated the cp command using line by line reading and writing. It's CRAZY FAST.
#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
with open(sys.argv[1]) as infile:
for line in infile:
outfile.write(line)
The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.
dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
df.head(10) # return first 10 rows
df.tail(10) # return last 10 rows
# iterate rows
for idx, row in df.iterrows():
...
# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()
# slice by column
df[df.my_field=='XYZ'].compute()
Heres the code for loading text files of any size without causing memory issues.
It support gigabytes sized files
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(data, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.
How about this?
Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.
Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.
def chunks(file,size=1024):
while 1:
startat=fh.tell()
print startat #file's object current position from the start
fh.seek(size,1) #offset from current postion -->1
data=fh.readline()
yield startat,fh.tell()-startat #doesnt store whole list in memory
if not data:
break
if os.path.isfile(fname):
try:
fh=open(fname,'rb')
except IOError as e: #file --> permission denied
print "I/O error({0}): {1}".format(e.errno, e.strerror)
except Exception as e1: #handle other exceptions such as attribute errors
print "Unexpected error: {0}".format(e1)
for ele in chunks(fh):
fh.seek(ele[0])#startat
data=fh.read(ele[1])#endat
print data
Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.
Then I had to remove a "=\n" in the middle of each line.
Then I split the lines at the new line.
b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
data_list = data_chunk.split('\n') #List containing lines in chunk
#print(data_list,'\n')
#time.sleep(1)
for j in range(len(data_list)): #iterate through data_list to get each item
i += 1
line_of_data = data_list[j]
print(line_of_data)
Here is the code starting just above "print data" in Arohi's code.
The best solution I found regarding this, and I tried it on 330 MB file.
lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
file.seek(lineno * (line_length + 2))
print(file.readline(), end='')
Where line_length is the number of characters in a single line. For example "abcd" has line length 4.
I have added 2 in line length to skip the '\n' character and move to the next character.
I realise this has been answered quite some time ago, but here is a way of doing it in parallel without killing your memory overhead (which would be the case if you tried to fire each line into the pool). Obviously swap the readJSON_line2 function out for something sensible - its just to illustrate the point here!
Speedup will depend on filesize and what you are doing with each line - but worst case scenario for a small file and just reading it with the JSON reader, I'm seeing similar performance to the ST with the settings below.
Hopefully useful to someone out there:
def readJSON_line2(linesIn):
#Function for reading a chunk of json lines
'''
Note, this function is nonsensical. A user would never use the approach suggested
for reading in a JSON file,
its role is to evaluate the MT approach for full line by line processing to both
increase speed and reduce memory overhead
'''
import json
linesRtn = []
for lineIn in linesIn:
if lineIn.strip() != 0:
lineRtn = json.loads(lineIn)
else:
lineRtn = ""
linesRtn.append(lineRtn)
return linesRtn
# -------------------------------------------------------------------
if __name__ == "__main__":
import multiprocessing as mp
path1 = "C:\\user\\Documents\\"
file1 = "someBigJson.json"
nBuffer = 20*nCPUs # How many chunks are queued up (so cpus aren't waiting on processes spawning)
nChunk = 1000 # How many lines are in each chunk
#Both of the above will require balancing speed against memory overhead
iJob = 0 #Tracker for SMP jobs submitted into pool
iiJob = 0 #Tracker for SMP jobs extracted back out of pool
jobs = [] #SMP job holder
MTres3 = [] #Final result holder
chunk = []
iBuffer = 0 # Buffer line count
with open(path1+file1) as f:
for line in f:
#Send to the chunk
if len(chunk) < nChunk:
chunk.append(line)
else:
#Chunk full
#Don't forget to add the current line to chunk
chunk.append(line)
#Then add the chunk to the buffer (submit to SMP pool)
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#Clear the chunk for the next batch of entries
chunk = []
#Buffer is full, any more chunks submitted would cause undue memory overhead
#(Partially) empty the buffer
if iBuffer >= nBuffer:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iBuffer -=1
iiJob+=1
#Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
if chunk:
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#And gather up the last of the buffer, including the final chunk
while iiJob < iJob:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iiJob+=1
#Cleanup
del chunk, jobs, temp1
pool.close()
This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.
def readInChunks(fileObj, chunkSize=1024):
while True:
data = fileObj.read(chunkSize)
if not data:
break
while data[-1:] != '\n':
data+=fileObj.read(1)
yield data