Python read huge file line by line with utf-8 encoding

Python read huge file line by line with utf-8 encoding - python

I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. Now I wrote this script:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:
../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
File "charactercounter.py", line 5, in <module>
for line in fileinput.input(files):
File "C:\Python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>
To solve this I searched the web a bit, and came up with this code:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?
p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.
Oh, and I use Python 3.2

I do not know why fileinput does not work as expected.
I suggest you use the open function instead. The return value can be iterated over and will return lines, just like fileinput.
The code will then be something like:
for filename in files:
print(filename)
for filelineno, line in enumerate(open(filename, encoding="utf-8")):
line = line.strip()
data = line.split('\t')
# ...
Some documentation links: enumerate, open, io.TextIOWrapper (open returns an instance of TextIOWrapper).

The problem is that fileinput doesn't use file.xreadlines(), which reads line by line, but file.readline(bufsize), which reads bufsize bytes at once (and turns that into a list of lines). You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). Bufsize 0 means that the whole file is buffered.
Solution: provide a reasonable bufsize.

This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem.
fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))

Could you try to read not a whole file, but a part of it as binary, then decode(), then proccess, then call the function again to read another part?

I don't if the one I have is the latest version (and I don't remember how I read them), but...
$ file -i googlebooks-eng-1M-1gram-20090715-0.csv
googlebooks-eng-1M-1gram-20090715-0.csv: text/plain; charset=us-ascii
Have you tried fileinput.hook_encoded('ascii') or fileinput.hook_encoded('latin_1')? Not sure why this would make a difference, since I think the these are just subsets of unicode with the same mapping, but worth a try.
EDIT I think this might be a bug in fileinput, neither of these work.

If you are worried about the mem usage, why not read by line using readline()? This will get rid of the memory issues you are running into. Currently you are reading the full file before performing any actions on the fileObj, with readline() you are not saving the data, merely searching it on a per-line basis.
def charCount1(_file, _char):
result = []
file = open(_file, encoding="utf-8")
data = file.read()
file.close()
for index, line in enumerate(data.split("\n")):
if _char in line:
result.append(index)
return result
def charCount2(_file, _char):
result = []
count = 0
file = open(_file, encoding="utf-8")
while 1:
line = file.readline()
if _char in line:
result.append(count)
count += 1
if not line: break
file.close()
return result
I didn't have a chance to really look over your code but the above samples should give you an idea of how to make the appropriate changes to your structure. charCount1() demonstrates your method which caches the entire file in a single call from read(). I tested your method out on a +400MB text file and the python.exe process went as high as +900MB. when you run charCount2(), the python.exe process shouldn't exceed more than a few MB's (provided you haven't bulked up the size with other code) ;)

Related

Inplace file encoding

I'm trying to remove duplicates from a csv file with a lot of data. The removal works as intended but I can't seem to figure out how to change encoding on inplace removal. Googling for an answer didn't help. Any of you got a suggestion?
This is my code:
seen = set()
for line in fileinput.FileInput('Dupes.csv', inplace=1):
if line in seen: continue # skip duplicated line
seen.add(line)
print(line, end='')

This script works fine with me.
import fileinput
import sys
encoding = 'utf8'
end = '\n'
seen = set()
dupeCount = 0
for line in fileinput.FileInput('Dupes.csv', inplace=1, mode='rU'):
stripped = line.strip()
if stripped in seen:
dupeCount += 1
continue
seen.add(stripped)
# Sends the output in the right representation
sys.stdout.buffer.write(stripped.encode(encoding) + end.encode(encoding))
print('Removed %d dupes' % dupeCount)
The idea is to read the file with the right mode, and then write to the file thru stdout in the correct encoding, which is done by writing everything in the utf8's byte representation.
Tested with accents, seems to work.

Searching for a string in a file is not working in Python

I am using this code to find a string in Python:
buildSucceeded = "Build succeeded."
datafile = r'C:\PowerBuild\logs\Release\BuildAllPart2.log'
with open(datafile, 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
I am quite sure there is the string in the file although it does not return anything.
If I just print one line by line it returns a lot of 'NUL' characters between each "valid" character.
EDIT 1:
The problem was the encoding of Windows. I changed the encoding following this post and it worked: Why doesn't Python recognize my utf-8 encoded source file?
Anyway the file looks like this:
Line 1.
Line 2.
...
Build succeeded.
0 Warning(s)
0 Error(s)
...
I am currently testing with Sublime for Windows editor - which outputs a 'NUL' character between each "real" character which is very odd.
Using python command line I have this output:
C:\Dev>python readFile.py
Traceback (most recent call last):
File "readFile.py", line 7, in <module>
print(line)
File "C:\Program Files\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1: character maps to <undefined>
Thanks for your help anyway...

If your file is not that big you can do a simple find. Otherwise I would check to file to see if you have the string in the file/ check the location for any spelling mistakes and try to narrow down the problem.
f = open(datafile, 'r')
lines = f.read()
answer = lines.find(buildSucceeded)
Also note that if it does not find the string answer would be -1.

As explained, the problem happening was related to encoding. In the below website there is a very good explanation on how to convert between files with one encoding to some other.
I used the last example (with Python 3 which is my case) it worked as expected:
buildSucceeded = "Build succeeded."
datafile = 'C:\\PowerBuild\\logs\\Release\\BuildAllPart2.log'
# Open both input and output streams.
#input = open(datafile, "rt", encoding="utf-16")
input = open(datafile, "r", encoding="utf-16")
output = open("output.txt", "w", encoding="utf-8")
# Stream chunks of unicode data.
with input, output:
while True:
# Read a chunk of data.
chunk = input.read(4096)
if not chunk:
break
# Remove vertical tabs.
chunk = chunk.replace("\u000B", "")
# Write the chunk of data.
output.write(chunk)
with open('output.txt', 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
Source: http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/

How to overcome memory issue when sequentially appending files to one another

I am running the following script in order to append files to one another by cycling through months and years if the file exists, I have just tested it with a larger dataset where I would expect the output file to be roughly 600mb in size. However I am running into memory issues. Firstly is this normal to run into memory issues (my pc has 8 gb ram) I am not sure how I am eating all of this memory space?
Code I am running
import datetime, os
import StringIO
stored_data = StringIO.StringIO()
start_year = "2011"
start_month = "November"
first_run = False
current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
if os.path.exists(csv_filename):
with open(csv_filename, 'rb') as current_csv:
if first_run != False:
next(current_csv)
else:
first_run = True
stored_data.writelines(current_csv)
possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
contents = stored_data.getvalue()
with open('FullMergedData.csv', 'wb') as output_csv:
output_csv.write(contents)
The trackback I receive:
Traceback (most recent call last):
File "C:\code snippets\FullMerger.py", line 23, in <module>
contents = stored_output.getvalue()
File "C:\Python27\lib\StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
MemoryError
Any ideas how to achieve a work around or make this code more efficient to overcome this issue. Many thanks
AEA
Edit1
Upon running the code supplied alKid I received the following traceback.
Traceback (most recent call last):
File "C:\FullMerger.py", line 22, in <module>
output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'
I fixed the above by changing it to writelines however I still received the following trace back.
Traceback (most recent call last):
File "C:\FullMerger.py", line 19, in <module>
next(current_csv)
StopIteration

In stored_data, you're trying to store the whole file, and since it's too large, you're getting the error you are showing.
One solution is to write the file per line. It is far more memory-efficient, since you only store a line of data in the buffer, instead of the whole 600 MB.
In short, the structure can be something this:
with open('FullMergedData.csv', 'a') as output_csv: #this will append
# the result to the file.
with open(csv_filename, 'rb') as current_csv:
for line in current_csv: #loop through the lines
if first_run != False:
next(current_csv)
first_run = True #After the first line,
#you should immidiately change first_run to true.
output_csv.writelines(line) #write it per line
Should fix your problem. Hope this helps!

Your memory error is because you store all the data in a buffer before writing it. Consider using something like copyfileobj to directly copy from one open file object to another, this will only buffer small amounts of data at a time. You could also do it line by line, which will have much the same effect.
Update
Using copyfileobj should be much faster than writing the file line by line. Here is an example of how to use copyfileobj. This code opens two files, skips the first line of the input file if skip_first_line is True and then copies the rest of that file to the output file.
skip_first_line = True
with open('FullMergedData.csv', 'a') as output_csv:
with open(csv_filename, 'rb') as current_csv:
if skip_first_line:
current_csv.readline()
shutil.copyfileobj(current_csv, output_csv)
Notice that if you're using copyfileobj you'll want to use current_csv.readline() instead of next(current_csv). That's because iterating over a file object buffers part of the file, which is normally very useful, but you don't want that in this case. More on that here.

problem with seek() in file handling

I am using seek in a file- the file has a bunch of filenames and some logs of a process done on the file- some of these logs have errors. I am going line by line, if i get an error, I want to log everything in between two filenames.
When I use seek, I think that instead of moving it to the line i want it to, it moves it to the character # instead. For example
f=open("fileblah",'r')
while f:
line=f.readline()
counter=counter+1
f.seek(tail_position) # i want the next loop to start from after the error happened.
if line.startswith("D:")
header_position=counter
error_flag=0 #unset error flag
if line.startswith("error")
error_flag=1 #set error_flag
while(not(line.startswith("D:"): #go until next file beginning
line=f.readline()
counter=counter+1
tail_position=counter #have come to the next filename
I can see this is highly inefficient, but it doesn't work at all, because f.seek(tail_position) is moving the file pointer to the character # instead of the line #

Use .tell() to store your start-of-line position, then you can .seek() back to it.
Edit: I think this is what you want:
def errorsInLog(fname, newfileStr='D:', iserrorStr='error'):
with open(fname) as inf:
prev = pos = inf.tell()
line = inf.readline()
error = False
while line:
if line.startswith(newfileStr):
if error:
inf.seek(prev)
yield(inf.read(pos-prev))
prev = pos
error = False
elif line.startswith(iserrorStr):
error = True
pos = inf.tell()
line = inf.readline()
if error:
inf.seek(prev)
yield(inf.read())
def main():
print('\n\n'.join(errorsInLog('fileblah')))
For each filename followed by an error it returns a string encompassing the filename and all following lines, up to but not including the next filename or end-of-file.

seek() is used more often in random access file reading. If the file being read is already text and in can be read line by line then you need only read the line and then operate on the line using string operations. There is no need to move the file read position.
Your code need only look like this:
for line in f:
do_stuff_with line

like stdio's fseek(),seek(offset[,whence]) sets offset of the current position. whence is defaults to 0.so you can do something like this:
while(not(line.startwith("D:"))):
fseek(tail_position,'\n')
tail_position ++

How to modify a text file?

I'm using Python, and would like to insert a string into a text file without deleting or copying the file. How can I do that?

Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.

Depends on what you want to do. To append you can open it with "a":
with open("foo.txt", "a") as f:
f.write("new line\n")
If you want to preprend something you have to read from the file first:
with open("foo.txt", "r+") as f:
old = f.read() # read everything in the file
f.seek(0) # rewind
f.write("new line\n" + old) # write the new line before

The fileinput module of the Python standard library will rewrite a file inplace if you use the inplace=1 parameter:
import sys
import fileinput
# replace all occurrences of 'sit' with 'SIT' and insert a line after the 5th
for i, line in enumerate(fileinput.input('lorem_ipsum.txt', inplace=1)):
sys.stdout.write(line.replace('sit', 'SIT')) # replace 'sit' and write
if i == 4: sys.stdout.write('\n') # write a blank line after the 5th line

Rewriting a file in place is often done by saving the old copy with a modified name. Unix folks add a ~ to mark the old one. Windows folks do all kinds of things -- add .bak or .old -- or rename the file entirely or put the ~ on the front of the name.
import shutil
shutil.move(afile, afile + "~")
destination= open(aFile, "w")
source= open(aFile + "~", "r")
for line in source:
destination.write(line)
if <some condition>:
destination.write(<some additional line> + "\n")
source.close()
destination.close()
Instead of shutil, you can use the following.
import os
os.rename(aFile, aFile + "~")

Python's mmap module will allow you to insert into a file. The following sample shows how it can be done in Unix (Windows mmap may be different). Note that this does not handle all error conditions and you might corrupt or lose the original file. Also, this won't handle unicode strings.
import os
from mmap import mmap
def insert(filename, str, pos):
if len(str) < 1:
# nothing to insert
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
# or this could be an error
if pos > origSize:
pos = origSize
elif pos < 0:
pos = 0
m.resize(origSize + len(str))
m[pos+len(str):] = m[pos:origSize]
m[pos:pos+len(str)] = str
m.close()
f.close()
It is also possible to do this without mmap with files opened in 'r+' mode, but it is less convenient and less efficient as you'd have to read and temporarily store the contents of the file from the insertion position to EOF - which might be huge.

As mentioned by Adam you have to take your system limitations into consideration before you can decide on approach whether you have enough memory to read it all into memory replace parts of it and re-write it.
If you're dealing with a small file or have no memory issues this might help:
Option 1)
Read entire file into memory, do a regex substitution on the entire or part of the line and replace it with that line plus the extra line. You will need to make sure that the 'middle line' is unique in the file or if you have timestamps on each line this should be pretty reliable.
# open file with r+b (allow write and binary mode)
f = open("file.log", 'r+b')
# read entire content of file into memory
f_content = f.read()
# basically match middle line and replace it with itself and the extra line
f_content = re.sub(r'(middle line)', r'\1\nnew line', f_content)
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(f_content)
# close file
f.close()
Option 2)
Figure out middle line, and replace it with that line plus the extra line.
# open file with r+b (allow write and binary mode)
f = open("file.log" , 'r+b')
# get array of lines
f_content = f.readlines()
# get middle line
middle_line = len(f_content)/2
# overwrite middle line
f_content[middle_line] += "\nnew line"
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(''.join(f_content))
# close file
f.close()

Wrote a small class for doing this cleanly.
import tempfile
class FileModifierError(Exception):
pass
class FileModifier(object):
def __init__(self, fname):
self.__write_dict = {}
self.__filename = fname
self.__tempfile = tempfile.TemporaryFile()
with open(fname, 'rb') as fp:
for line in fp:
self.__tempfile.write(line)
self.__tempfile.seek(0)
def write(self, s, line_number = 'END'):
if line_number != 'END' and not isinstance(line_number, (int, float)):
raise FileModifierError("Line number %s is not a valid number" % line_number)
try:
self.__write_dict[line_number].append(s)
except KeyError:
self.__write_dict[line_number] = [s]
def writeline(self, s, line_number = 'END'):
self.write('%s\n' % s, line_number)
def writelines(self, s, line_number = 'END'):
for ln in s:
self.writeline(s, line_number)
def __popline(self, index, fp):
try:
ilines = self.__write_dict.pop(index)
for line in ilines:
fp.write(line)
except KeyError:
pass
def close(self):
self.__exit__(None, None, None)
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
with open(self.__filename,'w') as fp:
for index, line in enumerate(self.__tempfile.readlines()):
self.__popline(index, fp)
fp.write(line)
for index in sorted(self.__write_dict):
for line in self.__write_dict[index]:
fp.write(line)
self.__tempfile.close()
Then you can use it this way:
with FileModifier(filename) as fp:
fp.writeline("String 1", 0)
fp.writeline("String 2", 20)
fp.writeline("String 3") # To write at the end of the file

If you know some unix you could try the following:
Notes: $ means the command prompt
Say you have a file my_data.txt with content as such:
$ cat my_data.txt
This is a data file
with all of my data in it.
Then using the os module you can use the usual sed commands
import os
# Identifiers used are:
my_data_file = "my_data.txt"
command = "sed -i 's/all/none/' my_data.txt"
# Execute the command
os.system(command)
If you aren't aware of sed, check it out, it is extremely useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python read huge file line by line with utf-8 encoding - python

This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem. fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))

Could you try to read not a whole file, but a part of it as binary, then decode(), then proccess, then call the function again to read another part?

Related

Inplace file encoding

Searching for a string in a file is not working in Python

How to overcome memory issue when sequentially appending files to one another

problem with seek() in file handling

How to modify a text file?

Categories

Resources