Read in a large data in python

Read in a large data in python - python

So I am attempting to read in a large data file in python. If the data had one column and 1 million rows I would do:
fp = open(ifile,'r');
for row in fp:
process row
My problem arises when the data I am reading in has, say 1 million columns and only 1 row. What I would like is a similar functionality to the fscanf() function in C.
Namely,
while not EOF:
part_row = read_next(%lf)
work on part_row
I could use fp.read(%lf), if I knew that the format was long float or whatever.
Any thoughts?

A million floats in text format really isn't that big... So unless it's proving a bottle neck of some sort, then I wouldn't worry about it and just do:
with open('file') as fin:
my_data = [process_line(word) for word in fin.read().split()]
A possible alternative (assuming space delimited "words") is something like:
import mmap, re
with open('whatever.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for word in re.finditer(r'(.*?)\s', mf):
print word.group(1)
And that'll scan the entire file and effectively give a massive word stream, regardless of rows / columns.

There are two basic ways to approach this:
First, you can write a read_column function with its own explicit buffer, either as a generator function:
def column_reader(fp):
buf = ''
while True:
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
buf += fp.read(4096)
col_and_buf = buf.split(',', 1)
col, buf = col_and_buf
yield col
… or as a class:
class ColumnReader(object):
def __init__(self, fp):
self.fp, self.buf = fp, ''
def next(self):
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
self.buf += self.fp.read(4096)
col_and_buf = self.buf.split(',', 1)
self.buf = buf
return col
But, if you write a read_until function that handles the buffering internally, then you can just do this:
next_col = read_until(fp, ',')[:-1]
There are multiple read_until recipes on ActiveState.
Or, if you mmap the file, you effectively get this for free. You can just treat the file as a huge string and use find (or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)
Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.

you can accomplish this using yield.
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('your_file.txt')
for piece in read_in_chunks(f):
process_data(piece)
Take a look at this question for more examples.

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a .txt file that looks like this (let's name it "myfile.txt"):
28807644'~'0'~'Maun FCU'~'US#####28855353'~'0'~'WNB Holdings LLC'~'US#####29212330'~'0'~'Idaho First Bank'~'US#####29278777'~'0'~'Republic Bank of Arizona'~'US#####29633181'~'0'~'Friendly Hills Bank'~'US#####29760145'~'0'~'The Freedom Bank of Virginia'~'US#####100504846'~'0'~'Community First Fund Federal Credit Union'~'US#####
I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:
import pandas as pd
import time
#time at the start of program is noted
start = time.time()
# We set the path where our file is located and read it
path = r'myfile.txt'
f = open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#####", "\n").replace("'", "")
# We read everything in columns with the separator "~"
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
#total time taken to print the file
print("Execution time in seconds: ",(end - start))
This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.
I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:
import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
write = csv.writer(outfile)
write.writerows(split_content)
Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?
I'd be happy to read any suggestions you may have, thanks in advance!

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.
Following #Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.
The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.
My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:
CHUNK_SZ = 1024 * 1024
FS = "'~'"
RS = '#####'
# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['####', '###', '##', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS = PARTIAL_FSES + PARTIAL_RSES
f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')
f_in = open('my_file.txt')
line = ''
while True:
# Read chunks till no more, then break out
chunk = f_in.read(CHUNK_SZ)
if not chunk:
break
# Any previous partial separator, plus new chunk
line += chunk
# Check end-of-line for a partial FS or RS; only when separators are more than one char
final_partial = ''
if line.endswith(FS) or line.endswith(RS):
pass # Write-out will replace complete FS or RS
else:
for partial in ALL_PARTIALS:
if line.endswith(partial):
final_partial = partial
line = line[:-len(partial)]
break
# Process/write chunk
f_out.write(line
.replace(FS, ',')
.replace(RS, '\n'))
# Add partial back, to be completed next chunk
line = final_partial
# Clean up
f_in.close()
f_out.close()

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:
def process(fn_in, fn_out, columns):
new_line = b'#####'
with open(fn_out, 'wb') as f_out:
# write the header
f_out.write((','.join(columns)+'\n').encode())
i = 0
with open(fn_in, "rb") as f_in:
while (b := f_in.read(1)):
if ord(b) == new_line[i]:
# keep matching the newline block
i += 1
if i == len(new_line):
# if matched entirely, write just a newline
f_out.write(b'\n')
i = 0
# write nothing while matching
continue
elif i > 0:
# if you reach this, it was a partial match, write it
f_out.write(new_line[:i])
i = 0
if b == b"'":
pass
elif b == b"~":
f_out.write(b',')
else:
# write the byte if no match
f_out.write(b)
process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])
That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.
This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.
Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

Just to share an alternative way, based on convtools (table docs | github).
This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).
Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.
from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table
def get_rows(filename):
with open(filename, "r") as f:
for row in split_buffer(f, "#####"):
yield row.replace("'", "")
Table.from_csv(
get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

Easy way to switch endianess of string

I need to read the binary file, and write it's content in form of text file which will initialize memory model. Problem is, I need to switch endianess in process. Let's look at example
binary file content, when I read it with:
with open(source_name, mode='rb') as file:
fileContent = file.read().hex()
filecontent: "aa000000bb000000...".
I need, to transform that into "000000aa000000bb...".
Of course, I can split this string into list of 8 chars substrings, than manualy reorganize it like newsubstr = substr[6:8]+substr[4:6]+substr[2:4]+substr[0:2]
, and then merge them into result string, but that seems clumsily, I suppose there is more natural way to do this in python.
Thanks to k1m190r, I found out about struct module which looks like what I need, but I still lost. I just designed another clumsy solution:
with open(source_name, mode='rb') as file:
fileContent = file.read()
while len(fileContent)%4 != 0:
fileContent += b"\x00"
res = ""
for i in range(0,len(fileContent),4):
substr = fileContent[i:i+4]
substr_val = struct.unpack("<L", substr)[0]
res += struct.pack(">L", substr_val).hex()
Is there a more elegant way? This solution is just slightly better than the original.

Actually in your specific case you don't even need struct. Below should be sufficient.
from binascii import b2a_hex
# open files in binary
with open("infile", "rb") as infile, open("outfile", "wb") as outfile:
# read 4 bytes at a time till read() spits out empty byte string b""
for x in iter(lambda: infile.read(4), b""):
if len(x) != 4:
# skip last bit if it is not 4 bytes long
break
outfile.write(b2a_hex(x[::-1]))

Is there a more elegant way? This solution is just slightly better than the original
Alternatively, you can craft a "smarter" struct format string: format specifiers take a number prefix which is the number of repetitions e.g. 10L is the same as LLLLLLLLLL so you can inject the size of your data divided by 4 before the letter and and convert the entire thing in one go (or a few steps, I don't know how big the counter can be).
array.array might also work as that's what the `byteswap, but you can't specify the input endianness (I think), so it's iffier.

To answer the original question:
import re
changed = re.sub(b'(....)', lambda x:x.group()[::-1], bindata)
Note: original had r'(....)' when the r should have been b.

Replacing string with id using dictionary in python

I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:
Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!
Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.
This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?
title_map = {}
token_file = open("titles-sorted.txt")
for number, line in enumerate(token_file):
title_map[line.rstrip().lower()] = str(number + 1)
token_file.close()
input_file = open("a.txt")
output_file = open("a.index", "w")
for line in input_file:
tokens = line.split("\t")
if tokens[0] in title_map:
output_list = [title_map[tokens[0]]]
output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
output_file.write("\t".join(output_list) + "\n")
output_file.close()
input_file.close()
If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

How to read a big binary file and split its content by some marker

In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).
The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?

There is no magic in Python that will do it for you, but it's not hard to write. For example:
def split_file(fp, marker):
BLOCKSIZE = 4096
result = []
current = ''
for block in iter(lambda: fp.read(BLOCKSIZE), ''):
current += block
while 1:
markerpos = current.find(marker)
if markerpos == -1:
break
result.append(current[:markerpos])
current = current[markerpos + len(marker):]
result.append(current)
return result
Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.

A general idea is using mmap you can then re.finditer over it:
import mmap
import re
with open('somefile', 'rb') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
markers = re.finditer('(.*?)MARKER', mf)
for marker in markers:
print marker.group(1)
I haven't tested, but you may want a (.*?)(MARKER|$) or similar in there as well.
Then, it's down to the OS to provide the necessaries for access to the file.

I don't think there's any built-in function for that, but you can "read-in-chunks" nicely with an iterator to prevent memory-inefficiency, similarly to #user4815162342 's suggestion:
def split_by_marker(f, marker = "-MARKER-", block_size = 4096):
current = ''
while True:
block = f.read(block_size)
if not block: # end-of-file
yield current
return
current += block
while True:
markerpos = current.find(marker)
if markerpos < 0:
break
yield current[:markerpos]
current = current[markerpos + len(marker):]
This way you won't save all the results in the memory at once, and you can still iterate it like:
for line in split_by_marker(open(filename, 'rb')): ...
Just make sure that each "line" does not take too much memory...

Readline itself reads in chunks, splits the chunks, remembers tails, etc. So, no.

Is there a way to read a file in a loop in python using a separator other than newline

I usually read files like this in Python:
f = open('filename.txt', 'r')
for x in f:
doStuff(x)
f.close()
However, this splits the file by newlines. I now have a file which has all of its info in one line (45,000 strings separated by commas). While a file of this size is trivial to read in using something like
f = open('filename.txt', 'r')
doStuff(f.read())
f.close()
I am curious if for a much larger file which is all in one line it would be possible to achieve a similar iteration effect as in the first code snippet but with splitting by comma instead of newline, or by any other character?

The following function is a fairly straightforward way to do what you want:
def file_split(f, delim=',', bufsize=1024):
prev = ''
while True:
s = f.read(bufsize)
if not s:
break
split = s.split(delim)
if len(split) > 1:
yield prev + split[0]
prev = split[-1]
for x in split[1:-1]:
yield x
else:
prev += s
if prev:
yield prev
You would use it like this:
for item in file_split(open('filename.txt')):
doStuff(item)
This should be faster than the solution that EMS linked, and will save a lot of memory over reading the entire file at once for large files.

Open the file using open(), then use the file.read(x) method to read (approximately) the next x bytes from the file. You could keep requesting blocks of 4096 characters until you hit end-of-file.
You will have to implement the splitting yourself - you can take inspiration from the csv module, but I don't believe you can use it directly because it wasn't designed to deal with extremely long lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read in a large data in python - python

you can accomplish this using yield. def read_in_chunks(file_object, chunk_size=1024): while True: data = file_object.read(chunk_size) if not data: break yield data f = open('your_file.txt') for piece in read_in_chunks(f): process_data(piece) Take a look at this question for more examples.

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

Easy way to switch endianess of string

Replacing string with id using dictionary in python

How to read a big binary file and split its content by some marker

Is there a way to read a file in a loop in python using a separator other than newline

Categories

Resources