I'm trying to read a huge csv.gz file from a url into chunks and write it into a database on the fly. I have to do all this in memory, no data can exist on disk.
I have the below generator function that generates the response chunks into Dataframe objects.
It works using the request's response.raw as input for the pd.read_csv function, but it appears unreliable and can sometimes throw the timeout error: urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(10054, \'WSAECONNRESET\')",)', OSError("(10054, 'WSAECONNRESET')",))
response = session.get(target, stream=True)
df_it = pd.read_csv(response.raw, compression='gzip', chunksize=10**6,
header=None, dtype=str, names=columns, parse_dates=['datetime'])
for i, df in enumerate(self.process_df(df_it)):
if df.empty:
continue
if (i % 10) == 0:
time.sleep(10)
yield df
I decided to use iter_content instead, as I read it should be more reliable. I have implemented the below functionality, but I'm getting this error: EOFError: Compressed file ended before the end-of-stream marker was reached.
I think it's to do with the fact I'm passing in a compressed Bytes object (?) but I'm not sure how to pass pandas.read_csv an object it will accept.
response = session.get(target, stream=True)
for chunk in response.iter_content(chunk_size=10**6):
file_obj = io.BytesIO()
file_obj.write(chunk)
file_obj.seek(0)
df_it = pd.read_csv(file_obj, compression='gzip', dtype=str,
header=None, names=columns, parse_dates=['datetime'])
for i, df in enumerate(self.process_df(df_it)):
if df.empty:
continue
if (i % 10) == 0:
time.sleep(10)
yield df
Any ideas greatly appreciated !
Thanks
You may wish to try this:
def iterable_to_stream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Lets you use an iterable (e.g. a generator) that yields bytestrings as a read-only
input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
Then
response = session.get(target, stream=True)
response.raw.decode_content = decode
df = pd.read_csv(iterable_to_stream(response.iter_content()), sep=';')
I use this to stream csv files in odsclient. It seems to work, although I did not try with gz compression.
Source: https://stackoverflow.com/a/20260030/7262247
Related
I am reading through a .json file and parsing some of the data to save into an Object. There are only 2000 or so items within the JSON that I need to iterate over, but the script I currently have running takes a lot longer than I'd like.
data_file = 'v1/data/data.json'
user = User.objects.get(username='lsv')
format = Format(format='Limited')
format.save()
lost_cards = []
lost_cards_file = 'v1/data/LostCards.txt'
with open(data_file) as file:
data = json.load(file)
for item in data:
if item['model'] == 'cards.cardmodel':
if len(Card.objects.filter(name=item['fields']['name'])) == 0:
print(f"card not found: {item['fields']['name']}")
lost_cards.append(item['fields']['name'])
try:
Rating(
card=Card.objects.get(name=item['fields']['name'], set__code=item['fields']['set']),
rating=item['fields']['rating'],
reason=item['fields']['reason'],
format=format,
rator=user
).save()
except Exception as e:
print(e, item['fields']['name'], item['fields']['set'])
break
with open(lost_cards_file, 'w') as file:
file.write(str(lost_cards))
The code is working as expected, but it's taking a lot longer than I'd like. I'm hoping there is a built-in JSON or iterator function that could accelerate this process.
There is. It's called the json module.
with open(data_file, 'r') as input_file:
dictionary_from_json = json.load(input_file)
should do it.
I'm trying to send a binary file from a client that runs on ironPython 2.7.4 to a server that runs on cpython 2.7.6 on a linuxbox. I followed this example, however when the server starts writing the file (first call to f.write), I get an Error:
TypeError: must be string or buffer, not int
Here the I think relevant pieces of the code.
server:
def recvall(self, count):
msgparts = []
while count > 0:
newbuf = self.conn.recv(count)
if not newbuf: return None
msgparts.append(newbuf)
count -= len(newbuf)
#print "%i bytes left" % count
return "".join(msgparts)
#receive file, write out
f = open(fname, 'wb')
chunk = self.recvall(1024)
while (chunk):
f.write(1024) #<-- error happens here.
chunk = self.recvall(1024)
f.close()
client:
f = open(fname, 'rb')
chunk = f.read(1024)
while (chunk):
self.conn.send(chunk)
chunk = f.read(1024)
f.close()
conn is the socket connection - this works, I can transfer pickled dicts successfully.
Any hints?
thanks and regards,
Dominic
f.write(chunk)
should do it (instead of f.write(1024).
As the error message states, f.write expects a string parameter, and 1024 is clearly an int
I have this function for streaming text files:
def txt_response(filename, iterator):
if not filename.endswith('.txt'):
filename += '.txt'
filename = filename.format(date=str(datetime.date.today()).replace(' ', '_'))
response = Response((_.encode('utf-8')+'\r\n' for _ in iterator), mimetype='text/txt')
response.headers['Content-Disposition'] = 'attachment; filename={filename}'.format(filename=filename)
return response
I am working out how to stream a CSV in a similar manner. This page gives an example, but I wish to use the CSV module.
I can use StringIO and create a fresh "file" and CSV writer for each line, but it seems very inefficient. Is there a better way?
According to this answer how do I clear a stringio object? it is quicker to just create a new StringIO object for each line in the file than the method I use below. However if you still don't want to create new StringIO instances you can achieve what you want like this:
import csv
import StringIO
from flask import Response
def iter_csv(data):
line = StringIO.StringIO()
writer = csv.writer(line)
for csv_line in data:
writer.writerow(csv_line)
line.seek(0)
yield line.read()
line.truncate(0)
line.seek(0) # required for Python 3
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
If you just want to stream back the results as they are created by csv.writer you can create a custom object implementing an interface the writer expects.
import csv
from flask import Response
class Line(object):
def __init__(self):
self._line = None
def write(self, line):
self._line = line
def read(self):
return self._line
def iter_csv(data):
line = Line()
writer = csv.writer(line)
for csv_line in data:
writer.writerow(csv_line)
yield line.read()
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
A slight improvement to Justin's existing great answer. You can take advantage of the fact that csv.writerow() returns the value returned by the underlying file's write call.
import csv
from flask import Response
class DummyWriter:
def write(self, line):
return line
def iter_csv(data):
writer = csv.writer(DummyWriter())
for row in data:
yield writer.writerow(row)
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
If you are dealing with large amounts of data that you don't want to store in memory then you could use SpooledTemporaryFile. This would use StringIO until it reaches a max_size after that it will roll over to disk.
However, I would stick with the recommended answer if you just want to stream back the results as they are created.
I have a function which processes binary data from a file using file.read(len) method. However, my file is huge and is cut into many smaller files 50 MBytes each. Is there some wrapper class that feeds many files into a buffered stream, and provides a read() method?
Class fileinput.FileInput can do such a thing, but it supports only line-by-line reading (method readline() with no arguments) and does not have read(len) with specifying number of bytes to read.
It's quite easy to concatenate iterables with itertools.chain:
from itertools import chain
def read_by_chunks(file_objects, block_size=1024):
readers = (iter(lambda f=f: f.read(block_size), '') for f in file_objects)
return chain.from_iterable(readers)
You can then do:
for chunk in read_by_chunks([f1, f2, f3, f4], 4096):
handle(chunk)
To process the files in sequence while reading it by chunks of 4096 bytes.
If you need to provide an object with a read method because some other function expects that you can write a very simple wrapper:
class ConcatFiles(object):
def __init__(self, files, block_size):
self._reader = read_by_chunks(files, block_size)
def __iter__(self):
return self._reader
def read(self):
return next(self._reader, '')
This however only uses a fixed block size. It's possible to support the block_size parameter for the read by doing something like:
def read(self, block_size=None):
block_size = block_size or self._block_size
total_read = 0
chunks = []
for chunk in self._reader:
chunks.append(chunk)
total_read += len(chunk)
if total_read > block_size:
contents = ''.join(chunks)
self._reader = chain([contents[block_size:]], self._reader)
return contents[:block_size]
return ''.join(chunks)
Note: if you are reading in binary mode you should replace the empty strings '' in the code with empty bytes b''.
Instead of converting the list of streams into a generator - as some of the other answers do - you can chain the streams together and then use the file interface:
def chain_streams(streams, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Chain an iterable of streams together into a single buffered stream.
Usage:
def generate_open_file_streams():
for file in filenames:
yield open(file, 'rb')
f = chain_streams(generate_open_file_streams())
f.read()
"""
class ChainStream(io.RawIOBase):
def __init__(self):
self.leftover = b''
self.stream_iter = iter(streams)
try:
self.stream = next(self.stream_iter)
except StopIteration:
self.stream = None
def readable(self):
return True
def _read_next_chunk(self, max_length):
# Return 0 or more bytes from the current stream, first returning all
# leftover bytes. If the stream is closed returns b''
if self.leftover:
return self.leftover
elif self.stream is not None:
return self.stream.read(max_length)
else:
return b''
def readinto(self, b):
buffer_length = len(b)
chunk = self._read_next_chunk(buffer_length)
while len(chunk) == 0:
# move to next stream
if self.stream is not None:
self.stream.close()
try:
self.stream = next(self.stream_iter)
chunk = self._read_next_chunk(buffer_length)
except StopIteration:
# No more streams to chain together
self.stream = None
return 0 # indicate EOF
output, self.leftover = chunk[:buffer_length], chunk[buffer_length:]
b[:len(output)] = output
return len(output)
return io.BufferedReader(ChainStream(), buffer_size=buffer_size)
Then use it as any other file/stream:
f = chain_streams(open_files_or_chunks)
f.read(len)
I'm not familiar with anything in the standard library that performs that function, so, in case there is none:
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
class ConcatenatedFiles( object ):
def __init__(self, file_objects):
self.fds= list(reversed(file_objects))
def read( self, size=None ):
remaining= size
data= StringIO()
while self.fds and (remaining>0 or remaining is None):
data_read= self.fds[-1].read(remaining or -1)
if len(data_read)<remaining or remaining is None: #exhausted file
self.fds.pop()
if not remaining is None:
remaining-=len(data_read)
data.write(data_read)
return data.getvalue()
Another method would be to use a generator:
def read_iter(streams, block_size=1024):
for stream in streams:
for chunk in stream.read(block_size):
yield chunk
# open file handles
file1 = open('f1.txt', 'r')
file2 = open('f2.txt', 'r')
fileOut = open('out.txt', 'w')
# concatenate files 1 & 2
for chunk in read_iter([file1, file2]):
# process chunk (in this case, just concatenate to output)
fileOut.write(chunk)
# close files
file1.close()
file2.close()
fileOut.close()
This shouldn't consume any memory beyond that required by the base script, and the chunk size; it's passing each chunk straight from one file reader, to the writer of another, then repeating until all streams are complete.
If you need this behaviour in a class, this could easily be build into a container class, as Bakuriu describes.
I have a function which takes a list of custom objects, conforms some values then writes them to a CSV file. Something really strange is happening in that when the list only contains a few objects, the resulting CSV file is always blank. When the list is longer, the function works fine. Is it some kind of weird anomaly with the temporary file perhaps?
I have to point out that this function returns the temporary file to a web server allowing the user to download the CSV. The web server function is below the main function.
def makeCSV(things):
from tempfile import NamedTemporaryFile
# make the csv headers from an object
headers = [h for h in dir(things[0]) if not h.startswith('_')]
# this just pretties up the object and returns it as a dict
def cleanVals(item):
new_item = {}
for h in headers:
try:
new_item[h] = getattr(item, h)
except:
new_item[h] = ''
if isinstance(new_item[h], list):
if new_item[h]:
new_item[h] = [z.__str__() for z in new_item[h]]
new_item[h] = ', '.join(new_item[h])
else:
new_item[h] = ''
new_item[h] = new_item[h].__str__()
return new_item
things = map(cleanVals, things)
f = NamedTemporaryFile(delete=True)
dw = csv.DictWriter(f,sorted(headers),restval='',extrasaction='ignore')
dw.writer.writerow(dw.fieldnames)
for t in things:
try:
dw.writerow(t)
# I can always see the dicts here...
print t
except Exception as e:
# and there are no exceptions
print e
return f
Web server function:
f = makeCSV(search_results)
response = FileResponse(f.name)
response.headers['Content-Disposition'] = (
"attachment; filename=export_%s.csv" % collection)
return response
Any help or advice greatly appreciated!
Summarizing eumiro's answer: the file needs to be flushed. Call f.flush() at the end of makeCSV().