Retrieving CSV fields and also raw string from file simultaneously in Python - python

I have a generator that yields rows from a CSV file one at a time, something like:
import csv
def as_csv(filename):
with open(filename) as fin:
yield from csv.reader(fin)
However, I need to also capture the raw string returned from the file, as this needs to be persisted at the same time.
As far as I can tell, the csv built-in can be used on an ad-hoc basis, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
for row in fin:
raw = row.strip()
values = csv.reader([raw])[0]
yield (values, raw)
... but this has the overhead of creating a new reader and a new iterable for each row of the file, so on files with millions of rows I'm concerned about the performance impact.
It feels like I could create a coroutine that could interact with the primary function, yielding the parsed fields in a way where I can control the input directly without losing it, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
reader = raw_to_csv(some_coroutine())
reader.next()
for row in fin:
raw = row.strip()
fields = reader.send(raw)
yield fields, raw
def raw_to_csv(data):
yield from csv.reader(data)
def some_coroutine():
# what goes here?
raise NotImplementedError
I haven't really wrapped my head around coroutines and using yield as an expression, so I'm not sure what goes in some_coroutine, but the intent is that each time I send a value in, that value is run through the csv.reader object and I get the set of fields back.
Can someone provide the implementation of some_coroutine, or alternately show me a better mechanism for getting the desired data?

You can use itertools.tee to create two independent iterators from the iterable file object, create a csv.reader from one of them, and then zip the other iterator with it for output:
from itertools import tee
def as_csv_and_raw(filename):
with open(filename) as fin:
row, raw = tee(fin)
yield from zip(csv.reader(row), raw)

Related

Handle huge bz2-file

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)

Obtain csv-like parse AND line length byte count?

I'm familiar with the csv Python module, and believe it's necessary in my case, as I have some fields that contain the delimiter (| rather than ,, but that's irrelevant) within quotes.
However, I am also looking for the byte-count length of each original row, prior to splitting into columns. I can't count on the data to always quote a column, and I don't know if/when csv will strip off outer quotes, so I don't think (but might be wrong) that simply joining on my delimiter will reproduce the original line string (less CRLF characters). Meaning, I'm not positive the following works:
with open(fname) as fh:
reader = csv.reader(fh, delimiter="|")
for row in reader:
original = "|".join(row) ## maybe?
I've tried looking at csv to see if there was anything in there that I could use/monkey-patch for this purpose, but since _csv.reader is a .so, I don't know how to mess around with that.
In case I'm dealing with an XY problem, my ultimate goal is to read through a CSV file, extracting certain fields and their overall file offsets to create a sort of look-up index. That way, later, when I have a list of candidate values, I can check each one's file-offset and seek() there, instead of chugging through the whole file again. As an idea of scale, I might have 100k values to look up across a 10GB file, so re-reading the file 100k times doesn't feel efficient to me. I'm open to other suggestions than the CSV module, but will still need csv-like intelligent parsing behavior.
EDIT: Not sure how to make it more clear than the title and body already explains - simply seek()-ing on a file handle isn't sufficient because I also need to parse the lines as a csv in order to pull out additional information.
You can't subclass _csv.reader, but the csvfile argument to the csv.reader() constructor only has to be a "file-like object". This means you could supply an instance of your own class that does some preprocessing—such as remembering the length of the last line read and file offset. Here's an implementation showing exactly that. Note that the line length does not include the end-of-line character(s). It also shows how the offsets to each line/row could be stored and used after the file is read.
import csv
class CSVInputFile(object):
""" File-like object. """
def __init__(self, file):
self.file = file
self.offset = None
self.linelen = None
def __iter__(self):
return self
def __next__(self):
offset = self.file.tell()
data = self.file.readline()
if not data:
raise StopIteration
self.offset = offset
self.linelen = len(data)
return data
next = __next__
offsets = [] # remember where each row starts
fname = 'unparsed.csv'
with open(fname) as fh:
csvfile = CSVInputFile(fh)
for row in csv.reader(csvfile, delimiter="|"):
print('offset: {}, linelen: {}, row: {}'.format(
csvfile.offset, csvfile.linelen, row)) # file offset and length of row
offsets.append(csvfile.offset) # remember where each row started
Depending on performance requirements and the size of the data, the low tech solution is to simply read the file twice. Make a first pass where you get the length of each line, and then then you can run the data through the csv parser. On my somewhat outdated Mac I can read and count the length of 2-3 million lines in a second, which isn't a huge performance hit.

Saving a Queue to a File

My goal is to have a text file that allows me to append data to the end and retrieve and remove the first data entry from the file. Essentially I want to use a text file as a queue (first in, first out). I thought of two ways to accomplish this, but I am unsure which way is more Pythonic and efficient. The first way is to use the json library.
import json
def add_to_queue(item):
q = retrieve_queue()
q.append(item)
write_to_queue(q)
def pop_from_queue():
q = retrieve_queue()
write_to_queue(q[1:])
return q[0]
def write_to_queue(data):
with open('queue.txt', 'w') as file_pointer:
json.dump(data, file_pointer)
def retrieve_queue():
try:
with open('queue.txt', 'r') as file_pointer:
return json.load(file_pointer)
except (IOError, ValueError):
return []
Seems pretty clean, but it requires serialization/deserialization of all of the json data every time I write/read, even though I only need the first item in the list.
The second option is to call readlines() and writelines() to retrieve and to store the data in the text file.
def add_to_queue(item):
with open('queue.txt', 'a') as file_pointer:
file_pointer.write(item + '\n')
def pop_from_queue():
with open('queue.txt', 'r+') as file_pointer:
lines = file_pointer.readlines()
file_pointer.seek(0)
file_pointer.truncate()
file_pointer.writelines(lines[1:])
return lines[0].strip()
Both of them work fine, so my question is: what is the recommended way to implement a "text file queue"? Is using json "better" (more Pythonic/faster/more memory efficient) than reading and writing to the file myself? Both of these solutions seem rather complicated based on the simplicity of the problem; am I missing a more obvious way to do this?

How to work with csv files in Python?

What is the 'Python way' regarding working with a CSV file? If I want to run some methods on the data in a particular column, should copy the whole think into an array, or should I pass the open file into a series of methods?
I tried to return the open file and got this error:
ValueError: I/O operation on closed file
here's the code:
import sys
import os
import csv
def main():
pass
def openCSVFile(CSVFile, openMode):
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
if __name__ == '__main__':
zipfile = openCSVFile('propertyOutput.csv','rb')
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Well there are many ways you could go about manipulating csv files. It depends
largely on how big your data is and how often you will perform these operations.
I will build on the already good answers and comments to present a somewhat more
complex handling, that wouldn't be far off from a real world example.
First of all, I prefer csv.DictReader because most csv files have a header
row with the column names. csv.DictReader takes advantage of that and gives
you the opportunity to grab it's cell value by its name.
Also, most of the times you need to perform various validation and normalization
operations on said data, so we're going to associate some functions with specific
columns.
Suppose we have a csv with information about products.
e.g.
Product Name,Release Date,Price
foo product,2012/03/23,99.9
awesome product,2013/10/14,40.5
.... and so on ........
Let's write a program to parse it and normalize the values
into appropriate native python objects.
import csv
import datetime
from decimal import Decimal
def stripper(value):
# Strip any whitespace from the left and right
return value.strip()
def to_decimal(value):
return Decimal(value)
def to_date(value):
# We expect dates like: "2013/05/23"
datetime.datetime.strptime(value, '%Y/%m/%d').date()
OPERATIONS = {
'Product Name': [stripper],
'Release Date': [stripper, to_date],
'Price': [stripper, to_decimal]
}
def parse_csv(filepath):
with open(filepath, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for column in row:
operations = OPERATIONS[column]
value = row[column]
for op in operations:
value = op(value)
# Print the cleaned value or store it somewhere
print value
Things to note:
1) We operate on the csv in a line by line basis. DictReader yields lines
one at a time and that means we can handle arbitrary sizes of csv files,
since we are not going to load the whole file into memory.
2) You can go crazy with normalizing the values of a csv, by building special
classes with magic methods or whatnot. As I said, it depends on the complexity
of your files, the quality of the data and the operations you need to perform
on them.
Have fun.
csv module provides one row at a time, understanding its content by spliting it as a list object (or dict in case of DictReader).
As Python knows how to loop on such an object, if you're just interested in some specific fields, building a list with these fields seems 'Pythonic' to me. Using an iterator is also valid if each item shall be considered separatly from the others.
You probably need to read PEP 343: The 'with' statement
Relevant quote:
Some standard Python objects now support the context management protocol and can be used with the 'with' statement. File objects are one example:
with open('/etc/passwd', 'r') as f:
for line in f:
print line
... more processing code ...
After this statement has executed, the file object in f will have been automatically closed, even if the 'for' loop raised an exception part-way through the block.
So your csvfile is closed outside with statement, and outside openCSVFile function. You need to not to use with statement,
def openCSVFile(CSVFile, openMode):
csvfile = open(CSVFile, openMode)
return csv.reader(csvfile, delimiter=',')
or move it to __main__:
def get_csv_reader(filelike):
return csv.reader(csvfile, delimiter=',')
if __name__ == '__main__':
with open('propertyOutput.csv', 'rb') as csvfile:
zipfile = get_csv_reader(csvfile)
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Firstly, the reason you're getting ValueError: I/O operation on closed file is that in the following, the with acting as a context manager is operating on an opened file which is the underlying fileobj that zipreader is then set to work on. What happens, is that as soon as the with block is exited, the file that was opened is then closed, which leaves the file unusable for zipreader to read from...
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
Generally, acquire the resource and then pass it a function if needed. So, in your main program open the file and create the csv.reader and then pass that to something and have it closed in the main program when it makes more sense that "you're done with it now".

Can csv data be made lazy?

Using Python's csv module, is it possible to read an entire, large, csv file into a lazy list of lists?
I am asking this, because in Clojure there are csv parsing modules that will parse a large file and return a lazy sequence (a sequence of sequences). I'm just wondering if that's possible in Python.
Unless I'm misunderstanding you, this is the default behavior, which is the very essence of reading through a csv file:
import csv
def lazy(csvfile):
with open(csvfile) as f:
r = csv.reader(f)
for row in r:
yield row
gives you back one row at a time.
The csv module's reader is lazy by default.
It will read a line in at a time from the file, parse it to a list, and return that list.
Python's reader or DictReader are generators. A row is produced only when the object's next() method is called.
The csv module does load the data lazily, one row at a time.

Categories

Resources