Obtain csv-like parse AND line length byte count? - python

I'm familiar with the csv Python module, and believe it's necessary in my case, as I have some fields that contain the delimiter (| rather than ,, but that's irrelevant) within quotes.
However, I am also looking for the byte-count length of each original row, prior to splitting into columns. I can't count on the data to always quote a column, and I don't know if/when csv will strip off outer quotes, so I don't think (but might be wrong) that simply joining on my delimiter will reproduce the original line string (less CRLF characters). Meaning, I'm not positive the following works:
with open(fname) as fh:
reader = csv.reader(fh, delimiter="|")
for row in reader:
original = "|".join(row) ## maybe?
I've tried looking at csv to see if there was anything in there that I could use/monkey-patch for this purpose, but since _csv.reader is a .so, I don't know how to mess around with that.
In case I'm dealing with an XY problem, my ultimate goal is to read through a CSV file, extracting certain fields and their overall file offsets to create a sort of look-up index. That way, later, when I have a list of candidate values, I can check each one's file-offset and seek() there, instead of chugging through the whole file again. As an idea of scale, I might have 100k values to look up across a 10GB file, so re-reading the file 100k times doesn't feel efficient to me. I'm open to other suggestions than the CSV module, but will still need csv-like intelligent parsing behavior.
EDIT: Not sure how to make it more clear than the title and body already explains - simply seek()-ing on a file handle isn't sufficient because I also need to parse the lines as a csv in order to pull out additional information.

You can't subclass _csv.reader, but the csvfile argument to the csv.reader() constructor only has to be a "file-like object". This means you could supply an instance of your own class that does some preprocessing—such as remembering the length of the last line read and file offset. Here's an implementation showing exactly that. Note that the line length does not include the end-of-line character(s). It also shows how the offsets to each line/row could be stored and used after the file is read.
import csv
class CSVInputFile(object):
""" File-like object. """
def __init__(self, file):
self.file = file
self.offset = None
self.linelen = None
def __iter__(self):
return self
def __next__(self):
offset = self.file.tell()
data = self.file.readline()
if not data:
raise StopIteration
self.offset = offset
self.linelen = len(data)
return data
next = __next__
offsets = [] # remember where each row starts
fname = 'unparsed.csv'
with open(fname) as fh:
csvfile = CSVInputFile(fh)
for row in csv.reader(csvfile, delimiter="|"):
print('offset: {}, linelen: {}, row: {}'.format(
csvfile.offset, csvfile.linelen, row)) # file offset and length of row
offsets.append(csvfile.offset) # remember where each row started

Depending on performance requirements and the size of the data, the low tech solution is to simply read the file twice. Make a first pass where you get the length of each line, and then then you can run the data through the csv parser. On my somewhat outdated Mac I can read and count the length of 2-3 million lines in a second, which isn't a huge performance hit.

Related

Handle huge bz2-file

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)

Python reading nothing from file [duplicate]

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

python replace random line in large file

Assuming I ve got a large file where I want to replace nth line. I am aware of this solution:
w = open('out','w')
for line in open('in','r'):
w.write(replace_somehow(line))
os.remove('in')
os.rename('out','in')
I do not want to rewrite the whole file with many lines if the line which is to be replaced in the beginning of the file.
Is there any proper possibility to replace nth line directly?
Unless your new line is guaranteed to be exactly the same length as the original line, there is no way around rewriting the entire file.
Some word processors get really fancy by storing a journal of changes, or a big list of chunks with extra space at the end of each chunk, or a database of smaller chunks, so that auto-save modifications can be done quickly (just append to the journal, or rewrite a single chunk, or do a database update), but the real "save" button will then reconstruct the whole file and write it all at once.
This is worth doing if you autosave much more often than the user manually saves, and your files are very big. (Keep in mind that when, e.g., Microsoft Word was designed, 100KB was really big…)
And this points to the right answer. If you've got 5GB of data, and you need to change the Nth record within that, you should not be using a format that's defined as a sequence of variable-length records with no index. Which is what a text file is. The simplest format that makes sense for your case is a sequence of fixed-size records—but if you need to insert or remove records as well as changing them in-place, it will be just as bad as a text file would. So, first think through your requirements, then pick a data structure.
If you need to deal with some more limited format (like text files) for interchange with other programs, that's fine. You will have to rewrite the entire file once, after all of your changes, to "export", but you won't have to do it every time you make any change.
If all of your lines are exactly the same length, you can do this as follows:
with open('myfile.txt', 'rb+') as f:
f.seek(FIXED_LINE_LENGTH * line_number)
f.write(new_line)
Note that it's length in bytes that matters, not length in characters. And you must open the file in binary mode to use it this way.
If you don't know which line number you're trying to replace, you'd want something like this:
with open('myfile.txt', 'rb+') as f:
for line_number, line in enumerate(f):
if is_the_right_line(line):
f.seek(FIXED_LINE_LENGTH * line_number)
f.write(new_line)
If your lines aren't all required to be the same length, but you can be absolutely positive that this one new line is the same length as the old line, you can do this:
with open('myfile.txt', 'rb+') as f:
last_pos = 0
for line_number, line in enumerate(f):
if is_the_right_line(line):
f.seek(last_pos)
f.write(new_line)
last_pos = f.tell()

Selecting and printing specific rows of text file

I have a very large (~8 gb) text file that has very long lines. I would like to pull out lines in selected ranges of this file and put them in another text file. In fact my question is very similar to this and this but I keep getting stuck when I try to select a range of lines instead of a single line.
So far this is the only approach I have gotten to work:
lines = readin.readlines()
out1.write(str(lines[5:67]))
out2.write(str(lines[89:111]))
However this gives me a list and I would like to output a file with a format identical to the input file (one line per row)
You can call join on the ranges.
lines = readin.readlines()
out1.write(''.join(lines[5:67]))
out2.write(''.join(lines[89:111]))
might i suggest not storing the entire file (since it is large) as per one of your links?
f = open('file')
n = open('newfile', 'w')
for i, text in enumerate(f):
if i > 4 and i < 68:
n.write(text)
elif i > 88 and i < 112:
n.write(text)
else:
pass
i'd also recommend using 'with' instead of opening and closing the file, but i unfortunately am not allowed to upgrade to a new enough version of python for that here : (.
The first thing you should think of when facing a problem like this, is to avoid reading the entire file into memory at once. readlines() will do that, so that specific method should be avoided.
Luckily, we have an excellent standard library in Python, itertools. itertools has lot of useful functions, and one of them is islice. islice iterates over an iterable (such as lists, generators, file-like objects etc.) and returns a generator containing the range specified:
itertools.islice(iterable, start, stop[, step])
Make an iterator that returns selected elements from the iterable. If start is non-zero,
then elements from the iterable are skipped until start is reached.
Afterward, elements are returned consecutively unless step is set
higher than one which results in items being skipped. If stop is None,
then iteration continues until the iterator is exhausted, if at all;
otherwise, it stops at the specified position. Unlike regular slicing,
islice() does not support negative values for start, stop, or step.
Can be used to extract related fields from data where the internal
structure has been flattened (for example, a multi-line report may
list a name field on every third line)
Using this information, together with the str.join method, you can e.g. extract lines 10-19 by using this simple code:
from itertools import islice
# Add the 'wb' flag if you use Windows
with open('huge_data_file.txt', 'wb') as data_file:
txt = '\n'.join(islice(data_file, 10, 20))
Note that when looping over the file object, the newline char is stripped from the lines, so you need to set \n as the joining char.
(Partial Answer) In order to make your current approach work you'll have to write line by line. For instance:
lines = readin.readlines()
for each in lines[5:67]:
out1.write(each)
for each in lines[89:111]:
out2.write(each)
path = "c:\\someplace\\"
Open 2 text files. One for reading and one for writing
f_in = open(path + "temp.txt", 'r')
f_out = open(path + output_name, 'w')
go through each line of the input file
for line in f_in:
if i_want_to_write_this_line == True:
f_out.write(line)
close the files when done
f_in.close()
f_out.close()

Python goto text file line without reading previous lines

I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line

Categories

Resources