Handle huge bz2-file

Handle huge bz2-file - python

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!

The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)

Related

python read bigger csv line by line

Hello i have huge csv file (1GB) that can be updated (server often add new value)
I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"
this is example of my csv file :
id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri
in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname
and in second time, i want read this file line by line not load all file in memory
and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)
i search actualy solution with use pandas
i read this topic :
Reading a huge .csv file
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
but i don't unterstand,
1) i don't know size of my csv file, how define chunksize ?
2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?
thanks for advance for your help

First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.
Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.
UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again

Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end.
The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.
When you need to read updated file you just start you for loop once again.
If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.
If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:
import time
while True:
do_what_you_need()
time.sleep(10)
In fact the period will be more that 10 seconds as do_what_you_need() also takes time.

If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.
First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv
The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.
All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.

Reset the csv.reader() iterator

I was trying to do some csv processing using csv reader and was stuck on an issue where I have to iterate over lines read by the csv reader. But on iterating second time, it returns nil since all the lines have already been iterated, is there any way to refresh the iterator to start from the scratch again.
Code:
desc=open("example.csv","r")
Reader1=csv.read(desc)
for lines in Reader1:
(Some code)
for lines in Reader1:
(some code)
what is precisely want to do is read a csv file in the format below
id,price,name
x,y,z
a,b,c
and rearrange it in the format below
id:x a
price: y b
name: z c
without using pandas library

Reset the underlying file object with seek, adding the following before the second loop:
desc.seek(0)
# Apparently, csv.reader will not refresh if the file is seeked to 0,
# so recreate it
Reader1 = csv.reader(desc)
Mind you, if memory is not a concern, it would typically be faster to read the input into a list, then iterate the list twice. Alternatively, you could use itertools.tee to make two iterators from the initial iterator (it requires similar memory to slurping to list if you iterate one iterator completely before starting the other, but allows you to begin iterating immediately, instead of waiting for the whole file to be read before you can process any of it). Either approach avoids additional system calls that iterating the file twice would entail. The tee approach, after the line you create Reader1 on:
# It's not safe to reuse the argument to tee, so we replace it with one of
# the results of tee
Reader1, Reader2 = itertools.tee(Reader1)
for line in Reader1:
...
for line in Reader2:
...

Python reading nothing from file [duplicate]

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body

The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.

I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

read data from a huge CSV file efficiently

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.
For example, the 2nd column contains data like
xxx, computer is good
xxx, build algorithm
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for line in infile:
wordcount.update(line.split())
My code is working for the whole columns, how to only read the second column without using CSV reader?

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.
You can just read line-by-line and parse manually:
X=[]
with open('desc.csv', 'r') as infile:
for line in infile:
# Split on comma first
cols = [x.strip() for x in line.split(',')]
# Grab 2nd "column"
col2 = cols[1]
# Split on spaces
words = [x.strip() for x in col2.split(' ')]
for word in words:
if word not in X:
X.append(word)
for w in X:
print w
That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.
From the manual csv.reader is an iterator
a reader object which will iterate over lines in the given csvfile.
csvfile can be any object which supports the iterator protocol and
returns a string each time its next() method is called
so it is fine to iterate through a huge file using csv.reader.
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for row in csv.reader(infile):
# count words in strings from second column
wordcount.update(row[1].split())

Obtain csv-like parse AND line length byte count?

I'm familiar with the csv Python module, and believe it's necessary in my case, as I have some fields that contain the delimiter (| rather than ,, but that's irrelevant) within quotes.
However, I am also looking for the byte-count length of each original row, prior to splitting into columns. I can't count on the data to always quote a column, and I don't know if/when csv will strip off outer quotes, so I don't think (but might be wrong) that simply joining on my delimiter will reproduce the original line string (less CRLF characters). Meaning, I'm not positive the following works:
with open(fname) as fh:
reader = csv.reader(fh, delimiter="|")
for row in reader:
original = "|".join(row) ## maybe?
I've tried looking at csv to see if there was anything in there that I could use/monkey-patch for this purpose, but since _csv.reader is a .so, I don't know how to mess around with that.
In case I'm dealing with an XY problem, my ultimate goal is to read through a CSV file, extracting certain fields and their overall file offsets to create a sort of look-up index. That way, later, when I have a list of candidate values, I can check each one's file-offset and seek() there, instead of chugging through the whole file again. As an idea of scale, I might have 100k values to look up across a 10GB file, so re-reading the file 100k times doesn't feel efficient to me. I'm open to other suggestions than the CSV module, but will still need csv-like intelligent parsing behavior.
EDIT: Not sure how to make it more clear than the title and body already explains - simply seek()-ing on a file handle isn't sufficient because I also need to parse the lines as a csv in order to pull out additional information.

You can't subclass _csv.reader, but the csvfile argument to the csv.reader() constructor only has to be a "file-like object". This means you could supply an instance of your own class that does some preprocessing—such as remembering the length of the last line read and file offset. Here's an implementation showing exactly that. Note that the line length does not include the end-of-line character(s). It also shows how the offsets to each line/row could be stored and used after the file is read.
import csv
class CSVInputFile(object):
""" File-like object. """
def __init__(self, file):
self.file = file
self.offset = None
self.linelen = None
def __iter__(self):
return self
def __next__(self):
offset = self.file.tell()
data = self.file.readline()
if not data:
raise StopIteration
self.offset = offset
self.linelen = len(data)
return data
next = __next__
offsets = [] # remember where each row starts
fname = 'unparsed.csv'
with open(fname) as fh:
csvfile = CSVInputFile(fh)
for row in csv.reader(csvfile, delimiter="|"):
print('offset: {}, linelen: {}, row: {}'.format(
csvfile.offset, csvfile.linelen, row)) # file offset and length of row
offsets.append(csvfile.offset) # remember where each row started

Depending on performance requirements and the size of the data, the low tech solution is to simply read the file twice. Make a first pass where you get the length of each line, and then then you can run the data through the csv parser. On my somewhat outdated Mac I can read and count the length of 2-3 million lines in a second, which isn't a huge performance hit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.