Reset the csv.reader() iterator - python

I was trying to do some csv processing using csv reader and was stuck on an issue where I have to iterate over lines read by the csv reader. But on iterating second time, it returns nil since all the lines have already been iterated, is there any way to refresh the iterator to start from the scratch again.
Code:
desc=open("example.csv","r")
Reader1=csv.read(desc)
for lines in Reader1:
(Some code)
for lines in Reader1:
(some code)
what is precisely want to do is read a csv file in the format below
id,price,name
x,y,z
a,b,c
and rearrange it in the format below
id:x a
price: y b
name: z c
without using pandas library

Reset the underlying file object with seek, adding the following before the second loop:
desc.seek(0)
# Apparently, csv.reader will not refresh if the file is seeked to 0,
# so recreate it
Reader1 = csv.reader(desc)
Mind you, if memory is not a concern, it would typically be faster to read the input into a list, then iterate the list twice. Alternatively, you could use itertools.tee to make two iterators from the initial iterator (it requires similar memory to slurping to list if you iterate one iterator completely before starting the other, but allows you to begin iterating immediately, instead of waiting for the whole file to be read before you can process any of it). Either approach avoids additional system calls that iterating the file twice would entail. The tee approach, after the line you create Reader1 on:
# It's not safe to reuse the argument to tee, so we replace it with one of
# the results of tee
Reader1, Reader2 = itertools.tee(Reader1)
for line in Reader1:
...
for line in Reader2:
...

Related

python read bigger csv line by line

Hello i have huge csv file (1GB) that can be updated (server often add new value)
I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"
this is example of my csv file :
id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri
in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname
and in second time, i want read this file line by line not load all file in memory
and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)
i search actualy solution with use pandas
i read this topic :
Reading a huge .csv file
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
but i don't unterstand,
1) i don't know size of my csv file, how define chunksize ?
2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?
thanks for advance for your help
First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.
Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.
UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again
Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end.
The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.
When you need to read updated file you just start you for loop once again.
If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.
If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:
import time
while True:
do_what_you_need()
time.sleep(10)
In fact the period will be more that 10 seconds as do_what_you_need() also takes time.
If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.
First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv
The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.
All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.

Handle huge bz2-file

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)

Why does Python require double the RAM to read a file?

I am reading a file which is 24 GB in size. I am using
lines = open(fname).read().splitlines()
and it seems that when reading the lines, it always uses ~double the amount of RAM which should be needed. It uses about 50 GB for my current script (after it jumps up to 50 it goes back down to 28) but every time I use this kind of line to read a file in Python it generally uses double the file size before dropping down to a size that I would expect.
Any idea why this is happening or how I can avoid it?
RAM Usage: Filesize * 1: Read the entire file into memory
open(fname).read()
RAM Usage Filesize * 2: Allocate enough space in a list to split the newlines
open(fname).read().splitlines()
After this operation is complete, the RAM usage drops back down to about Filesize * 1 because the full text of the file isn't needed anymore and it can be garbage-collected.
If you don't need the full text of the file at once, and are only operating on lines, then just iterate over the file
with open(filename) as f:
for line in f:
# do something
My guess is that read returns a string of the entire file, which is not garbage collected until a list is returned from splitlines. If you need the file in the memory, try readlines method:
with open(fname) as f:
lines = f.readlines()
read() tries to load the whole file in memory. With overhead and buffers, this can exceed the size of the file. Then you split the contents of the file into lines, because python allocates new memory for each line.
Can your code be refactored to use readline() and process the lines one by one instead? this would reduce the amount of memory that your program uses at once.
with open(filename) as f:
for line in f:
# process a single line, maybe keeping some state elsewhere.
However, if you still need to load all of the lines in memory at once, use readlines() instead:
with open(filename) as f:
lines = f.readlines()
read() is returning a single str with the whole file data in it. splitlines is returning the list of lines with the same data. The whole file data isn't cleaned up until after splitlines creates the list, so you store two copies of the data for a brief period.
If you want to minimize this overhead (and still strip newlines), you can try:
with open(fname) as f:
lines = [line.rstrip('\r\n') for line in f]
If you can process line by line (don't need whole list at once), it's even better:
with open(fname) as f:
for line in f:
line = line.rstrip('\r\n')
which avoids storing more than two lines at a time.
If the file contains 25Gb of data, then file_handle.read() will return a string that is 25Gb in size. When you split that string, you create a list that holds strings that add up to 25Gb of data (plus additional string overhead for each one). So you end up using about twice the memory.
The big string will get reaped almost immediately by the garbage collector making the memory available for new python objects to occupy, but that doesn't mean that the memory is completely freed to the operating system (due to optimizations in python's memory allocator).
A better approach is to accumulate a list of lines one at a time:
with open(filename) as f:
lines = list(f)
You'll only hold approximately one line in memory from the file at a time1 so then your memory use will be mostly just the memory to store the list.
1This isn't exactly true ... pythons internal line buffering will probably have a couple kb of data at any given time buffered...
Of course, there might also be the option to process the file iteratively:
with open(filename) as f:
for line in f:
process(line)
You read the whole file into memory with:
open(fname).read()
In a second step you create of list from this string with .splitlines().
During this time the string stays in memory but you copy the parts of the
string into the list, line-by-line. Only after your are finished creating the
list, the string can be garbage collected. So during this time you store
all information twice and hence need twice the memory.
You could use open(fname).readlines() or read the file line-by-line to reduce
the memory footprint.

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()

Selecting and printing specific rows of text file

I have a very large (~8 gb) text file that has very long lines. I would like to pull out lines in selected ranges of this file and put them in another text file. In fact my question is very similar to this and this but I keep getting stuck when I try to select a range of lines instead of a single line.
So far this is the only approach I have gotten to work:
lines = readin.readlines()
out1.write(str(lines[5:67]))
out2.write(str(lines[89:111]))
However this gives me a list and I would like to output a file with a format identical to the input file (one line per row)
You can call join on the ranges.
lines = readin.readlines()
out1.write(''.join(lines[5:67]))
out2.write(''.join(lines[89:111]))
might i suggest not storing the entire file (since it is large) as per one of your links?
f = open('file')
n = open('newfile', 'w')
for i, text in enumerate(f):
if i > 4 and i < 68:
n.write(text)
elif i > 88 and i < 112:
n.write(text)
else:
pass
i'd also recommend using 'with' instead of opening and closing the file, but i unfortunately am not allowed to upgrade to a new enough version of python for that here : (.
The first thing you should think of when facing a problem like this, is to avoid reading the entire file into memory at once. readlines() will do that, so that specific method should be avoided.
Luckily, we have an excellent standard library in Python, itertools. itertools has lot of useful functions, and one of them is islice. islice iterates over an iterable (such as lists, generators, file-like objects etc.) and returns a generator containing the range specified:
itertools.islice(iterable, start, stop[, step])
Make an iterator that returns selected elements from the iterable. If start is non-zero,
then elements from the iterable are skipped until start is reached.
Afterward, elements are returned consecutively unless step is set
higher than one which results in items being skipped. If stop is None,
then iteration continues until the iterator is exhausted, if at all;
otherwise, it stops at the specified position. Unlike regular slicing,
islice() does not support negative values for start, stop, or step.
Can be used to extract related fields from data where the internal
structure has been flattened (for example, a multi-line report may
list a name field on every third line)
Using this information, together with the str.join method, you can e.g. extract lines 10-19 by using this simple code:
from itertools import islice
# Add the 'wb' flag if you use Windows
with open('huge_data_file.txt', 'wb') as data_file:
txt = '\n'.join(islice(data_file, 10, 20))
Note that when looping over the file object, the newline char is stripped from the lines, so you need to set \n as the joining char.
(Partial Answer) In order to make your current approach work you'll have to write line by line. For instance:
lines = readin.readlines()
for each in lines[5:67]:
out1.write(each)
for each in lines[89:111]:
out2.write(each)
path = "c:\\someplace\\"
Open 2 text files. One for reading and one for writing
f_in = open(path + "temp.txt", 'r')
f_out = open(path + output_name, 'w')
go through each line of the input file
for line in f_in:
if i_want_to_write_this_line == True:
f_out.write(line)
close the files when done
f_in.close()
f_out.close()

Categories

Resources