Hello i have huge csv file (1GB) that can be updated (server often add new value)
I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"
this is example of my csv file :
id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri
in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname
and in second time, i want read this file line by line not load all file in memory
and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)
i search actualy solution with use pandas
i read this topic :
Reading a huge .csv file
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
but i don't unterstand,
1) i don't know size of my csv file, how define chunksize ?
2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?
thanks for advance for your help
First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.
Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.
UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again
Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end.
The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.
When you need to read updated file you just start you for loop once again.
If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.
If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:
import time
while True:
do_what_you_need()
time.sleep(10)
In fact the period will be more that 10 seconds as do_what_you_need() also takes time.
If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.
First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv
The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.
All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.
Related
I am reading from a force sensor from an Arduino and through Serial port. For a research reason, I need to obtain the values of the sensor (almost) as much as possible and save it to a CSV file in each sampling point (I cannot store the data as a variable and only write it once to the CSV file). Throughout the code, I also need to read from the CSV file.
The problem I have right now is that the code (showed below) starts working normally and pretends that it is working forever normally (never throws any error). However, the CSV does not update after it reaches 32768 lines every time I run the code. I know there is a 32767 character limit in each cell (which I don't exactly know the definition of cell) of csv but I don't see how it can be relevant to number of rows in my case.
This is the code I read from serial port and save the value as the csv file (ku is a queue object in multiprocessing, lock is also used in multiprocessing to limit access to the file while it is being written, delay_serial is used to create a small delay between reading from serial port):
def make_measurement(delay_serial, lock, ku, filename_sensor):
ser = serial.Serial(sPort, 115200)
ser.close()
ser.open()
while True:
lock.acquire()
tmp_data = ser.readline()
try:
tmp_data = float(tmp_data.strip())
except ValueError:
tmp_data = np.nan
this_data = [tmp_data, time.time()]
sleep(delay_serial)
with open(filename_sensor, 'a', newline='') as fp:
# Pass the CSV file object to the writer() function
writer_object = writer(fp, dialect='excel')
# Result - a writer object
# Pass the data in the list as an argument into the writerow() function
writer_object.writerow(this_data)
# Close the file object
fp.close()
ku.put(this_data)
lock.release()
UPDATE: Thanks to everyone for your help and comments. It seems the problem was with ku.put(this_data) that keeps the last reading in a buffer to be read as needed later in the code. Since I am saving all the data, instead of putting that last data in buffer, I read it from the saved CSV file. Doing so, now the CSV file can be saved (that seems to be) unlimitedly.
I tested the following code and it writes all the lines.
Edit your code based on it.
import csv
for i in range(100000):
with open("test_file.csv",'a',newline='') as fl:
writer = csv.writer(fl)
writer.writerow([i,1.5*i])
writer= None
# No need to fl.close(), the with statement does that.
Also, there is not 32768 character limit for CSV file cells. That's an excel limitation. If you try to read a CSV file with excel you need to respect Excel cell, column, and row number limitation, that is
32,767 characters per cell
1,048,576 rows
16,384 columns.
You can see what specs re supported by Excel here.
Note that these limitations are not inherent to CSV though.
I have been searching for a solution for this and haven't been able to find one. I have a directory of folders which contain multiple, very-large csv files. I'm looping through each csv in each folder in the directory to replace values of certain headers. I need the headers to be consistent (from file to file) in order to run a different script to process all the data properly.
I found this solution that I though would work: change first line of a file in python.
However this is not working as expected. My code:
from_file = open(filepath)
# for line in f:
# if
data = from_file.readline()
# print(data)
# with open(filepath, "w") as f:
print 'DBG: replacing in file', filepath
# s = s.replace(search_pattern, replacement)
for i in range(len(search_pattern)):
data = re.sub(search_pattern[i], replacement[i], data)
# data = re.sub(search_pattern, replacement, data)
to_file = open(filepath, mode="w")
to_file.write(data)
shutil.copyfileobj(from_file, to_file)
I want to replace the header values in search_pattern with values in replacement without saving or writing to a different file - I want to modify the file. I have also tried
shutil.copyfileobj(from_file, to_file, -1)
As I understand it that should copy the whole file rather than breaking it up in chunks, but it doesn't seem to have an effect on my output. Is it possible that the csv is just too big?
I haven't been able to determine a different way to do this or make this way work. Any help would be greatly appreciated!
this answer from change first line of a file in python you copied from doesn't work in windows
On Linux, you can open a file for reading & writing at the same time. The system ensures that there's no conflict, but behind the scenes, 2 different file objects are being handled. And this method is very unsafe: if the program crashes while reading/writing (power off, disk full)... the file has a great chance to be truncated/corrupt.
Anyway, in Windows, you cannot open a file for reading and writing at the same time using 2 handles. It just destroys the contents of the file.
So there are 2 options, which are portable and safe:
create a file in the same directory, once copied, delete first file, and rename the new one
Like this:
import os
import shutil
filepath = "test.txt"
with open(filepath) as from_file, open(filepath+".new","w") as to_file:
data = from_file.readline()
to_file.write("something else\n")
shutil.copyfileobj(from_file, to_file)
os.remove(filepath)
os.rename(filepath+".new",filepath)
This doesn't take much longer, because the rename operation is instantaneous. Besides, if the program/computer crashes at any point, one of the files (old or new) is valid, so it's safe.
if patterns have the same length, use read/write mode
like this:
filepath = "test.txt"
with open(filepath,"r+") as rw_file:
data = rw_file.readline()
data = "h"*(len(data)-1) + "\n"
rw_file.seek(0)
rw_file.write(data)
Here we, read the line, replace the first line by the same amount of h characters, rewind the file and write the first line back, overwriting previous contents, keeping the rest of the lines. This is also safe, and even if the file is huge, it's very fast. The only constraint is that the pattern must be of the exact same size (else you would have remainders of the previous data, or you would overwrite the next line(s) since no data is shifted)
I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)
I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.
Here's the relevant part of the code:
for filename in os.listdir(path):
if not filename.startswith("."):
with open(b, 'a') as newfile, gzip.GzipFile(path+filename,'rb') as oldfile:
# BEGIN Reads each remaining line from the log into a list
data = oldfile.readlines()
for line in data:
parts = line.split()
after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system
What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.
Question:
how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)
Thanks
It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:
# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()
for line in data:
parts = line.split()
...use:
# GOOD: Iterates a line at a time
for line in oldfile:
parts = line.split()
...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.
That easy.
I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards