Save ongoing measurement into csv file - python

I would like to store data into csv file. But the data are incrementing with time. I wrote a simple example to show the problem :
import csv
import time
i = 0
with open('testfile.csv','wb') as csvfile:
writer = csv.writer(csvfile,delimiter=';',quoting=csv.QUOTE_NONE)
while True:
i = i+1
print i
writer.writerow([i])
time.sleep(2)
When the while loop is running the csv file is not written. But when I stop the program then the data are stored in the csv file.
Is there a possibility to keep the program running and 'force' the writing into the csv file?

writing in python is buffered. you can force to write the output (flush the buffer) with:
csvfile.flush()
in your code i suggest you add this line right after writer.writerow([i]).
you could also pass a buffering argument to the open() function - but i suggest you do not do that; switching buffering off comes with a performance penalty.

Related

(Solved) 32768 row limit on writing CSV file from Serial port in Python

I am reading from a force sensor from an Arduino and through Serial port. For a research reason, I need to obtain the values of the sensor (almost) as much as possible and save it to a CSV file in each sampling point (I cannot store the data as a variable and only write it once to the CSV file). Throughout the code, I also need to read from the CSV file.
The problem I have right now is that the code (showed below) starts working normally and pretends that it is working forever normally (never throws any error). However, the CSV does not update after it reaches 32768 lines every time I run the code. I know there is a 32767 character limit in each cell (which I don't exactly know the definition of cell) of csv but I don't see how it can be relevant to number of rows in my case.
This is the code I read from serial port and save the value as the csv file (ku is a queue object in multiprocessing, lock is also used in multiprocessing to limit access to the file while it is being written, delay_serial is used to create a small delay between reading from serial port):
def make_measurement(delay_serial, lock, ku, filename_sensor):
ser = serial.Serial(sPort, 115200)
ser.close()
ser.open()
while True:
lock.acquire()
tmp_data = ser.readline()
try:
tmp_data = float(tmp_data.strip())
except ValueError:
tmp_data = np.nan
this_data = [tmp_data, time.time()]
sleep(delay_serial)
with open(filename_sensor, 'a', newline='') as fp:
# Pass the CSV file object to the writer() function
writer_object = writer(fp, dialect='excel')
# Result - a writer object
# Pass the data in the list as an argument into the writerow() function
writer_object.writerow(this_data)
# Close the file object
fp.close()
ku.put(this_data)
lock.release()
UPDATE: Thanks to everyone for your help and comments. It seems the problem was with ku.put(this_data) that keeps the last reading in a buffer to be read as needed later in the code. Since I am saving all the data, instead of putting that last data in buffer, I read it from the saved CSV file. Doing so, now the CSV file can be saved (that seems to be) unlimitedly.
I tested the following code and it writes all the lines.
Edit your code based on it.
import csv
for i in range(100000):
with open("test_file.csv",'a',newline='') as fl:
writer = csv.writer(fl)
writer.writerow([i,1.5*i])
writer= None
# No need to fl.close(), the with statement does that.
Also, there is not 32768 character limit for CSV file cells. That's an excel limitation. If you try to read a CSV file with excel you need to respect Excel cell, column, and row number limitation, that is
32,767 characters per cell
1,048,576 rows
16,384 columns.
You can see what specs re supported by Excel here.
Note that these limitations are not inherent to CSV though.

Reading a file without locking it in Python

I want to read a file, but without any lock on it.
with open(source, "rb") as infile:
data = infile.read()
Can the code above lock the source file?
This source file can be updated at any time with new rows (during my script running for example).
I think not because it is only in reading mode ("rb"). But I found that we can use Windows API to read it without lock. I did not find an simple answer for my question.
My script runs locally but the source file and the script/software which appends changes on it are not (network drive).
Opening a file does not put a lock on it. In fact, if you needed to ensure that separate processes did not access a file simultaneously, all these processes would have to to cooperatively take special steps to ensure that only a single process accessed the file at one time (see Locking a file in Python). This can also be demonstrated by the following small program that purposely takes its time in reading a file to give another process (namely me with a text editor) a chance to append some data to the end of the file while the program is running. This program reads and outputs the file one byte at a time pausing .1 seconds between each read. During the running of the program I added some additional text to the end of the file and the program printed the additional text:
import time
with open('test.txt', "rb") as infile:
while True:
data = infile.read(1)
if data == b'':
break
time.sleep(.1)
print(data.decode('ascii'), end='', flush=True)
You can read your file in pieces and then join these pieces together if you need one single byte string. But this will not be as memory efficient as reading the file with a single read:
BLOCKSIZE = 64*1024 # or some other value depending on the file size
with open(source, "rb") as infile:
blocks = []
while True:
data = infile.read(BLOCKSIZE)
if data == b'':
break
blocks.append(data)
# if you need the data in one piece (otherwise the pieces are in blocks):
data = b''.join(blocks)
One alternative is to make a copy of the file temporarily and read the copy.
You can use the shutil package for such a task:
import os
import time
from shutil import copyfile
def read_file_non_blocking(file):
temp_file = f"{filename}-{time.time()}" # Stores it in the local directory
copyfile(file, temp_file)
with open(temp_file, 'r') as my_file:
# Do Something cool
my_file.close()
os.remove(temp_file)
Windows is weird in how it handles files if you, like myself, are used to Posix style file handling. I have run into this issue numerous times and I have been luck enough to avoid solving it. However in this case, if I had to solve it, I would look at the flags that can passed to os.open and see if any of those can disable to locking.
https://docs.python.org/3/library/os.html#os.open
I would do a little testing but I don't have a non-production critical Windows workstation to test on.

python read bigger csv line by line

Hello i have huge csv file (1GB) that can be updated (server often add new value)
I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"
this is example of my csv file :
id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri
in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname
and in second time, i want read this file line by line not load all file in memory
and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)
i search actualy solution with use pandas
i read this topic :
Reading a huge .csv file
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
but i don't unterstand,
1) i don't know size of my csv file, how define chunksize ?
2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?
thanks for advance for your help
First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.
Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
reader = csv.reader(f.readlines()) # create a CSV reader for the new lines
for row in reader: # iterate over the new rows, if any
print("Processing new row: {}".format(row)) # process each row however you want
time.sleep(10) # wait 10 seconds before attempting again
Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.
UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:
import csv
import time
filename = "path/to/your/file.csv"
with open(filename, "rb") as f: # on Python 3.x use: open(filename, "r", newline="")
reader = csv.reader(f) # create a CSV reader
header = next(reader) # grab the first line and keep it as a header reference
print("CSV header: {}".format(header))
for row in reader: # iterate over the available rows
print("Processing row: {}".format(row)) # process each row however you want
# file exhausted, entering a 'waiting for new data' state where we manually read new lines
while True: # process ad infinitum...
line = f.readline() # collect the next line, if any available
if line.strip(): # new line found, we'll ignore empty lines too
row = next(csv.reader([line])) # load a line into a reader, parse it immediately
print("Processing new row: {}".format(row)) # process the row however you want
continue # avoid waiting before grabbing the next line
time.sleep(10) # wait 10 seconds before attempting again
Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end.
The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.
When you need to read updated file you just start you for loop once again.
If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.
If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:
import time
while True:
do_what_you_need()
time.sleep(10)
In fact the period will be more that 10 seconds as do_what_you_need() also takes time.
If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.
First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv
The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.
All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.

Python saving data inside Memory? (ram)

I am new to Python, but I didn't know this til yet.
I have a basic program inside a for loop, that requests data from a site and saves it to a text file
But when I checked inside my task manager I saw that the memory usage only increase? This might be a problem for me when running this for a long time.
Is it standard for Python to do this or can you change it?
Here is a what the program basically is
savefile = open("file.txt", "r+")
for i in savefile:
#My code goes here
savefile.write(i)
#end of loop
savefile.close()
Python does not write to file until you call .close() or .flush() or until it hits a specified buffer size. This question might help you: How often does python flush to a file?
As #Almog said, Python does not write to the file immediately. Because of this, every line you write to the file gets stored into RAM until you use savefile.close(), which flushes the internal buffer and writes everything to the file. This would explain the extra memory usage.
Try changing the loop to this:
savefile = open('file.txt', 'r+')
for i in savefile:
savefile.write(i)
savefile.flush() #flushes buffer, saving RAM
savefile.close()
There is a better Solution, in pythonic way, to this:
with open("your_file.txt", "write_mode") as file_variable_name:
for line in file_name:
file_name.write(line)
file_name.flush()
This code flushes the File for each line and after it's execution it closes the File thanks to the with-Statement

Python - Reading from a text file that is being written in Windows

I am using Windows 7, Python 2.7. I am trying to write to a text file with one file ID in one program that continues writing new data/numbers for several minutes.
In a separate program, after the writing has begun, I am trying to read from the file that is being written in order to update and plot the data in the file.
While the first program is writing the data, I am unable to read the data until it finishes. Here is some example code to illustrate my point:
Program 1:
import time
fid = open("test1.txt", "w+")
for i in range(0, 5):
fid.write(str(i) + "\n")
print(i)
time.sleep(5)
fid.close()
Program 2:
fid = open("test1.txt", "r+")
dataList = fid.read().splitlines()
print(dataList)
fid.close()
Executing Program 2 while Program 1 is running does not allow me to see any changes until Program 1 is completed.
Is there a way to fix this issue? I need to keep the reading and writing in two separate programs.
This might be caused by buffering in program 1. You can try flushing the output in program 1 after each write:
fid.write(str(i) + "\n")
fid.flush()
Another thing you can try is to run the Python interpreter in unbuffered mode for program 1. Use the python -u option.
Also, do you need to open the file for update (mode r+) in program 2? If you just want to read it, mode r is sufficient, or you can omit the mode when calling open().

Categories

Resources