How to read a csv file in tail -f manner using python? - python

I want to read the csv file in a manner similar to tail -f i.e. like reading an error log file.
I can perform this operation in a text file with this code:
while 1:
where = self.file.tell()
line = self.file.readline()
if not line:
print "No line waiting, waiting for one second"
time.sleep(1)
self.file.seek(where)
if (re.search('[a-zA-Z]', line) == False):
continue
else:
response = self.naturalLanguageProcessing(line)
if(response is not None):
response["id"] = self.id
self.id += 1
response["tweet"] = line
self.saveResults(response)
else:
continue
How do I perform the same task for a csv file? I have gone through a link which can give me last 8 rows but that is not what I require. The csv file will be getting updated simultaneously and I need to get the newly appended rows.

Connecting A File Tailer To A csv.reader
In order to plug your code that looks for content newly appended to a file into a csv.reader, you need to put it into the form of an iterator.
I'm not intending to showcase correct code, but specifically to show how to adopt your existing code into this form, without making assertions about its correctness. In particular, the sleep() would be better replaced with a mechanism such as inotify to let the operating system assertively inform you when the file has changed; and the seek() and tell() would be better replaced with storing partial lines in memory rather than backing up and rereading them from the beginning over and over.
import csv
import time
class FileTailer(object):
def __init__(self, file, delay=0.1):
self.file = file
self.delay = delay
def __iter__(self):
while True:
where = self.file.tell()
line = self.file.readline()
if line and line.endswith('\n'): # only emit full lines
yield line
else: # for a partial line, pause and back up
time.sleep(self.delay) # ...not actually a recommended approach.
self.file.seek(where)
csv_reader = csv.reader(FileTailer(open('myfile.csv')))
for row in csv_reader:
print("Read row: %r" % (row,))
If you create an empty myfile.csv, start python csvtailer.py, and then echo "first,line" >>myfile.csv from a different window, you'll see the output of Read row: ['first', 'line'] immediately appear.
Finding A Correct File Tailer In Python
For a correctly-implemented iterator that waits for new lines to be available, consider referring to one of the existing StackOverflow questions on the topic:
How to implement a pythonic equivalent of tail -F?
Reading infinite stream - tail
Reading updated files on the fly in Python

Related

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

I'm running SageMath 9.0, on Windows 10 OS
I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.
I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.
My current code reads the file line by line from start, and do some computation after reading each line :
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of :
Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).
so that my new code is
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.
Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?
Each line has a constant number of characters. So "jumping" might be feasible.
Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()
If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)

How to create a non-blocking generator to read a file?

I'm trying to create a file generator which would allow me to keep reading a file (CSV) line by line, and keep running as new lines get added to the file (like a continuous log), but also keeps waiting/running the background when no new lines are found in the logs.
I've tried using aiofiles, but I couldn't figure out how to run the async function from my sync function/main function. Then I tried trio, and using the following code I'm able to read the lines in the file.
async def open_file(filepath):
with open(filepath, newline='') as f:
first_line = True
_reader = csv.reader(f, lineterminator='\n')
for row in _reader:
if row:
# skip header row
if first_line:
first_line = False
else:
print(tuple(row)) # yield tuple(row) here gives the error stated below
else:
await trio.sleep(1)
trio.run(open_file, 'products.csv')
But the script stops after reading the rows and doesn't wait for more rows in the background.
And when I replace with print(row) with yield tuple(row) (which will actually return a generator), I get the error TypeError: start_soon expected an async function but got an async generator <async_generator object open_file at 0x1050944d0>.
So printing the rows is working fine, but yielding is not
How can I fix this? Also, will this be able to help read lines in parallel?
Update:
Please not that I've to use csv.reader to read the lines as some of the rows contain \n and this is the only way to read the record properly.
Iterator won't help in your case because iterator stops immediately when it reaches the end of file
You could probably look up the similar functionality in the following module - https://github.com/kasun/python-tail/blob/master/tail.py
def follow(filename):
with open(filename) as file_:
file_.seek(0,2) # Remove it if you need to scan file from the beginning
while True:
curr_position = file_.tell()
line = file_.readline()
if not line:
file_.seek(curr_position)
yield None
else:
yield line
Then you can create a generator that yields None if line is not ready yet and a string if there is a next line in the file available. Using next() function you can fetch lines one by one in a non-blocking manner.
Here is how you use it:
non_blocking_reader = follow("my_file.txt")
# do something
line = next(non_blocking_reader)
if line is not None: # You need to distinguish None from empty string, so use `is not` not just `if line:`
# do something else
# do something else
next_line = next(non_blocking_reader)
# ...

AWK to Python For Mbox

What would be the best Pythonic way of implementing this awk command in python?
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==500){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
I'm using this now to split up enormous mailbox (mbox format) files.
I'm trying a recursive method right now.
def chunkUp(mbox, chunk=0):
with open(mbox, 'r') as bigfile:
msg = 0
for line in bigfile:
if msg == 0:
with open("./TestChunks/chunks/chunk_"+str(chunk)+".txt", "a+") as cf:
if line.startswith("From "): msg += 1
cf.write(line)
if msg > 20: chunkUp(mbox, chunk+1)
I would love to be able to implement this in python and be able to resume progress if it is interrupted. Working on that bit now.
I'm tying my brain into knots! Cheers!
your recursive approach is doomed to fail: you may end up having too many open files at once, since the with blocks don't exit until the end of the program.
Better have one handle open and write to it, close & reopen new handle when "From" is encountered.
also open your files in write mode, not append. The code below tries to do the minimal operations & tests to write each line in a file, and close/open another file when From: is found. Also, in the end, the last file is closed.
def chunkUp(mbox):
with open(mbox, 'r') as bigfile:
handle = None
chunk = 0
for line in bigfile:
if line.startswith("From "):
# next (or first) file
chunk += 1
if handle is not None:
handle.close()
handle = None
# file was closed / first file: create a new one
if handle is None:
handle = open("./TestChunks/chunks/chunk_{}.txt".format(chunk), "w")
# write the line in the current file
handle.write(line)
if handle is not None:
handle.close()
I haven't tested it, but it's simple enough, it should work. If file doesn't have "From" in the first line, all lines before are stored in chunk_0.txt file.

Check if a file is modified in Python

I am trying to create a box that tells me if a file text is modified or not, if it is modified it prints out the new text inside of it. This should be in an infinite loop (the bot sleeps until the text file is modified).
I have tried this code but it doesn't work.
while True:
tfile1 = open("most_recent_follower.txt", "r")
SMRF1 = tfile1.readline()
if tfile1.readline() == SMRF1:
print(tfile1.readline())
But this is totally not working... I am new to Python, can anyone help me?
def read_file():
with open("most_recent_follower.txt", "r") as f:
SMRF1 = f.readlines()
return SMRF1
initial = read_file()
while True:
current = read_file()
if initial != current:
for line in current:
if line not in initial:
print(line)
initial = current
Read the file in once, to get it's initial state. Then continuously repeat reading of the file. When it changes, print out its contents.
I don't know what bot you are referring to, but this code, and yours, will continuously read the file. It never seems to exit.
I might suggest copying the file to a safe duplicate location, and possibly using a diff program to determine if the current file is different from the original copy, and print the added lines. If you just want lines appended you might try to utilize a utility like tail
You can also use a library like pyinotify to only trigger when the filesystem detects the file has been modified
This is the first result on Google for "check if a file is modified in python" so I'm gonna add an extra solution here.
If you're curious if a file is modified in the sense that its contents have changed, OR it was touched, then you can use os.stat:
import os
get_time = lambda f: os.stat(f).st_ctime
fn = 'file.name'
prev_time = get_time(fn)
while True:
t = get_time(fn)
if t != prev_time:
do_stuff()
prev_time = t

Open text file, print new lines only in python

I am opening a text file, which once created is constantly being written to, and then printing this out to a console any new lines, as I don't want to reprint the whole text file each time. I am checking to see if the file grows in size, if it is, just print the next new line. This is mostly working, but occasionally it gets a bit confused about the next new line, and new lines appear a few lines up, mixed in with the old lines.
Is there a better way to do this, below is my current code.
infile = "Null"
while not os.path.exists(self.logPath):
time.sleep(.1)
if os.path.isfile(self.logPath):
infile = codecs.open(self.logPath, encoding='utf8')
else:
raise ValueError("%s isn't a file!" % file_path)
lastSize = 0
lastLineIndex = 0
while True:
wx.Yield()
fileSize = os.path.getsize(self.logPath)
if fileSize > lastSize:
lines = infile.readlines()
newLines = 0
for line in lines[lastLineIndex:]:
newLines += 1
self.running_log.WriteText(line)
lastLineIndex += newLines
if "DBG-X: Returning 1" in line:
self.subject = "FAILED! - "
self.sendEmail(self)
break
if "DBG-X: Returning 0" in line:
self.subject = "PASSED! - "
self.sendEmail(self)
break
fileSize1 = fileSize
infile.flush()
infile.seek(0)
infile.close()
Also my application freezes whilst waiting for the text file to be created, as it takes a couple of seconds to appear, which isn't great.
Cheers.
This solution could help. You'd also have to do a bit of waiting until the file appears, using os.path.isfile and time.sleep.
Maybe you could:
open the file each time you need to read in it,
use lastSize as argument to seek directly to where you stopped at last reading.
Additional comment: I don't know if you need some protection, but I think you should not bother to test whether given filename is a file or not; just open it in a try...except block and catch problems if any.
As for the freezing of your application, you may want to use some kind of Threading, for instance: one thread, your main one, is handling the GUI, and a second one would wait for the file to be created. Once the file is created, the second thread sends signals to the GUI thread, containing the data to be displayed.

Categories

Resources