I'm trying to create a file generator which would allow me to keep reading a file (CSV) line by line, and keep running as new lines get added to the file (like a continuous log), but also keeps waiting/running the background when no new lines are found in the logs.
I've tried using aiofiles, but I couldn't figure out how to run the async function from my sync function/main function. Then I tried trio, and using the following code I'm able to read the lines in the file.
async def open_file(filepath):
with open(filepath, newline='') as f:
first_line = True
_reader = csv.reader(f, lineterminator='\n')
for row in _reader:
if row:
# skip header row
if first_line:
first_line = False
else:
print(tuple(row)) # yield tuple(row) here gives the error stated below
else:
await trio.sleep(1)
trio.run(open_file, 'products.csv')
But the script stops after reading the rows and doesn't wait for more rows in the background.
And when I replace with print(row) with yield tuple(row) (which will actually return a generator), I get the error TypeError: start_soon expected an async function but got an async generator <async_generator object open_file at 0x1050944d0>.
So printing the rows is working fine, but yielding is not
How can I fix this? Also, will this be able to help read lines in parallel?
Update:
Please not that I've to use csv.reader to read the lines as some of the rows contain \n and this is the only way to read the record properly.
Iterator won't help in your case because iterator stops immediately when it reaches the end of file
You could probably look up the similar functionality in the following module - https://github.com/kasun/python-tail/blob/master/tail.py
def follow(filename):
with open(filename) as file_:
file_.seek(0,2) # Remove it if you need to scan file from the beginning
while True:
curr_position = file_.tell()
line = file_.readline()
if not line:
file_.seek(curr_position)
yield None
else:
yield line
Then you can create a generator that yields None if line is not ready yet and a string if there is a next line in the file available. Using next() function you can fetch lines one by one in a non-blocking manner.
Here is how you use it:
non_blocking_reader = follow("my_file.txt")
# do something
line = next(non_blocking_reader)
if line is not None: # You need to distinguish None from empty string, so use `is not` not just `if line:`
# do something else
# do something else
next_line = next(non_blocking_reader)
# ...
Related
I have this code for opening a big file:
fr = open('X1','r')
text = fr.read()
print(text)
fr.close()
When I open it with gedit, the file is something like this with the number of each row:
but in terminal it is shown without any row number:
So, it is difficult to distinguish among different rows.
How can I add the number of rows in my python script?
If you just want to show the number of the line, wrap the line iterator in enumerate, it will return a iterator of tuples with the index (zero based) and the line.
Like this:
with open('X1', 'r') as fr:
for index, line in enumerate(fr):
print(f'{index}: {line}')
Also, when working with files, using the context manager with with is better. It ensures proper closing and flushing the data buffers even if an exception is raised
EDIT:
As a bonus, the example I gave uses only iterators, the file object is a iterator of lines and enumerate also returns a iterator that builds the tuple.
This means that this script only holds only line at a time in memory (and the buffers defined by your platform), not the whole file.
fr = open('X1','r')
rowcount=0
for i in fr:
print(rowcount,end="")
print(":",end="")
print(i)
rowcount+=1
print(rowcount)
fr.close()
We have just read file line by line and printing counter each time.
I process log files with python. Let´s say that I have a log file that contains a line which is START and a line that is END, like below:
START
one line
two line
...
n line
END
What I do want is to be able to store the content between the START and END lines for further processing.
I do the following in Python:
with open (file) as name_of_file:
for line in name_of_file:
if 'START' in line: # We found the start_delimiter
print(line)
found_start = True
for line in name_of_file: # We now read until the end delimiter
if 'END' in line: # We exit here as we have the info
found_end=True
break
else:
if not (line.isspace()): # We do not want to add to the data empty strings, so we ensure the line is not empty
data.append(line.replace(',','').strip().split()) # We store information in a list called data we do not want ','' or spaces
if(found_start and found_end):
relevant_data=data
And then I process the relevant_data.
Looks to far complicated for the purity of Python, and hence my question: is there a more Pythonic way of doing this?
Thanks!
You are right that there is something not OK with having a nested loop over the same iterator. File objects are already iterators, and you can use that to your advantage. For example, to find the first line with a START in it:
line = next(l for l in name_of_file if 'START' in l)
This will raise a StopIteration if there is no such line. It also sets the file pointer to the beginning of the first line you care about.
Getting the last line without anything that comes after it is a bit more complicated because it's difficult to set external state in a generator expression. Instead, you can make a simple generator:
def interesting_lines(file):
if not next((line for line in file if 'START' in line), None):
return
for line in file:
if 'END' in line:
break
line = line.strip()
if not line:
continue
yield line.replace(',', '').split()
The generator will yield nothing if you don't have a START, but it will yield all the lines until the end if there is no END, so it differs a little from your implementation. You would use the generator to replace your loop entirely:
with open(name_of_file) as file:
data = list(interesting_lines(file))
if data:
... # process data
Wrapping the generator in list immediately processes it, so the lines persist even after you close the file. The iterator can be used repeatedly because at the end of your call, the file pointer will be just past the END line:
with open(name_of_file) as file:
for data in iter(lambda: list(interesting_lines(file)), []):
# Process another data set.
The relatively lesser known form of iter converts any callable object that accepts no arguments into an iterator. The end is reached when the callable returns the sentinel value, in this case an empty list.
To perform that, you can use iter(callable, sentinel) discussed in this post , that will read until a sentinel value is reached, in your case 'END' (after applying .strip()).
with open(filename) as file:
start_token = next(l for l in file if l.strip()=='START') # Used to read until the start token
result = [line.replace(',', '').split() for line in iter(lambda x=file: next(x).strip(), 'END') if line]
This is a mission for regular expressions re, for example:
import re
lines = """ not this line
START
this line
this line too
END
not this one
"""
search_obj = re.search( r'START(.*)END', lines, re.S)
search_obj.groups(1)
# ('\n this line\n this line too\n ',)
The re.S is necessary for spanning multiple lines.
I want to read the csv file in a manner similar to tail -f i.e. like reading an error log file.
I can perform this operation in a text file with this code:
while 1:
where = self.file.tell()
line = self.file.readline()
if not line:
print "No line waiting, waiting for one second"
time.sleep(1)
self.file.seek(where)
if (re.search('[a-zA-Z]', line) == False):
continue
else:
response = self.naturalLanguageProcessing(line)
if(response is not None):
response["id"] = self.id
self.id += 1
response["tweet"] = line
self.saveResults(response)
else:
continue
How do I perform the same task for a csv file? I have gone through a link which can give me last 8 rows but that is not what I require. The csv file will be getting updated simultaneously and I need to get the newly appended rows.
Connecting A File Tailer To A csv.reader
In order to plug your code that looks for content newly appended to a file into a csv.reader, you need to put it into the form of an iterator.
I'm not intending to showcase correct code, but specifically to show how to adopt your existing code into this form, without making assertions about its correctness. In particular, the sleep() would be better replaced with a mechanism such as inotify to let the operating system assertively inform you when the file has changed; and the seek() and tell() would be better replaced with storing partial lines in memory rather than backing up and rereading them from the beginning over and over.
import csv
import time
class FileTailer(object):
def __init__(self, file, delay=0.1):
self.file = file
self.delay = delay
def __iter__(self):
while True:
where = self.file.tell()
line = self.file.readline()
if line and line.endswith('\n'): # only emit full lines
yield line
else: # for a partial line, pause and back up
time.sleep(self.delay) # ...not actually a recommended approach.
self.file.seek(where)
csv_reader = csv.reader(FileTailer(open('myfile.csv')))
for row in csv_reader:
print("Read row: %r" % (row,))
If you create an empty myfile.csv, start python csvtailer.py, and then echo "first,line" >>myfile.csv from a different window, you'll see the output of Read row: ['first', 'line'] immediately appear.
Finding A Correct File Tailer In Python
For a correctly-implemented iterator that waits for new lines to be available, consider referring to one of the existing StackOverflow questions on the topic:
How to implement a pythonic equivalent of tail -F?
Reading infinite stream - tail
Reading updated files on the fly in Python
I have a text document that I would like to repeatedly remove the first line of text from every 30 seconds or so.
I have already written (or more accurately copied) the code for the python resettable timer object that allows a function to be called every 30 seconds in a non blocking way if not asked to reset or cancel.
Resettable timer in python repeats until cancelled
(If someone could check the way I implemented the repeat in that is ok, because my python sometimes crashes while running that, would be appreciated :))
I now want to write my function to load a text file and perhaps copy all but the first line and then rewrite it to the same text file. I can do this, this way I think... but is it the most efficient ?
def removeLine():
with open(path, 'rU') as file:
lines = deque(file)
try:
print lines.popleft()
except IndexError:
print "Nothing to pop?"
with open(path, 'w') as file:
file.writelines(lines)
This works, but is it the best way to do it ?
I'd use the fileinput module with inplace=True:
import fileinput
def removeLine():
inputfile = fileinput.input(path, inplace=True, mode='rU')
next(inputfile, None) # skip a line *if present*
for line in inputfile:
print line, # write out again, but without an extra newline
inputfile.close()
inplace=True causes sys.stdout to be redirected to the open file, so we can simply 'print' the lines.
The next() call is used to skip the first line; giving it a default None suppresses the StopIteration exception for an empty file.
This makes rewriting a large file more efficient as you only need to keep the fileinput readlines buffer in memory.
I don't think a deque is needed at all, even for your solution; just use next() there too, then use list() to catch the remaining lines:
def removeLine():
with open(path, 'rU') as file:
next(file, None) # skip a line *if present*
lines = list(file)
with open(path, 'w') as file:
file.writelines(lines)
but this requires you to read all of the file in memory; don't do that with large files.
So I have a file that contains this:
SequenceName 4.6e-38 810..924
SequenceName_FGS_810..924 VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
SequenceName 1.6e-38 887..992
SequenceName_GYQ_887..992 PLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
I want my program to read only the lines that contain these protein sequences. Up until now I got this, which skips the first line and read the second one:
handle = open(filename, "r")
handle.readline()
linearr = handle.readline().split()
handle.close()
fnamealpha = fname + ".txt"
handle = open(fnamealpha, "w")
handle.write(">%s\n%s\n" % (linearr[0], linearr[1]))
handle.close()
But it only processes the first sequence and I need it to process every line that contains a sequence, so I need a loop, how can I do it?
The part that saves to a txt file is really important too so I need to find a way in which I can combine these two objectives.
My output with the above code is:
>SequenceName_810..924
VAWNCRQNVFWAPLFQGPYTPARYYYAPEEPKHYQEMKQCFSQTYHGMSFCDGCQIGMCH
Okay, I think I understand your question--you want to iterate over the lines in the file, right? But only the second line in the sequence--the one with the protein sequence--matters, correct? Here's my suggestion:
# context manager `with` takes care of file closing, error handling
with open(filename, 'r') as handle:
for line in handle:
if line.startswith('SequenceName_'):
print line.split()
# Write to file, etc.
My reasoning being that you're only interested in lines that start with SequenceName_###.
Use readlines and throw it all into a for loop.
with open(filename, 'r') as fh:
for line in fh.readlines:
# do processing here
In the #do processing here section, you can just prepare another list of lines to write to the other file. (Using with handles all the proper closure and sure.)