In C++ you can read one value at a time like this:
//from console
cin >> x;
//from file:
ifstream fin("file name");
fin >> x;
I would like to emulate this behaviour in Python. It seems, however, that the ordinary ways to get input in Python read either whole lines, the whole file, or a set number of bits.
I would like a function, let's call it one_read(), that reads from a file until it encounters either a white-space or a newline character, then stops. Also, on subsequent calls to one_read() the input should begin where it left off.
Examples of how it should work:
# file input.in is:
# 5 4
# 1 2 3 4 5
n = int(one_read())
k = int(one_read())
a = []
for i in range(n):
a.append(int(one_read()))
# n = 5 , k = 4 , a = [1,2,3,4,5]
How can I do this?
I think the following should get you close. I admit I haven't tested the code carefully. It sounds like itertools.takewhile should be your friend, and a generator like yield_characters below will be useful.
from itertools import takewhile
import regex as re
# this function yields characters from a file one a at a time.
def yield_characters(file):
with open(file, 'r') as f:
while f:
line = f.readline()
for char in line:
yield char
# double check this. My python regex is weak.
def not_whitespace(char):
return bool(re.match(r"\S", char))
# this should use takewhile to get iterators while something is
def read_one(file):
chars = yield_character(file)
while chars:
yield list(takewhile(not_whitespace, chars)).join()
The read_one above is a generator, so you will need to do something like call list on it.
Normally you would just read a line at a time, then split this and work with each part. However if you can't do this for resource reasons, you can implement your own reader which will read one character at a time, and then yield a word each time it reaches a delimiter (or in this example also a newline or the end of the file).
This implemention uses a context manager to handle the file opening/reading, though this might be overkill:
from functools import partial
class Words():
def __init__(self, fname, delim):
self.delims = ['\n', delim]
self.fname = fname
self.fh = None
def __enter__(self):
self.fh = open(self.fname)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.fh.close()
def one_read(self):
chars = []
for char in iter(partial(self.fh.read, 1), ''):
if char in self.delims:
# delimiter signifies end of word
word = ''.join(chars)
chars = []
yield word
else:
chars.append(char)
# Assuming x.txt contains 12 34 567 8910
with Words('/tmp/x.txt', ' ') as w:
print(next(w.one_read()))
# 12
print(next(w.one_read()))
# 34
print(list(w.one_read()))
# [567, 8910]
More or less anything that operates on files in Python can operate on the standard input and standard output. The sys standard library module defines stdin and stdout which give you access to those streams as file-like objects.
Reading a line at a time is considered idiomatic in Python because the other way is quite error-prone (just one C++ example question on Stack Overflow). But if you insist: you will have to build it yourself.
As you've found, .read(n) will read at most n text characters (technically, Unicode code points) from a stream opened in text mode. You can't tell where the end of the word is until you read the whitespace, but you can .seek back one spot - though not on the standard input, which isn't seekable.
You should also be aware that the built-in input will ignore any existing data on the standard input before prompting the user:
>>> sys.stdin.read(1) # blocks
foo
'f'
>>> # the `foo` is our input, the `'f'` is the result
>>> sys.stdin.read(1) # data is available; doesn't block
'o'
>>> input()
bar
'bar'
>>> # the second `o` from the first input was lost
Try creating a class to remember where the operation left off.
The __init__ function takes the filename, you could modify this to take a list or other iterable.
read_one checks if there is anything left to read, and if there is, removes and returns the item at index 0 in the list; that being everything until the first whitespace.
class Reader:
def __init__(self, filename):
self.file_contents = open(filename).read().split()
def read_one(self):
if self.file_contents != []:
return self.file_contents.pop(0)
Initalise the function as follows and adapt to your liking:
reader = Reader(filepath)
reader.read_one()
Related
I have a file named sample.txt which looks like below
ServiceProfile.SharediFCList[1].DefaultHandling=1
ServiceProfile.SharediFCList[1].ServiceInformation=
ServiceProfile.SharediFCList[1].IncludeRegisterRequest=n
ServiceProfile.SharediFCList[1].IncludeRegisterResponse=n
Here my requirement is to remove the brackets and the integer and enter os commands with that
ServiceProfile.SharediFCList.DefaultHandling=1
ServiceProfile.SharediFCList.ServiceInformation=
ServiceProfile.SharediFCList.IncludeRegisterRequest=n
ServiceProfile.SharediFCList.IncludeRegisterResponse=n
I am quite a newbie in Python. This is my first attempt. I have used these codes to remove the brackets:
#!/usr/bin/python
import re
import os
import sys
f = os.open("sample.txt", os.O_RDWR)
ret = os.read(f, 10000)
os.close(f)
print ret
var1 = re.sub("[\(\[].*?[\)\]]", "", ret)
print var1f = open("removed.cfg", "w+")
f.write(var1)
f.close()
After this using the file as input I want to form application specific commands which looks like this:
cmcli INS "DefaultHandling=1 ServiceInformation="
and the next set as
cmcli INS "IncludeRegisterRequest=n IncludeRegisterRequest=y"
so basically now I want the all the output to be bunched to a set of two for me to execute the commands on the operating system.
Is there any way that I could bunch them up as set of two?
Reading 10,000 bytes of text into a string is really not necessary when your file is line-oriented text, and isn't scalable either. And you need a very good reason to be using os.open() instead of open().
So, treat your data as the lines of text that it is, and every two lines, compose a single line of output.
from __future__ import print_function
import re
command = [None,None]
cmd_id = 1
bracket_re = re.compile(r".+\[\d\]\.(.+)")
# This doesn't just remove the brackets: what you actually seem to want is
# to pick out everything after [1]. and ignore the rest.
with open("removed_cfg","w") as outfile:
with open("sample.txt") as infile:
for line in infile:
m = bracket_re.match(line)
cmd_id = 1 - cmd_id # gives 0, 1, 0, 1
command[cmd_id] = m.group(1)
if cmd_id == 1: # we have a pair
output_line = """cmcli INS "{0} {1}" """.format(*command)
print (output_line, file=outfile)
This gives the output
cmcli INS "DefaultHandling=1 ServiceInformation="
cmcli INS "IncludeRegisterRequest=n IncludeRegisterResponse=n"
The second line doesn't correspond to your sample output. I don't know how the input IncludeRegisterResponse=n is supposed to become the output IncludeRegisterRequest=y. I assume that's a mistake.
Note that this code depends on your input data being precisely as you describe it and has no error checking whatsoever. So if the format of the input is in reality more variable than that, then you will need to add some validation.
I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.
Previously, I had been cleaning out data using the code snippet below
import unicodedata, re, io
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
return cc_re.sub('', s)
cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
There are newline characters in the file that i want to keep.
The following records the time taken for cc_re.sub('', s) to substitute the first few lines (1st column is the time taken and 2nd column is len(s)):
0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208
As #ashwinichaudhary suggested, using s.translate(dict.fromkeys(control_chars)) and the same time taken log outputs:
0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209
But the code is really slow for my 1GB of text. Is there any other way to clean out controlled characters?
found a solution working character by charater, I bench marked it using a 100K file:
import unicodedata, re, io
from time import time
# This is to generate randomly a file to test the script
from string import lowercase
from random import random
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars
fnam = 'filename.txt'
out=io.open(fnam, 'w')
for line in range(1000000):
out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()
# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
return cc_re.sub('', s)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line =rm_control_chars(line)
cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
return ''.join(c for c in s if not c in control_chars)
t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
line = rm_control_chars_1(line)
cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0
the output is:
114.625444174
0.0149750709534
I tried on a file of 1Gb (only for the second one) and it lasted 186s.
I also wrote this other version of the same script, slightly faster (176s), and more memory efficient (for very large files not fitting in RAM):
t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
for line in fin:
out.write(rm_control_chars_1(line))
out.close()
print time() - t0
As in UTF-8, all control characters are coded in 1 byte (compatible with ASCII) and bellow 32, I suggest this fast piece of code:
#!/usr/bin/python
import sys
ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]
with open(filename, 'rb') as f1:
with open(filename + '.txt', 'wb') as f2:
b = f1.read(1)
while b != '':
if ord(b) not in ctrl_chars:
f2.write(b)
b = f1.read(1)
Is it ok enough?
Does this have to be in python? How about cleaning the file before you read it in python to start with. Use sed which will treat it line by line anyway.
See removing control characters using sed.
and if you pipe it out to another file you can open that. I don't know how fast it would be though. You can do it in a shell script and test it. according to this page - sed is 82M characters per second.
Hope it helps.
If you want it to move really fast? Break your input into multiple chunks, wrap up that data munging code as a method, and use Python's multiprocessing package to parallelize it, writing to some common text file. Going character-by-character is the easiest method to crunch stuff like this, but it always takes a while.
https://docs.python.org/3/library/multiprocessing.html
I'm surprised no one has mentioned mmap which might just be the right fit here.
Note: I'll put this in as an answer in case it's useful and apologize that I don't have the time to actually test and compare it right now.
You load the file into memory (kind of) and then you can actually run a re.sub() over the object. This helps eliminate the IO bottleneck and allows you to change the bytes in-place before writing it back at once.
After this, then, you can experiment with str.translate() vs re.sub() and also include any further optimisations like double buffering CPU and IO or using multiple CPU cores/threads.
But it'll look something like this;
import mmap
f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
A nice excerpt from the mmap documentation is;
..You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a',..
A couple of things I would try.
First, do the substitution with a replace all regex.
Second, setup a regex char class with known control char ranges instead
of a class of individual control char's.
(This is incase the engine doesn't optimize it to ranges.
A range requires two conditionals on the assembly level,
as opposed to individual conditional on each char in the class)
Third, since you are removing the characters, add a greedy quantifier
after the class. This will negate the necessity to enter into substitution
subroutines after each single char match, instead grabbing all adjacent chars
as needed.
I don't know pythons syntax for regex constructs off the top of my head,
nor all the control codes in Unicode, but the result would look something
like this:
[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+
The largest amount of time would be in copying the results to another string.
The smallest amount of time would be in finding all the control codes, which
would be miniscule.
All things being equal, the regex (as described above) is the fastest way to go.
I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo
I have a text file which a lot of random occurrences of the string #STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like
#STRING_A
then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like
#STRING_A
#STRING_A
and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.
What do you think? Is it possible to do it in bash or python?
Thanks
Funny that after all these hours nobody's yet given a solution to the problem as actually phrased (as #John Machin points out in a comment) -- remove just the leading marker (if followed by another such marker 3 lines down), not the whole line containing it. It's not hard, of course -- here's a tiny mod as needed of #truppo's fun solution, for example:
from itertools import izip, chain
f = "foo.txt"
for third, line in izip(chain(" ", open(f)), open(f)):
if third.startswith("#STRING_A") and line.startswith("#STRING_A"):
line = line[len("#STRING_A"):]
print line,
Of course, in real life, one would use an iterator.tee instead of reading the file twice, have this code in a function, not repeat the marker constant endlessly, &c;-).
Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.
Here is a more fun solution, using two iterators with a three element offset :)
from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain(" ", f1), f2):
if not (third.startswith("#STRING_A") and line.startswith("#STRING_A")):
print line,
Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.
Same goes for Python.
I'd provide a script of my own, but that wouldn't be tested. ;-)
As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.
I took a few moments to code this snippet up:
from collections import deque
line_fifo = deque()
for line in open("test"):
line_fifo.append(line)
if len(line_fifo) == 4:
# "look 3 lines backward"
if line_fifo[0] == line_fifo[-1] == "#STRING_A\n":
# get rid of that match
line_fifo.popleft()
else:
# print out the top of the fifo
print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
This code will scan through the file, and remove lines starting with the marker. It only keeps only three lines in memory by default:
from collections import deque
def delete(fp, marker, gap=3):
"""Delete lines from *fp* if they with *marker* and are followed
by another line starting with *marker* *gap* lines after.
"""
buf = deque()
for line in fp:
if len(buf) < gap:
buf.append(line)
else:
old = buf.popleft()
if not (line.startswith(marker) and old.startswith(marker)):
yield old
buf.append(line)
for line in buf:
yield line
I've tested it with:
>>> from StringIO import StringIO
>>> fp = StringIO('''a
... b
... xxx 1
... c
... xxx 2
... d
... e
... xxx 3
... f
... g
... h
... xxx 4
... i''')
>>> print ''.join(delete(fp, 'xxx'))
a
b
xxx 1
c
d
e
xxx 3
f
g
h
xxx 4
i
This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.
Example of your script causing IndexError:
>>> lines = "#string line 0\nblah blah\n".splitlines(True)
>>> needle = "#string "
>>> for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list index out of range
and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")
>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
x y <<<=== whoops!
noddle
nuddle
<<<=== still got unwanted newline in here
>>>
My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "d"
LAST = NR
}' test_file` test_file
Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.
The bad part? It does read the test_file twice.
The good part? It is a bash/shell-utility implementation.
Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the #STRING_A flag only)
This is easily remedied by adjusting the command to sed:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "s/#STRING_A//"
LAST = NR
}' test_file` test_file
This may be what you're looking for?
lines = open('sample.txt').readlines()
needle = "#string "
for i,line in enumerate(lines):
if line.startswith(needle) and lines[i-3].startswith(needle):
lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)
this outputs:
string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced -- 4 extra text
string 5 extra text
string 6 extra text
#string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
In bash you can use sort -r filename and tail -n filename to read the file backwards.
$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.
O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.
my awk script works as follows:
It reads lines and stores them into 3 line buffer
once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago
if pattern has been seen, then 3 lines are scratched
to be honest, I would probably go with python also, given that awk is really awkward.
the AWK code follows:
function max(a, b)
{
if (a > b)
return a;
else
return b;
}
BEGIN {
w = 0; #write index
r = 0; #read index
buf[0, 1, 2]; #buffer
}
END {
# flush buffer
# start at read index and print out up to w index
for (k = r % 3; k r - max(r - 3, 0); k--) {
#search in 3 line history buf
if (match(buf[k % 3], /^data.*/) != 0) {
# found -> remove lines from history
# by rewriting them -> adjust write index
w -= max(r, 3);
}
}
buf[w % 3] = $0;
w++;
}
/^.*/ {
# store line into buffer, if the history
# is full, print out the oldest one.
if (w > 2) {
print buf[r % 3];
r++;
buf[w % 3] = $0;
}
else {
buf[w] = $0;
}
w++;
}