I have a code and I just want to have /X/Y/Z/C, /X/Y/Z/D, /X/Y/Z/E back(whatever comes after -tree).
It should actually reads the file, ignores everything till it sees WFS and then get the information in {}, find tree and just gives me the paths back.
I am a beginner in Python. The match pattern doesn't work cause I think the path changes every day.
any help will be appreciated.
The code:
DEFAULTS
{
FS
{
-A AAA
-B
} -aaaaaa
C
{
}
}
D "FW0"
{
}
WFS "C:" XXXX:"/C"
{
-trees
"/X/Y/Z/C"
"/X/Y/Z/D"
"/X/Y/Z/E"
-A AAA
}
A state machine-based lexical analyzer would do the trick reliably.
It recognizes the file's constructs that interest us: nested curly braces, named sections (an identifier and an opening brace on the following line; this one only cares about top-level sections) and clauses (started by -identifier inside a top-level section, possibly followed by data lines and terminated by another clause or the section's end).
Then it keeps reading the file and prints data lines found if they happen to be in the section and clause we're interested in. It also sets a flag upon finding them in order to quit immediately after that clause ends.
f = open("t.txt")
import re
identifier=None
brace_level=0
section=None
clause=None
req_clause_found=False
def in_req_clause(): return section=='WFS' and clause=='trees'
for l in (l.strip() for l in f):
if req_clause_found and not in_req_clause(): break
m=re.match(r'[A-Z]+',l) #adjust if section names can be different
if m and section is None:
identifier=m.group(0)
continue
m=re.match(r'\{(\s|$)',l)
if m:
brace_level+=1
if identifier is not None and brace_level==1:
section=identifier
identifier=None
continue
else: identifier=None
m=re.match(r'\}(\s|$)',l)
if m:
brace_level-=1
if brace_level==0: section=None
clause=None
continue
m=re.match(r'-([A-Za-z]+)',l) #adjust if clause names can be different
if m and brace_level==1:
clause=m.group(1)
continue
m=re.match(r'"(.*)"$',l)
if m and in_req_clause():
print m.group(1)
req_clause_found=True
continue
On the sample, this outputs
/X/Y/Z/C
/X/Y/Z/D
/X/Y/Z/E
I'm a little confused by the layout of your file but is there any reason not to parse it line-by-line?
def parse():
with open('data.txt') as fptr:
for line in fptr:
if line.startswith('WFS'):
for line in fptr:
if line.strip().startswith('-trees'):
result = []
for line in fptr:
if line.strip().startswith('"'):
result.append(line.strip())
else:
return result
That's not pretty but I think it'll work! Let's try it:
In [1]: !cat temp.txt
DEFAULTS
{
FS
{
-A AAA
-B
} -aaaaaa
C
{
}
}
D "FW0"
{
}
WFS "C:" XXXX:"/C"
{
-trees
"/X/Y/Z/C"
"/X/Y/Z/D"
"/X/Y/Z/E"
-A AAA
}
In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def parse():
: with open('temp.txt') as fptr:
: for line in fptr:
: if line.startswith('WFS'):
: for line in fptr:
: if line.strip().startswith('-trees'):
: result = []
: for line in fptr:
: if line.strip().startswith('"'):
: result.append(line.strip())
: else:
: return result
:
:--
In [3]: parse()
Out[3]: ['"/X/Y/Z/C"', '"/X/Y/Z/D"', '"/X/Y/Z/E"']
I'm not sure what the exact variations of your patters are, but you could use regex groups:
import re
myjunk = open("t.txt", "r")
for line in myjunk:
if re.match('(/[A-Z])*', line)
print line,
You may need to fiddle with the regex a bit, but the important point here is to invest a bit of time learning regex, and you won't have to deal with some of the unnecessarily complicated solutions suggested elsewhere. Regex is a mini language purpose built for so many things related to to text that it's really essential knowledge, even for the python newbie. You'll be glad you put the time in! And the python community is helpful, so why not join IRC and we'll see you in your favorite python channel for real time help.
Best of luck, let me know if you need more help.
PJ
Related
I have a huge text file which I need to read line by line for memory optimization.
I would like to get the string within two identifiers, as an example here between the identifiers '{' and '}':
input:
"
not this line
not this line
Pattern 'pattern' {
get this line
get this line
}
not this line
not this line
"
the output would be a string "get this line get this line "
There can be some other identifiers ('{', '}', '[', ...) inside the string but I need matching ones. Ex: Pattern { something else {...} } would get something else {...} (the englobed {...} is inside the string)
I have written a simple counter like this but it is quite slow. I was looking at a faster way of doing this.
currentString = ""
counter = 0
def GetStringBetweenIdentifiers(string, identifierA, identifierB):
global currentString, counter
for i in string:
if (i == identifierB):
counter -= 1
if(counter > 0):
currentString += i
if(i == identifierA):
counter += 1
if(counter==0):
string = currentString
currentString = ""
return string
return ""
with open(filePath) as read_obj:
for num, line in enumerate(read_obj, 1):
String = GetStringBetweenIdentifiers(line, '{', '}')
if (String != ""):
"Do something with the string"
To add some examples, there can be identifiers in the middle of the line, for example:
input:
"
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
"
the output would be a string " I want this get this line { something here } get this line also this part"
Thank you for reading!
This kind of thing can be very tricky due to ambiguous sequences. For example... Let's say that the start of a sequence of interest is '{' and the end is '}'. Now imagine that you've observed a start sentinel then, before you see an end marker, you see another start marker. What do you do then?
Anyway, here's something that will work in the perfect world (which doesn't really exist but it might give some ideas).
My input file looks like this:
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
...and the code like this...
START = '{'
END = '}'
capture = 0
data = []
section = []
with open('foo.txt') as txt:
while (c := txt.read(1)):
if c == START:
if (capture := capture + 1) > 1:
section.append(c)
elif c == END:
if (capture := capture - 1) < 0:
print('ERROR: unable to process (too many end tags)')
break
if capture:
section.append(c)
elif section:
data.append(section)
section = []
elif capture and c not in '\r\n':
section.append(c)
for section in data:
print(''.join(section))
...and this output....
I want this get this line { something here }get this line also this part
Welcome to the world of regex. It's quirky, but highly effective. This works for your situation, if in the lines you read there is only one capture-able sequence, which may contain sub sequences that might also be captured, as you show in your example. It will fail if there are independent sequences within the same input string, as it will capture the "outer most" subsequence that it finds. It would be a little more work to have it handle this case. (As they say, an exercise left to the interested reader.)
Lots of good info in the python dox and this website is key for testing.
Aside: You may also want to look into grep terminal command (not a python solution). grep is highly effective at processing massive files and pulling out matches and it works seamlessly with regex also
Anyhow:
import re
with open('dummy_text.txt', 'r') as src:
lines = src.readlines()
composite_string = ''.join(lines)
print('loaded and working with:\n')
print(composite_string)
print()
pattern = r'{((?s:.*))}'
results = re.search(pattern, composite_string)
print(f'I found: {results.group(1)}')
Produces:
loaded and working with:
not this line
not this line
Pattern 'pattern' {
get this line
get {this} line
}
not this line
not this line
I found:
get this line
get {this} line
I am writing a script, this part of the code is making my script output to print slow. I think its the nested loop which is causing the issue ( Used Dictionary concept there ). Is there any alternative way I can make my script to print the result without waiting for it.
Log = open("file.txt")
for LogLine in Log:
flag = True
for key, ConfLine in Conf.items():
for patterns in ConfLine:
patterns = DateString + patterns
if re.match(patterns, LogLine):
flag = False
break
if(flag == False):
break
if(flag):
print LogLine.strip()
C Panda's answer is good but it's not obvious that a regex full of | is the fastest way to try all regexes. Test the performance of this alternative:
pats = [re.compile(date_string+pat) for conf in Conf.values() for pat in conf]
with open('file.txt') as log:
for line in log:
if any(pat.match(line) for pat in pats):
print(line.strip())
On a side note, here's how your current code could be written with a clean break and no need for flag:
for ConfLine, patterns in ((c, p) for c in Conf.values() for p in c):
patterns = DateString + patterns
if re.match(patterns, LogLine):
break
else:
print LogLine.strip()
Try the following. It will give you a lot of speed up. Apply appropriate changes for Python 2.x
pats = (date_string+pat for conf in Conf.values() for pat in conf)
master_pat = re.compile('|'.join(pats))
with open('file.txt') as log:
for line in log:
if master_pat.match(line):
print(line.strip())
If I misread the logic and is not working, please comment.
I have a file, say:
Program foo
Implicit None
integer::ab
End Program bar
Now, I want the "bar" in last line to be "foo" (i.e. Program and End program name should be same).
I wrote a python script for this, which works very well:
#!/usr/bin/python
import fileinput
with open("i.f90") as inp:
for line in inp:
if line.startswith("Program"):
var = line.rsplit(' ', 1)[-1].strip()
if line.startswith("End Program"):
v2 = line.strip()
print v2
inp.close()
STR="End Program "+var
print STR
for line in fileinput.input("i.f90", inplace=True):
print line.replace(v2, STR).strip()
But, as I want to call it from vim, as a ftplugin, I put it in vim's ftplugin as:
function! Edname()
python<<EOF
import vim
import fileinput
with open("i.f90") as inp:
for line in inp:
if line.startswith("Program"):
var = line.rsplit(' ', 1)[-1].strip()
if line.startswith("End Program"):
v2 = line.strip()
print v2
inp.close()
STR="End Program "+var
print STR
for line in fileinput.input("i.f90", inplace=True):
print line.replace(v2, STR).strip()
EOF
endfunction
As you can see the only changes I have made is to put it in vim. No real change in the actual python code. But in this case, this is not working. It is printing the v2 and STR properly, but the line in the file (or the buffer)
is not getting updated.
Any clue?
This is largely to do with my earlier post.
But now, I have find a partial solution with python, but that is not working when called from vim.
If that's what you mean by automation, you don't need Python. Just put something like this in a ftplugin for your filetype:
function! s:FixName()
let [buf, l, c, off] = getpos('.')
call cursor([1, 1, 0])
let lnum = search('\v\c^Program\s+', 'cnW')
if !lnum
call cursor(l, c, off)
return
endif
let parts = matchlist(getline(lnum), '\v\c^Program\s+(\S*)\s*$')
if len(parts) < 2
call cursor(l, c, off)
return
endif
let lnum = search('\v\c^End\s+Program\s+', 'cnW')
call cursor(l, c, off)
if !lnum
return
endif
call setline(lnum, substitute(getline(lnum), '\v\c^End\s+Program\s+\zs.*', parts[1], ''))
endfunction
call s:FixName()
You can also do it with a macro, but that doesn't look as clever as the function above. ;) Something like this:
nnoremap <buffer> <silent> <C-p> /\v\c^Program\s+\zs<CR>"zy$/\v\c^End\s+Program\s+\zs<CR>D"zP
I have a file in the following format
Summary;None;Description;Emails\nDarlene\nGregory Murphy\nDr. Ingram\n;DateStart;20100615T111500;DateEnd;20100615T121500;Time;20100805T084547Z
Summary;Presence tech in smart energy management;Description;;DateStart;20100628T130000;DateEnd;20100628T133000;Time;20100628T055408Z
Summary;meeting;Description;None;DateStart;20100629T110000;DateEnd;20100629T120000;Time;20100805T084547Z
Summary;meeting;Description;None;DateStart;20100630T090000;DateEnd;20100630T100000;Time;20100805T084547Z
Summary;Balaji Viswanath: Meeting;Description;None;DateStart;20100712T140000;DateEnd;20100712T143000;Time;20100805T084547Z
Summary;Government Industry Training: How Smart is Your City - The Smarter City Assessment Tool\nUS Call-In Information: 1-866-803-2143\, International Number: 1-210-795-1098\, International Toll-free Numbers: See below\, Passcode: 6785765\nPresentation Link - Copy and paste URL into web browser: http://w3.tap.ibm.com/medialibrary/media_view?id=87408;Description;International Toll-free Numbers link - Copy and paste this URL into your web browser:\n\nhttps://w3-03.sso.ibm.com/sales/support/ShowDoc.wss?docid=NS010BBUN-7P4TZU&infotype=SK&infosubtype=N0&node=clientset\,IA%7Cindustries\,Y&ftext=&sort=date&showDetails=false&hitsize=25&offset=0&campaign=#International_Call-in_Numbers;DateStart;20100811T203000;DateEnd;20100811T213000;Time;20100805T084547Z
Now I need to create a function that does the following:
The function argument would specify which line to read, and let say i have already done line.split(;)
See if there is "meeting" or "call in number" anywhere in line[1], and see if there is "meeting" or "call in number" anywhere in line[2]. If either of both of these are true, the function should return "call-in meeting". Else it should return "None Inferred".
Thanks in advance
use the in operator to see if there is a match
for line in open("file"):
if "string" in line :
....
vlad003 is right: if you have newline characters in the lines; they will be new lines! In this case, I would split on "Summary" instead:
import itertools
def chunks( filePath ):
"Since you have newline characters in each section,\
you can't read each line in turn. This function reads\
lines of the file and splits them into chunks, restarting\
each time 'Summary' starts a line."
with open( filePath ) as theFile:
chunk = [ ]
for line in theFile:
if line.startswith( "Summary" ):
if chunk: yield chunk
chunk = [ line ]
else:
chunk.append( line )
yield chunk
def nth(iterable, n, default=None):
"Gets the nth element of an iterator."
return next(islice(iterable, n, None), default)
def getStatus( chunkNum ):
"Get the nth chunk of the file, split it by ";", and return the result."
chunk = nth( chunks, chunkNum, "" ).split( ";" )
if not chunk[ 0 ]:
raise SomeError # could not get the right chunk
if "meeting" in chunk[ 1 ].lower() or "call in number" in chunk[ 1 ].lower():
return "call-in meeting"
else:
return "None Inferred"
Note that this is silly if you plan to read all the chunks of the file, since it opens the file and reads through it once per query. If you plan to do this often, it would be worth parsing it into a better data format (e.g. an array of statuses). This would require one pass through the file, and give you much better lookups.
A build on ghostdog74's answer:
def finder(line):
'''Takes line number as argument. First line is number 0.'''
with open('/home/vlad/Desktop/file.txt') as f:
lines = f.read().split('Summary')[1:]
searchLine = lines[line]
if 'meeting' in searchLine.lower() or 'call in number' in searchLine.lower():
return 'call-in meeting'
else:
return 'None Inferred'
I don't quite understand what you meant by line[1] and line[2] so this is the best I could do.
EDIT: Fixed the problem with the \n's. I figure since you're searching for the meeting and call in number you don't need the Summary so I used it to split the lines.
I have a text file which a lot of random occurrences of the string #STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like
#STRING_A
then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like
#STRING_A
#STRING_A
and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.
What do you think? Is it possible to do it in bash or python?
Thanks
Funny that after all these hours nobody's yet given a solution to the problem as actually phrased (as #John Machin points out in a comment) -- remove just the leading marker (if followed by another such marker 3 lines down), not the whole line containing it. It's not hard, of course -- here's a tiny mod as needed of #truppo's fun solution, for example:
from itertools import izip, chain
f = "foo.txt"
for third, line in izip(chain(" ", open(f)), open(f)):
if third.startswith("#STRING_A") and line.startswith("#STRING_A"):
line = line[len("#STRING_A"):]
print line,
Of course, in real life, one would use an iterator.tee instead of reading the file twice, have this code in a function, not repeat the marker constant endlessly, &c;-).
Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.
Here is a more fun solution, using two iterators with a three element offset :)
from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain(" ", f1), f2):
if not (third.startswith("#STRING_A") and line.startswith("#STRING_A")):
print line,
Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.
Same goes for Python.
I'd provide a script of my own, but that wouldn't be tested. ;-)
As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.
I took a few moments to code this snippet up:
from collections import deque
line_fifo = deque()
for line in open("test"):
line_fifo.append(line)
if len(line_fifo) == 4:
# "look 3 lines backward"
if line_fifo[0] == line_fifo[-1] == "#STRING_A\n":
# get rid of that match
line_fifo.popleft()
else:
# print out the top of the fifo
print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
This code will scan through the file, and remove lines starting with the marker. It only keeps only three lines in memory by default:
from collections import deque
def delete(fp, marker, gap=3):
"""Delete lines from *fp* if they with *marker* and are followed
by another line starting with *marker* *gap* lines after.
"""
buf = deque()
for line in fp:
if len(buf) < gap:
buf.append(line)
else:
old = buf.popleft()
if not (line.startswith(marker) and old.startswith(marker)):
yield old
buf.append(line)
for line in buf:
yield line
I've tested it with:
>>> from StringIO import StringIO
>>> fp = StringIO('''a
... b
... xxx 1
... c
... xxx 2
... d
... e
... xxx 3
... f
... g
... h
... xxx 4
... i''')
>>> print ''.join(delete(fp, 'xxx'))
a
b
xxx 1
c
d
e
xxx 3
f
g
h
xxx 4
i
This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.
Example of your script causing IndexError:
>>> lines = "#string line 0\nblah blah\n".splitlines(True)
>>> needle = "#string "
>>> for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list index out of range
and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")
>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
x y <<<=== whoops!
noddle
nuddle
<<<=== still got unwanted newline in here
>>>
My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "d"
LAST = NR
}' test_file` test_file
Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.
The bad part? It does read the test_file twice.
The good part? It is a bash/shell-utility implementation.
Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the #STRING_A flag only)
This is easily remedied by adjusting the command to sed:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "s/#STRING_A//"
LAST = NR
}' test_file` test_file
This may be what you're looking for?
lines = open('sample.txt').readlines()
needle = "#string "
for i,line in enumerate(lines):
if line.startswith(needle) and lines[i-3].startswith(needle):
lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)
this outputs:
string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced -- 4 extra text
string 5 extra text
string 6 extra text
#string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
In bash you can use sort -r filename and tail -n filename to read the file backwards.
$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.
O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.
my awk script works as follows:
It reads lines and stores them into 3 line buffer
once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago
if pattern has been seen, then 3 lines are scratched
to be honest, I would probably go with python also, given that awk is really awkward.
the AWK code follows:
function max(a, b)
{
if (a > b)
return a;
else
return b;
}
BEGIN {
w = 0; #write index
r = 0; #read index
buf[0, 1, 2]; #buffer
}
END {
# flush buffer
# start at read index and print out up to w index
for (k = r % 3; k r - max(r - 3, 0); k--) {
#search in 3 line history buf
if (match(buf[k % 3], /^data.*/) != 0) {
# found -> remove lines from history
# by rewriting them -> adjust write index
w -= max(r, 3);
}
}
buf[w % 3] = $0;
w++;
}
/^.*/ {
# store line into buffer, if the history
# is full, print out the oldest one.
if (w > 2) {
print buf[r % 3];
r++;
buf[w % 3] = $0;
}
else {
buf[w] = $0;
}
w++;
}