I have a text file, which is strucutred as following:
segmentA {
content Aa
content Ab
content Ac
....
}
segmentB {
content Ba
content Bb
content Bc
......
}
segmentC {
content Ca
content Cb
content Cc
......
}
I know how to search certrain strings through the whole text file, but how can i define to search for a certain string whithin, like example, "segmentC". I need something like reg expression to tell the script??:
If text beginn with "segmentC {" perform a search of a certain string until the first "}" appears.
Someone an idea?
Thanks in advance!
Not a RegEx solution ...but would do the work!
def SearchStuff(lines,sstr):
i=0
while(lines[i]!='}'):
#Do stuffff .....for e.g.
if 'Ca' in lines[i]:
return lines[i]
i+=1
def main(search_str):
f=open('file.txt','r')
lines = f.readlines()
f.close()
for line in lines:
if search_str in line:
index = lines.index(line)
break
lines = lines[index+1:]
print SearchStuff(lines,search_str)
search_str = 'segmentC' #set this string accordingly
main(search_str)
Depending on the complexity you are looking for, you can range from a simple state machine with line based pattern searching to a full lexer.
Line based search
The below example makes the assumption that you are only looking for one segment and that segmentC { and the closing } are on one single line.
def parsesegment(fh):
# Yields all lines inside "segmentC"
state = "out"
for line in fh:
line = line.strip() # in case there are whitespaces around
if state == "out":
if line.startswith("segmentC {"):
state = "in"
break
elif state == "in":
if line.startswith("}"):
state = "out"
break
# Work on the specific lines here
yield line
with open(...) as fh:
for line in parsesegment(fh):
# do something
Simple Lexer
If you need more flexibility, you can design a simple lexer/parser couple. For example, the following code makes no assumption to the organisation of the syntax between lines. It also ignores unknown pattern, which a typical lexer do not (normally it should raise a syntax error):
import re
class ParseSegment:
# Dictionary of patterns per state
# Tuples are (token name, pattern, state change command)
_regexes = {
"out": [
("open", re.compile(r"segment(?P<segment>\w+)\s+\{"), "in")
],
"in": [
("close", re.compile(r"\}"), "out"),
# Here an example of what you could want to match
("content", re.compile(r"content\s+(?P<content>\w+)"), None)
]
}
def lex(self, source, initpos = 0):
pos = initpos
end = len(source)
state = "out"
while pos < end:
for token_name, reg, state_chng in self._regexes[state]:
# Try to get a match
match = reg.match(source, pos)
if match:
# Advance according to how much was matched
pos = match.end()
# yield a token if it has a name
if token_name is not None:
# Yield token name, the full matched part of source
# and the match grouped according to (?P<tag>) tags
yield (token_name, match.group(), match.groupdict())
# Switch state if requested
if state_chng is not None:
state = state_chng
break
else:
# No match, advance by one character
# This is particular to that lexer, usually no match means
# the input file has an error in the syntax and lexer should
# yield an exception
pos += 1
def parse(self, source, initpos = 0):
# This is an example of use of the lexer with a parser
# This converts the input file into a dictionary. Keys are segment
# names, and values are list of contents.
segments = {}
cur_segment = None
# Use lexer to get tokens from source
for token, fullmatch, groups in self.lex(source, initpos):
# On open, create the list of content in segments
if token == "open":
cur_segment = groups["segment"]
segments[cur_segment] = []
# On content, ensure we know the segment and add content to the
# list
elif token == "content":
if cur_segment is None:
raise RuntimeError("Content found outside a segment")
segments[cur_segment].append(groups["content"])
# On close, set the current segment to unknown
elif token == "close":
cur_segment = None
# ignore unknown tokens, we could raise an error instead
return segments
def main():
with open("...", "r") as fh:
data = fh.read()
lexer = ParseSegment()
segments = lexer.parse(data)
print(segments)
return 0
if __name__ == '__main__':
main()
Full Lexer
Then if you need even more flexibility and reuseability, you will have to create a full parser. No need to reinvent the wheel, have a look at this list of language parsing modules, you will probably find the one that suits you.
Related
Suppose I have file that looks like this
Channel 1
12:30-14:00 Children’s program
17:00-19:00 Afternoon News
8:00-9:00 Morning News
————————————————————————— Channel 2
19:30-21:00 National Geographic
14:00-15:30 Comedy movies
And so on for a finite number of Channels and programs .
I would like to read the file and create a dictionary and sort it with respect to channel and program that is being shown on a given time . So something like this
Channels={
Channel 1:{Childrens program : 19:30-21:00,Afternooon news : 17:00-18:00},
Channel 2 :{ National Geographic: 19:30-21:00,Batman:14:00-15:30}
}
Try regular expressions at each line you parse, and fill your dictionary depending on what type of line this is, like so:
d = {}
with open("f.txt") as f:
channel = None
for l in f.readlines():
# If channel line format is found
match_line = re.findall(r"Channel \d*", l)
if match_line:
channel = match_line[0]
d[channel] = {}
# If program format is found
match_program = re.findall(r"(\d{1,2}:\d{1,2}-\d{1,2}:\d{1,2}) (.*$)", l)
if match_program:
d[channel][match_program[0][1]] = match_program[0][0]
d equals to:
{
"Channel 1": {
"Children’s program": "12:30-14:00",
"Afternoon News": "17:00-19:00",
"Morning News": "8:00-9:00"
},
"Channel 2": {
"National Geographic": "19:30-21:00",
"Comedy movies": "14:00-15:30"
}
}
My approach: First, break the text into blocks, each block is separated by the dash lines. Next, convert each block in a single key, value of a bigger dictionary. Finally, putting them all together to form the result.
import json
import pathlib
import re
def parse_channel(text):
"""
The `text` is a block of text such as:
Channel 1
12:30-14:00 Children’s program
17:00-19:00 Afternoon News
8:00-9:00 Morning News
This function will return ("Channel 1", {...}) which are
the key and value of a bigger dictionary
"""
lines = (line.strip() for line in text.splitlines() if line)
key = next(lines)
value = {}
for line in lines:
time, name = line.split(" ", 1)
value[name] = time
return key, value
path = pathlib.Path(__file__).with_name("data.txt")
assert path.exists()
text = path.read_text(encoding="utf-8")
# Break text into blocks, separated by dash lines
blocks = re.split("-{2,}", text)
# Convert each block into key/value, then build the final dictionary
channels = dict(map(parse_channel, blocks))
print(json.dumps(channels, indent=4))
I got plugin for sublime text 3 that let me move cursor to line by its number:
import sublime, sublime_plugin
class prompt_goto_lineCommand(sublime_plugin.WindowCommand):
def run(self):
self.window.show_input_panel("Goto Line:", "", self.on_done, None, None)
pass
def on_done(self, text):
try:
line = int(text)
if self.window.active_view():
self.window.active_view().run_command("goto_line", {"line": line} )
except ValueError:
pass
class go_to_lineCommand(sublime_plugin.TextCommand):
def run(self, edit, line):
# Convert from 1 based to a 0 based line number
line = int(line) - 1
# Negative line numbers count from the end of the buffer
if line < 0:
lines, _ = self.view.rowcol(self.view.size())
line = lines + line + 1
pt = self.view.text_point(line, 0)
self.view.sel().clear()
self.view.sel().add(sublime.Region(pt))
self.view.show(pt)
I want to improve it to let me move cursor to first line containing the specified string. It is like a search on file:
For example if pass to it string "class go_to_lineCommand" plugin must move cursor to line 17 :
and possibly select string class go_to_lineCommand.
The problem is reduced to finding regionWithGivenString, and then I can select it:
self.view.sel().add(regionWithGivenString)
But don't know method to get regionWithGivenString.
I tried to
find on google: sublime plugin find and select text
check api
But still no result.
I am not sure about the typical way. However, you can achieve this in following way:
Get the content of current doc.
Search target string to find out its start and end position. Now you have the start and end point.
Add the Region(start, end) to selections.
Example:
def run(self, edit, target):
if not target or target == "":
return
content = self.view.substr(sublime.Region(0, self.view.size()))
begin = content.find(target)
if begin == -1:
return
end = begin + len(target)
target_region = sublime.Region(begin, end)
self.view.sel().clear()
self.view.sel().add(target_region)
there you have it in the API, use the view.find(regex,pos) method.
s = self.view.find("go_to_lineCommand", 0)
self.view.sel().add(s)
http://www.sublimetext.com/docs/3/api_reference.html
A possible improvement to the longhua's answer - adding moving cursor to the target line.
class FindcustomCommand(sublime_plugin.TextCommand):
def _select(self):
self.view.sel().clear()
self.view.sel().add(self._target_region)
def run(self, edit):
TARGET = 'http://nabiraem'
# if not target or target == "":
# return
content = self.view.substr(sublime.Region(0, self.view.size()))
begin = content.find(TARGET)
if begin == -1:
return
end = begin + len(TARGET)
self._target_region = sublime.Region(begin, end)
self._select()
self.view.show(self._target_region) # scroll to selection
I have a neat little script in python that I would like to port to Ruby and I think it's highlighting my noobishness at Ruby. I'm getting the error that there is an unexpected END statement, but I don't see how this can be so. Perhaps there is a keyword that requires an END or something that doesn't want an END that I forgot about. Here is all of the code leading up to the offending line Offending line is commented.
begin
require base64
require base32
rescue LoadError
puts "etext requires base32. use 'gem install --remote base32' and try again"
end
# Get a string from a text file from disk
filename = ARGV.first
textFile = File.open(filename)
text = textFile.read()
mailType = "text only" # set the default mailType
#cut the email up by sections
textList1 = text.split(/\n\n/)
header = textList1[0]
if header.match (/MIME-Version/)
mailType = "MIME"
end
#If mail has no attachments, parse as text-only. This is the class that does this
class TextOnlyMailParser
def initialize(textList)
a = 1
body = ""
header = textList[0]
#parsedEmail = Email.new(header)
while a < textList.count
body += ('\n' + textList[a] + '\n')
a += 1
end
#parsedEmail.body = body
end
end
def separate(text,boundary = nil)
# returns list of strings and lists containing all of the parts of the email
if !boundary #look in the email for "boundary= X"
text.scan(/(?<=boundary=).*/) do |bound|
textList = recursiveSplit(text,bound)
end
return textList
end
if boundary
textList = recursiveSplit(text,boundary)
end
end
def recursiveSplit(chunk,boundary)
if chunk.is_a? String
searchString = "--" + boundary
ar = cunk.split(searchString)
return ar
elsif chunk.is_a? Array
chunk do |bit|
recursiveSplit(bit,boundary);
end
end
end
class MIMEParser
def initialize(textList)
#textList = textList
#nestedItems = []
newItem = NestItem.new(self)
newItem.value = #textList[0]
newItem.contentType = "Header"
#nestedItems.push(newItem)
#setup parsed email
#parsedEmail = Email.new(newItem.value)
self._constructNest
end
def checkForContentSpecial(item)
match = item.value.match (/Content-Disposition: attachment/)
if match
filename = item.value.match (/(?<=filename=").+(?=")/)
encoding = item.value.match (/(?<=Content-Transfer-Encoding: ).+/)
data = item.value.match (/(?<=\n\n).*(?=(\n--)|(--))/m)
dataGroup = data.split(/\n/)
dataString = ''
i = 0
while i < dataGroup.count
dataString += dataGroup[i]
i ++
end #<-----THIS IS THE OFFENDING LINE
#parsedEmail.attachments.push(Attachment.new(filename,encoding,dataString))
end
Your issue is the i ++ line, Ruby does not have a post or pre increment/decrement operators and the line is failing to parse. I can't personally account as to why i++ evaluates in IRB but i ++ does not perform any action.
Instead replace your ++ operators with += 1 making that last while:
while i < dataGroup.count
dataString += dataGroup[i]
i += 1
end
But also think about the ruby way, if you're just adding that to a string why not do a dataString = dataGroup.join instead of looping over with a while construct?
If I have a keyword, how can I get it to, once it encounters a keyword, to just grab the rest of the line and return it as a string? Once it encounters an end of line, return everything on that line.
Here is the line I'm looking at:
description here is the rest of my text to collect
Thus, when the lexer encounters description, I would like "here is the rest of my text to collect" returned as a string
I have the following defined, but it seems to be throwing an error:
states = (
('bcdescription', 'exclusive'),
)
def t_bcdescription(t):
r'description '
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.valiue.count('\n')
t.lexer.begin('INITIAL')
return t
This is part of the error being returned:
File "/Users/me/Coding/wm/wm_parser/ply/lex.py", line 393, in token
raise LexError("Illegal character '%s' at index %d" % (lexdata[lexpos],lexpos), lexdata[lexpos:])
ply.lex.LexError: Illegal character ' ' at index 40
Finally, if I wanted this functionality for more than one token, how could I accomplish that?
Thanks for your time
There is no big problem with your code,in fact,i just copy your code and run it,it works well
import ply.lex as lex
states = (
('bcdescription', 'exclusive'),
)
tokens = ("BCDESCRIPTION",)
def t_bcdescription(t):
r'\bdescription\b'
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
return t
def t_bcdescription_content(t):
r'[^\n]+'
lexer = lex.lex()
data = 'description here is the rest of my text to collect\n'
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
and result is :
LexToken(BCDESCRIPTION,' here is the rest of my text to collect\n',1,50)
So maybe your can check other parts of your code
and if I wanted this functionality for more than one token, then you can simply capture words and when there comes a word appears in those tokens, start to capture the rest of content by the code above.
It is not obvious why you need to use a lexer/parser for this without further information.
>>> x = 'description here is the rest of my text to collect'
>>> a, b = x.split(' ', 1)
>>> a
'description'
>>> b
'here is the rest of my text to collect'
I have config file,
[local]
variable1 : val1 ;#comment1
variable2 : val2 ;#comment2
code like this reads only value of the key:
class Config(object):
def __init__(self):
self.config = ConfigParser.ConfigParser()
self.config.read('config.py')
def get_path(self):
return self.config.get('local', 'variable1')
if __name__ == '__main__':
c = Config()
print c.get_path()
but i also want to read the comment present along with the value, any suggestions in this regards will be very helpful.
Alas, this is not easily done in general case. Comments are supposed to be ignored by the parser.
In your specific case, it is easy, because # only serves as a comment character if it begins a line. So variable1's value will be "val1 #comment1". I suppose you use something like this, only less brittle:
val1_line = c.get('local', 'var1')
val1, comment = val1_line.split(' #')
If the value of a 'comment' is needed, probably it is not a proper comment? Consider adding explicit keys for the 'comments', like this:
[local]
var1: 108.5j
var1_comment: remember, the flux capacitor capacitance is imaginary!
Your only solutions is to write another ConfigParser overriding the method _read(). In your ConfigParser you should delete all checks about comment removal. This is a dangerous solution, but should work.
class ValuesWithCommentsConfigParser(ConfigParser.ConfigParser):
def _read(self, fp, fpname):
from ConfigParser import DEFAULTSECT, MissingSectionHeaderError, ParsingError
cursect = None # None, or a dictionary
optname = None
lineno = 0
e = None # None, or an exception
while True:
line = fp.readline()
if not line:
break
lineno = lineno + 1
# comment or blank line?
if line.strip() == '' or line[0] in '#;':
continue
if line.split(None, 1)[0].lower() == 'rem' and line[0] in "rR":
# no leading whitespace
continue
# continuation line?
if line[0].isspace() and cursect is not None and optname:
value = line.strip()
if value:
cursect[optname].append(value)
# a section header or option header?
else:
# is it a section header?
mo = self.SECTCRE.match(line)
if mo:
sectname = mo.group('header')
if sectname in self._sections:
cursect = self._sections[sectname]
elif sectname == DEFAULTSECT:
cursect = self._defaults
else:
cursect = self._dict()
cursect['__name__'] = sectname
self._sections[sectname] = cursect
# So sections can't start with a continuation line
optname = None
# no section header in the file?
elif cursect is None:
raise MissingSectionHeaderError(fpname, lineno, line)
# an option line?
else:
mo = self._optcre.match(line)
if mo:
optname, vi, optval = mo.group('option', 'vi', 'value')
optname = self.optionxform(optname.rstrip())
# This check is fine because the OPTCRE cannot
# match if it would set optval to None
if optval is not None:
optval = optval.strip()
# allow empty values
if optval == '""':
optval = ''
cursect[optname] = [optval]
else:
# valueless option handling
cursect[optname] = optval
else:
# a non-fatal parsing error occurred. set up the
# exception but keep going. the exception will be
# raised at the end of the file and will contain a
# list of all bogus lines
if not e:
e = ParsingError(fpname)
e.append(lineno, repr(line))
# if any parsing errors occurred, raise an exception
if e:
raise e
# join the multi-line values collected while reading
all_sections = [self._defaults]
all_sections.extend(self._sections.values())
for options in all_sections:
for name, val in options.items():
if isinstance(val, list):
options[name] = '\n'.join(val)
In the ValuesWithCommentsConfigParser I fixed some imports and deleted the appropriate sections of code.
Using the same config.ini from my previous answer, I can prove the previous code is correct.
config = ValuesWithCommentsConfigParser()
config.read('config.ini')
assert config.get('local', 'variable1') == 'value1 ; comment1'
assert config.get('local', 'variable2') == 'value2 # comment2'
Accordiing to the ConfigParser module documentation,
Configuration files may include comments, prefixed by specific
characters (# and ;). Comments may appear on their own in an otherwise
empty line, or may be entered in lines holding values or section
names. In the latter case, they need to be preceded by a whitespace
character to be recognized as a comment. (For backwards compatibility,
only ; starts an inline comment, while # does not.)
If you want to read the "comment" with the value, you can omit the whitespace before the ; character or use the #. But in this case the strings comment1 and comment2 become part of the value and are not considered comments any more.
A better approach would be to use a different property name, such as variable1_comment, or to define another section in the configuration dedicated to comments:
[local]
variable1 = value1
[comments]
variable1 = comment1
The first solution requires you to generate a new key using another one (i.e. compute variable1_comment from variable1), the other one allows you to use the same key targeting different sections in the configuration file.
As of Python 2.7.2, is always possibile to read a comment along the line if you use the # character. As the docs say, it's for backward compatibility. The following code should run smoothly:
config = ConfigParser.ConfigParser()
config.read('config.ini')
assert config.get('local', 'variable1') == 'value1'
assert config.get('local', 'variable2') == 'value2 # comment2'
for the following config.ini file:
[local]
variable1 = value1 ; comment1
variable2 = value2 # comment2
If you adopt this solution, remember to manually parse the result of get() for values and comments.
according to the manuals:
Lines beginning with '#' or ';' are ignored and may be used to provide comments.
so the value of variable1 is "val1 #comment1".The comment is part of the value
you can check your config whether you put a Enter before your comment
In case anyone comes along afterwards. My situation was I needed to read in a .ini file generated by a Pascal Application. That configparser didn't care about # or ; starting the keys.
For example the .ini file would look like this
[KEYTYPE PATTERNS]
##-######=CLAIM
Python's configparser would skip that key value pair. Needed to modify the configparser to not look at # as comments
config = configparser.ConfigParser(comment_prefixes="")
config.read("thefile")
I'm sure I could set the comment_prefixes to whatever Pascal uses for comments, but didn't see any, so I set it to an empty string