Remove text between two regex special character delimiters

Remove text between two regex special character delimiters - python

s = '^^^# """#$ raw data &*823ohcneuj^^^ Important Information ^^^raw data^^^ Imp Info'
In it, I want to remove texts between the delimiters ^^^ and ^^^.
The output should be "Important Information Imp Info"

You can do this with regular expressions:
import re
s = '^^^# """#$ raw data &*823ohcneuj^^^ Important Information ^^^raw data^^^ Imp Info'
important = re.compile(r'\^\^\^.*?\^\^\^').sub('', s)
The key elements in this regular expression are:
escape the ^ charater since it has special meaning
use the ungreedy match of .*?

def removeText(text):
carrotCount = 0
newText = ""
for char in text:
if(char == '^'):
# Reset if we have exceeded 2 sets of carrots
if(carrotCount == 6):
carrotCount = 1
else:
carrotCount += 1
# Check if we have reached the first '^^^'
elif(carrotCount == 3):
# Ignore everything between the carrots
if(char != '^'):
continue;
# Add the second set of carrots when we find them
else:
carrotCount += 1
# Check if we have reached the end of the second ^^^
# If we have, we have the message
elif(carrotCount == 6):
newText += char
return newText
This will print "Important Information Imp Info."

Related

How NOT to print emojis from comments or submission when using praw

Getting error messages when I am trying to print out comments or submission with emojis in it. How can I just disregard and print only letters and numbers?
Using Praw to webscrape
top_posts2 = page.top(limit = 25)
for post in top_posts2:
outputFile.write(post.title)
outputFile.write(' ')
outputFile.write(str(post.score))
outputFile.write('\n')
outputFile.write(post.selftext)
outputFile.write('\n')
submissions = reddit.submission(id = post.id)
comment_page = submissions.comments
top_comment = comment_page[0] #by default, this will be the best comment of the post
commentBody = top_comment.body
outputFile.write(top_comment.body)
outputFile.write('\n')
I want to output only letters and numbers. and maybe some special characters (or all)

There's a couple ways you can do this. I would recommend creating kind of a "text cleaning" function
def cleanText(text):
new_text = ""
for c in text: # for each character in the text
if c.isalnum(): # check if it is either a letter or number (alphanumeric)
new_text += c
return new_text
or if you want to include specific non-alphanumeric numbers
def cleanText(text):
valid_symbols = "!##$%^&*()" # <-- add whatever symbols you want here
new_text = ""
for c in text: # for each character in the text
if c.isalnum() or c in valid_symbols: # check if alphanumeric or a valid symbol
new_text += c
return new_text
so then in your script you can do something like
commentBody = cleanText(top_comment.body)

Extract string between 2 delimiters

I'm trying to extract some words between two delimiters. It works for the files where the script find these delimiters, but for the others files, the code extract all of the file.
Example:
File 00.txt:
'bqukfkb saved qshfqs illjQNqdj iohqsijqsd qsoiqsdqs'
File 01.txt:
'jkhjkl dbdqs ihnzqid Bad value okkkk SPAN sfsdf didjsfsdf'
I want to open 2 or more files like these two and extract only words between:
'Bad Value' and 'SPAN'.
My code works for the file 01.txt, but not for the 00.txt ( i think it's because it doesn't find the delimiters so he prints everything. How can i fix it ?
def get_path(): #return the path of the selected file(s)
root = Tk()
i= datetime.datetime.now()
day = i.day
month=i.month
root.filename = filedialog.askopenfilenames(initialdir = "Z:\SGI\SYNCBBG",title = "Select your files",filetypes = (("Fichier 1","f6365tscf.SCD*"+str(month)+str(day)+".1"),("all files",".*")))
root.withdraw()
return (root.filename)
def extraction_error(file):
f=open(file,'r')
file=f.read()
f.close()
start = file.find('Bad value') +9
end = file.find('SPAN', start)
return(file[start:end])
paths=get_path()
cpt=len(paths)
for x in range(0,cpt):
print(extraction_error(paths[x]))
Output : saved qshfqs illjQNqdj iohqsijqsd qsoiqsdq
okkkk
So in this case i just want to extract 'okkkk' and not print ' saved....' for the other file.
Thanks in advance for your help

In your extraction_error function, you may want to test if the two key words can be found:
start = file.find('Bad value') # remove + 9 here, put it later
end = file.find('SPAN', start)
if start != -1 and end != -1: # test if key words can be found, -1 for not found:
return(file[start+9:end])
else:
return ""

You're printing out something, because you are adding 8 to the start variable. Find returns negative one if the string is not found. So what you end up doing is printing out the elements from [7:-1]. I would add an if statement before the print statement:
start = file.find('Bad value')
end = file.find('SPAN', start)
if start != -1 and end != -1:
print(file[start + 9: end])

string.find() return -1 if the argument is not found in the string, example:
print "abcd".find("e") # -1
You can just check the result before the return:
start = file.find('Bad value') + 9
end = file.find('SPAN', start)
if start == -1 or end == -1:
return '' # Or None
return(file[start:end])

Using re:
import re
def get_text(text):
pattern= r'.+(Bad value)(.+)(SPAN).+'
r=re.match(pattern,text)
if r!=None and len(r.groups()) == 3:
print(r.groups()[1])
lines = [
'jkhjkl dbdqs ihnzqid Bad value okkkk SPAN sfsdf didjsfsdf'
,'ghghujh']
for line in lines:
get_text(line)
Output:
okkkk

Checking a text segment within brackets with python

I have a text file, which is strucutred as following:
segmentA {
content Aa
content Ab
content Ac
....
}
segmentB {
content Ba
content Bb
content Bc
......
}
segmentC {
content Ca
content Cb
content Cc
......
}
I know how to search certrain strings through the whole text file, but how can i define to search for a certain string whithin, like example, "segmentC". I need something like reg expression to tell the script??:
If text beginn with "segmentC {" perform a search of a certain string until the first "}" appears.
Someone an idea?
Thanks in advance!

Not a RegEx solution ...but would do the work!
def SearchStuff(lines,sstr):
i=0
while(lines[i]!='}'):
#Do stuffff .....for e.g.
if 'Ca' in lines[i]:
return lines[i]
i+=1
def main(search_str):
f=open('file.txt','r')
lines = f.readlines()
f.close()
for line in lines:
if search_str in line:
index = lines.index(line)
break
lines = lines[index+1:]
print SearchStuff(lines,search_str)
search_str = 'segmentC' #set this string accordingly
main(search_str)

Depending on the complexity you are looking for, you can range from a simple state machine with line based pattern searching to a full lexer.
Line based search
The below example makes the assumption that you are only looking for one segment and that segmentC { and the closing } are on one single line.
def parsesegment(fh):
# Yields all lines inside "segmentC"
state = "out"
for line in fh:
line = line.strip() # in case there are whitespaces around
if state == "out":
if line.startswith("segmentC {"):
state = "in"
break
elif state == "in":
if line.startswith("}"):
state = "out"
break
# Work on the specific lines here
yield line
with open(...) as fh:
for line in parsesegment(fh):
# do something
Simple Lexer
If you need more flexibility, you can design a simple lexer/parser couple. For example, the following code makes no assumption to the organisation of the syntax between lines. It also ignores unknown pattern, which a typical lexer do not (normally it should raise a syntax error):
import re
class ParseSegment:
# Dictionary of patterns per state
# Tuples are (token name, pattern, state change command)
_regexes = {
"out": [
("open", re.compile(r"segment(?P<segment>\w+)\s+\{"), "in")
],
"in": [
("close", re.compile(r"\}"), "out"),
# Here an example of what you could want to match
("content", re.compile(r"content\s+(?P<content>\w+)"), None)
]
}
def lex(self, source, initpos = 0):
pos = initpos
end = len(source)
state = "out"
while pos < end:
for token_name, reg, state_chng in self._regexes[state]:
# Try to get a match
match = reg.match(source, pos)
if match:
# Advance according to how much was matched
pos = match.end()
# yield a token if it has a name
if token_name is not None:
# Yield token name, the full matched part of source
# and the match grouped according to (?P<tag>) tags
yield (token_name, match.group(), match.groupdict())
# Switch state if requested
if state_chng is not None:
state = state_chng
break
else:
# No match, advance by one character
# This is particular to that lexer, usually no match means
# the input file has an error in the syntax and lexer should
# yield an exception
pos += 1
def parse(self, source, initpos = 0):
# This is an example of use of the lexer with a parser
# This converts the input file into a dictionary. Keys are segment
# names, and values are list of contents.
segments = {}
cur_segment = None
# Use lexer to get tokens from source
for token, fullmatch, groups in self.lex(source, initpos):
# On open, create the list of content in segments
if token == "open":
cur_segment = groups["segment"]
segments[cur_segment] = []
# On content, ensure we know the segment and add content to the
# list
elif token == "content":
if cur_segment is None:
raise RuntimeError("Content found outside a segment")
segments[cur_segment].append(groups["content"])
# On close, set the current segment to unknown
elif token == "close":
cur_segment = None
# ignore unknown tokens, we could raise an error instead
return segments
def main():
with open("...", "r") as fh:
data = fh.read()
lexer = ParseSegment()
segments = lexer.parse(data)
print(segments)
return 0
if __name__ == '__main__':
main()
Full Lexer
Then if you need even more flexibility and reuseability, you will have to create a full parser. No need to reinvent the wheel, have a look at this list of language parsing modules, you will probably find the one that suits you.

User specified substitution regex in python

Working on a simple script processing text files and logs. It has to take from the command line a list of regular expression for substitutions. For example:
./myscript.py --replace=s/foo/bar/ --replace=s#/etc/hosts#/etc/foo# --replace=#test\#email.com#root\#email.com#
Is there a simple way to provide a user specified substitution pattern to the python re library? And have that pattern run against a string? Any elegant solution?
If possible, I'd like to avoid writing my own parser. Note that I'd like support for modifiers like /g or /i and so on.
Thanks!

You could use a space as a separator to exploit shell's command-line parser:
$ myscript --replace=foo bar \
> --replace=/etc/hosts /etc/foo gi \
> --replace=test#email.com root#email.com
g flag is default in Python so you need to add special support for it:
#!/usr/bin/env python
import re
from argparse import ArgumentParser
from functools import partial
all_re_flags = 'Lgimsux' # regex flags
parser = ArgumentParser(usage='%(prog)s [--replace PATTERN REPL [FLAGS]]...')
parser.add_argument('-e', '--replace', action='append', nargs='*')
args = parser.parse_args()
print(args.replace)
subs = [] # replacement functions: input string -> result
for arg in args.replace:
count = 1 # replace only the first occurrence if no `g` flag
if len(arg) == 2:
pattern, repl = arg
elif len(arg) == 3:
pattern, repl, flags = arg
if ''.join(sorted(flags)) not in all_re_flags:
parser.error('invalid flags %r for --replace option' % flags)
if 'g' in flags: # add support for `g` flag
flags = flags.replace('g', '')
count = 0 # replace all occurrences
if flags: # embed flags
pattern = "(?%s)%s" % (flags, pattern)
else:
parser.error('wrong number of arguments for --replace option')
subs.append(partial(re.compile(pattern).sub, repl, count=count))
You could use subs as follows:
input_string = 'a b a b'
for replace in subs:
print(replace(input_string))
Example:
$ ./myscript -e 'a b' 'no flag' -e 'a B' 'with flags' ig
Output:
[['a b', 'no flag'], ['a B', 'with flags', 'ig']]
no flag a b
with flags with flags

Like mentioned in the comments, you can use re.compile(), but that only works for matching and searching apparently. Assuming you have only substitutions, you might do something like this:
modifiers_map = {
'i': re.IGNORE,
...
}
for replace in replacements:
# Look for a generalized separator in front of a command
m = re.match(r'(s?)(.)([^\2]+)\2([^\2]+)\2([ig]*)', replace)
if not m:
print 'Invalid command: %s' % replace
continue
command, separator, query, substitution, modifiers = m.groups()
# Convert the modifiers to flags
flags = reduce(operator.__or__, [modifiers_map[char] for char in modifiers], 0)
# This needs a little bit of tweaking if you want to support
# group matching (like \1, \2, etc.). This also assumes that
# you're only getting 's' as a command
my_text = re.sub(query, substitution, my_text, flags=flags)
Suffice it to say, this is a rough draft but I think it'd get you 90% of the way to what you're looking for.

Thanks for the answers. Given the complexity of any of the proposed solutions and the lack of a pre-backed parser in the standard libraries, I just went the extra mile and implemented my own parser.
It is not significantly more complex than the other proposals, see below. I just need to write tests now.
Thanks!
class Replacer(object):
def __init__(self, patterns=[]):
self.patterns = []
for pattern in patterns:
self.AddPattern(pattern)
def ParseFlags(self, flags):
mapping = {
'g': 0, 'i': re.I, 'l': re.L, 'm': re.M, 's': re.S, 'u': re.U, 'x': re.X,
'd': re.DEBUG
}
result = 0
for flag in flags:
try:
result |= mapping[flag]
except KeyError:
raise ValueError(
"Invalid flag: %s, known flags: %s" % (flag, mapping.keys()))
return result
def Apply(self, text):
for regex, repl in self.patterns:
text = regex.sub(repl, text)
return text
def AddPattern(self, pattern):
separator = pattern[0]
match = []
for position, char in enumerate(pattern[1:], start=1):
if char == separator:
if pattern[position - 1] != '\\':
break
match[-1] = separator
continue
match += char
else:
raise ValueError("Invalid pattern: could not find divisor.")
replacement = []
for position, char in enumerate(pattern[position + 1:], start=position + 1):
if char == separator:
if pattern[position - 1] != '\\':
break
replacement[-1] = separator
continue
replacement += char
else:
raise ValueError(
"Invalid pattern: could not find divisor '%s'." % separator)
flags = self.ParseFlags(pattern[position + 1:])
match = ''.join(match)
replacement = ''.join(replacement)
self.patterns.append((re.compile(match, flags=flags), replacement))

Lexer for Parsing to the end of a line

If I have a keyword, how can I get it to, once it encounters a keyword, to just grab the rest of the line and return it as a string? Once it encounters an end of line, return everything on that line.
Here is the line I'm looking at:
description here is the rest of my text to collect
Thus, when the lexer encounters description, I would like "here is the rest of my text to collect" returned as a string
I have the following defined, but it seems to be throwing an error:
states = (
('bcdescription', 'exclusive'),
)
def t_bcdescription(t):
r'description '
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.valiue.count('\n')
t.lexer.begin('INITIAL')
return t
This is part of the error being returned:
File "/Users/me/Coding/wm/wm_parser/ply/lex.py", line 393, in token
raise LexError("Illegal character '%s' at index %d" % (lexdata[lexpos],lexpos), lexdata[lexpos:])
ply.lex.LexError: Illegal character ' ' at index 40
Finally, if I wanted this functionality for more than one token, how could I accomplish that?
Thanks for your time

There is no big problem with your code,in fact,i just copy your code and run it,it works well
import ply.lex as lex
states = (
('bcdescription', 'exclusive'),
)
tokens = ("BCDESCRIPTION",)
def t_bcdescription(t):
r'\bdescription\b'
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
return t
def t_bcdescription_content(t):
r'[^\n]+'
lexer = lex.lex()
data = 'description here is the rest of my text to collect\n'
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
and result is :
LexToken(BCDESCRIPTION,' here is the rest of my text to collect\n',1,50)
So maybe your can check other parts of your code
and if I wanted this functionality for more than one token, then you can simply capture words and when there comes a word appears in those tokens, start to capture the rest of content by the code above.

It is not obvious why you need to use a lexer/parser for this without further information.
>>> x = 'description here is the rest of my text to collect'
>>> a, b = x.split(' ', 1)
>>> a
'description'
>>> b
'here is the rest of my text to collect'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove text between two regex special character delimiters - python

s = '^^^# """#$ raw data &*823ohcneuj^^^ Important Information ^^^raw data^^^ Imp Info' In it, I want to remove texts between the delimiters ^^^ and ^^^. The output should be "Important Information Imp Info"

Related

How NOT to print emojis from comments or submission when using praw

Extract string between 2 delimiters

Checking a text segment within brackets with python

User specified substitution regex in python

Lexer for Parsing to the end of a line

Categories

Resources