Capture ALL strings within a Python script with regex - python

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.

Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

Related

Replacing unwanted characters with regex

I have this string which is on one line:
https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com
Note that I purposely placed , into the second URL. I am trying to replace the , where the http(s) meets. So far I tried this:
pattern_src = r"http(.*)"
for i, line_src in enumerate(open("/Users/test/Documents/tools/dump/email.txt")):
for match in re.finditer(pattern_src, line_src):
mal_url = (match.group())
source_ = mal_url
string = source_
for ch in ["[" , "]"]:
for c in [","]:
string = string.replace(c,"\n")
string = string.replace(ch,"")
with open("/Users/test/Documents/tools/dump/urls.txt", 'w') as file:
file.write(string)
print(string)
But you can clearly see it will replace all the , in the string. So my question is, how would I go around replacing just the , before the http and have every http URL on a new line?
>>> s = 'https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com'
>>> print(re.sub(r',(?=http)', '\n', s))
https[:]//sometest[.]com
http[:]//differentt,est.net
https://lololo.com
,(?=http) will match , only if it is followed by http. Here (?=http) is a positive lookahead assertion, which allows to check for conditions without consuming those characters.
See Reference - What does this regex mean? for details on lookarounds or my book: https://learnbyexample.github.io/py_regular_expressions/lookarounds.html

A way to remove all occurrences of words within brackets in a string?

I'm trying to find a way to delete all mentions of references in a text file.
I haven't tried much, as I am new to Python but thought that this is something that Python could do.
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt') as file:
wholefile = f.read()
for '(' in
I have no idea where to go from here or if what I've done is right. Any suggestions would be helpful!
You'll have an easier time with a text editing program that handles regular expressions, like Notepad++, than learning Python for this one task (reading in a file, correcting fundamental errors like for '(' in..., etc.). You can even use tools available online for this, such as RegExr (a regular expression tester). In RegExr, write an appropriate expression into the "expression" field and paste your text into the "text" field. Then, in the "tools" area below the text, choose the "replace" option and remove the placeholder expression. Your cleaned-up text will appear there.
You're looking for a space, then a literal opening parenthesis, then some characters, then a comma, then a year (let's just call that 3 or 4 digits), then a literal closing parenthesis, so I'd suggest the following expression:
\(.*?, \d{3,4}\)
This will preserve non-citation parenthesized text and remove the leading space before a citation.
Try re
>>> import re
>>> re.sub(r'\(.*?\)', '', 'nonsense (nonsense, 2015)')
'nonsense '
>>> re.sub(r'\(.*?\)', '', 'qwerty (qwerty) dkjah (Smith, 2018)')
'qwerty dkjah '
import re
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt', 'r') as file:
wholefile = file.read()
# Be care for use 'w', it will delete raw data.
whth open('random_text.txt', 'w') as file:
file.write(re.sub(r'\(.*?\)', '', wholefile))

function to get rid of delimeters in python

I have a function where the user passes in a file and a String and the code should get rid of the specificed delimeters. I am having trouble finishing the part where I loop through my code and get rid of each of the replacements. I will post the code down below
def forReader(filename):
try:
# Opens up the file
file = open(filename , "r")
# Reads the lines in the file
read = file.readlines()
# closes the files
file.close()
# loops through the lines in the file
for sentence in read:
# will split each element by a spaace
line = sentence.split()
replacements = (',', '-', '!', '?' '(' ')' '<' ' = ' ';')
# will loop through the space delimited line and get rid of
# of the replacements
for sentences in line:
# Exception thrown if File does not exist
except FileExistsError:
print('File is not created yet')
forReader("mo.txt")
mo.txt
for ( int i;
After running the filemo.txt I would like for the output to look like this
for int i
Here's a way to do this using regex. First, we create a pattern consisting of all the delimiter characters, being careful to escape them, since several of those characters have special meaning in a regex. Then we can use re.sub to replace each delimiter with an empty string. This process can leave us with two or more adjacent spaces, which we then need to replace with a single space.
The Python re module allows us to compile patterns that are used frequently. Theoretically, this can make them more efficient, but it's a good idea to test such patterns against real data to see if it does actually help. :)
import re
delimiters = ',-!?()<=;'
# Make a pattern consisting of all the delimiters
pat = re.compile('|'.join(re.escape(c) for c in delimiters))
s = 'for ( int i;'
# Remove the delimiters
z = pat.sub('', s)
#Clean up any runs of 2 or more spaces
z = re.sub(r'\s{2,}', ' ', z)
print(z)
output
for int i

Python: re.compile and re.sub

Question part 1
I got this file f1:
<something #37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something #37>
and I want to re.compile it that it looks like this f1: (with spaces)
George Washington Joe Taylor
I tried this code but it kind of deletes everything:
import re
file = open('f1.txt')
fixed = open('fnew.txt', 'w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ', text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
I had a different question about comparing 2 files I got this code from Alfe:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?
I will explain what happens with your code:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.
Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.
There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.
Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.
.
So , now I can propose you a code for your first problem:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\1>))',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.
The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\2>)) *',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3
You can use Beautiful Soup to do this easily:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
Sorry I don't understand part 2, good luck!
Don't need re.compile
import re
clean_string = ''
with open('f1.txt') as f1:
for line in f1:
match = re.search('.+>(.+)<.+', line)
if match:
clean_string += (match.group(1))
clean_string += ' '
print(clean_string) # 'George Washington Joe Taylor'
Figured the first part out it was the missing '?'
match = re.compile('<.*?>')
does the trick.
Anyway still not sure about the second questions. :/
For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []. However, note that you need to call re.compile only once prior to initializing the Counter instance. It would be better if you clarify the second part of your question.
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.

Python complex regex replace

I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance
I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))
You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.
This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)

Categories

Resources