Regular expressions in a Python find-and-replace script? Update - python

I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious.
I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following:
infile = sys.argv[1]
charenc = sys.argv[2]
outFile=infile+'.output'
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),charenc)
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-8'))
outF.close()
How would I go about having the script do a find and replace for regular expressions?
Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg:
Title: This is the title
Author: This is the author
Date: This is the date
and convert it into LaTeX format. Eg:
\title{This is the title}
\author{This is the author}
\date{This is the date}
Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know!
Thanks!
Update: Thanks for posting some example code in your answers! I can get it to work so long as I replace the findreplace action, but I can't get both to work. The problem now is I can't integrate it properly into the code I've got. How would I go about having the script do multiple actions on 'outtext' in the below snippet?
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext

>>> import re
>>> s = """Title: This is the title
... Author: This is the author
... Date: This is the date"""
>>> p = re.compile(r'^(\w+):\s*(.+)$', re.M)
>>> print p.sub(r'\\\1{\2}', s)
\Title{This is the title}
\Author{This is the author}
\Date{This is the date}
To change the case, use a function as replace parameter:
def repl_cb(m):
return "\\%s{%s}" %(m.group(1).lower(), m.group(2))
p = re.compile(r'^(\w+):\s*(.+)$', re.M)
print p.sub(repl_cb, s)
\title{This is the title}
\author{This is the author}
\date{This is the date}

See re.sub()

The regular expression you want would probably be along the lines of this one:
^([^:]+): (.*)
and the replacement expression would be
\\\1{\2}

>>> import re
>>> m = 'title', 'author', 'date'
>>> s = """Title: This is the title
Author: This is the author
Date: This is the date"""
>>> for i in m:
s = re.compile(i+': (.*)', re.I).sub(r'\\' + i + r'{\1}', s)
>>> print(s)
\title{This is the title}
\author{This is the author}
\date{This is the date}

Related

Regex match but re.match() doesn't return anything

I try to parse a .md file using a specific pattern with regex in python. The file is written like this:
## title
## title 2
### first paragraph
[lines]
...
### second
[lines]
...
## third
[lines]
...
## last
[lines]
...
So i used this regular expression to match it:
##(.*)\n+##(.*)\n+###((\n|.)*)###((\n|.)*)##((\n|.)*)##((\n|.)*)
when I am trying it online, the regex match:
https://regex101.com/r/8iYBrp/1
But when I am using it in python, it doesn't work, I can't understand why.
Here is my code:
Here is my code:
import re
str = (
r'##(.*)\n+##(.*)\n+###((\n|.)*)###((\n|.)*)##((\n|.)*)##((\n|.)*)')
file_regexp = re.compile(str)
## Retrieve the content of the file (I am sure this part
## returns what I want)
m = file_regexp.match(fileContent)
# m is always None
I already tried to add flags, like re.DOTALL, re.I, re.M, re.S. But when I do this, the script becomes really slow and my computer starts making strange noise.
Does anyone know what I did wrong ? Any help appreciated
First of all you assign your regex pattern to a variable str (overrides built-in str), but you use featureStr afterwards. Your resulting match object is empty, because you told it to ignore, what it matched. You can assign names to the regex placeholder using ?P<name> and access them later. Here is a working example:
import re
featureStr = (
r'##(?P<title>.*)\n+##(?P<title_2>.*)\n+###(?P<first>(.*)###(?P<second>(.*)##(?P<third>(.*)##(.*)')
file_regexp = re.compile(featureStr, re.S)
fileContent = open("markdown.md").read()
m = file_regexp.match(fileContent)
print(m.groupdict())
Which prints:
{'title': ' title', 'title_2': ' title 2', 'first': ' first paragraph\n[lines]\n...\n\n', 'second': ' second\n[lines]\n...\n\n', 'third': ' third \n[lines]\n...\n\n'}
I hope this helps you. Let me know if there are any questions left. Have a nice day!
Correct me if I'm wrong, but if you're interested only in the lines you could just skip the lines starting with #. This could be solved with something like
with open("/path/to/your/file",'r') as in_file:
for line in in_file:
if line.startswith('#'):
continue
else:
do something here.
Why do you need a regex?
Use re.search instead of re.match.
str = (r'##(.*?)\n##(.*?)\n+###(.*?)\n+###(.*?)\n+##(.*?)\n+##(.*?)')
file_regexp = re.compile(str, re.S)
fileContent = '''
## title
## title 2
### first paragraph
[lines]
...
### second
[lines]
...
## third
[lines]
...
## last
[lines]
...
'''
m = file_regexp.search(fileContent)
print(m.groups())
Output:
(' title', ' title 2', ' first paragraph\n[lines]\n...', ' second\n[lines]\n...', ' third \n[lines]\n...', '')

Stripping URLs in regex

I am trying to isolate the domain name for a database full of URLs, but I'm running into some regex problems.
Starting example:
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
Desired goal:
['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com']
I've been trying a two stage process where I add in a "." before "www", then replace the "www.", but that doesn't quite lead to the results I'd like.
Any regex wizards out there able to help?
Thanks in advance!
import re
def extract(domain):
return re.sub(r'^www[\d-]*\.?', '', domain)
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
result = [extract(d) for d in examples]
assert result == ['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com'], result

Python usage of regular expressions

How can I extract string1#string2 from the bellow line?
<![CDATA[<html><body><p style="margin:0;">string1#string2</p></body></html>]]>
The # character and the structure of the line is always the same.
Simple, buggy, not reliable:
line.replace('<![CDATA[<html><body><p style="margin:0;">', "").replace('</p></body></html>]]>', "").split("#")
re.search(r'[^>]+#[^<]+',s).group()
I would like to refer you to this gem:
In synthesis a regex is not the appropriate tool for this job
Also have you tried an XML parser instead?
EDIT:
import xml.etree.ElementTree as ET
a = "<html><body><p style=\"margin:0;\">string1#string2</p></body></html>"
root = ET.fromstring(a)
c = root[0][0].text
OUT:
c
'string1#string2'
d = c.replace('#', ' ').split()
Out:
d
['string1', 'string2']
If you wish to use a regex:
>>> re.search(r"<p.*?>(.+?)</p>", txt).group(1)
'string1#string2'

pattern between two strings in python

I have a string in following format:
In-product feedback from Vince (aaa#bbb.com)...In-product feedback from Corey Zimmerman Anderson (ccc#ddd.com)...In-product feedback from Andrea Ibarra (eee#fff.com)
I need to extract the email ID from above string. The "In-product feedback from " will be static and email IDs will always be inside parenthesis, but the name in between will vary.
Since the text you have is pretty much static and names will likely not contain () you can use a non regex approach:
s = "In-product feedback from Vince (aaa#bbb.com)"
s_clean = s.rsplit('(')[1].strip(')')
print(s_clean)
# 'aaa#bbb.com'
Or use regex anyway:
import re
s = "In-product feedback from Vince (aaa#bbb.com)"
s_clean = re.findall(r'\((.*?)\)', s)[0]
print(s_clean)
# 'aaa#bbb.com'
And with multiple occurrences, you'll get a list of all the emails:
s = "In-product feedback from Vince (aaa#bbb.com)...In-product feedback from Corey Zimmerman Anderson (ccc#ddd.com)...In-product feedback from Andrea Ibarra (eee#fff.com)"
s_clean = re.findall(r'\((.*?)\)', s)
print(s_clean)
# ['aaa#bbb.com', 'ccc#ddd.com', 'eee#fff.com']
Use the following code:
import re
r = re.findall(r"\(([^)]+)\)", s)
print(r)
where s in your strings.
Try this
import re
str = 'In-product feedback from Vince (aaa#bbb.com)'
regex = '(In-product feedback from) ([a-zA-Z ]+) \(([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)\)'
phrase= re.match(regex, str)
print phrase.group(1) # In-product feedback from
print phrase.group(2) # Vince
print phrase.group(3) # aaa#bbb.com

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Categories

Resources