Python regular expression extract sub-strings from template - python

In Python, I'm looking for a way to extract regex groups given a string and a matching template pattern, for example:
file_path = "/101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
file_template = "/{CODE}_{ELEMENT}/{WIDTH}x{HEIGHT}/{CODE}_{ELEMENT}.{FRAME}.exr"
The output I'm looking for is the following:
{
"CODE": "101-001-015",
"ELEMENT": "fg01",
"WIDTH": "4312",
"HEIGHT: "2156",
"FRAME": "0001"
}
My initial approach was to format my template and find any and all matches, but it's not ideal:
import re
re_format = file_template.format(SHOT='(.*)', ELEMENT='(.*)', WIDTH='(.*)', HEIGHT='(.*)', FRAME='(.*)')
search = re.compile(re_format)
result = search.findall(file_path)
# result: [('101-001-015', 'fg01', '4312', '2156', '101-001-015', 'fg01.000', '')]
All template keys could be contain various characters and be of various lengths so I'm looking for a good matching algorithm. Any ideas if and how this could be done with Python re or any alternative libraries?
Thanks!

I would go for named capturing groups and extract the desired results with the groupdict() function:
import re
file_path = "/101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
rx = r"\/(?P<CODE>.+)_(?P<ELEMENT>.+)\/(?P<WIDTH>.+)x(?P<HEIGHT>.+)\/.+\.(?P<FRAME>\w+).exr"
m = re.match(rx, file_path)
result = m.groupdict()
# {'CODE': '101-001-015', 'ELEMENT': 'fg01', 'WIDTH': '4312', 'HEIGHT': '2156', 'FRAME': '0001'}

Kind of similar like Simon did, I'll also try with named captured group
import re
regex = r"(?P<CODE>[0-9-]+)_(?P<ELEMENT>[0-9a-z]+)\/(?P<WIDTH>[0-9]+)x(?P<HEIGHT>[0-9]+)\/\1_\2\.(?P<FRAME>[0-9]+)\.exr"
test_str = "101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
matches = re.match(regex, test_str)
print(matches.groupdict())
DEMO: https://rextester.com/BEZH21139

Related

How to extract some url from html?

I need to extract all image links from a local html file. Unfortunately, I can't install bs4 and cssutils to process html.
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
I tried to extract data using a regex:
images = []
for line in html.split('\n'):
images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)
[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]
I suppose my regular expression is greedy because I used .*?
How to get the following outcome?
images = ['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
If it can help all links are enclosed by src="..." or url(...)
Thanks for your help.
import re
indeces_start = sorted(
[m.start()+5 for m in re.finditer("src=", html)]
+ [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]
image_list = []
for start,end in zip(indeces_start,indeces_end):
image_list.append(html[start:end])
print(image_list)
That's a solution which comes to my mind. It consists of finding the start and end indeces of the image path strings. It obviously has to be adjusted if there are different image types.
Edit: Changed the start criteria, in case there are other URLs in the document
You can use
import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)
See the Python demo. Output:
['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
See the regex demo, too. It means
https://s2 - some literal text
[^\s?]* -zero or more chars other than whitespace and ? chars
(?=\?lastmod=\d) - immediately to the right, there must be ?lastmode= and a digit (the text is not added to the match since it is a pattern inside a positive lookahead, a non-consuming pattern).
import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
if(len(x)== 0): continue
x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
if(len(x)== 0): continue
url.append(x[0])
print(url)

How to pass regex in presidio library

I want to recognize my custom pattern using Microsoft's Presidio library in python.
while passing the regex I am getting this error.
AttributeError: 'str' object has no attribute 'regex'
from presidio_analyzer import PatternRecognizer
regex = ("^[2-9]{1}[0-9]{3}\\" +
"s[0-9]{4}\\s[0-9]{4}$")
#p = re.compile(regex)
aadhar_number_recognizer = PatternRecognizer(supported_entity="AADHAR_NUMBER",
patterns=[regex])```
PatternRecognizer receives list of 'Pattern' objects as the 'patterns' argument. You are passing plain regex string.
Should be:
from presidio_analyzer import PatternRecognizer, Pattern
aadhar_number_recognizer = PatternRecognizer(supported_entity = "AADHAR_NUMBER",
deny_list=[],
patterns=[Pattern(name="AADHAR Number", score=0.8,
regex="(^[2-9]{1}[0-9]{3}\\s[0-9]{4}\\s[0-9]{4}$)")],
context=[])
For more references you can take a look at how Presidio implements PatternRecognizer in its built-in recognizers.
I was working on this. Although I'm using a different Regex, you can use the code below as a template.
In the example I'm using the base sentence:
"Hi, Java is awesome"
and by using Presidio custom Regex it will be "anonymized" into:
"Hi, Python is awesome"
The code below is just an example, if you just want to replace "Java" with "Python" there are easier ways. This was just the first thing that came to mind. When anonymizing it makes more sense to replace "Java" or "Python" with something like <PROGRAMMING_LANGUAGE>.
from presidio_analyzer import PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
base_sentence = "Hi, Java is awesome!"
# Define the regex pattern in a Presidio `Pattern` object:
java_pattern = Pattern(name="java_pattern",regex="Java", score = 0.5)
# Define the recognizer with one or more patterns
java_pattern = PatternRecognizer(supported_entity="JAVA", patterns = [java_pattern])
java_pattern_result = java_pattern.analyze(text=base_sentence, entities=["JAVA"])
print("Sentence:", base_sentence)
print("Found:", java_pattern_result)
print()
# Now anonymize
# Initialize the engine:
engine = AnonymizerEngine()
anonymize_result = engine.anonymize(
text=base_sentence,
analyzer_results=java_pattern_result,
operators={"JAVA":OperatorConfig("replace",
{"new_value": "Python"})})
print("Anonymized result:")
print(anonymize_result)
This will print:
Sentence: Hi, Java is awesome!
Found: [type: JAVA, start: 4, end: 8, score: 0.5]
Anonymized result:
text: Hi, Python is awesome!
items:
[
{'start': 4, 'end': 10, 'entity_type': 'JAVA', 'text': 'Python', 'operator': 'replace'}
]

Regex : How can I capture multiple text from this string?

I have text from log file with format like this :
{s:9:\\"batch_num\\";s:16:\\"4578123645712459\\";s:9:\\"full_name\\";s:8:\\"John
Doe\\";s:6:\\"mobile\\";s:12:\\"123456784512\\";s:7:\\"address\\";s:5:\\"Redacted"\\";s:11:\\"create_time\\";s:19:\\"2017-09-10
12:45:01\\";s:6:\\"gender\\";s:1:\\"1\\";s:9:\\"birthdate\\";s:10:\\"1996-03-09\\";s:11:\\"contact_num\\";s:1:\\"0\\";s:8:\\"identity\\";s:1:\\"2\\";s:6:\\"school\\";N;s:14:\\"school_city_id\\";N;s:17:\\"profile_pic\\";s:43:\\"profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\";s:14:\\"school_address\\";N;s:17:\\"enter_school_date\\";N;s:10:\\"speciality\\";}
Currently I can extract batch_num only with regex :
(?<=batch_num\\\\";s:16:\\\\")([0-9]{1,16})(?=\\\)
Link : https://regex101.com/r/OBaOY0/1/
Question
I want to extract value from batch_num, full_name and profile_pic.
My expected output is :
4578123645712459
John Doe
profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
How do i get the desired output with the right regex?
Thanks in advance.
A solution to elegantly extract values by converting the string to json.
Step 1: Clean the string
import re, itertools
str_text = text.replace('\\','').replace(';','').replace('""','"').replace(':"','"').replace('N',',""')
str_text = re.sub('s:\d+',',', str_text)
str_text = re.sub('^{,','{', str_text)
str_text = re.sub('}$',':""}', str_text)
str_text = re.sub('(,)', lambda m, c=itertools.count(): m.group() if next(c) % 2 else ':', str_text)
str_text
#'{"batch_num":"4578123645712459","full_name":"John Doe","mobile":"123456784512","address":"Redacted","create_time":"2017-09-10 12:45:01","gender":"1","birthdate":"1996-03-09","contact_num":"0","identity":"2","school":"","school_city_id":"","profile_pic":"profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg","school_address":"","enter_school_date":"","speciality":""}'
Step 2: Convert string to json and extract
import json
str_json = json.loads(str_text)
print(str_json['batch_num'])
print(str_json['full_name'])
print(str_json['profile_pic'])
#4578123645712459
#John Doe
#profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
With multiple regular expressions.
Batch Number
(?<="batch_num)\\{3}";s:\d+:\\{3}"(\d+)
Full Name
(?<="full_name)\\{3}";s:\d+:\\{3}"(\w+\s\w+)
Full Name (with more than 2 words)
(?<="full_name)\\{3}";s:\d+:\\{3}"([\w+\s]{1,})
Profile
(?<="profile_pic)\\{3}";s:\d+:\\{3}"(\w+\\{2}\/\w+\.\w+)
Code
regex_batch = r'(?<="batch_num)\\{3}";s:\d+:\\{3}"(\d+)'
regex_name = r'(?<="full_name)\\{3}";s:\d+:\\{3}"(\w+\s\w+)'
regex_prof = r'(?<="profile_pic)\\{3}";s:\d+:\\{3}"(\w+\\{2}\/\w+\.\w+)'
test_str = "{s:9:\\\\\\\"batch_num\\\\\\\";s:16:\\\\\\\"4578123645712459\\\\\\\";s:9:\\\\\\\"full_name\\\\\\\";s:8:\\\\\\\"John Doe\\\\\\\";s:6:\\\\\\\"mobile\\\\\\\";s:12:\\\\\\\"123456784512\\\\\\\";s:7:\\\\\\\"address\\\\\\\";s:5:\\\\\\\"Redacted\"\\\\\\\";s:11:\\\\\\\"create_time\\\\\\\";s:19:\\\\\\\"2017-09-10 12:45:01\\\\\\\";s:6:\\\\\\\"gender\\\\\\\";s:1:\\\\\\\"1\\\\\\\";s:9:\\\\\\\"birthdate\\\\\\\";s:10:\\\\\\\"1996-03-09\\\\\\\";s:11:\\\\\\\"contact_num\\\\\\\";s:1:\\\\\\\"0\\\\\\\";s:8:\\\\\\\"identity\\\\\\\";s:1:\\\\\\\"2\\\\\\\";s:6:\\\\\\\"school\\\\\\\";N;s:14:\\\\\\\"school_city_id\\\\\\\";N;s:17:\\\\\\\"profile_pic\\\\\\\";s:43:\\\\\\\"profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\\\\\";s:14:\\\\\\\"school_address\\\\\\\";N;s:17:\\\\\\\"enter_school_date\\\\\\\";N;s:10:\\\\\\\"speciality\\\\\\\";}"
m_batch = re.findall(regex_batch, test_str, re.MULTILINE)[0]
m_name = re.findall(regex_name, test_str, re.MULTILINE)[0]
m_prof = re.findall(regex_prof, test_str, re.MULTILINE)[0]
print(m_batch, m_name, m_prof)
Output
4578123645712459 John Doe profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
I think i have one for you. The jpeg match group2 excludes the two //'s which is why they are in pink, they are the same match group:
https://regex101.com/r/OBaOY0/2
import itertools, re
a = '{s:9:\\"batch_num\\";s:16:\\"4578123645712459\\";s:9:\\"full_name\\";s:8:\\"John Doe\\";s:6:\\"mobile\\";s:12:\\"123456784512\\";s:7:\\"address\\";s:5:\\"Redacted"\\";s:11:\\"create_time\\";s:19:\\"2017-09-10 12:45:01\\";s:6:\\"gender\\";s:1:\\"1\\";s:9:\\"birthdate\\";s:10:\\"1996-03-09\\";s:11:\\"contact_num\\";s:1:\\"0\\";s:8:\\"identity\\";s:1:\\"2\\";s:6:\\"school\\";N;s:14:\\"school_city_id\\";N;s:17:\\"profile_pic\\";s:43:\\"profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\";s:14:\\"school_address\\";N;s:17:\\"enter_school_date\\";N;s:10:\\"speciality\\";}'.replace("\\","")
list(filter(None, list(itertools.chain.from_iterable(re.findall(r'(?:s:16:\")(\d+)|(?:s:8:\")(\w+ \w+)|(?:s:43:\")(\w+/\w+\.\w+)', a)))))
output:
['4578123645712459',
'John Doe',
'profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']
You could get all 3 matches for the example data using an alternation and a capturing group:
\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"
In parts
\b(?:batch_num|full_name|profile_pic)\b Match one of the options between word boundaries
\\\\\\";s:\d+: Match \\\"s: and 1+ digits
\\\\\\" Match \\\"
( Capture group 1
[^"]+ Match 1+ times char except "
) Close group
\\\\\\" Match \\\"
Regex demo | Python demo
For example
import re
regex = r'\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"'
test_str = r'''{s:9:\\\"batch_num\\\";s:16:\\\"4578123645712459\\\";s:9:\\\"full_name\\\";s:8:\\\"John Doe\\\";s:6:\\\"mobile\\\";s:12:\\\"123456784512\\\";s:7:\\\"address\\\";s:5:\\\"Redacted"\\\";s:11:\\\"create_time\\\";s:19:\\\"2017-09-10 12:45:01\\\";s:6:\\\"gender\\\";s:1:\\\"1\\\";s:9:\\\"birthdate\\\";s:10:\\\"1996-03-09\\\";s:11:\\\"contact_num\\\";s:1:\\\"0\\\";s:8:\\\"identity\\\";s:1:\\\"2\\\";s:6:\\\"school\\\";N;s:14:\\\"school_city_id\\\";N;s:17:\\\"profile_pic\\\";s:43:\\\"profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\";s:14:\\\"school_address\\\";N;s:17:\\\"enter_school_date\\\";N;s:10:\\\"speciality\\\";}'''
matches = re.finditer(regex, test_str)
print(re.findall(regex, test_str))
Output
['4578123645712459', 'John Doe', 'profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']

Python and regex: create a template

I need to find a lot of substrings in string but It takes a lot of time, so I need to combine it in pattern:
I should find string
003.ru/%[KEYWORD]%
1click.ru/%[KEYWORD]%
3dnews.ru/%[KEYWORD]%
where % - is an any symbols
and [KEYWORD] - can be ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
I try to do a search with
keywords = ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
for i, key in enumerate(keywords):
coding['keyword_url'] = coding.url.apply(lambda x: x.replace('[KEYWORD]', key).replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+') if '[KEYWORD]' in x else x.replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+'))
for (domain, keyword_url) in zip(coding.domain.values.tolist(), coding.keyword_url.values.tolist()):
df.loc[df.event_address.str.contains(keyword_url), 'domain'] = domain
Where df contains only event_address (urls)
coding
domain url
003.ru 003.ru/%[KEYWORD]%
1CLICK 1click.ru/%[KEYWORD]%
33033.ru 33033.ru/%[KEYWORD]%
3D NEWS 3dnews.ru/%[KEYWORD]%
96telefonov.ru 96telefonov.ru/%[KEYWORD]%
How can I improve my pattern to do it faster?
First, you should consider using re module. Look at the re.compile function for your patterns and then you can match them.

Regular expressions in a Python find-and-replace script? Update

I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious.
I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following:
infile = sys.argv[1]
charenc = sys.argv[2]
outFile=infile+'.output'
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),charenc)
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-8'))
outF.close()
How would I go about having the script do a find and replace for regular expressions?
Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg:
Title: This is the title
Author: This is the author
Date: This is the date
and convert it into LaTeX format. Eg:
\title{This is the title}
\author{This is the author}
\date{This is the date}
Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know!
Thanks!
Update: Thanks for posting some example code in your answers! I can get it to work so long as I replace the findreplace action, but I can't get both to work. The problem now is I can't integrate it properly into the code I've got. How would I go about having the script do multiple actions on 'outtext' in the below snippet?
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
>>> import re
>>> s = """Title: This is the title
... Author: This is the author
... Date: This is the date"""
>>> p = re.compile(r'^(\w+):\s*(.+)$', re.M)
>>> print p.sub(r'\\\1{\2}', s)
\Title{This is the title}
\Author{This is the author}
\Date{This is the date}
To change the case, use a function as replace parameter:
def repl_cb(m):
return "\\%s{%s}" %(m.group(1).lower(), m.group(2))
p = re.compile(r'^(\w+):\s*(.+)$', re.M)
print p.sub(repl_cb, s)
\title{This is the title}
\author{This is the author}
\date{This is the date}
See re.sub()
The regular expression you want would probably be along the lines of this one:
^([^:]+): (.*)
and the replacement expression would be
\\\1{\2}
>>> import re
>>> m = 'title', 'author', 'date'
>>> s = """Title: This is the title
Author: This is the author
Date: This is the date"""
>>> for i in m:
s = re.compile(i+': (.*)', re.I).sub(r'\\' + i + r'{\1}', s)
>>> print(s)
\title{This is the title}
\author{This is the author}
\date{This is the date}

Categories

Resources