Regex : How can I capture multiple text from this string? - python

I have text from log file with format like this :
{s:9:\\"batch_num\\";s:16:\\"4578123645712459\\";s:9:\\"full_name\\";s:8:\\"John
Doe\\";s:6:\\"mobile\\";s:12:\\"123456784512\\";s:7:\\"address\\";s:5:\\"Redacted"\\";s:11:\\"create_time\\";s:19:\\"2017-09-10
12:45:01\\";s:6:\\"gender\\";s:1:\\"1\\";s:9:\\"birthdate\\";s:10:\\"1996-03-09\\";s:11:\\"contact_num\\";s:1:\\"0\\";s:8:\\"identity\\";s:1:\\"2\\";s:6:\\"school\\";N;s:14:\\"school_city_id\\";N;s:17:\\"profile_pic\\";s:43:\\"profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\";s:14:\\"school_address\\";N;s:17:\\"enter_school_date\\";N;s:10:\\"speciality\\";}
Currently I can extract batch_num only with regex :
(?<=batch_num\\\\";s:16:\\\\")([0-9]{1,16})(?=\\\)
Link : https://regex101.com/r/OBaOY0/1/
Question
I want to extract value from batch_num, full_name and profile_pic.
My expected output is :
4578123645712459
John Doe
profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg
How do i get the desired output with the right regex?
Thanks in advance.

A solution to elegantly extract values by converting the string to json.
Step 1: Clean the string
import re, itertools
str_text = text.replace('\\','').replace(';','').replace('""','"').replace(':"','"').replace('N',',""')
str_text = re.sub('s:\d+',',', str_text)
str_text = re.sub('^{,','{', str_text)
str_text = re.sub('}$',':""}', str_text)
str_text = re.sub('(,)', lambda m, c=itertools.count(): m.group() if next(c) % 2 else ':', str_text)
str_text
#'{"batch_num":"4578123645712459","full_name":"John Doe","mobile":"123456784512","address":"Redacted","create_time":"2017-09-10 12:45:01","gender":"1","birthdate":"1996-03-09","contact_num":"0","identity":"2","school":"","school_city_id":"","profile_pic":"profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg","school_address":"","enter_school_date":"","speciality":""}'
Step 2: Convert string to json and extract
import json
str_json = json.loads(str_text)
print(str_json['batch_num'])
print(str_json['full_name'])
print(str_json['profile_pic'])
#4578123645712459
#John Doe
#profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg

With multiple regular expressions.
Batch Number
(?<="batch_num)\\{3}";s:\d+:\\{3}"(\d+)
Full Name
(?<="full_name)\\{3}";s:\d+:\\{3}"(\w+\s\w+)
Full Name (with more than 2 words)
(?<="full_name)\\{3}";s:\d+:\\{3}"([\w+\s]{1,})
Profile
(?<="profile_pic)\\{3}";s:\d+:\\{3}"(\w+\\{2}\/\w+\.\w+)
Code
regex_batch = r'(?<="batch_num)\\{3}";s:\d+:\\{3}"(\d+)'
regex_name = r'(?<="full_name)\\{3}";s:\d+:\\{3}"(\w+\s\w+)'
regex_prof = r'(?<="profile_pic)\\{3}";s:\d+:\\{3}"(\w+\\{2}\/\w+\.\w+)'
test_str = "{s:9:\\\\\\\"batch_num\\\\\\\";s:16:\\\\\\\"4578123645712459\\\\\\\";s:9:\\\\\\\"full_name\\\\\\\";s:8:\\\\\\\"John Doe\\\\\\\";s:6:\\\\\\\"mobile\\\\\\\";s:12:\\\\\\\"123456784512\\\\\\\";s:7:\\\\\\\"address\\\\\\\";s:5:\\\\\\\"Redacted\"\\\\\\\";s:11:\\\\\\\"create_time\\\\\\\";s:19:\\\\\\\"2017-09-10 12:45:01\\\\\\\";s:6:\\\\\\\"gender\\\\\\\";s:1:\\\\\\\"1\\\\\\\";s:9:\\\\\\\"birthdate\\\\\\\";s:10:\\\\\\\"1996-03-09\\\\\\\";s:11:\\\\\\\"contact_num\\\\\\\";s:1:\\\\\\\"0\\\\\\\";s:8:\\\\\\\"identity\\\\\\\";s:1:\\\\\\\"2\\\\\\\";s:6:\\\\\\\"school\\\\\\\";N;s:14:\\\\\\\"school_city_id\\\\\\\";N;s:17:\\\\\\\"profile_pic\\\\\\\";s:43:\\\\\\\"profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\\\\\";s:14:\\\\\\\"school_address\\\\\\\";N;s:17:\\\\\\\"enter_school_date\\\\\\\";N;s:10:\\\\\\\"speciality\\\\\\\";}"
m_batch = re.findall(regex_batch, test_str, re.MULTILINE)[0]
m_name = re.findall(regex_name, test_str, re.MULTILINE)[0]
m_prof = re.findall(regex_prof, test_str, re.MULTILINE)[0]
print(m_batch, m_name, m_prof)
Output
4578123645712459 John Doe profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg

I think i have one for you. The jpeg match group2 excludes the two //'s which is why they are in pink, they are the same match group:
https://regex101.com/r/OBaOY0/2
import itertools, re
a = '{s:9:\\"batch_num\\";s:16:\\"4578123645712459\\";s:9:\\"full_name\\";s:8:\\"John Doe\\";s:6:\\"mobile\\";s:12:\\"123456784512\\";s:7:\\"address\\";s:5:\\"Redacted"\\";s:11:\\"create_time\\";s:19:\\"2017-09-10 12:45:01\\";s:6:\\"gender\\";s:1:\\"1\\";s:9:\\"birthdate\\";s:10:\\"1996-03-09\\";s:11:\\"contact_num\\";s:1:\\"0\\";s:8:\\"identity\\";s:1:\\"2\\";s:6:\\"school\\";N;s:14:\\"school_city_id\\";N;s:17:\\"profile_pic\\";s:43:\\"profile\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\";s:14:\\"school_address\\";N;s:17:\\"enter_school_date\\";N;s:10:\\"speciality\\";}'.replace("\\","")
list(filter(None, list(itertools.chain.from_iterable(re.findall(r'(?:s:16:\")(\d+)|(?:s:8:\")(\w+ \w+)|(?:s:43:\")(\w+/\w+\.\w+)', a)))))
output:
['4578123645712459',
'John Doe',
'profile/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']

You could get all 3 matches for the example data using an alternation and a capturing group:
\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"
In parts
\b(?:batch_num|full_name|profile_pic)\b Match one of the options between word boundaries
\\\\\\";s:\d+: Match \\\"s: and 1+ digits
\\\\\\" Match \\\"
( Capture group 1
[^"]+ Match 1+ times char except "
) Close group
\\\\\\" Match \\\"
Regex demo | Python demo
For example
import re
regex = r'\b(?:batch_num|full_name|profile_pic)\b\\\\\\";s:\d+:\\\\\\"([^"]+)\\\\\\"'
test_str = r'''{s:9:\\\"batch_num\\\";s:16:\\\"4578123645712459\\\";s:9:\\\"full_name\\\";s:8:\\\"John Doe\\\";s:6:\\\"mobile\\\";s:12:\\\"123456784512\\\";s:7:\\\"address\\\";s:5:\\\"Redacted"\\\";s:11:\\\"create_time\\\";s:19:\\\"2017-09-10 12:45:01\\\";s:6:\\\"gender\\\";s:1:\\\"1\\\";s:9:\\\"birthdate\\\";s:10:\\\"1996-03-09\\\";s:11:\\\"contact_num\\\";s:1:\\\"0\\\";s:8:\\\"identity\\\";s:1:\\\"2\\\";s:6:\\\"school\\\";N;s:14:\\\"school_city_id\\\";N;s:17:\\\"profile_pic\\\";s:43:\\\"profile\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg\\\";s:14:\\\"school_address\\\";N;s:17:\\\"enter_school_date\\\";N;s:10:\\\"speciality\\\";}'''
matches = re.finditer(regex, test_str)
print(re.findall(regex, test_str))
Output
['4578123645712459', 'John Doe', 'profile\\\\/2df0d9f29ab3ha65fed4847c8lb1o9sa.jpeg']

Related

REGEX_String between strings in a list

From this list:
['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
I would like to reduce it to this list:
['BELMONT PARK', 'EAGLE FARM']
You can see from the first list that the desired words are between '\n' and '('.
My attempted solution is:
for i in x:
result = re.search('\n(.*)(', i)
print(result.group(1))
This returns the error 'unterminated subpattern'.
Thankyou
You’re getting an error because the ( is unescaped. Regardless, it will not work, as you’ll get the following matches:
\nBELMONT PARK (
\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (
You can try the following:
(?<=\\n)(?!.*\\n)(.*)(?= \()
(?<=\\n): Positive lookbehind to ensure \n is before match
(?!.*\\n): Negative lookahead to ensure no further \n is included
(.*): Your match
(?= \(): Positive lookahead to ensure ( is after match
You can get the matches without using any lookarounds, as you are already using a capture group.
\n(.*) \(
Explanation
\n Match a newline
(.*) Capture group 1, match any character except a newline, as much as possible
\( Match a space and (
See a regex101 demo and a Python demo.
Example
import re
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
for i in x:
m = re.search(pattern, i)
if m:
print(m.group(1))
Output
BELMONT PARK
EAGLE FARM
If you want to return a list:
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
res = [m.group(1) for i in x for m in [re.search(pattern, i)] if m]
print(res)
Output
['BELMONT PARK', 'EAGLE FARM']

Regex match match "words" that contain two continuous streaks of digits and letters or vice-versa and split them

I am having following line of text as given below:
text= 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
I am trying to split numbers followed by characters or characters followed by numbers only to get the output as:
output_text = 'Cms 12345678 Gleandaleacademy Fee Collection 00001234 Abcd Renewal 123Acgf456789
I have tried the following approcah:
import re
text = 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
text = text.lower().strip()
text = text.split(' ')
output_text =[]
for i in text:
if bool(re.match(r'[a-z]+\d+|\d+\w+',i, re.IGNORECASE))==True:
out_split = re.split('(\d+)',i)
for j in out_split:
output_text.append(j)
else:
output_text.append(i)
output_text = ' '.join(output_text)
Which is giving output as:
output_text = 'cms 12345678 gleandaleacademy fee collection 00001234 abcd renewal 123 acgf 456789 '
This code is also splliting the last element of text 123acgf456789 due to incorrect regex in re.match.
Please help me out to get correct output.
You can use
re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text)
See the regex demo
Details
\b - word boundary
(?: - start of a non-capturing group (necessary for the word boundaries to be applied to all the alternatives):
([a-zA-Z]+)(\d+) - Group 1: one or more letters and Group 2: one or more digits
| - or
(\d+)([a-zA-Z]+) - Group 3: one or more digits and Group 4: one or more letters
) - end of the group
\b - word boundary
During the replacement, either \1 and \2 or \3 and \4 replacement backreferences are initialized, so concatenating them as \1\3 and \2\4 yields the right results.
See a Python demo:
import re
text = "Cms1291682971 Gleandaleacademy Fee Collecti 0000548Andb Renewal 402Ecfev845410001"
print( re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text) )
# => Cms 1291682971 Gleandaleacademy Fee Collecti 0000548 Andb Renewal 402Ecfev845410001

How to replace characters in a text by space except for list of words in python

I want to replace all characters in a text by spaces, but I want to leave a list of words.
For instante:
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
My wanted output would be:
output_text = "*********** ********** "
I would like to change unwanted characters to spaces before I do the * replacement:
"John Thomas Acme Corp. "
Right know I know how to replace only the list of words, but cannot come out with the spaces part.
rep = {key: len(key)*'_**_' for key in list_of_words}
rep = dict((re.escape(k), v) for k, v in rep.items())
pattern = re.compile("|".join(rep.keys()))
pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
You may build a pattern like
(?s)word1|word2|wordN|(.)
When Group 1 matches, replace with a space, else, replace with the same amount of asterisks as the match text length:
import re
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
pat = "|".join(sorted(map(re.escape, list_of_words), key=len, reverse=True))
pattern = re.compile(f'{pat}|(.)', re.S)
print(pattern.sub(lambda m: " " if m.group(1) else len(m.group(0))*"*", text))
=> '*********** ********** '
See the Python demo
Details
sorted(map(re.escape, list_of_words), key=len, reverse=True) - escapes words in list_of_words and sorts the list by length in descending order (it will be necessary if there are multiword items)
"|".join(...) - build the alternatives out of list_of_words items
lambda m: " " if m.group(1) else len(m.group(0))*"*" - if Group 1 matches, replace with a space, else with the asterisks of the same length as the match length.

Python - regular expressions - part of the search patern is in same line, part is in the next one

I have 3 emails which have following in email body:
1st email
2nd email
3rd email
=
means new line.There are 3 cases:
Case 1
machine name is on the next line
Example
MACHINE: =
ldnmdsbatchxl01
Case 2
machine name is on the same line:
MACHINE: p2prog06
Case 3
Part of the machine is in the same line, part is in next line
MACHINE: p1prog=
07
Following works for first 2 and partial for 3rd case:regex2 = r'\bMACHINE:\s*(?:=.*)?\s*([^<^\n ]+)
in 3rd i'm getting p1prog=
> Desired output:
p1prog07
ldnmdsbatchxl01
p2prog06
Thanks
if resp == 'OK':
email_body = data[0][1].decode('utf-8')
mail = email.message_from_string(email_body)
#get all emails with words "PA1" or "PA2" in subject
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
#search email body for job name (string after word "JOB")
regex1 = r'(?<!^)JOB:\s*(\S+)'
regex2 = r'\bMACHINE:\s*(?:=.*)?\s*([^<^\n ]+)|$'
c=re.findall(regex2, email_body)[0]#,re.DOTALL)
a=re.findall(regex1 ,email_body)
You may use
import re
email = 'MACHINE: =\nldnmdsbatchxl01\n\n\nMACHINE: p2prog06\n\n\nMACHINE: p1prog=^M\n07'
res = list(set([re.sub(r'=(?:\^M)?|[\r\n]+', '', x) for x in re.findall(r'\bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?)', email, re.M)]))
print(res)
# => ['ldnmdsbatchxl01', 'p2prog06', 'p1prog07']
See the Python demo
The regex used is \bMACHINE:\s*(.*(?:(?:\r\n?|\n)\S+)?):
\bMACHINE - whole word MACHINE
: - a : char
\s* - 0+ whitespaces
(.*(?:(?:\r\n?|\n)\S+)?) - Group 1 (this substring will be returned by re.findall):
.* - 0+ chars other than line break chars
(?:(?:\r\n?|\n)\S+)? - an optional substring:
(?:\r\n?|\n) - a CRLF, LF or CR line break sequence
\S+ - 1+ non-whitespace chars
The re.sub(r'=(?:\^M)?|[\r\n]+', '', x) removes = or =^M and CR/LF symbols from the Group 1 value.
To get unique values, use list(set(res)).
Short answer:
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
value = regexp.search(data)[1]
value.replace('=^M\n', ''))
Long answer:
Assume we have data from your examples:
data = """
BFAILURE JOB: p2_batch_excel_quants_fx_daily_vol_check_0800 MACHINE: =
ldnmdsbatchxl01 EXITCODE: 268438455
(...)
RUNALARM JOB: p2_credit_qv_curve_snap MACHINE: p2prog06
Attachments:
(...)
[11/01/2019 08:15:09] CAUAJM_I_40245 EVENT: ALARM ALARM: JO=^M
BFAILURE JOB: p1_static_console_row_based_permissions MACHINE: p1prog=^M
07 EXITCODE: 1<br>^M
"""
Then we may use code:
import re
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
for d in data.split("(...)"):
value = regexp.search(d)[1]
print(value.replace('=^M\n', ''))
As you see regexp match =^M\n too, so we need to remove it after.
output:
ldnmdsbatchxl01
p2prog06
p1prog07
EDIT:
if your data contains many email bodies in one string:
import re
regexp = re.compile('MACHINE:\s={0,1}\s{0,1}((\S+=\^M\s\S+|\S+))')
matches = regexp.findall(data)
print(matches)
print('---')
for m in matches:
print(m[0].replace('=^M\n', ''))
produce:
[('ldnmdsbatchxl01', 'ldnmdsbatchxl01'), ('p2prog06', 'p2prog06'), ('p1prog=^M\n07', 'p1prog=^M\n07')]
---
ldnmdsbatchxl01
p2prog06
p1prog07

Python Regular Expression to find all combinations of a Letter Number Letter Designation

I need to implement a Python regular expression to search for a all occurrences A1a or A_1_a or A-1-a or _A_1_a_ or _A1a, where:
A can be A to Z.
1 can be 1 to 9.
a can be a to z.
Where there are only three characters letter number letter, separated by Underscores, Dashes or nothing. The case in the search string needs to be matched exactly.
The main problem I am having is that sometimes these three letter combinations are connected to other text by dashes and underscores. Also creating the same regular expression to search for A1a, A-1-a and A_1_a.
Also I forgot to mention this is an XML file.
Thanks this found every occurrence of what I was looking for with a slight modification [-]?[A][-]?[1][-]?[a][-]?, but I need to have these be variables something like
[-]?[var_A][-]?[var_3][-]?[Var_a][-]?
would that be done like this
regex = r"[-]?[%s][-]?[%s][-]?[%s][-]?"
print re.findall(regex,var_A,var_Num,Var_a)
Or more like:
regex = ''.join(['r','\"','[-]?[',Var_X,'][-]?[',Var_Num,'][-]?[',Var_x,'][-]?','\"'‌​])
print regex
for sstr in searchstrs:
matches = re.findall(regex, sstr, re.I)
But this isn't working
Sample Lines of the File:
Before Running Script
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="A_3_a Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="A3a1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="A_3_a**_2 Energized from Norm" t:S="0" t:SC="5">
After Running Script
What I am getting: (It's deleting the entire line and leaving only what is below)
B_1_c
B1c1
B_1_c_2
What I Want to get:
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="B_1_c Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="B1c1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="B_1_c_2 Energized from Norm" t:S="0" t:SC="5">
import re
import os
search_file_name = 'Alarms Test.fwn'
pattern = 'A3a'
fileName, fileExtension = os.path.splitext(search_file_name)
newfilename = fileName + '_' + pattern + fileExtension
outfile = open(newfilename, 'wb')
def find_ext(text):
matches = re.findall(r'([_-]?[A{1}][_-]?[3{1}][_-]?[a{1}][_-]?)', text)
records = [m.replace('3', '1').replace('A', 'B').replace('a', 'c') for m in matches]
if matches:
outfile.writelines(records)
return 1
else:
outfile.writelines(text)
return 0
def main():
success = 0
count = 0
with open(search_file_name, 'rb') as searchfile:
try:
searchstrs = searchfile.readlines()
for s in searchstrs:
success = find_ext(s)
count = count + success
finally:
searchfile.close()
print count
if __name__ == "__main__":
main()
You want to use the following to find your matches.
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', s, re.I)
See regex101 demo
If your are looking to find the matches then strip all of the -, _ characters, you could do..
import re
s = '''
A1a _A_1 A_ A_1_a A-1-a _A_1_a_ _A1a _A-1-A_ a1_a A-_-5-a
_A-_-5-A a1_-1 XMDC_A1a or XMDC-A1a or XMDC_A1-a XMDC_A_1_a_ _A-1-A_
'''
def find_this(text):
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', text, re.I)
records = [m.replace('-', '').replace('_', '') for m in matches]
print records
find_this(s)
Output
['A1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A', 'a1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A']
See working demo
To quickly get the A1as out without the punctuation, and not having to reconstruct the string from captured parts...
t = '''A1a _B_2_z_
A_1_a
A-1-a
_A_1_a_
_C1c '''
re.findall("[A-Z][0-9][a-z]",t.replace("-","").replace("_",""))
Output:
['A1a', 'B2z', 'A1a', 'A1a', 'A1a', 'C1c']
(But if you don't want to capture from FILE.TXT-2b, then you would have to be careful about most of these solutions...)
If the string can be separated by multiple underscores or dashes (e.g. A__1a):
[_-]*[A-Z][_-]*[1-9][_-]*[a-z]
If there can only be one or zero underscores or dashes:
[_-]?[A-Z][_-]?[1-9][_-]?[a-z]
regex = r"[A-Z][-_]?[1-9][-_]?[a-z]"
print re.findall(regex,some_string_variable)
should work
to just capture the parts your interested in wrap them in parens
regex = r"([A-Z])[-_]?([1-9])[-_]?([a-z])"
print re.findall(regex,some_string_variable)
if the underscores or dashes or lack thereof must match or it will return bad results you would need a statemachine whereas regex is stateless

Categories

Resources