extracting items using regular expression in python

extracting items using regular expression in python - python

I have a a file which has the following :
new=['{"TES1":"=TES0"}}', '{"""TES1:IDD""": """=0x3C""", """TES1:VCC""": """=0x00"""}']
I am trying to extract the first item, TES1:=TES0 from the list. I am trying to use a regular expression to do this. This is what i tried but i am not able to grab the second item TES0.
import re
TES=re.compile('(TES[\d].)+')
for item in new:
result = TES.search(item)
print result.groups()
The result of the print was ('TES1:',). I have tried various ways to extract it but am always getting the same result. Any suggestion or help is appreciated. Thanks!

I think you are looking for findall:
import re
TES=re.compile('TES[\d].')
for item in new:
result = TES.findall(item)
print result

First Option (with quotes)
To match "TES1":"=TES0", you can use this regex:
"TES\d+":"=TES\d+"
like this:
match = re.search(r'"TES\d+":"=TES\d+"', subject)
if match:
result = match.group()
Second Option (without quotes)
If you want to get rid of the quotes, as in TES1:=TES0, you use this regex:
Search: "(TES\d+)":"(=TES\d+)"
Replace: \1:\2
like this:
result = re.sub(r'"(TES\d+)":"(=TES\d+)"', r"\1:\2", subject)
How does it work?
"(TES\d+)":"(=TES\d+)"
Match the character “"” literally "
Match the regex below and capture its match into backreference number 1 (TES\d+)
Match the character string “TES” literally (case sensitive) TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character string “":"” literally ":"
Match the regex below and capture its match into backreference number 2 (=TES\d+)
Match the character string “=TES” literally (case sensitive) =TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character “"” literally "
\1:\2
Insert the text that was last matched by capturing group number 1 \1
Insert the character “:” literally :
Insert the text that was last matched by capturing group number 2 \2

You can use a single replacement, example:
import re
result = re.sub(r'{"(TES\d)":"(=TES\d)"}}', '$1:$2', yourstr, 1)

Related

How to match regex in python?

describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do
I try to filter the sg-ezsrzerzer out of it (so I want to filter on start sg- till double quote). I'm using python
I currently have:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-.*\b', a)
print(test)
output is
['sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do']
How do I only get ['sg-ezsrzerzer']?

The pattern (?<=group_id=\>").+?(?=\") would work nicely if the goal is to extract the group_id value within a given string formatted as in your example.
(?<=group_id=\>") Looks behind for the sub-string group_id=>" before the string to be matched.
.+? Matches one or more of any character lazily.
(?=\") Looks ahead for the character " following the match (effectively making the expression .+ match any character except a closing ").
If you only want to extract sub-strings where the group_id starts with sg- then you can simply add this to the matching part of the pattern as follows (?<=group_id=\>")sg\-.+?(?=\")
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
results = re.findall('(?<=group_id=\>").+?(?=\")', s)
print(results)
Output
['sg-ezsrzerzer']
Of course you could alternatively use re.search instead of re.findall to find the first instance of a sub-string matching the above pattern in a given string - depends on your use case I suppose.
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
result = re.search('(?<=group_id=\>").+?(?=\")', s)
if result:
result = result.group()
print(result)
Output
'sg-ezsrzerzer'
If you decide to use re.search you will find that it returns None if there is no match found in your input string and an re.Match object if there is - hence the if statement and call to s.group() to extract the matching string if present in the above example.

The pattern \bsg-.*\b matches too much as the .* will match until the end of the string, and will then backtrack to the first word boundary, which is after the o and the end of string.
If you are using re.findall you can also use a capture group instead of lookarounds and the group value will be in the result.
:group_id=>"(sg-[^"\r\n]+)"
The pattern matches:
:group_id=>" Match literally
(sg-[^"\r\n]+) Capture group 1 match sg- and 1+ times any char except " or a newline
" Match the double quote
See a regex demo or a Python demo
For example
import re
pattern = r':group_id=>"(sg-[^"\r\n]+)"'
s = "describe aws_security_group({:group_id=>\"sg-ezsrzerzer\", :vpc_id=>\"vpc-zfds54zef4s\"}) do"
print(re.findall(pattern, s))
Output
['sg-ezsrzerzer']

Match until the first word boundary with \w+:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-\w+', a)
print(test[0])
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
sg- 'sg-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
Results: g-ezsrzerzer

only keep digits after ":" in regular expression in python [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.

If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']

You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']

To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.

re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.

You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']

First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []

Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).

Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']

The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]

Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

python regex get value after string

I am trying to parse a comma separated string keyword://pass#ip:port.
The string is a comma separated string, however the password can contain any character including comma. hence I can not use a split operation based on comma as delimiter.
I have tried to use regex to get the string after "myserver://" and later on I can split the rest of the information by using string operation (pass#ip:port/key1) but I could not make it working as I can not fetch the information after the above keyword.
myserver:// is a hardcoded string, and I need to get whatever follows each myserver as a comma separated list (i.e. pass#ip:port/key1, pass2#ip2:port2/key2, etc)
This is the closest I can get:
import re
my_servers="myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
result = re.search(r'myserver:\/\/(.*)[,(.*)|\s]', my_servers)
using search I tries to find the occurrence of the "myserver://" keyword followed by any characters, and ends with comma (means it will be followed by myserver://zzz,myserver://qqq) or space (incase of single myserver:// element, but I do not know how to do this better apart of using space as end-indicator). However this does not come out right. How can I do this better with regex?

You may consider the following splitting approach if you do not need to keep myserver:// in the results:
filter(None, re.split(r'\s*,?\s*myserver://', s))
The \s*,?\s*myserver:// pattern matches an optional , enclosed with 0+ whitespaces and then myserver:// substring. See this regex demo. Note we need to remove empty entries to get rid of an empty leading entry as when the match is found at the string start, the empty string at the beginning will be added to the resulting list.
Alternatively, you can use the lookahead based pattern with a lazy dot matching pattern with re.findall:
rx = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
See the Python demo
Details:
myserver:// - a literal substring
(.*?) - Capturing group 1 whose contents will be returned by re.findall matching any 0+ chars other than line break chars, as few as possible, up to the first occurrence (but excluding it)
(?=\s*,\s*myserver://|$) - either of the 2 alternatives:
\s*,\s*myserver:// - , enclosed with 0+ whitespaces and then a literal myserver:// substring
| - or
$ - end of string.
Here is the regex demo.
See a Python demo for the both approaches:
import re
s = "myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
rx1 = r'\s*,?\s*myserver://'
res1 = filter(None, re.split(rx1, s))
print(res1)
#or
rx2 = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
res2 = re.findall(rx2, s)
print(res2)
Both will print ['password,123#ip:port/key1', 'pass2#ip2:port2/key2'].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting items using regular expression in python - python

I think you are looking for findall: import re TES=re.compile('TES[\d].') for item in new: result = TES.findall(item) print result

You can use a single replacement, example: import re result = re.sub(r'{"(TES\d)":"(=TES\d)"}}', '$1:$2', yourstr, 1)

Related

How to match regex in python?

only keep digits after ":" in regular expression in python [duplicate]

strange output regular expression r'[-.\:alnum:](.*)'

regex select sequences that start with specific number

python regex get value after string

Categories

Resources