Python negative regex - python

I have a string such as:
s = "The code for the product is A8H4DKE3SP93W6J and you can buy it here."
The text in this string will not always be in the same format, it will be dynamic, so I can't do a simple find and replace to obtain the product code.
I can see that:
re.sub(r'A[0-9a-zA-Z_]{14} ', '', s)
will get ride of the product code. How do I go about doing the opposite of this, i.e. deleting all of the text, apart from the product code? The product code will always be a 15 character string, starting with the letter A.
I have been racking my brain and Googling to find a solution, but can't seem to figure it out.
Thanks

Instead of substituting the rest of the string, use re.search() to search for the product number:
In [1]: import re
In [2]: s = "The code for the product is A8H4DKE3SP93W6J and you can buy it here."
In [3]: re.search(r"A[0-9a-zA-Z_]{14}", s).group()
Out[3]: 'A8H4DKE3SP93W6J'

In regex, you can match on the portion you want to keep for substituting by using braces around the pattern and then referring to it in the sub-pattern with backslash followed by the index for that matching portion. In the code below, "(A[0-9A-Za-z_]{14})" is the portion you want to match, and you can substitute in the resulting string using "\1".
re.sub(r'.*(A[0-9A-Za-z_]{14}).*', r'\1', s)

Related

I wish to take the middle pattern of the sentence in chinese character using regex

I tried to take the middle words based on my pattern. Below are my codes:
text = "東京都田中区9-7−4"
import re
#Sorry due to the edit problem and stackoverflow doesnt allow me to include long sentences here, please check my comment below for the compile function of re.
city = re.findall(r,text)
print("getCity: {}".format(city))
My current output:
getCity: ['都田中区']
My expected output:
getCity: ['田中区']
I do not want to take the [都道府県] so I use "?!" in my first beginning pattern as (?!...??[都道府県]). However, when I run my program, it shows that "都" is inside as well like I show on my current output. Could anyone please direct me on this?
The problem with your regex is that it is too allowing.
If you look at this visualisation here (I have removed all the hardcoded city (市) names because they are irrelevant):
you can see a lot of "any character" repeated x times, or just "not 市" and "not 町" repeated x times. These are what matches the 都道府県 in your string. Therefore, these are the places where you should disallow 都道府県:
The corresponding regex would be:
(?:余市|高市|[^都道府県市]{2,3}?)郡(?:玉村|大町|[^都道府県]{1,5}?)[町村]|(?:[^都道府県]{1,4}市)?[^都道府県町]{1,4}?区|[^都道府県]{1,7}?[市町村]
Remember to add the hardcoded cities when you put this in your code!

regex to match a word and the first parenteshis i find

I need a regex to match a word like 'estabilidade' and then matches anything until it gets to the first parenteshis.
I already tried some regex that i found on internet, but i have difficulties to make my own regex, as i dont understand how it works very well.
Someone can help me?
The regex i already tried were:
re.search(r"([^\(]+)", resultado) -> trying to get just the parenteshis.
and
re.search(r"estabilidade((\s*|.*))\(+", resultado).group(1)
Real Example (need to pick up all the numbers inside the parenthesis, but knowing which word this number is related to. For instance, the first 7 is related to the sentence 'Procura por estabilidade'):
Procura por
estabilidade
(7)
É assertivo(a)
com os outros
(5)
Procura convencer
os outros
(7)
Espontaneamente
se aproxima
dos outros
LIDERANÇA INFLUÊ
10
9
(6)
Demonstra
diplomacia
(5)
As you didn't specify which part of the matched string you want to check, so I included some more groups.
import re
s = 'hello there estabilidade this is just some text (yes it is)'
r = re.search(r"(estabilidade([.\S]+))\(", s)
print(r.group(1)) # "estabilidade this is just some text"
print(r.group(2)) # " this is just some text"
Something like this?
In [1]: import re
In [2]: re.findall(r'([^()]+)\((\d+)\)', 'estabilidade_smth(10) estabilidade_other(20)')
Out[2]: [('estabilidade_smth', '10'), (' estabilidade_other', '20')]
This should do it:
estabilidade([^(]+)
It's using a negative character class, that's the key take away and a good tool in your bag to have. [] is a character class. It is a list of characters, if you put in ^ as the first character it's a list of characters not in there. So [^(] means any character that isn't (. Adding the + means at least 1 of the item to the left. So, putting all that together we want at least 1 non (.
Here is it in Python:
import re
text = "hello estabilidade how are you today (at the farm)"
print (re.search("estabilidade([^(]+)", text).group(1))
Output:
how are you today
Example to play with:
https://regex101.com/r/2qxa0y/1/
Here is a good site to learn some of the basic regex tricks, this will go a long way: https://www.regular-expressions.info/tutorial.html
For my question, i solved the problem with the following regex, using the following tool indicate for one the users here (https://regex101.com/r/2qxa0y/1/)
((|.|[(]|\s)*)\((\d*)\)
Thanks to everyone!!

Regex (Python) to count elements in domain name

I would like to parse an URL and count the number of "elements" in its domain name.
If I for example had an url http://news.bbc.co.uk/foo/bar/xyzzy.html, I would be interested in number 4 (news, bbc, co, uk).
I have always shunned regular expressions as too cryptic. I would normally do this by splitting the string between // and / and counting dots in between. This time I decided to move away from my comfort zone and boldly try some self-improvement and do this with regular expressions, counting the number of match groups.
This is what I tried:
pattern = r"^.*//(([^./]+\.)+)/.*$"
but this does not match anything. I know there is a problem somewhere there, at least in handling the final part of the domain uk/ (should be counted in but then something else than a dot should be consumed), but still breaking the match group pattern so that parsing enters the tail part.
My idea was to first consume everything until // including //. This part probably works. Then I would start matching groups where a group is anything that is not . or /, repeat until a dot, then consume the dot, until all such groups have been consumed. These would be the match groups I am interested in. Then consume / and deal with the rest as I am not interested in it anymore. This goes wrong.
Or is this a futile attempt to use regex somewhere where it is not suitable?
Assuming consistent input, you can do:
^[^:]+://([^/]+)
^[^:]+ matches one or more characters from start till first :
:// matches the characters literally
([^/]+) the captured group contains one or more characters till next /
You would get e.g. news.bbc.co.uk using the above, then its a matter of simple str.split('.').
Note: The obvious one, don't use Regex for this, use a proper URL parser library (e.g.urlparse).
Example:
In [49]: s = 'http://news.bbc.co.uk/foo/bar/xyzzy.html'
In [50]: re.search(r'^[^:]+://([^/]+)', s).group(1).split('.')
Out[50]: ['news', 'bbc', 'co', 'uk']
You can try this regex :
import re
pattern=r'(?:\/\/)(\w+)|(?<=\.)(\w+)'
string='http://news.bbc.co.uk/foo/bar/xyzzy.html'
result=[]
match=re.finditer(pattern,string)
for i in match:
if i.group(1)!=None:
result.append(i.group(1))
elif i.group(2)!=None and i.group(2)!='html':
result.append(i.group(2))
print(result)
output:
['news', 'bbc', 'co', 'uk']
But Cool thing is you can do this thing in one line:
import tldextract
result=tldextract.extract("http://news.bbc.co.uk/foo/bar/xyzzy.html")
print([i.split('.') for i in result])
output:
[['news'], ['bbc'], ['co', 'uk']]

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Categories

Resources