regex to match a word and the first parenteshis i find - python

I need a regex to match a word like 'estabilidade' and then matches anything until it gets to the first parenteshis.
I already tried some regex that i found on internet, but i have difficulties to make my own regex, as i dont understand how it works very well.
Someone can help me?
The regex i already tried were:
re.search(r"([^\(]+)", resultado) -> trying to get just the parenteshis.
and
re.search(r"estabilidade((\s*|.*))\(+", resultado).group(1)
Real Example (need to pick up all the numbers inside the parenthesis, but knowing which word this number is related to. For instance, the first 7 is related to the sentence 'Procura por estabilidade'):
Procura por
estabilidade
(7)
É assertivo(a)
com os outros
(5)
Procura convencer
os outros
(7)
Espontaneamente
se aproxima
dos outros
LIDERANÇA INFLUÊ
10
9
(6)
Demonstra
diplomacia
(5)

As you didn't specify which part of the matched string you want to check, so I included some more groups.
import re
s = 'hello there estabilidade this is just some text (yes it is)'
r = re.search(r"(estabilidade([.\S]+))\(", s)
print(r.group(1)) # "estabilidade this is just some text"
print(r.group(2)) # " this is just some text"

Something like this?
In [1]: import re
In [2]: re.findall(r'([^()]+)\((\d+)\)', 'estabilidade_smth(10) estabilidade_other(20)')
Out[2]: [('estabilidade_smth', '10'), (' estabilidade_other', '20')]

This should do it:
estabilidade([^(]+)
It's using a negative character class, that's the key take away and a good tool in your bag to have. [] is a character class. It is a list of characters, if you put in ^ as the first character it's a list of characters not in there. So [^(] means any character that isn't (. Adding the + means at least 1 of the item to the left. So, putting all that together we want at least 1 non (.
Here is it in Python:
import re
text = "hello estabilidade how are you today (at the farm)"
print (re.search("estabilidade([^(]+)", text).group(1))
Output:
how are you today
Example to play with:
https://regex101.com/r/2qxa0y/1/
Here is a good site to learn some of the basic regex tricks, this will go a long way: https://www.regular-expressions.info/tutorial.html

For my question, i solved the problem with the following regex, using the following tool indicate for one the users here (https://regex101.com/r/2qxa0y/1/)
((|.|[(]|\s)*)\((\d*)\)
Thanks to everyone!!

Related

Python negative regex

I have a string such as:
s = "The code for the product is A8H4DKE3SP93W6J and you can buy it here."
The text in this string will not always be in the same format, it will be dynamic, so I can't do a simple find and replace to obtain the product code.
I can see that:
re.sub(r'A[0-9a-zA-Z_]{14} ', '', s)
will get ride of the product code. How do I go about doing the opposite of this, i.e. deleting all of the text, apart from the product code? The product code will always be a 15 character string, starting with the letter A.
I have been racking my brain and Googling to find a solution, but can't seem to figure it out.
Thanks
Instead of substituting the rest of the string, use re.search() to search for the product number:
In [1]: import re
In [2]: s = "The code for the product is A8H4DKE3SP93W6J and you can buy it here."
In [3]: re.search(r"A[0-9a-zA-Z_]{14}", s).group()
Out[3]: 'A8H4DKE3SP93W6J'
In regex, you can match on the portion you want to keep for substituting by using braces around the pattern and then referring to it in the sub-pattern with backslash followed by the index for that matching portion. In the code below, "(A[0-9A-Za-z_]{14})" is the portion you want to match, and you can substitute in the resulting string using "\1".
re.sub(r'.*(A[0-9A-Za-z_]{14}).*', r'\1', s)

regexp for nvda to put spaces between all capital letters?

So, I use NVDA, a free screen reader for the blind that many people use, and a speech synthesizer. I am building a library of modified versions of addons which it takes, and dictionaries that can contain regular expressions acceptable by python, as well as standard word replacement operation.
My thing is, I do not know how to design a regular expression that will place a space between capital letters such as in ANM, which the synth says as one word rather than spelling it like it should.
I do not know enough python to manually code an addon for this thing, I only use regexp for this kind of thing. I do know regular expressions basics, the general implementation, which you can find by googling "regular expressions in about 55 minutes".
I want it to do something like this.
Input: ANM
Output: A N M
Also with the way this speech synth works, I may have to replace A with eh, which would make this.
Input: ANM
Output: Eh N M
Could any of you provide me a regular expression to do this if it is possible? And no, I don't think I can compile them in loops because I didn't write the python.
This should do the trick for the capital letters, it uses ?= to look ahead for the next capital letter without 'eating it up':
>>> import re
>>> re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZ a Test")
'A B C thIs iS X Y Z a Test'
If you have a lot of replacements to make, it might be easiest to put them into a single variable:
replacements = [("A", "eh"), ("B", "bee"), ("X", "ex")]
result = re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZX. A Xylophone")
for source, dest in replacements:
result = re.sub("("+source+r")(?=\W)" , dest, result)
print(result)
Output:
eh bee C thIs iS ex Y Z ex. eh Xylophone
I build a regex in the 'replacements' code to handle capitalised words and standalone capitals at the end of sentences correctly. If you want to avoid replacing e.g. the standalone 'A' with 'eh' then the more advanced regex replacement function as mentioned in #fjarri's answer is the way to go.
While #Galax's solution certainly works, it may be easier to perform further processing of abbreviations if you use callbacks on matches (this way you won't replace any standalone capitals):
import re
s = "This is a normal sentence featuring an abbreviation ANM. One, two, three."
def process_abbreviation(match_object):
spaced = ' '.join(match_object.group(1))
return spaced.replace('A', 'Eh')
print(re.sub("([A-Z]{2,})", process_abbreviation, s))
okay, found the answer. Using a sequence of regexes in a certain order, i got it to work. THanks you guys, you helped me form the basis and you are appreciated.

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

How to find all words followed by symbol using Python Regex?

I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!
'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'
You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.
Just add some optional whitespace before the =:
\w+(?=\s*=)
Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»
You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")
r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.
re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']

Categories

Resources