python's re: find words beginning from "string" in any case - python

I'm trying to make a regex, that will return list words that begin with barbar in any case. It must return not the whole word, but only matching part. For example, from string
string = u'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
# output is...
>>> ['baRbar', 'BARbar', 'BARBAR']
I've tried such code:
re.compile(ur"([\A\b]*)(barbar)", re.UNICODE | re.IGNORECASE).findall(string)
# it returns...
[(u'', u'baRbar'), (u'', u'barbar'), (u'', u'BARbar'), (u'', u'BARBAR')]
It seems that I missunderstood something. Could you help me, please? And it will be also great if you advice some good tutorials about re module. It's too hard to understand re from default Python's documentation. Thanks!

The following regex is sufficient for what you want to do (as long as flags are set):
\bbarbar
Example:
>>> s = u'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
>>> re.findall(r'\bbarbar', s, re.IGNORECASE | re.UNICODE)
[u'baRbar', u'BARbar', u'BARBAR']
Here are some comments on your current regex which may clarify why \bbarbar does the job:
[\A\b] - \A is normally the start of string, and \b is word boundary, but inside of a character class \b becomes a backspace and I'm not really sure what \A becomes
[\A\b]* - This is why your regex matched 'semibarbarus', the * means 0 or more so it doesn't require that portion to match, if you dropped the * and fixed the above problem it would work
([\A\b]*)(barbar) - Multiple groups mean that re.findall() will return a tuple of the groups, rather than just the portion you are interested in

Because you want to have only the words beginning with barbar you have to split the string before. So you should do something like this:
def findBarbarus(my_string):
result = []
for s in my_string.split(" "):
result += re.compile(ur"(^barbar)", re.UNICODE | re.IGNORECASE).findall(s)
return result
The ^ in the regular expression means, that the word must begin with barbar.

You could try...
string = 'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
l=re.findall(' barbar.+? |^barbar.+? ', string, re.IGNORECASE)
print l

Just for the record: If you use \A inside a character class e.g. r"[\A]", it should be treated like a literal A. However it is silently ignored. The same happens with \B and \Z.
I have reported the bug.

Related

re.match never returns None? [duplicate]

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Match string between special characters

I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.
You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'
Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]

python's re: replace regex to regex

I have to replace text with text which was found. Smth like this:
regex = u'barbar'
oldstring = u'BarBaR barbarian BarbaRONt'
pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
newstring = pattern.sub(.....)
print(newstring) # And here is what I want to see
>>> u'TEXT1BarBaRTEXT2 TEXT1barbarTEXT2ian TEXT1BarbaRTEXT2ONt'
So I want to receive my original text, where each word that matches 'barbar' (with ignored case) will be surrounded by two words, TEXT1 and TEXT2. Return value must be a unicode string.
How can I realize it? Thanks!
You can use capturing group for that:
regex = u'(barbar)'
...
pattern.sub('TEXT1\\1TEXT2', oldstring)
# => u'TEXT1BarBaRTEXT2 TEXT1barbarTEXT2ian TEXT1BarbaRTEXT2ONt'
Taking barbar into parenthesis makes regexp to capture every part of the string that matches this part of the regexp into a group. As it's the first (and the only one) capturing group you can refer to it as \1 anywhere in the replacement string or in the regexp itself.
For more explanation see (...) and \number sections in the docs.
Btw, if you don't like escaping of the slash before group number you can use raw string instead:
pattern.sub(r'TEXT1\1TEXT2', oldstring)

Categories

Resources