Replacing a special identifier pattern with re.sub in python - python

I am new to regex and have a regex replacement in a re.sub that I can't figure out.
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[/|#][A-Z|a-z|0-9|-]*','', test)
print(test)
The code should print:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
But, instead I am currently getting this (with 4,5,8 not fully converted):
1-Some String
2-Some String
3-Some String
4-Some String (Fubar )
5-Some String (Fubar - .67 A)
6-Some String
7-Some String
8-Some String

Please try the following:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))','', test)
print(test)
Result:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
The regex (substring to delete) can be defined as:
To start with "/", "#" or "- "
May be preceded by whitespace(s)
To consist of whitespaces, alphanumerics, hyphens, hashes or dots
To be anchored by "end of line" or ")" by using a positive lookahead
Then the regex will look like:
\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))
positive lookahead may require some explanation. The pattern (?=regex)
is a zero-width assertion meaning followed by regex.
The benefit is the matched substring does not include the regex and
you can use it as an anchor.

Another option is to match only the last occurrence of # using a negative lookahead (?![^#\n\r]*#). For clarity I have put matching a space [ ] between square brackets.
[ ]*(?:[/-][ ]*)?#(?![^#\n\r]*#)[\da-zA-Z. -]+
Explanation
[ ]* Match 0+ times a space
(?:[/-][ ]*)? Optionally match / or - and 0+ spaces
# Match literally
(?![^#\n\r]*#) Negative lookahead, assert when is om the right does not contain #
[\da-zA-Z. -]+ Match 1+ times what is listed in the character class
Regex demo
In the replacement use an empty string.

It is probably easier to do it in two steps:
First: Clean up the part in parenthesis. After the '(' and some letters remove everything up to the closing ')'.
Second: Remove the unwanted stuff at the end of a line. A line ends either at '#' followed by 2 or more digits or a '/'. There may be a space before the '#' or '/'.
import re
paren_re = re.compile(r"([(][a-zA-Z]+)([^)]*)")
eol_re = re.compile(r"(.*?)\s*(?:#\d\d|/).*")
for line in test_cases:
result = paren_re.sub(r"\1", line)
result = eol_re.sub(r"\1", result)
print(result)

I couldn't fit them into one regex, maybe someone can. Here's a 2-line solution:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[\/#][\w\s\d\-]*', '', test)
test = re.sub(r'[\s\.\-\d]+\w+\)', ')', test)
print(test)
Output:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some
Explain:
\w for a-zA-Z
\d for 0-9
\s for spaces
\. for dot
\- for minus
But I'm confused with your last line of output, why it outputs #1 String, based on what? If you confirm that you can write a specific regex for that pattern.

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Regular Expression that return matches specific strings in bracket and return its next and preceding string in brackets

I need to find the string in brackets that matches some specific string and all values in the string.
Right not I am getting values from a position where the string matches.
text = 'This is an sample string which have some information in brackets (info; matchingString, someotherString).'
regex= r"\(*?matchingString.*?\)"
matches = re.findall(regex, text)
From this I am getting result
matchingString, someotherString)
what I want is to get the string before the matching string as well.
The result should be like this:
(info; matchingString, someotherString)
This regex works if the matching string is in the first string in brackets.
You can use
\([^()]*?matchingString[^)]*\)
See the regex demo. Due to the [^()]*?, the match will never overflow across other (...) substrings.
Regex details:
\( - a ( char
[^()]*? - zero or more chars other than ( and ) as few as possible
matchingString - a hardcoded string
[^)]* - zero or more chars other than )
\) - a ) char.
See the Python demo:
import re
text = 'This is an sample string which have some information in brackets (info; matchingString, someotherString).'
regex= r"\([^()]*?matchingString[^)]*\)"
print( re.findall(regex, text) )
# => ['(info; matchingString, someotherString)']

python string split only by `/` but not `//` [duplicate]

I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")
You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).
You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"

How to cut non-alphanumeric prefix and suffix from a string in Python?

How to cut all characters from the beginning and the end of the string which are not alphanumerical?
For example:
print(clearText('%!_./123apple_42.juice_(./$)'))
# => '123apple_42.juice'
print(clearText(' %!_also remove.white_spaces(./$) '))
# => 'also remove.white_spaces'
You could use this pattern: ^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$
Explanation:
^[^a-zA-Z0-9] - match one or more non-alphanumerical character at the beginning of a string (thanks to ^)
[^a-zA-Z0-9]$ - match one or more non-alphanumerical character at the end of a string (thanks to $)
| means alternation, so it matches non alphanumerical string of characters at the begginning or at the end
Demo
Then it's enough to replace matches with empty string.
This guy grabs everything between alphanumeric characters.
import re
def clearText(s):
return re.search("[a-zA-Z0-9].*[a-zA-Z0-9]", s).group(0)
print(clearText("%!_./123apple_42.juice_(./$)"))

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Categories

Resources