Regex remove 'by' from a string - python

Update 2: https://regex101.com/r/bE5aWW/2
Update: This is what I can come up with so far, https://regex101.com/r/bE5aWW/1/, but need help to get rid of .
Case 1
\n \n by name name\n \n
Case 2
\n \n name name\n \n
Case 3
by name name
Case 4
name name
I would like to select the name part from the above strings, i.e. name name. The one I came up with, (?:by)? ([\w ]+) donesn't work when there are spaces before by.
Thanks
Codes from regex101
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:by)? ([\w ]+)"
test_str = ("\\n \\n by Ally Foster\\n \\n \n\n"
"\\n \\n Ally Foster\\n \\n \n\n"
"by name name\n\n"
"name name")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

(?:by )?(\b(?!by\b)[\w, ]+\S)
My final version, which also won't select strings only have by

I suggest using
re.findall(r'\b(?!by\b)[^\W\d_]+(?: *(?:, *)?[^\W\d_]+)*', s)
See the regex demo. In Python 2, you will need to pass re.U flag to make all the shorthand character classes and the word boundary Unicode aware. To also match tabs rather than just spaces, replace spaces with [ \t].
Details
\b - a word boundary
(?!by\b) - the next word cannot be by
[^\W\d_]+ - one or more letters
(?: *(?:, *)?[^\W\d_]+)* - a non-capturing group that matches 0 or more occurrences of:
* - zero or more spaces
(?:, *)? - an optional sequence of , and 0+ spaces
[^\W\d_]+ - one or more letters.

Related

How to get all substrings between some delimiters in python [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
I am trying to get all the substring that matches some delimiters. My issue is that i also need the character at the end of the last occurrence. The strings need to be between any of these characters: . , / , ? , = , - , _
I have tried this regular expression
pattern = re.compile(r"""[./?=\-_][^./?=\-_]+[./?=\-_]""")
In this exemple:
-facebook=chat.messenger?
I am not able to get the substring =chat.
I am only getting -facebook= and .messenger?
Looks like the overlap is what's causing some the drama. If using the regex module (which is expected to eventually replace the re module), you can do
import regex as re
delimiters = r'[./?=\-_]'
pattern = delimiters + r'[a-z]+' + delimiters
s = '-facebook=chat.messenger?'
print(regex.findall(pattern, s, overlapped=True))
# ['-facebook=', '=chat.', '.messenger?']
Notice that this assumes all characters are lowercase with [a-z], and that [./?=\-_] is the list of delimiters you specified.
Hope this helps!
My guess is that this expression might be what we might want to start with:
((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)"
test_str = "-facebook=chat.messenger?"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

How can i check string is Thai language that return boolean like isalpha()

I'm trying to check str that is only Thai character or not by using regex or any if it can solve
I'm trying to use
re.compile(u"[^\u0E00-\u0E7F']|^'|'$|''")
ret = regexp_thai.sub("", s)
to slice another language or digit
by the way it just only slice not for return boolean
I expect output like
s = "engภาษาไทยที่มีสระ123!#"
regexp_thai = re.compile(u"[^\u0E00-\u0E7F']|^'|'$|''")
ret = regexp_thai.sub("", s)
print(ret) # ภาษาไทยที่มีสระ
print(isthai(ret)) # True
u0E00-u0E7F is a unicode of Thai language
How can I write isthai function
I'm not quite sure what might be the desired output. However, I'm guessing that we like to capture the Tai letters, which based on your original expression, we might just want to add a simple list of chars, wrap it with a capturing group and swipe our desired Tai letters from left to right, maybe similar to:
([\u0E00-\u0E7F]+)
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([\u0E00-\u0E7F]+)"
test_str = "engภาษาไทยที่มีสระ123!#"
matches = re.finditer(regex, test_str, re.MULTILINE | re.UNICODE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /([\u0E00-\u0E7F]+)/gmu;
const str = `engภาษาไทยที่มีสระ123!#`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
Reference
Regular Expression to accept all Thai characters and English letters in python

RegEx for matching email in URLs

I have a regex in my django code but I don't know what it means actually. Here is my regex :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$',
Could you give me some examples which match with this regex?
RegEx Circuit
You can visualize your expressions in jex.im:
You can also test/modify/change your expressions in regex101.com.
Basically, your expression would match:
email/some_alphanumeric[A-Z0-9]_special_chars_##$*some_alphanumeric_special_chars_#$*.some_alphanumeric_special_chars_#$*
Demo
If you wish to match:
myurl/email/blabla#blabla.com
You can modify it to:
myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)"
test_str = "myurl/email/blabla#blabla.com"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
in addition :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$'
this regx use in django url
url example : email/test#gmail.com/
email/ = consolent value in your url
[^#\s] = you can write any character except # and space "/s"
#[^#\s] = you must start with # + anything expect #character and space "/s"
\. = matches "."
[^#\s] = you can write anycharacter except # and space "/s"
+ = you can type many character
/$ = end of url

Python findall Regex function catches only some text

I'm still new at Regex, and I've been trying to implement a Gmail validation algorithm in my Python program.
This is my Regex
mail_address = "hello.89#gmail.com"
result = re.findall(r'\w+[\w.]+(#gmail.com){1}', mail_address)
print (str(result))
The first char must be alphanumeric (\w+), from there it catches every set of chars ([\w.]+), followed by only one instance of #gmail.com
This is what it prints:
['#gmail.com']
But it should print
['hello.89#gmail.com']
What am I doing wrong?
EDIT: Here's the Regex I chose:
\A(\w+[\w.]+#gmail\.com)\Z
Just alter the parentheses so that it includes all of your desired output:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
I have slightly altered your expression insofar as the gmail.com part is now only a string. Additionally, you don't need to convert the results to string plus you don't need to repeat a group just once.
That being said, in the end, you'd end up having:
import re
mail_address = "hello.89#gmail.com"
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
print (result)
# ['hello.89#gmail.com']
Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
Demo: https://regex101.com/r/Z5EGbZ/1
Quantifier after #gmail.com is meaningless.
this should work, using your regex only
regex = r"\w+[\w.]+(#gmail.com){1}"
test_str = "hello.89#gmail.com"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
check online compiler

Remove leading zeros in middle of string with regex

I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.
I am currently doing it in two steps, i.e:
m = re.match(
r"(?P<first_sequence>\d{8})"
r"(?P<second_sequence>\d{8})"
r"(?P<third_sequence>\d{8})",
string)
second_secquence = m.group(2)
second_secquence.lstrip(0)
Which does work, and gives me the right results, e.g.:
112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456
But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?
I guess I am looking for a regex which does the following:
Skips over the first eight digits.
Skips any leading zeros.
Captures anything after that, up to the point where there's sixteen characters behind/eight infront.
The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.
This is a typical case of "useless use of regular expressions".
Your strings are fixed-length. Just cut them at the appropriate positions.
s = "112233440012345655667788"
int(s[8:16])
# -> 123456
I think it's simpler not to use regex.
result = my_str[8:16].lstrip('0')
Agree with the other answers here that regex isn't really required. If you really want to use regex, then \d{8}0*(\d*)\d{8} should do it.
Just to show that it is possible with regex:
https://regex101.com/r/8RSxaH/2
# CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"
test_str = ("112233441234567855667788\n"
"112233440012345655667788\n"
"112233001234567855667788\n"
"112233000012345655667788")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Although you don't really need it to do what you're asking

Categories

Resources