I am using this \d{2}-\d{2}-\d{4} to validate my string. It works to pull the sequence of numbers out of said string resulting in 02-01-1716 however, i also need to pull the letter the string begins with and ends with; i.e. Q:\Region01s\FY 02\02-01-1716A.pdf i need the Q as well as the A so in the end i would have Q: 02-01-1716A
You can use
import re
regex = r"^([a-zA-Z]:)\\(?:.*\\)?(\d{2}-\d{2}-\d{4}[a-zA-Z]?)"
text = r"Q:\Region01s\FY 02\02-01-1716A.pdf"
match = re.search(regex, text)
if match:
print(f"{match.group(1)} {match.group(2)}")
# => Q: 02-01-1716A
See the Python demo. Also, see the regex demo. Details:
^ - start of string
([a-zA-Z]:) - Group 1: a letter and :
\\ - a backslash
(?:.*\\)? - an optional sequence of any chars other than line break chars as many as possible, followed with a backslash
(\d{2}-\d{2}-\d{4}[a-zA-Z]?) - Group 2: two digits, -, two digits, -, four digits, an optional letter.
The output - if there is a match - is a concatenation of Group 1, space and Group 2 values.
You can try:
(.).*(.)\.[^\.]+$
Or with the validation:
(.).*\d{2}-\d{2}-\d{4}(.)\.[^\.]+$
Related
String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo
I have a string composed of both letters followed by a number, and I need to remove all the letters, as well as the leading zeros in the number.
For example: in the test string U012034, I want to match the U and the 0 at the beginning of 012034.
So far I have [^0-9] to match the any characters that aren't digits, but I can't figure out how to also remove the leading zeros in the number.
I know I could do this in multiple steps with something like int(re.sub("[^0-9]", "", test_string) but I need this process to be done in one regex.
You can use
re.sub(r'^\D*0*', '', text)
See the regex demo. Details
^ - start of string
\D* - any zero or more non-digit chars
0* - zero or more zeros.
See Python demo:
import re
text = "U012034"
print( re.sub(r'^\D*0*', '', text) )
# => 12034
If there is more text after the first number, use
print( re.sub(r'^\D*0*(\d+).*', r'\1', text) )
See this regex demo. Details:
^ - start of string
\D* - zero or more non-digits
0* - zero or more zeros
(\d+) - Group 1: one or more digits (use (\d+(?:\.\d+)?) to match float or int values)
`.* - the rest of the string.
The replacement is the Group 1 value.
You may use this re.sub in Python:
string = re.sub(r'^[a-zA-Z]*0*|[a-zA-Z]+', '', string)
RegEx Demo
Explanation:
^: Start
[a-zA-Z]*: Match 0 or more letters
0*L: Match 0 or more zeroes
|: OR
[a-zA-Z]+: Match 1+ of letters
Does this do what you need?
re.sub("[^0-9]+0*", "", "U0123")
>>> '123'
Using python v3, I'm trying to find a string only if it contains one to two digits (and not anymore than that in the same number) along with everything else following it. The match breaks on periods or new lines.
\d{1,2}[^.\n]+ is almost right except it returns numbers greater than two digits.
For example:
"5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(jn."
Should return:
5+years {} experience
10 asdasdas
1abc1
Based upon your description and your sample data, you can use following regex to match the intended strings and discard others,
^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)
Regex Explanation:
^ - Start of line
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Also, notice, multiline mode is enabled as ^ and $ need to match start of line and end of line.ad
Regex Demo 1
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1']
Also, if matching lines doesn't necessarily start with digits, you can use this regex to capture your intended string but here you need to get your string from group1 if you want captured string to start with number only, and if intended string doesn't necessarily have to start with digits, then you can capture whole match.
^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)
Regex Explanation:
^ - Start of line
[^\d\n]* - Allows zero or more non-digit characters before first digit
( - Starts first grouping pattern to capture the string starting with first digit
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
`) - End of first capturing pattern
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Multiline mode is enabled which you can enable by placing (?m) before start of regex also called inline modifier or by passing third argument to re.search as re.MULTILINE
Regex Demo 2
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
aaa1abc1
aa2aa1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1', '1abc1']
I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')
I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.