Split a split (regex) in python

Split a split (regex) in python - python

I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output
'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']
My current approach
re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?

Use re.findall() instead of re.split().
You want to match
a number \d, followed by
two letters [A-Z]{2}, followed by
a space \s, followed by
a bunch of characters until you encounter a comma [^,]+, followed by
two digits \d{2}
Try it at regex101
So do:
input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)
Which gives
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex
[^,]+,\d{2} Try it at regex101
This will match as many of any character except a comma, then a single comma, then two digits.
re.findall(r"[^,]+,\d{2}", input_str)
# Output:
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']

Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way
import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)
output
['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']
Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.

Related

Regex to match first occurrence of non alpha-numeric characters

I am parsing some user input to make a basic Discord bot assigning roles and such. I am trying to generalize some code to reuse for different similar tasks (doing similar things in different categories/channels).
Generally, I am looking for a substring (the category), then taking the string after as that categories value. I am looking line by line for my category, replacing the "category" substring and returning a stripped version. However, what I have now also replaces any space in the "value" string.
Originally the string looks like this:
Gamertag : 00test gamertag
What I want to do, is preserve the spaces in the value. The regex I am trying to do is: match all non alpha-numeric chars until the first letter.
My return is already matching non alpha but can't figure out how to get just first group, looks like it should be simply adding a ? to make it a lazy operator but not sure.. example code and string below (regex I want to replace is the final print string).
String I am working with:
- 00test Gamertag #(or any non-alpha delimiter)
Desired Results (by matching and stripping the extra characters)
00test Gamertag #(remove leading space and any non-alpha characters before the first words)
The regex I am trying to do is: match all non alpha-numeric chars until the first letter. Should be something like the following, which is close to what I use to strip non-alphas now but it does all not the first group - so I want to match the first group of non-alphas in a string to strip that part using re.sub..
\W+?
https://www.online-python.com/gDVhZrnmlq
Thank you!

Your regex will substitute the non-alphanumerical characters anywhere in the input string. If you only need to have this happening at the start of the string, then use the start-of-input anchor (i.e. ^):
^\W+

It depends on your inputs, you can use two regex to achieve your goal, the first to remove all non alpha-numeric from your string including the ones between words, and the second one to remove whitespaces between words if there is more than one space between each two words :
import re
gamer_tag = "µ& - 00test - Gamertag"
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
gamer_tag = re.sub(r" +", " ", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag
You can remove the second re.sub() if you sure that there will no more than one space between words.
gamer_tag = "- 00test Gamertag "
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?

Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.

You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.

You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Regex find content in between single quotes, but only if contains certain word

I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'

If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']

The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]

Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

regex to strict check numbers in string

Example strings:
I am a numeric string 75698
I am a alphanumeric string A14-B32-C7D
So far my regex works: (\S+)$
I want to add a way (probably look ahead) to check if the result generated by above regex contains any digit (0-9) one or more times?
This is not working: (\S+(?=\S*\d\S*))$
How should I do it?

Look ahead is not necessary for this, this is simply :
(\S*\d+\S*)
Here is a test case :
http://regexr.com?34s7v

permute it and use the \D class instead of \S:
((?=\D*\d)\S+)$
explanation: \D = [^\d] in other words it is all that is not a digit.
You can be more explicit (better performances for your examples) with:
((?=[a-zA-Z-]*\d)\[a-zA-Z\d-]+)$
and if you have only uppercase letters, you know what to do. (smaller is the class, better is the regex)

text = '''
I am a numeric string 75698 \t
I am a alphanumeric string A14-B32-C7D
I am a alphanumeric string A14-B32-C74578
I am an alphabetic number: three
'''
import re
regx = re.compile('\s(?=.*\d)([\da-zA-Z-]+)\s*$',re.MULTILINE)
print regx.findall(text)
# result ['75698', 'A14-B32-C7D', 'A14-B32-C74578']
Note the presence of \s* in front of $ in order to catch alphanumeric portions that are separated with whitespazces from the end of the lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.