Lets say I have Strings like:
"H39_M1", "H3_M15", "H3M19", "H3M11", "D363_H3", "D_128_H17_M50"
How can I split them every single one into a list of substrings?
like this:
["H39", "M1"], "[H3, "Min15"], ["H3","M19"], ["H3","M11"], ["D363","H3"], ["D128","H17","M50"]
and afterwards: switch places of alphanumeric-group and numeric group,
like this:
["39H", "1M"], "[3H, "15Min"], ["3H","19M"], ["3H","11M"], ["363D","3H"],["128D","17H","50M"]
length of numbers-group and of alphanumeric group varys as you can see.
also "_" underscores can divide them.
I might suggest using re.findall here with re.sub:
inp = "H3M19"
inp = re.sub(r'([A-Z]+)([0-9]+)', r'\2\1', inp)
parts = re.findall(r'[0-9]+[A-Z]+', inp)
print(parts)
This prints:
['3H', '19M']
The first re.sub step converts H3M19 into 3H19M, by capturing the letter and numeric pairs and then swapping them. Then, we use re.findall to find all number/letter pairs in the swapped input.
Related
I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]
Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']
I'm trying to split a string by specific letters(in this case:'r','g' and'b') so that I can then later append them to a list. The catch here is that I want the letters to be copied to over to the list as well.
string = '1b24g55r44r'
What I want:
[[1b], [24g], [55r], [44r]]
You can use findall:
import re
print([match for match in re.findall('[^rgb]+?[rgb]', '1b24g55r44r')])
Output
['1b', '24g', '55r', '44r']
The regex match:
[^rgb]+? everything that is not rgb one or more times
followed by one of [rgb].
If you need the result to be singleton lists you can do it like this:
print([[match] for match in re.findall('[^rgb]+?[rgb]', '1b24g55r44r')])
Output
[['1b'], ['24g'], ['55r'], ['44r']]
Also if the string is only composed of digits and rgb you can do it like this:
import re
print([[match] for match in re.findall('\d+?[rgb]', '1b24g55r44r')])
The only change in the above regex is \d+?, that means match one or more digits.
Output
[['1b'], ['24g'], ['55r'], ['44r']]
I have this string:
abc,12345,abc,abc,abc,abc,12345,98765443,xyz,zyx,123
What can I use to add a 0 to the beginning of each number in this string? So how can I turn that string into something like:
abc,012345,abc,abc,abc,abc,012345,098765443,xyz,zyx,0123
I've tried playing around with Regex but I'm unsure how I can use that effectively to yield the result I want. I need it to match with a string of numbers rather than a positive integer, but with only numbers in the string, so not something like:
1234abc567 into 01234abc567 as it has letters in it. Each value is always separated by a comma.
Use re.sub,
re.sub(r'(^|,)(\d)', r'\g<1>0\2', s)
or
re.sub(r'(^|,)(?=\d)', r'\g<1>0', s)
or
re.sub(r'\b(\d)', r'0\1', s)
Try following
re.sub(r'(?<=\b)(\d+)(?=\b)', r'\g<1>0', str)
If the numbers are always seperated by commas in your string, you can use basic list methods to achieve the result you want.
Let's say your string is called x
y=x.split(',')
x=''
for i in y:
if i.isdigit():
i='0'+i
x=x+i+','
What this piece of code does is the following:
Splits your string into pieces depending on where you have commas and returns a list of the pieces.
Checks if the pieces are actually numbers, and if they are a 0 is added using string concatenation.
Finally your string is rebuilt by concatenating the pieces along with the commas.
I have a string format let's say where A = alphanumeric and N = Integer so the template is "AAAAAA-NNNN" now the user sometimes will ommit the dash, and sometimes the "NNNN" is only three digits in which case I need it to pad a 0. The first digit of "NNNN" has to be 0, thus if it is a number is is the last digit of the "AAAAAA" as opposed to the first digit of "NNNN". So in essence if I have the following inputs I want the following results:
Sample Inputs:
"SAMPLE0001"
"SAMPL1-0002"
"SAMPL3003"
"SAMPLE-004"
Desired Outputs:
"SAMPLE-0001"
"SAMPL1-0002"
"SAMPL3-0003"
"SAMPLE-0004"
I know how to check for this using regular expressions but essentially I want to do the opposite. I was wondering if there is a easy way to do this other than doing a nested conditional checking for all these variations. I am using python and pandas but either will suffice.
The regex pattern would be:
"[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]-\d\d\d\d"
or in abbreviated form:
"[a-zA-Z0-9]{6}-[\d]{4}"
It would be possible through two re.sub functions.
>>> import re
>>> s = '''SAMPLE0001
SAMPL1-0002
SAMPL3003
SAMPLE-004'''
>>> print(re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)))
SAMPLE-0001
SAMPL1-0002
SAMPL3-0003
SAMPLE-0004
Explanation:
re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s) would be processed at first. It just places a hyphen after the 6th character from the beginning only if the following character is not a hyphen.
re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)) By taking the above command's output as input, this would add a digit 0 after to the hyphen and the characters following must be exactly 3.
An alternative solution, it uses str.join:
import re
inputs = ['SAMPLE0001', 'SAMPL1-0002', 'SAMPL3003','SAMPLE-004']
outputs = []
for input_ in inputs:
m = re.match(r'(\w{6})-?\d?(\d{3})', input_)
outputs.append('-0'.join(m.groups()))
print(outputs)
# ['SAMPLE-0001', 'SAMPL1-0002', 'SAMPL3-0003', 'SAMPLE-0004']
We are matching the regex (\w{6})-?\d?(\d{3}) against the input strings and joining the captured groups with the string '-0'. This is very simple and fast.
Let me know if you need a more in-depth explanation of the regex itself.