Extract Number before a Character in a String Using Python - python

I'm trying to extract the number before character "M" in a series of strings. The strings may look like:
"107S33M15H"
"33M100S"
"12M100H33M"
so basically there would be a sets of numbers separated by different characters, and "M" may show up more than once. For the example here, I would like my code to return:
33
33
12,33 #doesn't matter what deliminator to use here
One way I could think of is to split the string by "M", and find items that are pure numbers, but I suspect there are better ways to do it. Thanks a lot for the help.

You may use a simple (\d+)M regex (1+ digit(s) followed with M where the digits are captured into a capture group) with re.findall.
See IDEONE demo:
import re
s = "107S33M15H\n33M100S\n12M100H33M"
print(re.findall(r"(\d+)M", s))
And here is a regex demo

You can use rpartition to achieve that job.
s = '107S33M15H'
prefix = s.rpartition('M')[0]

Related

How to extract date in yyyy-yyyy format using regex python

I know this is basic but can someone please provide a regex solution to extract "1234-5678" out of "abcfd1234-5678gfvjh". Here the leading and trailing strings can be anything and they might not be there always i.e. the string can be just "1234-5678" as well. It is guaranteed that there will no be alphabet between the numbers only "-" can be there. There is one more format of the string "1234-56". i.e. the second number can be of length 2 or 4. Please see the below explanation:
input :a = "abcfd1234-5678gfvjh"
output :"1234-5678"
input :a = "abcfd1234-56gfvjh"
output :"1234-56"
input :a = "1234-5678hgjg"
output :"1234-5678"
input :a = "abcfd1234-5678"
output :"1234-5678"
input :a = "1234-56"
output :"1234-56"
\d{4}[-–](?:\d{4}|\d{2})
See an explanation here: https://regex101.com/r/kocRuY/2
Basically we say to search for four digits then a hyphen then either (using a non-capturing group to bracket) four digits or, failing that, two digits.
You should use the regex "search" method rather than "match" method as the processor will have to find where the sequence starts in the string. If you are restricted to matching from the start with "match", then you could add some sort of quantifier at the start to gobble up the start characters.
>>> import re
>>> re.findall('\d+-\d+', "abcfd1234-5678gfvjh")
['1234-5678']
you can try different regexes in https://regex101.com/
Surely a dozen duplicates on StackOverflow.
As the request occurs very often, there's a module called datefinder (pip install datefinder). You'd then call it like this:
import datefinder
matches = datefinder.find_dates(your_string_here)
for match in matches:
print (match)

How to replace a number sequence using regex in python? [duplicate]

I need to filter a set of strings with a wildcard-type search, like the following:
Looking for He*lo should match "Hello", but not "Helo"
Looking for *ant should match "pant" and "want" but not "ant"
Looking for *yp* should match "gypsy" and "typical"
The * represents one or more characters. I don't mind a handwritten or regex-based search. Any ideas? The typical .NET approach for wildcards matches 0 or more, but I need 1 or more characters. How can I do this?
What you're looking for is the + regex operator
You want the .
For example: he.lo will match your hello, but not helo.
same goes for the rest.
You can easily test it here: http://regexpal.com/.
Do note that .yp. will not match typical nor gypsy, but `.yp.+' will (because of the rest of the characters)

Dynamically Removing string with regex python

I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!
Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'
Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.
To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)

How to parse a string into two different strings based on first instance of an integer? (Python)

I'm trying to take a string like "PR405j" and separate it into two strings. In this instance, the two strings would be "PR" and "405j." There are a variety of strings I have to do this to. Exmaples:
"ACR498" would be "ACR" and "498", "FR707e" would be "FR" and "707e", "TY699l" would be "TY" and "699l" and so on and so forth.
The problem I'm having is separating the first part from the second part. The amount of characters on either side differs, and the second string (the one with the numbers) may or may not have alphabetic characters in there as well. The only commonality between all of these strings is that you can divide them based on the first instance of an integer.
I thought a for loop that goes through every character in the original string and builds two separate strings inside would work, but I could only think to base the separation on integers and alphabetic characters, which would make something like "PR405j" turn into "PRj" and "405".
I also thought the split string method would help, but there's no one character all these strings have in common.
Finally, I can't split the strings based on the numbers of alphabetic characters in the beginning of the string (say 2 for "PR405j") because there is variation between strings.
If anybody could help me with this, I'd greatly appreciate it. Thank you!
You can use regular expressions to do simple string matching such as this. The expression '(\D+)(.+)' is saying 'Extract one or more non-digits as the first group, then extract one or more other characters as the second.'
import re
inputs = ['PR405j']
for input in inputs:
match = re.match('(\D+)(.+)', input)
start = match.group(1)
end = match.group(2)
print input, start, end
EDIT: I misunderstood the question, thought you wanted 3 groups, not two. Zack Bloom's answer is more correct, but I'll leave this here as a reference in case someone has a similar question.
You can use re.split:
>>> re.split(r'(\d+)', 'PR405j')
['PR', '405', 'j']
The trick here is using a capturing group (with parentheses) as the regular expression to split by; this will cause the output to contain the portions that caused the split as well as the portions to either side of it. If you have a string with multiple groups of digits separated by non-digits, this will fully split the string:
>>> re.split(r'(\d+)', 'PR405j123abc')
['PR', '405', 'j', '123', 'abc']
re.split, like the rest of the answers. But you have to munge it to deal with the grouping:
import re
re.split(r'([a-zA-Z]+)', 'PR405j', 1)[1:]

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Categories

Resources