regex expression to get all digits before full stop

regex expression to get all digits before full stop - python

I wish to do as my title said but I cant seem to be able to do it.
string = "tex３５９１．４５" #please be aware that my digit is in half-width
text_temp = re.findall("(\d.)", string)
My current output is:
['３５', '９１', '４５']
My expected output is:
['3591.'] # with the "." at the end of the integer. No matter how many integer infront of this full stop

You need to escape the .:
text_temp = re.findall(r"\d+\.", string)
since . is a special character in regex, which matches any character. Added the + also to match 1 or more digits.
Or if you actually are using 'FULLWIDTH FULL STOP' (U+FF0E) you can just use the special character in the regex without escaping it:
text_temp = re.findall(r"\d+．", string)

You can use this regex along with re.findall to get your desired result
\d(?=.*?．)
will generate individual digits as answer
Demo in regex 101
\d+(?=.*?．)
Demo2
This will generate a bunch of numbers as one string
I used a positive lookahead and a greedy matching to check if there is a full stop after a certain digit and then give output. Hope this helps :).

Related

How to handle " in Regex Python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.

The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.

Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'

When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

Regular expression to find even/odd number

I used codes below to find out even numbers from a string and returned nothing.
could anyone tell me what I missed? Thank you very much.
import re
str2 = "adbv345hj43hvb42"
even_number = re.findall('/^[0-9]*[02468]$/', str2 )

In python you should not wrap expression with slashes ('/^[0-9]*[02468]$/' -> '^[0-9]*[02468]$')
$ and ^ are used to match the beginning and the end of string (or line in MULTILINE regex). But your example doesn't look you need to ('^[0-9]*[02468]$'' -> '[0-9]*[02468]')
After that you need to stop matching only prefixes ('[0-9]*[02468]' -> r'[0-9]*[02468](?![0-9])')
That's it :)

Your re matches:
Start of string
0 or more digits 0,1,2,3,4,5,6,7,8 or 9
One even number
End of string
That does not match your string, you should drop the begin of string ^ and end of string $ markers
To find an even number, just match any number of digits that ends with an even number '/[0-9]*02468/'

Not sure what exactly you want to extract from the string, but in order to match single even numbers use such syntax: [02468] (find one of the present in the list).

How to extract date in yyyy-yyyy format using regex python

I know this is basic but can someone please provide a regex solution to extract "1234-5678" out of "abcfd1234-5678gfvjh". Here the leading and trailing strings can be anything and they might not be there always i.e. the string can be just "1234-5678" as well. It is guaranteed that there will no be alphabet between the numbers only "-" can be there. There is one more format of the string "1234-56". i.e. the second number can be of length 2 or 4. Please see the below explanation:
input :a = "abcfd1234-5678gfvjh"
output :"1234-5678"
input :a = "abcfd1234-56gfvjh"
output :"1234-56"
input :a = "1234-5678hgjg"
output :"1234-5678"
input :a = "abcfd1234-5678"
output :"1234-5678"
input :a = "1234-56"
output :"1234-56"

\d{4}[-–](?:\d{4}|\d{2})
See an explanation here: https://regex101.com/r/kocRuY/2
Basically we say to search for four digits then a hyphen then either (using a non-capturing group to bracket) four digits or, failing that, two digits.
You should use the regex "search" method rather than "match" method as the processor will have to find where the sequence starts in the string. If you are restricted to matching from the start with "match", then you could add some sort of quantifier at the start to gobble up the start characters.

>>> import re
>>> re.findall('\d+-\d+', "abcfd1234-5678gfvjh")
['1234-5678']
you can try different regexes in https://regex101.com/

Surely a dozen duplicates on StackOverflow.
As the request occurs very often, there's a module called datefinder (pip install datefinder). You'd then call it like this:
import datefinder
matches = datefinder.find_dates(your_string_here)
for match in matches:
print (match)

Dynamically Removing string with regex python

I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!

Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'

Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.

To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.

First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'

Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex expression to get all digits before full stop - python

Related

How to handle " in Regex Python

Regular expression to find even/odd number

How to extract date in yyyy-yyyy format using regex python

Dynamically Removing string with regex python

Regular expression capturing entire match consisting of repeated groups

Categories

Resources