How to extract date in yyyy-yyyy format using regex python

How to extract date in yyyy-yyyy format using regex python - python

I know this is basic but can someone please provide a regex solution to extract "1234-5678" out of "abcfd1234-5678gfvjh". Here the leading and trailing strings can be anything and they might not be there always i.e. the string can be just "1234-5678" as well. It is guaranteed that there will no be alphabet between the numbers only "-" can be there. There is one more format of the string "1234-56". i.e. the second number can be of length 2 or 4. Please see the below explanation:
input :a = "abcfd1234-5678gfvjh"
output :"1234-5678"
input :a = "abcfd1234-56gfvjh"
output :"1234-56"
input :a = "1234-5678hgjg"
output :"1234-5678"
input :a = "abcfd1234-5678"
output :"1234-5678"
input :a = "1234-56"
output :"1234-56"

\d{4}[-–](?:\d{4}|\d{2})
See an explanation here: https://regex101.com/r/kocRuY/2
Basically we say to search for four digits then a hyphen then either (using a non-capturing group to bracket) four digits or, failing that, two digits.
You should use the regex "search" method rather than "match" method as the processor will have to find where the sequence starts in the string. If you are restricted to matching from the start with "match", then you could add some sort of quantifier at the start to gobble up the start characters.

>>> import re
>>> re.findall('\d+-\d+', "abcfd1234-5678gfvjh")
['1234-5678']
you can try different regexes in https://regex101.com/

Surely a dozen duplicates on StackOverflow.
As the request occurs very often, there's a module called datefinder (pip install datefinder). You'd then call it like this:
import datefinder
matches = datefinder.find_dates(your_string_here)
for match in matches:
print (match)

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.

For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

regex expression to get all digits before full stop

I wish to do as my title said but I cant seem to be able to do it.
string = "tex３５９１．４５" #please be aware that my digit is in half-width
text_temp = re.findall("(\d.)", string)
My current output is:
['３５', '９１', '４５']
My expected output is:
['3591.'] # with the "." at the end of the integer. No matter how many integer infront of this full stop

You need to escape the .:
text_temp = re.findall(r"\d+\.", string)
since . is a special character in regex, which matches any character. Added the + also to match 1 or more digits.
Or if you actually are using 'FULLWIDTH FULL STOP' (U+FF0E) you can just use the special character in the regex without escaping it:
text_temp = re.findall(r"\d+．", string)

You can use this regex along with re.findall to get your desired result
\d(?=.*?．)
will generate individual digits as answer
Demo in regex 101
\d+(?=.*?．)
Demo2
This will generate a bunch of numbers as one string
I used a positive lookahead and a greedy matching to check if there is a full stop after a certain digit and then give output. Hope this helps :).

Python regex find all matches

I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.

You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall

Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.

i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)

When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work

Dynamically Removing string with regex python

I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!

Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'

Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.

To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.

First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'

Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract date in yyyy-yyyy format using regex python - python

>>> import re >>> re.findall('\d+-\d+', "abcfd1234-5678gfvjh") ['1234-5678'] you can try different regexes in https://regex101.com/

Surely a dozen duplicates on StackOverflow. As the request occurs very often, there's a module called datefinder (pip install datefinder). You'd then call it like this: import datefinder matches = datefinder.find_dates(your_string_here) for match in matches: print (match)

Related

Python regex expression example

regex expression to get all digits before full stop

Python regex find all matches

Dynamically Removing string with regex python

Regular expression capturing entire match consisting of repeated groups

Categories

Resources