Quick regular expressions question.
I want an expression that will find the first digit in a line and also a word at the end of that line. (this will exclude any digits in there)
IE if the string is, "12345hello" then I want the regular expression to find "1hello"
Or even if it's "12345hel45667lo" to find the same thing.
I have the first digit down but my expression I thought would work is:
print re.findall(r'^\d\D+',string)
This just gives me empty brackets, or the first digit if I take out the \D. What gives?
Edit: If I put in a | for or then I get what I want sort of. Returns the words in the string along with the first digit but in separate groupings. I want it all in one.
print re.findall(r'^\d|\D+',string)
print re.sub(r'(?<!^)\d', '', "12345hel45667lo9a") -> '1helloa'
The only thing I can think of is to run a for loop that scans across the string for letters and combines them.
Related
So, quite recently I have been introduced to regular expressions in Python and I've come across with some code online to filter words from a string list that are contained on other substrings.
def Filter(string, substr):
return [str for str in string
if re.match(r'[^\d]+|^', str).group(0) in substr]
It seems pretty straightforward and it works pretty well for my specific problem I'm meeting, but I really can't wrap my head around the meaning of it and how it is working. It just seems very confusing. Can anyone explain to me as if I was a baby or something? My coding skills are not that great, and I'm still a rookie.
Just to be clear, the code works, and I'm happy to move on, I just don't understand this bit.
[^\d] matches any character that isn't a numeric digit; this can also be written as \D.
+ after a pattern means to match any sequence of characters that match the pattern, so [^\d]+ matches a sequence of non-digits.
| separates alternative patterns to match.
The second alternative ^ matches the beginning of the string. Every string will match this. I think they use this just to avoid the match failing, so that you can always call .group(0) on the result. They could accomplish the same thing by changing + to * in the first alternative, since this means that the matched sequence can be 0 repetitions.
re.match() looks for a match of the regexp at the beginning of the argument string. And .group(0) returns what was matched by the entire regexp. So this whole thing returns the initial sequence of non-digits in str.
Finally, the list comprehension returns any of the items in strings whose initial sequence of non-digits is in substr.
With the simplifications I mentioned above, this can be rewritten:
def Filter(string, substr):
return [item for item in string
if re.match(r'\D*', item).group(0) in substr]
Note that if any of the items begin with a digit, the result of the regexp will be an empty string, and an empty string is a substring of every string. So these items will be included in the filter result. I suspect this is not the intended result.
I will try to to explain this for you.
So basically we are creating a method named "filter" and passing two arguments i.e "string (to be searched in)" and "substring (to be searched for)". Then we are using re.match inside a python return function along with an if condition within a for loop (the for loop helps us traverse through the main string one by one).
As for: (r'[^\d]+|^': this is a regular expression pattern where, \d is regex pattern for digit and + means at least one or more and finally they are closed within () that means the group that you want to capture.
re.match:
re.match is a function that searches only from the beginning of the string and returns the matched object (if found). However, if the substring is found somewhere in the middle then it will simply return none.
I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.
I have a piece of code that records times in this format:
0.0-8.0
0.0-9.0
0.0-10.0
I want to use a regular expression that will find all of these strings and have checked here and here for help but am still confused. I understand how to do it if I only wanted to do single digit numbers, but I can't figure out how to handle double digit numbers like 10 or 20.
It is also important that the expression does not find the string
0.0-1.0
as it should be ignored.
So far my expression looks like this:
expression = re.compile(',0\.0\-[0-2][0-9])
If you want to match each line shown in your question, try an expression like this:
0\.0\-[0-2]?\d\.\d
\d is the same as [0-9]. The ? means 0 or 1 occurrences, so this will only match 1- or 2-digit numbers. If you need the comma at the start of the regex, add that in.
If you want to exclude 0.0-1.0, then you should do that in code, not in the regular expression, since that would make it less readable. But if you insist, I have included one that will exclude that string for you:
Try it here
0\.0\-[0-2]?[0-9]\.(?<!0-1\.)\d
This uses a negative lookbehind to ensure the previous part is not 0-1., which would only occur in the match you didn't want.
I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*
I'm trying to use regular expressions to find three or more of the same character in a string. So for example:
'hello' would not match
'ohhh' would.
I've tried doing things like:
re.compile('(?!.*(.)\1{3,})^[a-zA-Z]*$')
re.compile('(\w)\1{5,}')
but neither seem to work.
(\w)\1{2,} is the regex you are looking for.
In Python it could be quoted like r"(\w)\1{2,}"
if you're looking for the same character three times consecutively, you can do this:
(\w)\1\1
if you want to find the same character three times anywhere in the string, you need to put a dot and an asterisk between the parts of the expression above, like so:
(\w).*\1.*\1
The .* matches any number of any character, so this expression should match any string which has any single word character that appears three or more times, with any number of any characters in between them.
Hope that helps.