This question already has answers here:
Carets in Regular Expressions
(2 answers)
Difference between * and + regex
(7 answers)
Closed 5 years ago.
I apologize for the poorly worded question.
I have a large number of strings like:
"ODLS_ND33283633__PS1185"
Which the first letters up to the first "_" are a header and the remainder (ND33283633__PS1185) is a unique ID.
I wrote a regex in python trying to remove everything up to the first "_" desiring
"ND33283633__PS1185"
as the end result.
I figured something like:
.*_? or .+?_
Would do the trick, but that was not the case...
I kept trying to write various regex unsuccessfully to accomplish this and finally went online and found another person's answer I was able to use as an example to rewrite as:
^[^_]+_
Which gave me my desired result, but now I have questions which I can't figure out the answer for:
I found that removing the "^" at the front and writing it as:
[^_]+_
caused the regex to remove everything up to the second "_" so the resulting string was:
"_PS1185"
I understand that "^" identifies as the beginning of the line, but I would like to know why not including it removes up to the second without the "^" at the front?
My understanding is that [^_]+ matches characters NOT equal to "_" 1 or more number of times, so why would including the "^" at the beginning cause it to stop at the first, while excluding it causes it to stop at the second?
Another thing, when I replaced the "+" symbol with a "*":
[^_]*_
I expected the same result but instead got:
PS1185
I thought that * matches 0 or more, while + matches 1 or more, so they're effectively the same except + is supposed to be more 'strict'. However, seeing these results makes me feel like I don't fully understand how regex is behaving. Is there anyone here that can please explain what is actually going on?
Related
This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 1 year ago.
How would i eventually remove punctuations in this function? would "(str.maketrans('', '', string.punctuation)" work? And where in the function should I write it?
def convert(lst):
return " " .join(lst).split()
lst = ["Good for the price, but poor Bluetooth connections."]
print(convert(lst))
In general on Stack Overflow, you would have been better editing your original question with an update on your progress.
string.punctuation is definitely a step in the right direction. You've got a few options including:
str.strip()
str.translate() (with or without maketrans())
As to where? If you do it after you create the list, you'll need to apply it to each element individually. But if you split up the your string transformation function calls (split, join) on the return line into their own lines, you could do it earlier. Try performing one action at a time, then printing the result to see if you can spot where else you can remove punctuation without having to iterate.
This question already has answers here:
Remove Last instance of a character and rest of a string
(5 answers)
Closed 3 years ago.
I have a string such as:
string="lcl|NC_011588.1_cds_YP_002321424.1_1"
and I would like to keep only: "YP_002321424.1"
So I tried :
string=re.sub(".*_cds_","",string)
string=re.sub("_\d","",string)
Does someone have an idea?
But the first _ is removed to
Note: The number can change (they are not fixed).
"Ordinary" split, as proposed in the other answer, is not enough,
because you also want to strip the trailing _1, so the part to capture
should end after a dot and digit.
Try the following pattern:
(?<=_cds_)\w+\.\d
For a working example see https://regex101.com/r/U2QsFH/1
Don't bother with regexes, a simple
string.split('_cds_')[1]
will be enough
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I want to achieve the following:
say I have two regex, regex1 and regex2. I want to construct a new regex that is of 'prefix_regex1 | prefix_regex2', what syntax should I use to share the prefix, I tried 'prefix_(regex1|regex2)' but it's not working, since I think it's confused on the bracket used as group rather than making the | precedence higher.
example:
I have two string that both should match the pattern:
prefix_123
prefix_abc
I wrote this pattern: prefix_(\d*|\D*) that tries to capture both cases, but when I run it against prefix_abc it's only matching prefix_, not the entire string.
This site might help with this problem (and others). It lets you tinker with the regex and see the result both graphically and in code: https://www.debuggex.com/
For example, I changed your regex to this: prefix_(\d+|\D+) which requires 1 or more digit or non-digit after "prefix_" Not sure if that's what you are looking for, but it's easy to experiment with the site I shared above.
Hope it helps.
This question already has an answer here:
Restricting character length in a regular expression
(1 answer)
Closed 4 years ago.
So I have a regex that goes like:
regex1= re.compile(r'\S+#\S+')
This works perfectly but I am trying to add a character limit so the total amount of characters have to be less than 20.
I tried re.compile(r'\S+#\S+{5,20}') but it keeps giving me an error. Seems like a simple fix, but cant see what I am doing wrong.
You can't specify a greedy modifier (+) with a specific number of characters (i.e., \S+{5,20) is not a valid pattern). If you're doing this in python, I'd suggest just using the len(...) function on the string in addition to the regex to verify. For example:
if regex1.match(email) and (len(email) < 20):
...
This question already has answers here:
Python for-in loop preceded by a variable [duplicate]
(5 answers)
Closed 4 years ago.
I'm new to Python, so I was hoping somebody could break down the following statement and explain the purpose of each part.
[digit for digit in string.split() if digit.isdigit()][0]
Obviously for digit in string.split() creates a list of substrings by separating the string into elements at each space.
What confuses me is the digit at the very beginning and the if statement at the very end.
Is the very first digit what will be returned if digit.isdigit()?
Why must this statement be wrapped in a list?
I've never seen a for loop and an if statement combined into one statement like this before, but it reminds me of a particular JS syntax: for (condition) // whatever or if (condition) // whatever. However, in JS you can't combine them into a single statement (i.e. for (condition) if (condition) // whatever).
This is called a list comprehension. You will find plenty of pages explaining how it works. Just ask you favorite search engine.