This question already has answers here:
Remove Last instance of a character and rest of a string
(5 answers)
Closed 3 years ago.
I have a string such as:
string="lcl|NC_011588.1_cds_YP_002321424.1_1"
and I would like to keep only: "YP_002321424.1"
So I tried :
string=re.sub(".*_cds_","",string)
string=re.sub("_\d","",string)
Does someone have an idea?
But the first _ is removed to
Note: The number can change (they are not fixed).
"Ordinary" split, as proposed in the other answer, is not enough,
because you also want to strip the trailing _1, so the part to capture
should end after a dot and digit.
Try the following pattern:
(?<=_cds_)\w+\.\d
For a working example see https://regex101.com/r/U2QsFH/1
Don't bother with regexes, a simple
string.split('_cds_')[1]
will be enough
Related
This question already has answers here:
Python string.strip stripping too many characters [duplicate]
(3 answers)
Strip removing more characters than expected
(2 answers)
How to remove the left part of a string?
(21 answers)
Closed 13 days ago.
I have the following list of elements named 'files_temp':
['CDS_SPREAD_AA1EUNBCBM', 'CDS_SPREAD_AA1EUNCCBM', 'CDS_SPREAD_AA1USNBCBM', 'CDS_SPREAD_AA1USNCCBM', 'CDS_SPREAD_AALLN1EUNECBM', 'CDS_SPREAD_AALLN1USNECBM', 'CDS_SPREAD_ABB3EUNECBM', 'CDS_SPREAD_ABB3USNECBM', 'CDS_SPREAD_ABX1EUNCCBM', 'CDS_SPREAD_ABX1USNCCBM', 'CDS_SPREAD_ACAFP1EUBECBM', 'CDS_SPREAD_ACAFP1EUNECBM', 'CDS_SPREAD_ACOM1JPNACBM', 'CDS_SPREAD_ACOM1USNACBM', 'CDS_SPREAD_AEGON1EUBACBM', 'CDS_SPREAD_AEGON1EUNECBM', 'CDS_SPREAD_AEGON1JPBACBM', 'CDS_SPREAD_AEGON1USBACBM', 'CDS_SPREAD_AEGON1USNECBM', 'CDS_SPREAD_AEP1USNBCBM' ...]
I would like to keep only the alphanumeric codes, removing the CDS_SPREAD_ part and tried the following code:
files_temp=[elem.strip('CDS_SPREAD_') for elem in files_temp]
However, besides the CDS_SPREAD_ part it is also removing a part of the alphanumeric code:
['1EUNBCBM', '1EUNCCBM', '1USNBCBM', '1USNCCBM', 'LLN1EUNECBM', 'LLN1USNECBM', 'BB3EUNECBM', 'BB3USNECBM', 'BX1EUNCCBM', 'BX1USNCCBM', 'FP1EUBECBM', 'FP1EUNECBM', 'OM1JPNACBM', 'OM1USNACBM', 'GON1EUBACBM', 'GON1EUNECBM', 'GON1JPBACBM', 'GON1USBACBM', 'GON1USNECBM', '1USNBCBM', '1USNCCBM', 'T1EUNCCBM', 'T1USNBCBM' ...]
For instance, for the first element, in theory I should get AA1EUNBCBM instead of 1EUNBCBM. Would you know why this is happening? I would highly appreciate an alternative to solve the issue as well.
The strip() function removes all the characters you are providing as the parameters. For your case, you should use replace() function.
files_temp=[elem.replace('CDS_SPREAD_', '') for elem in files_temp]
This question already has answers here:
Split a string only by first space in python [duplicate]
(4 answers)
Closed 1 year ago.
I have a list in python:
name = ['A.A.BCD', 'B.B.AAD', 'B.A.A.D']
I wish to discard everything before the second '.' and keep the rest. Below is what I have come up with.
[n.split('.')[2] for n in name]
Above is working for all except the last entry. Any way to do this:
Expected output: ['BCD', 'AAD', 'A.D']
Read the documentation for split() and you’ll find it has an optional parameter for the maximum number of splits - use this to get the last one to work:
[n.split('.',maxsplit=2)[2] for n in name]
See https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split
Big disadvantage of doing this as a one-liner is it will fail if there ever aren’t two . in a string, so using a for loop can be more robust.
name = ['A.A.BCD', 'B.B.AAD', 'B.A.A.D']
['.'.join(n.split('.')[2:]) for n in name]
result
['BCD', 'AAD', 'A.D']
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I want to achieve the following:
say I have two regex, regex1 and regex2. I want to construct a new regex that is of 'prefix_regex1 | prefix_regex2', what syntax should I use to share the prefix, I tried 'prefix_(regex1|regex2)' but it's not working, since I think it's confused on the bracket used as group rather than making the | precedence higher.
example:
I have two string that both should match the pattern:
prefix_123
prefix_abc
I wrote this pattern: prefix_(\d*|\D*) that tries to capture both cases, but when I run it against prefix_abc it's only matching prefix_, not the entire string.
This site might help with this problem (and others). It lets you tinker with the regex and see the result both graphically and in code: https://www.debuggex.com/
For example, I changed your regex to this: prefix_(\d+|\D+) which requires 1 or more digit or non-digit after "prefix_" Not sure if that's what you are looking for, but it's easy to experiment with the site I shared above.
Hope it helps.
This question already has answers here:
Python csv string to array
(10 answers)
In regex, match either the end of the string or a specific character
(2 answers)
Closed 3 years ago.
I need to capture words separated by tabs as illustrated in the image below.
The expression (.*?)[\t|\n] works well, except for the last line where a line feed is missing. Can anyone suggest a modification of the regular expression to also match the last word, i.e. Cheyenne? Link to code example
Replace [\t|\n] with (\t|$).
BTW, [\t|\n] is a character class, so the pipe | is literal here. You probably meant [\t\n].
This question already has answers here:
Are there limits to using string.lstrip() in python? [duplicate]
(3 answers)
Closed 8 years ago.
So I have a super long string composed of integers and I am trying to extract and remove the first three numbers in the string, and I have been using the lstrip method (the idea is kinda like pop) but sometimes it would remove more than three.
x="49008410..."
x.lstrip(x[0:3])
"8410..."
I was hoping it would just remove 490 and return 08410 but it's being stubborn -_- .
Also I am running Python 2.7 on Windows... And don't ask why the integers are strings. If that bothers you, just replace them with letters. Same thing! LOL
Instead of remove the first 3 numbers, get all numbers behind the third position. You can do it using : operator.
x="49008410..."
x[3:]
>> "8410..."