find all substring that starts and ends with specific characters - python

I have to find all substring is a string $a$ that starts with M and ends with _
I tried
a = 'ICQLEFAKNASFSVSNVSKKNGEFSHAHEQDQNLRLIARQR_RSADGTPNKVNTSNVRCSTPIFGNNPFAQSLAHREYGHEGENVQCRPCGSLPSRKCQRNVHPKQQQQQQHQHCHRNSA_APAIRAAQAAGGDNSSRSEK_RAAAARIPVNDDSNMETSLALESRRRNHQSIEPLVRG_PCRQCNNRFSCTWAWRTM_PISNEAHIDLVELASLERADNC_NRPKYR_GLQPYHGNCSTLFK_IAGMSIFYHNTKILKCFM_RETL_F_NYVDN_VGILELL_KTWNS_SSSFLALNNKL_YTNKNLCNS_NVAPKLIYKN_IYFVS_QIA'$
b=re.findall('^M_$',a)
it gives an empty list
I want the output to be like that
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'M_']

Here is one way to do it:
>>> re.findall('M.*?_', a)
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'MSIFYHNTKILKCFM_']
Or, if the results must not contain embedded M characters:
>>> re.findall('M[^M]*?_', a)
['METSLALESRRRNHQSIEPLVRG_', 'M_', 'M_']

Related

remove gibberish prefix from a string

a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
How do I remove the gibberish from my string before valid so that I have something like this -
valid_name0
valid_name1
If your strings always contains valid word, then you can try something like -
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
for s in (a, b):
print(s[s.rfind('valid'):])
So, even if the prefix contains _ or substring valid in it, the output will be correct. Though if your valid substring contains the word valid multiple times, then this will not work
We can try using re.sub here:
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
inp = [a, b]
output = [re.sub(r'^[^_]+_', '', i) for i in inp]
print(output) # ['valid_name0', 'valid_name1']
You can use a split join approach for this.
Try this:
a = "aajfkdfvf_valid_name0"
valid_a = '_'.join(a.split('_')[1:])
# 'valid_name0'
# can use maxsplit to split only once at the first _ and then take the remaining part of the string
another_valid_a = a.split('_',1)[1]
# valid_name0
Basically what this is doing is that it is splitting the original string at the _, then ignoring the first element and joining the remaining part again using _.
The other approaches seem a bit too over-engineered for this task, at least in my opinion.
If you already know that the gibberish comes before the first underscore _ character, you can just do a single str.split and discard the first split result:
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
def clean_string(s: str) -> str:
return s.split('_', 1)[1]
print(clean_string(a)) # valid_name0
print(clean_string(b)) # valid_name1
If you're sure that just a '_' is your need, a string split will help:
fixed_a = '_'.join(a.split('_')[1:])
The worst case is that this pattern is not the only one you're looking at. Then, check this:
You need to know exactly what your 'valid_name' looks like, you could make a REGEX to achieve your need.
Check for standards, patterns and all those.
I'm pretty sure if is there a pattern, a Regex can handle.
I recommend this site to do so.

Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks)
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
What I want to do is to search this string for all occurrences of "key:", then extract the "values" following "key:".
One further complication for me is that I don't know how long these values belonging to key are (e.g. key:12/eas9 and key:43/e3). All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits.
This is why my idea was to slice from the indices of key plus the next say 10 characters (e.g. key:12/eas9g) and then work backward until isdigit() is false.
I tried to split my initial string (that did contain breaks):
stringA_split = re.split("\n", stringA)
for linex in stringA_split:
index_start = linex.rfind("key:")
index_end = index_start + 8
print(linex[index_start:index_end]
#then work backward
However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion.
How would I then solve this (e.g. as a start with getting all indices of '"key:"' and slice this to a list)?
import re
>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']
\d+ # One or more digits.
[^\d]+ # Everything except a digit (equivalent to [\D]).
[\d] # The final digit
(\d+[^\d]+[\d]) # The group of the expression above
'key:(\d+[^\d]+[\d])' # 'key:' followed by the group expression
If you want key: in your result:
>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']
I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described
import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
ma = re.match(r'(\d+\/.*\d+)', v)
if ma:
print ma.group(1)
This returns:
12/eas9
43/e3
You can apply just one RE that gets all the keys into an array of tuples:
import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)
After the execution, you have:
ret
[('12', 'eas9'), ('43', 'e3')]
edit: a better answer was posted above. I misread the original question when proposing to reverse here, which really wasn't necessary. Good luck!
If you know that the format is always key:, what if you reversed the string and rex for :yek? You'd isolate all keys and then can reverse them back
import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")
word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

Regex substring one mismatch in any location of string

Can someone explain why the code below returns an empty list:
>>> import re
>>> m = re.findall("(SS){e<=1}", "PSSZ")
>>> m
[]
I am trying to find the total number of occurrences of SS (and incorporating the possibility of up to one mismatch) within PSSZ.
I saw a similar example of code here: Search for string allowing for one mismatch in any location of the string
You need to remove e<= chars present inside the range quantifier. Range quantifier must be of ,
{n} . Repeats the previous token n number of times.
{min,max} Repeats the previous token from min to max times.
It would be,
m = re.findall("(SS){1}", "PSSZ")
or
m = re.findall(r'SS','PSSZ')
Update:
>>> re.findall(r'(?=(S.|.S))', 'PSSZ')
['PS', 'SS', 'SZ']

Python - Parse strings with variable repeating substring

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.
I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Replacing reoccuring characters in strings in Python 3.1

Is it possible to replace a single character inside a string that occurs many times?
Input:
Sentence=("This is an Example. Thxs code is not what I'm having problems with.") #Example input
^
Sentence=("This is an Example. This code is not what I'm having problems with.") #Desired output
Replace the 'x' in "Thxs" with an i, without replacing the x in "Example".
You can do it by including some context:
s = s.replace("Thxs", "This")
Alternatively you can keep a list of words that you don't wish to replace:
whitelist = ['example', 'explanation']
def replace_except_whitelist(m):
s = m.group()
if s in whitelist: return s
else: return s.replace('x', 'i')
s = 'Thxs example'
result = re.sub("\w+", replace_except_whitelist, s)
print(result)
Output:
This example
Sure, but you essentially have to build up a new string out of the parts you want:
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> s[22]
'x'
>>> s[:22] + "i" + s[23:]
"This is an Example. This code is not what I'm having problems with."
For information about the notation used here, see good primer for python slice notation.
If you know whether you want to replace the first occurrence of x, or the second, or the third, or the last, you can combine str.find (or str.rfind if you wish to start from the end of the string) with slicing and str.replace, feeding the character you wish to replace to the first method, as many times as it is needed to get a position just before the character you want to replace (for the specific sentence you suggest, just one), then slice the string in two and replace only one occurrence in the second slice.
An example is worth a thousands words, or so they say. In the following, I assume you want to substitute the (n+1)th occurrence of the character.
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> n = 1
>>> pos = 0
>>> for i in range(n):
>>> pos = s.find('x', pos) + 1
...
>>> s[:pos] + s[pos:].replace('x', 'i', 1)
"This is an Example. This code is not what I'm having problems with."
Note that you need to add an offset to pos, otherwise you will replace the occurrence of x you have just found.

Categories

Resources