Finding part of string using regular expressions - python

I'm pretty new in Python and don't really know regex. I've got some strings like:
a = "Tom Hanks XYZ doesn't really matter"
b = "Julia Roberts XYZ don't worry be happy"
c = "Morgan Freeman XYZ all the best"
In the middle of each string there's word XYZ and than some text. I need regex that will find and match this part, more precisely: from XYZ to the end of string.

Unles there is a specific requirement to do through Regex, a non-regex solution will work fine here.
There are two possible ways you can approach this problem
1.
Given
a = "Tom Hanks XYZ doesn't really matter"
Partition the string with the separator, preceded with a space
''.join(a.partition(" XYZ")[1:])[1:]
Please note, if the separator string does not exist this will return a empty string.
2.
a[a.index(" XYZ") + 1:]
This will raise an exception ValueError: substring not found if the string is not found

Use the following expression
(XYZ.*)
What this does is start capturing when it sees the letters "XYZ" and matches anything beyond that zero or more times.

m = re.search("(XYZ.*)", a)
If you want to show that part of the string:
print m.groups()[0]

Related

re.match never returns None? [duplicate]

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

Python RegEx Overlapping

The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:
First, we're lookig for any one of these strings:
CATGTG
CATTTG
CACGTG
Second, the pattern is:
string
6-7 letters
string
Example
match: CATGTGXXXXXXCACGTG
no match: CATGTGXXXCACGTG (because 3 letters between)
Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.
Example:
input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX
workflow (spaces for readability):
found match: CATGTG XXXXXX CATTTG
it starts at 3
resuming search at C in CATTTG
found match: CATTTG XXXXXXX CACGTG
it starts at 15
and so on...
After a few hours of tinkering, my sorry attempt did not yield what I expected:
regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group())
3 CATGTG
15 CATTTG (incorrect)
You're a genius if you can figure this out with a RegEx. Thanks :D
You can use this kind of pattern:
import re
s='XXXCATGTGXXXXXXCATTTGXXXXXXXCACGTGXXX'
regex = re.compile(r'(?=(((?:CATGTG|CATTTG|CACGTG).{6,7}?)(?:CATGTG|CATTTG|CACGTG)))\2')
for m in regex.finditer(s):
print(m.start(), m.group(1))
The idea is to put the whole string inside the lookahead and to use a backreference to consume characters you don't want to test after.
The first capture group contains the whole sequence, the second contains all characters until the next start position.
Note that you can change (?:CATGTG|CATTTG|CACGTG) to CA(?:TGTG|TTTG|CGTG) to improve the pattern.
The main issue is that in order to use the | character, you need to enclose the alternatives in parentheses.
Assuming from your example that you want only the first matching string, try the following:
regex = re.compile("(CATGTG|CATTTG|CACGTG).{6,7}(?:CATGTG|CATTTG|CACGTG)")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group(1))
Note the .group(1), which will match only what's in the first set of parentheses, as opposed to .group() which will return the whole match.

How to pull out language via regex

I have the following two string:
s1 = 'Audio: Dolby Digital 5.1 (English)'
s2 = 'Audio: Stereo (English, French)'
I want to pull out the first language in each string. Here is what I have so far:
re.search(r'\s\((.+)', s1)
['English)']
How would I improve this to work on both of the above?
You could use this which will only find the first language and it is only a small tweak to your existing code
f=re.findall(r'\((\w+)', s1)
e=re.findall(r'\((\w+)', s2)
if f:
print f
if e:
print e
f = ['English']
e = ['English']
if you only want the first language then you should be using search instead like so
f = re.search(r'\((\w+)', s1)
e = re.search(r'\((\w+)', s2)
if f:
print f.group(1)
if e:
print e.group(1)
This will print a string rather than a list since it is only finding one thing
Widen the search to start the phrase with a parenthesis or comma+space, and end with a parenthesis or comma+space:
>>> re.findall(r'\s(?:\(|, )(.+)(?:\)|, )', s2)
['English, French']
The ?: after a parenthesis indicates a non-capturing group.
You can then grab whichever language you're interested in with indexing.
Since the strings you're searching are actually pretty tidy, you can also do this without regex:
>>> s1.split('(')[1].split(')')[0].split(', ')[0]
'English'
>>> s2.split('(')[1].split(')')[0].split(', ')[0]
'English'
You can just use this simple modification of your regular expression:
\s\(([^,\n\)]+)
Regex101
You're looking for the text after the first LParen and before the first comma. So, a regex that would match this is:
\(([^,]*),
(Your answer will be in group 1)
Finally, I'd like to point you to https://www.debuggex.com/, which will help you easily visualize your regex questions.
Assuming languages are always at the end, surrounded by brackets and listed with ,:
(?<=\()\w+(?=(?:, \w+)*\)$)
See it in action
The idea is:
(?<=\() - the string should be preceded by an opening bracket(()
\w+ - the language itself is a sequence of letters
(?=(?:, \w+)*\)$) - after it, there can be zero or more other languages, separated with comma and space and after closing the bracket()) leaves us at the end of the string

how to use python re to match a sting only with several specific charaters?

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.
sorry for mis-describing my question. I'm not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters
Thanks
Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.
m=re.search('(^[ATGC]+$)',line_in_file)
From your clarification msg at above:
If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:
(A+G+C+T+)
The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.
m=re.search('(ATGC)+',a)
EDIT:
According to your comment, this won't match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.
EDIT2:
To match "ATGCCATG" but not "STUPID" try,
re.match("^[ATGC]$", str)
Then check for a NOT match, rather than a match.
The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.
A slight modification:
def DNAcheck(dna):
y = dna.upper()
print(y)
if re.match("^[ATGC]+$", y):
return (2)
else:
return(1)
The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Categories

Resources