Split string via regular expression - python

Suppose I am given a string like:
input = """
abc#gmail.com is a very nice person
xyz#gmail.com sucks
lol#gmail.com is pretty funny."""
I have a regular expression for email addresses: ^[A-z0-9\+\.]+\#[A-z0-9\+\.]+\.[A-z0-9\+]+$
The goal is to split the string based on the email address regular expression.
The output should be:
["is a very nice person", "sucks", "is pretty funny."]
I have been trying to use re.split(EMAIL_REGEX, input) but i haven't been successful.
I get the output as the entire string contained in the list.

Remove the ^ and $ anchors, as they only match the beginning and end of the string. Since the email addresses are in the middle of the string, they'll never match.
Your regexp has other problems. The account name can contain many other characters than the ones you allow, e.g. _ and -. The domain name can contain - characters, but not +. And you shouldn't use the range A-z to get upper and lower case characters, because there are characters between the two alphabetic blocks that you probably don't want to include (see the ASCII Table); either use A-Za-z or use a-z and add flags = re.IGNORECASE.

The '^$' might be throwing it off. It'll only match string that starts and ends with the matching regex.
I have something close to what you want:
>>> EMAIL_REGEX = r'[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> re.split(EMAIL_REGEX, input, flags=re.IGNORECASE)
['\n', ' is a very nice person\n', ' sucks\n', ' is pretty funny.']

You will probably need to loop through the lines and then split each with your regex.
Also your regex shouldn't have $ at the end.
Try something like:
EMAIL_REGEX = r"\.[a-z]{3} " # just for the demo note the space
ends =[]
for L in input.split("\n"):
parts = re.split(EMAIL_REGEX,L)
if len(parts) > 1:
ends.append(parts[1])
Output:
['is a very nice person', 'sucks', 'is pretty funny.']

Wouldn't use a regex here, it would work like this as well:
messages = []
for item in input.split('\n'):
item = ' '.join(item.split(' ')[1:]) #removes everything before the first space, which is just the email-address in this case
messages.append(item)
Output of messages when using:
input = """
abc#gmail.com is a very nice person
xyz#gmail.com sucks
lol#gmail.com is pretty funny."""
['', 'is a very nice person', 'sucks', 'is pretty funny.']
If you want to remove the first element, just do it like this: messages = messages[1:]

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

regular expression to extract part of email address

I am trying to use a regular expression to extract the part of an email address between the "#" sign and the "." character. This is how I am currently doing it, but can't get the right results.
company = re.findall('^From:.+#(.*).',line)
Gives me:
['#iupui.edu']
I want to get rid of the .edu
To match a literal . in your regex, you need to use \., so your code should look like this:
company = re.findall('^From:.+#(.*)\.',line)
# ^ this position was wrong
See it live here.
Note that this will always match the last occurrence of . in your string, because (.*) is greedy. If you want to match the first occurence, you need to exclude any . from your capturing group:
company = re.findall('^From:.+#([^\.]*)\.',line)
See a demo.
You can try this:
(?<=\#)(.*?)(?=\.)
See a demo.
A simple example would be:
>>> import re
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo.com")
['moo']
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo-hihihi.com")
['moo-hihihi']
This matches the hostname regardless of the beginning of the line, i.e. it's greedy.
You could just split and find:
s = " abc.def#ghi.mn I"
s = s.split("#", 1)[-1]
print(s[:s.find(".")])
Or just split if it is not always going to match your string:
s = s.split("#", 1)[-1].split(".", 1)[0]
If it is then find will be the fastest:
i = s.find("#")
s = s[i+1:s.find(".", i)]

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Match string between special characters

I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.
You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'
Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]

Categories

Resources