Matching alphanumeric words, mentions or emails with Python regex

Matching alphanumeric words, mentions or emails with Python regex - python

I've already read this and this and this and lots of others. They don't answer to my problem.
I'd like to filter a string that may contain emails or strings starting by "#" (like emails but without the text before the "#"). I've tested many ones but one of the simplest that begins to get close is:
import re
re.split(r'(#)', "test #aa test2 #bb #cc t-es #dd-#ee, test#again")
Out[40]:
['test ', '#', 'aa test2 ', '#', 'bb ', '#', 'cc t-es ', '#', 'dd-', '#', 'ee, test', '#', 'again']
I'm looking for the right regexp that could give me:
['test ', '#aa', 'test2 ', '#bb ', '#cc', 't-es ', '#dd-', '#ee', 'test#again']

Why try to split when you can go "yo regex, give me all that matches":
test = "test #aa test2 #bb #cc t-es #dd-#ee, test#again"
import re
print(
re.findall("[^\s#]*?#?[^#]* |[^#]*#[^\s#]*", test)
)
# ['test ', '#aa test2 ', '#bb ', '#cc t-es ', '#dd-', '#ee, ', 'test#again']
I tried but I couldn't make the regex any smaller, but at least it works and who expects regex to be small anyway
As per the OP's new requirements(or corrected requirements)
[^\s#]*?#?[^\s#]* |[^#]*#[^\s#]*

My own solution based on different email parsing + simple "#[:alphanum:]+" parsing is:
USERNAME_OR_EMAIL_REGEX = re.compile(
r"#[a-zA-Z0-9-]+" # simple username
r"|"
r"[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+" # email
r"#" # following: domain name:
r"[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?"
r"(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)")

Related

Split a string in python with side by side delimiters

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!

This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.

As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']

You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'｟([^｟｠]+)｠', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'｟\1:\2｠', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ｟([^｟｠]+)｠ pattern just matches any ｟...｠ substring but keeps what is in between as it is captured.

You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)

One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

How to split at spaces and commas in Python?

I've been looking around here, but I didn't find anything that was close to my problem. I'm using Python3.
I want to split a string at every whitespace and at commas. Here is what I got now, but I am getting some weird output:
(Don't worry, the sentence is translated from German)
import re
sentence = "We eat, Granny"
split = re.split(r'(\s|\,)', sentence.strip())
print (split)
>>>['We', ' ', 'eat', ',', '', ' ', 'Granny']
What I actually want to have is:
>>>['We', ' ', 'eat', ',', ' ', 'Granny']

I'd go for findall instead of split and just match all the desired contents, like
import re
sentence = "We eat, Granny"
print(re.findall(r'\s|,|[^,\s]+', sentence))

This should work for you:
import re
sentence = "We eat, Granny"
split = list(filter(None, re.split(r'(\s|\,)', sentence.strip())))
print (split)

Alternate way:
import re
sentence = "We eat, Granny"
split = [a for a in re.split(r'(\s|\,)', sentence.strip()) if a]
Output:
['We', ' ', 'eat', ',', ' ', 'Granny']
Works with both python 2.7 and 3

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.

This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()

convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Why isn't this regex parsing the whole string?

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']

re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching alphanumeric words, mentions or emails with Python regex - python

Related

Split a string in python with side by side delimiters

Parse sentences with [value](type) format

How to split at spaces and commas in Python?

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Why isn't this regex parsing the whole string?

Categories

Resources