I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.
For example:
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
The output should be something like:
output_1 = ["This is", "my comment"]
output_2 = ["Another comment to", "parse"]
I have been thinking that I could do a parsing with special characters only, but maybe it will leave the "O" in ":O", or the "v" in ":v"
You may try matching on the pattern (?<!\S)\w+\S?(?: \w+\S?)*, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).
inp = ["This is :) my comment :O", ">:O Another comment to :v parse"]
for i in inp:
matches = re.findall(r'(?<!\S)\w+\S?(?: \w+\S?)*', i)
print(matches)
This prints:
['This is', 'my comment']
['Another comment to', 'parse']
Here is an explanation of the regex pattern being used:
(?<!\S) assert that what precedes the word is either whitespace
or the start of the string
\w+ match a word
\S? followed by zero or one non whitespace character
(such as punctuation symbols)
(?: \w+\S*)* zero or more word/symbol sequences following
When splitting re.split is often useful, I would do
import re
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
output_1 = re.split(r'\s*\S?:\S+\s*', comment_1)
output_2 = re.split(r'\s*\S?:\S+\s*', comment_2)
print(output_1)
print(output_2)
output
['This is', 'my comment', '']
['', 'Another comment to', 'parse']
Note that this differ from your required output as there is empty strs in outputs but these can be easily removed using list comprehension, e.g. [i for i in output_1 if i]. r'\s*\S?:\S+\s*' might be explained as zero or none non-whitespaces (\S?) followed by colon (;) and one or more non-whitespaces (\S+) with added leading and trailing whitespaces if present (\s*).
Related
I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).
I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING
You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']
You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).
One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.
You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']
You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']
Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']
I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
Code 1:(#https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.
Issue: That doesn't work on words that end in an apostrophe, i.e.
most possessive plurals, and it also doesn't work on informal
abbreviations that start with an apostrophe.
Code 2:(#https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
Too large set of words to specify, as how can one specify persons' etc.
Please help.
Edit 1:
I am using #anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail.
Code=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.
Edit 2:
Naive and long approach to fix:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
Are there any corner cases I am missing or are there any better approaches?
First attempt
You can also use this regex:
(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))
DEMO IN REGEX101
This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".
(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or
new line (to match multiline sentences) with lazy quantifire (to avoid
matching from first to last single quoting mark), followed by
optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude
text like "i'm", "you're" etc. where quoting mark is beetwen words,
The s' case
However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:
(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))
DEMO IN REGEX101
PYTHON IMPLEMENTATION
with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:
(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word
character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.
this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.
Flaws of \w
Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.
You can use:
input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)
Output:
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
RegEx Demo
Try this: you can use this regex ((?<=\s)'([^']+)'(?=\s)) and replace with "\2"
import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""
result = re.sub(p, subst, test_str)
Output
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
Demo
Here is a non-regex way of doing it
text="the stackoverflow don't said, 'hey what'"
out = []
for i, j in enumerate(text):
if j == '\'':
if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
out.append(j)
else:
out.append('"')
else:
out.append(j)
print ''.join(out)
gives as an output
the stackoverflow don't said, "hey what"
Of course, you can improve the exclusion list to not have to use manually check each exclusion...
Here is another possible way of doing it:
import re
text = "I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)
I have tried to avoid the need for special cases, it gives:
I'm one of the persons' stackoverflow don't th'em said,"hey what" I'll handle it.
I'm fighting too long on this regex now.
The split should use blank as separator
but maintain the remaining ones in a blank sequence to the next token
'123 45 678 123.0'
=>
'123', '45', ' 678', ' 123.0'
My numbers are floats as well and the group count is unknown.
What about using a lookbehind assertion?:
>>> import re
>>> regex = re.compile(r'(?<=[^\s])\s')
>>> regex.split('this is a string')
['this', ' is', 'a', ' string']
regex breakdown:
(?<=...) #lookbehind. Only match if the `...` matches before hand
[^\s] #Anything that isn't whitespace
\s #single whitespace character
In english, this translates to "match a single whitespace character if it isn't preceded by a whitespace character."
Or you can use a negative lookbehind assertion:
regex = re.compile(r'(?<!\s)\s')
which might be slightly nicer (as suggested in the comments), and should be relatively easy to figure out how it works since it is very similar to the above.