Have the following string:
text = 'interesting/practical/useful'
I need to split it like interesting / practical / useful using Regex.
I'm trying this
newText = re.sub(r'[a-zA-Z](/)[a-zA-Z]', ' \g<0> ', text)
but it returns
interestin g/p ractica l/u seful
P.S. I have other texts in the Corpus that also have forward slashes that shouldn't be altered due to this operation, which is my I'm looking for regex patterns specifically with characters before and after the '/'.
Use lookarounds for the letters before and after the /, so they're not included in the match.
newText = re.sub(r'(?<=[a-zA-Z])/(?=[a-zA-Z])', ' \g<0> ', text)
I'm used to do it this way:
newText = re.sub(r'([a-zA-Z])/([a-zA-Z])', '\1 / \2', text)
\1 and \2 are the references on the first and second founded expressions in the brakets ()
In plain English it means: look for any letter, a slash, a letter and change them with the founded letter, a space, a slash, a space and the second founded letter.
Related
I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(#[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like "#aaa bbb các. ddd".
it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "#aaa".
But it produces the same input text!: "#aaa bbb các. ddd"
Did I miss something?
You have several issues in the current code:
To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'#\w+', ' ', s, flags=re.U)
... s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
... s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))
For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.
You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.
Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo
You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'
Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")
For example, if I have a string:
"I really..like something like....that"
I want to get only:
"I something"
Any suggestion?
If you want to do it with regex; you can to use below regex to remove them:
r"[^\.\s]+\.{2,}[^\.\s]+"g
[ Regex Demo ]
Regex explanation:
[^\.\s]+ at least one of any character instead of '.' and a white space
\.{2,} at least two or more '.'
[^\.\s]+ at least one of any character instead of '.' and a white space
or this regex:
r"\s+[^\.\s]+\.{2,}[^\.\s]+"g
^^^ for including spaces before those combination
[ Regex Demo ]
If you want to use a regex explicitly you could use the following.
import re
string = "I really..like something like....that"
with_dots = re.findall(r'\w+[.]+\w+', string)
split = string.split()
without_dots = [word for word in split if word not in with_dots]
The solution provided by rawing also works in this case.
' '.join(word for word in text.split() if '..' not in word)
You may very well use boundaries in combination with lookarounds:
\b(?<!\.)(\w+)\b(?!\.)
See a demo on regex101.com.
Broken apart, this says:
\b # a word boundary
(?<!\.) # followed by a negative lookbehind making sure there's no '.' behind
\w+ # 1+ word characters
\b # another word boundary
(?!\.) # a negative lookahead making sure there's no '.' ahead
As a whole Python snippet:
import re
string = "I really..like something like....that"
rx = re.compile(r'\b(?<!\.)(\w+)\b(?!\.)')
print(rx.findall(string))
# ['I', 'something']
I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!
If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]
That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'
For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)