Python replace - Treat multiple instances as one

Python replace - Treat multiple instances as one - python

I am trying to replace all the carriage returns with a command for incoming lines. It works fine, except when multiple carriage returns exist. I see no information in python's string.replace() function on how to treat multiple instances of the same item as though they are one. Is this possible?
For instance, this line:
This is\nA sentence\nwith multiple\nbreaklines\n\npython.
Should end up like this:
This is, A sentence, with multiple, breaklines, python.
But it actually turns into this:
This is, A sentence, with multiple, breaklines, , python.

You can use regex.
In [48]: mystr = "This is\nA sentence\nwith multiple\nbreaklines\n\npython."
In [49]: re.sub(r'\n+', ', ', mystr)
Out[49]: 'This is, A sentence, with multiple, breaklines, python.'
The regex pattern matches where there's one or more \n's next to each other and replaces them with a ,.

Related

removing block comments but keeping linebreaks

I'm removing block comments from python scripts with this regex
re.sub("'''.*?'''", "", string, flags = re.DOTALL)
It removes the complete block comment including line breaks (\n). However I would like to keep the line breaks for further processing of the files. Any way to do this with a regex?

What youre doing is trying to find repeated matches of lines contained within the multiline strings and replace them with new line characters instead of the whole line. Re.sub can actually take a method/lambda as its second parameter and that is what you should do. Here is the description and an example from pythons documentation
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
>>> def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
Using that concept, you just find the blockquotes and everything within them, then pass that match to a method which will run its own search, but this time you get the ability to just replace any line with a newline character. So that would be like replace "^.*" with "\n" and made sure you dont remove the triple quotes, or dont include them in the original regex group. Then you can just pass that value back from the method which should then happen for each group indepentantly.

Matching regex pattern where there is \n\r between starting and ending pattern

The red underscore is the desired string I want to match
I would like to match all strings (including \n) between the the two string provided in the example
However, in the first example, where there is a newline, I can't get anything to match
In the second example, the regex expression works. It matches the string highlighted in Green because it resides on a single line
Not sure if there is a notation I need to include for \n\r to be part of the pattern to match

Use this
output = re.search('This(.*?)\n\n(.*?)match', text)
>>> output.group(1)
'is a multiline expression'
>>> output.group(2)
'I would like to '

Try this one aswell:
output = re.search(r"This ([\S.]+) match", text).group(1).replace(r'\n','')
That will find the entire thing as one group then remove the new lines.

python regex - characters between certain characters

Edit: I should add, that the string in the test is supposed to contain every char there possible is (i.e. * + $ § € / etc.). So i thought of regexp should help best.
i am using regex to find all characters between certain characters([" and "]. My example goes like this:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
The supposed output should be like this:
['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
My code including the regex looks like this:
import re
my_list = re.findall(r'(?<=\[").*(?="\])*[^ ,\n]', test)
print (my_list)
And my outcome is the following:
['this is a text and its supposed to contain every possible char."]', 'another one after a newline."]', 'and another one even with']
so there are two problems:
1) its not removing "] at the end of a text as i want it to do with (?="\])
2) its not capturing the third text in brackets, guess because of the newlines. But so far i wasnt able to capture those when i try .*\n it gives me back an empty string.
I am thankful for any help or hints with this issue. Thank you in advance.
Btw iam using python 3.6 on anaconda-spyder and the newest regex (2018).
EDIT 2: One Alteration to the test:
test = """[
"this is a text and its supposed to contain every possible char."
],
[
"another one after a newline."
],
[
"and another one even with
newlines
in it."
]"""
Once again i have trouble to remove the newlines from it, guess the whitespaces could be removed with \s, so an regexp like this could solve it, i thought.
my_list = re.findall(r'(?<=\[\S\s\")[\w\W]*(?=\"\S\s\])', test)
print (my_list)
But that returns only an empty list. How to get the supposed output above from that input?

In case you might also accept not regex solution, you can try
result = []
for l in eval(' '.join(test.split())):
result.extend(l)
print(result)
# ['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']

You can try this mate.
(?<=\[\")[\w\s.]+(?=\"\])
Demo
What you missed in your regex .* will not match newline.
P.S I am not matching special characters. if you want it can be achieved very easily.
This one matches special characters too
(?<=\[\")[\w\W]+?(?=\"\])
Demo 2

So here's what I came up:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
for i in test.replace('\n', '').replace(' ', ' ').split(','):
print(i.lstrip(r' ["').rstrip(r'"]'))
Which results in the following being printed to the screen
this is a text and its supposed to contain every possible char.
another one after a newline.
and another one even with newlines in it.
If you want a list of those -exact- strings, we could modify it to-
newList = []
for i in test.replace('\n', '').replace(' ', ' ').split(','):
newList.append(i.lstrip(r' ["').rstrip(r'"]'))

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.

re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'

You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Replacing only a specific group within a matched expression

I'm parsing text in which I would like to make changes, but only to specific lines.
I have a regular expression pattern that catches the entire line if it's a line of interest, and within the expression I have a remembered group of the thing I would actually like to change.
I would like to be able to changed only the specific group within a matched expression, and not replace the entire expression (that would replace the entire line).
For example:
I have a textual file with:
This is a completely silly example.
something something "this should be replaced" bla.
more uninteresting stuff
And I have the regex:
pattern = '.*("[^"]*").*'
Then I catch the second line, but I would to replace only the "this should be replaced" matched group within the line, not the entire line. (so using re.sub(pattern, replacement, string) won't do the job.
Thanks in advance!

What's wrong with
r'"[^"]+"'
Your .* before and after the matched expression match zero-length-string too, so you don't need it at all.
re.sub(r'"[^"]+"', 'DEF', 'abc"def"ghi')
# returns 'abcDEFghi'
and your example text will result into:
'This is a completely silly example.\nsomething something DEF bla.\nmore uninteresting stuff

eumiro answer is best in this very case, but for the sake of completeness, if you really need to perform some more complicated processing of pre, inside, and post text, you can simply use multiple groups, like:
'(.*)("[^"]*")(.*)'
(first group provides the the text before, third the text after, do what you like with them)
Also, you may prefer to forbid " in the pre-part:
'([^"]*)("[^"]*")(.*)'

re.match and re.search return a "match object". (See the python documentation). Supposing you want to replace group 3 in your RE, pull out its start/end indices and replace the substring directly:
mobj = re.match(pattern, line)
start = mobj.start(3)
end = mobj.end(3)
line = line[:start] + replacement + line[end:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python replace - Treat multiple instances as one - python

Related

removing block comments but keeping linebreaks

Matching regex pattern where there is \n\r between starting and ending pattern

python regex - characters between certain characters

Regex only finds results once

Replacing only a specific group within a matched expression

Categories

Resources