Splitting a string by using two substrings in Python

Splitting a string by using two substrings in Python - python

I am searching a string by using re, which works quite right for almost all cases except when there is a newline character(\n)
For instance if string is defined as:
testStr = " Test to see\n\nThis one print\n "
Then searching like this re.search('Test(.*)print', testStr) does not return anything.
What is the problem here? How can I fix it?

The re module has re.DOTALL to indicate "." should also match newlines. Normally "." matches anything except a newline.
re.search('Test(.*)print', testStr, re.DOTALL)
Alternatively:
re.search('Test((?:.|\n)*)print', testStr)
# (?:…) is a non-matching group to apply *
Example:
>>> testStr = " Test to see\n\nThis one print\n "
>>> m = re.search('Test(.*)print', testStr, re.DOTALL)
>>> print m
<_sre.SRE_Match object at 0x1706300>
>>> m.group(1)
' to see\n\nThis one '

Related

Python re regex sub letters not surrounded in quotes and not if they match specific word including regex group / match

I need to sub letters not surrounded in quotes and not if they match the word TODAY with a particular string where a part of it includes the match group e.g.
import re
import string
s = 'AB+B+" HELLO"+TODAY()/C* 100'
x = re.sub(r'\"[^"]*\"|\bTODAY\b|([A-Z]+)', r'a2num("\g<0>")', s)
print (x)
expected output:
'a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100'
actual output:
'a2num("AB")+a2num("B")+a2num("" HELLO"")+a2num("TODAY")()/a2num("C")* 100'
I am nearly there but it is not obeying the quote rules or the TODAY word rule, I know the string doesn't make any sense but it's just a harsh test of the regex

Your regex approach is correct but you need to use a lambda function in re.sub
>>> s = 'AB+B+" HELLO"+TODAY()/C* 100'
>>> rs = re.sub(r'"[^"]*"|\bTODAY\b|\b([A-Z]+)\b',
... lambda m: 'a2num("' + m.group(1) + '")' if m.group(1) else m.group(), s)
>>> print (rs)
a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100
Code Demo

python regex issue with underscore

i am trying to do some string search with regular expressions, where i need to print the [a-z,A-Z,_] only if they end with " " space, but i am having some trouble if i have underscore at the end then it doesn't wait for the space and executes the command.
if re.search(r".*\s\D+\s", string):
print string
if i keep
string = "abc shot0000 "
it works fine, i do need it to execute it only when the string ends with a space \s.
but if i keep
string = "abc shot0000 _"
then it doesn't wait for the space \s and executes the command.

You're using search and this function, as the name says, search in your string if the pattern appear and that's the case in your two strings.
You should add a $ to your regular expression to search for the end of string:
if re.search(r".*\s\D+\s$", string):
print string

You need to anchor the RE at the end of the string with $:
if re.search(r".*\s\D+\s$", string):
print string

Use a $:
>>> strs = "abc shot0000 "
>>> re.search(r"\s\w+\s$", strs) #use \w: it'll handle A-Za-z_
<_sre.SRE_Match object at 0xa530100>
>>> strs = "abc shot0000 _"
>>> re.search(r"\s\w+\s$", strs)
#None

Python regular expression for a sentence does not want to match

Can anyone explain why this re (in Python):
pattern = re.compile(r"""
^
([[a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1}]+)
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
if re.match(pattern, line):
does not match "A sentence."
I would actually like to return the entire sentence (including the period) as a returned group (), but have been failing miserably.

I think that maybe you meant to do this:
(([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+)
^ ^
I don't think the nested square brackets you had do what you think they do.

This regex works:
pattern = re.compile(r"""
^
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
line = "A sentence."
match = re.match(pattern, line)
>>> print "'%s'" % match.group(0)
'A sentence.'
>>> print "'%s'" % match.group(1)
'A '
>>> print "'%s'" % match.group(2)
'sentence'
To return the entire match (line in this case), use match.group(0).
Because the first match group can match multiple times (once for each word except the last one), you can only access the next to last word using match.group(1).
Btw, the {1} notation is not necessary in this case, matching once and only once is the default behavior, so this bit can be removed.
The extra set of square brackets definitely weren't helping you :)

It turns out the following actually works and includes all the extended ascii characters I wanted
^
([\w+\s{1}]+\w{1}\.{1})
$

Period stops multiline regex substitute in Python?

I have multiple line string that I'd like to replace, but don't understand why it's not working. For some reason, a period in the string stops the matching for the regular expression.
My string:
s = """
[some_previous_text]
<start>
one_period .
<end>
[some_text_after]
"""
What I'd like to end up with:
s = """
[some_previous_text]
foo
[some_text_after]
"""
What I initially tried, but it doesn't match anything:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
<start>
one_period .
<end>
However, when I took the period out, it worked fine:
>>> import re
>>> s = "<start>\nno_period\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
Also, when I put an <end> tag before the period, it matched the first <end> tag:
>>> import re
>>> s = "<start>\n<end>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
one_period .
<end>
So what's going on here? Why does the period stop the [^.]* matching?
EDIT:
SOLVED
I mistakenly thought that the carat ^ was for new-line matching. What I needed was a re.DOTALL flag (as indicated by Amber). Here's the expression I'm now using:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
foo

Why wouldn't it? [^.] is "the set of all characters that is not a ." and thus doesn't match periods.
Perhaps you instead meant to just put .* (any number of any characters) instead of [^.]*?
For matching across newlines, specify re.DOTALL:
re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)

Thats because [^.]* is a negated character class that matches any character but a period.
You probably want something like <start>.*?<end> together with the re.S modifier, that makes the dot matches also newline characters.
re.sub("<start>.*?<end>", "foo", s, flags=re.S)

How do I removes \n founds between double quotes from a string?

Good day,
I am totally new to Python and I am trying to do something with string.
I would like to remove any \n characters found between double quotes ( " ) only, from a given string :
str = "foo,bar,\n\"hihi\",\"hi\nhi\""
The desired output must be:
foo,bar
"hihi", "hihi"
Edit:
The desired output must be similar to that string:
after = "foo,bar,\n\"hihi\",\"hihi\""
Any tips?

A simple stateful filter will do the trick.
in_string = False
input_str = 'foo,bar,\n"hihi","hi\nhi"'
output_str = ''
for ch in input_str:
if ch == '"': in_string = not in_string
if ch == '\n' and in_string: continue
output_str += ch
print output_str

This should do:
def removenewlines(s):
inquotes = False
result = []
for chunk in s.split("\""):
if inquotes: chunk.replace("\n", "")
result.append(chunk)
inquotes = not inquotes
return "\"".join(result)

Quick note: Python strings can use '' or "" as delimiters, so it's common practice to use one when the other is inside your string, for readability. Eg: 'foo,bar,\n"hihi","hi\nhi"'. On to the question...
You probably want the python regexp module: re.
In particular, the substitution function is what you want here. There are a bunch of ways to do it, but one quick option is to use a regexp that identifies the "" substrings, then calls a helper function to strip any \n out of them...
import re
def helper(match):
return match.group().replace("\n","")
input = 'foo,bar,\n"hihi","hi\nhi"'
result = re.sub('(".*?")', helper, input, flags=re.S)

>>> str = "foo,bar,\n\"hihi\",\"hi\nhi\""
>>> re.sub(r'".*?"', lambda x: x.group(0).replace('\n',''), str, flags=re.S)
'foo,bar,\n"hihi","hihi"'
>>>
Short explanation:
re.sub is a substitution engine. It takes a regular expression, a substitution function or expression, a string to work on, and other options.
The regular expression ".*?" catches strings in double quotes that don't in themselves contain other double quotes (it has a small bug, because it wouldn't catch strings which contain escaped double-quotes).
lambda x: ... is an expression which can be used wherever a function can be used.
The substitution engine calls the function with the match object.
x.group(0) is "the whole matched string", which also includes the double quotes.
x.group(0) is the matched string with '\n' substituted for ''.
The flag re.S tells re.sub that '\n' is a valid character to catch with a dot.
Personally I find longer functions that say the same thing more tiring and less readable, in the same way that in C I would prefer i++ to i = i + 1. It's all about what one is used to reading.

This regex works (assuming that quotes are correctly balanced):
import re
result = re.sub(r"""(?x) # verbose regex
\n # Match a newline
(?! # only if it is not followed by
(?:
[^"]*" # an even number of quotes
[^"]*" # (and any other non-quote characters)
)* # (yes, zero counts, too)
[^"]*
\z # until the end of the string.
)""",
"", str)

Something like this
Break the CSV data into columns.
>>> m=re.findall(r'(".*?"|[^"]*?)(,\s*|\Z)',s,re.M|re.S)
>>> m
[('foo', ','), ('bar', ',\n'), ('"hihi"', ','), ('"hi\nhi"', ''), ('', '')]
Replace just the field instances of '\n' with ''.
>>> [ field.replace('\n','') + sep for field,sep in m ]
['foo,', 'bar,\n', '"hihi",', '"hihi"', '']
Reassemble the resulting stuff (if that's really the point.)
>>> "".join(_)
'foo,bar,\n"hihi","hihi"'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a string by using two substrings in Python - python

Related

Python re regex sub letters not surrounded in quotes and not if they match specific word including regex group / match

python regex issue with underscore

Python regular expression for a sentence does not want to match

Period stops multiline regex substitute in Python?

How do I removes \n founds between double quotes from a string?

Categories

Resources