Good day,
I am totally new to Python and I am trying to do something with string.
I would like to remove any \n characters found between double quotes ( " ) only, from a given string :
str = "foo,bar,\n\"hihi\",\"hi\nhi\""
The desired output must be:
foo,bar
"hihi", "hihi"
Edit:
The desired output must be similar to that string:
after = "foo,bar,\n\"hihi\",\"hihi\""
Any tips?
A simple stateful filter will do the trick.
in_string = False
input_str = 'foo,bar,\n"hihi","hi\nhi"'
output_str = ''
for ch in input_str:
if ch == '"': in_string = not in_string
if ch == '\n' and in_string: continue
output_str += ch
print output_str
This should do:
def removenewlines(s):
inquotes = False
result = []
for chunk in s.split("\""):
if inquotes: chunk.replace("\n", "")
result.append(chunk)
inquotes = not inquotes
return "\"".join(result)
Quick note: Python strings can use '' or "" as delimiters, so it's common practice to use one when the other is inside your string, for readability. Eg: 'foo,bar,\n"hihi","hi\nhi"'. On to the question...
You probably want the python regexp module: re.
In particular, the substitution function is what you want here. There are a bunch of ways to do it, but one quick option is to use a regexp that identifies the "" substrings, then calls a helper function to strip any \n out of them...
import re
def helper(match):
return match.group().replace("\n","")
input = 'foo,bar,\n"hihi","hi\nhi"'
result = re.sub('(".*?")', helper, input, flags=re.S)
>>> str = "foo,bar,\n\"hihi\",\"hi\nhi\""
>>> re.sub(r'".*?"', lambda x: x.group(0).replace('\n',''), str, flags=re.S)
'foo,bar,\n"hihi","hihi"'
>>>
Short explanation:
re.sub is a substitution engine. It takes a regular expression, a substitution function or expression, a string to work on, and other options.
The regular expression ".*?" catches strings in double quotes that don't in themselves contain other double quotes (it has a small bug, because it wouldn't catch strings which contain escaped double-quotes).
lambda x: ... is an expression which can be used wherever a function can be used.
The substitution engine calls the function with the match object.
x.group(0) is "the whole matched string", which also includes the double quotes.
x.group(0) is the matched string with '\n' substituted for ''.
The flag re.S tells re.sub that '\n' is a valid character to catch with a dot.
Personally I find longer functions that say the same thing more tiring and less readable, in the same way that in C I would prefer i++ to i = i + 1. It's all about what one is used to reading.
This regex works (assuming that quotes are correctly balanced):
import re
result = re.sub(r"""(?x) # verbose regex
\n # Match a newline
(?! # only if it is not followed by
(?:
[^"]*" # an even number of quotes
[^"]*" # (and any other non-quote characters)
)* # (yes, zero counts, too)
[^"]*
\z # until the end of the string.
)""",
"", str)
Something like this
Break the CSV data into columns.
>>> m=re.findall(r'(".*?"|[^"]*?)(,\s*|\Z)',s,re.M|re.S)
>>> m
[('foo', ','), ('bar', ',\n'), ('"hihi"', ','), ('"hi\nhi"', ''), ('', '')]
Replace just the field instances of '\n' with ''.
>>> [ field.replace('\n','') + sep for field,sep in m ]
['foo,', 'bar,\n', '"hihi",', '"hihi"', '']
Reassemble the resulting stuff (if that's really the point.)
>>> "".join(_)
'foo,bar,\n"hihi","hihi"'
Related
I'd like to use a variable inside a regex, how can I do this in Python?
TEXTO = sys.argv[1]
if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
You have to build the regex as a string:
TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"
if re.search(my_regex, subject, re.IGNORECASE):
etc.
Note the use of re.escape so that if your text has special characters, they won't be interpreted as such.
From python 3.6 on you can also use Literal String Interpolation, "f-strings". In your particular case the solution would be:
if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
...do something
EDIT:
Since there have been some questions in the comment on how to deal with special characters I'd like to extend my answer:
raw strings ('r'):
One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:
In short:
Let's say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:
TEXTO = "Var"
subject = r"Var\boundary"
if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
print("match")
This only works because we are using a raw-string (the regex is preceded by 'r'), otherwise we must write "\\\\boundary" in the regex (four backslashes). Additionally, without '\r', \b' would not converted to a word boundary anymore but to a backspace!
re.escape:
Basically puts a backslash in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
print("match")
NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, #, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)
Curly braces:
If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let's say you want to match TEXTO followed by exactly 2 digits:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
print("match")
if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):
This will insert what is in TEXTO into the regex as a string.
rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.
import re
string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)
Output:
[('begin', 'id1'), ('middl', 'id2')]
I agree with all the above unless:
sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor
sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"
you would not want to use re.escape, because in that case you would like it to behave like a regex
TEXTO = sys.argv[1]
if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
you can try another usage using format grammer suger:
re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)
I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:
pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)
Output can be printed using the following:
print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
here's another format you can use (tested on python 3.7)
regex_str = r'\b(?<=\w)%s\b(?!\w)'%TEXTO
I find it's useful when you can't use {} for variable (here replaced with %s)
You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.
if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
# Successful match**strong text**
else:
# Match attempt failed
more example
I have configus.yml
with flows files
"pattern":
- _(\d{14})_
"datetime_string":
- "%m%d%Y%H%M%f"
in python code I use
data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)
I am trying to replace "[!" at the start of the string only with "(".
The same holds for "!]" with ")", only at the end.
import re
l=["[!hdfjkhtxt.png!] abc", "hghjgfsdjhfg [a!=234]", "[![ITEM:15120710/1]/1587454425954.png!]", "abc"]
p=[re.sub("\[!\w+!]", '', l[i]) for i in range(len(l)) if l[i] != ""]
print(p)
the required output is
["(hdfjkhtxt.png)", "hghjgfsdjhfg [a!=234]", "([ITEM:15120710/1]/1587454425954.png)", "abc"]
Regex placing parens around content between matching pairs of '[!', '!]'
# content between '[!' and '!]' in in capture group
[re.sub(r"\[!(.*)!\]", lambda m: "(" + m.group(1) + ")", s) for s in l]
Output
['(hdfjkhtxt.png) abc', 'hghjgfsdjhfg [a!=234]', '([ITEM:15120710/1]/1587454425954.png)', 'abc']
You describe your task as a combination of two parts:
substitute [! by ( and
substitute !] by ).
If this can be done separately or has only to be done simultaneously is addressed later.
First approach
Think if str.replace could do the job. It looks quite convenient and you don't even need to import re:
[e.replace("[!", "(").replace("!]", ")") for e in l]
BTW: there is no need to exclude the empty string ("") from the replacement because it's formally replaced by "" and will be technically skipped anyway.
Regex version
[re.sub(r"\[!", "(", re.sub(r"!\]", ")", e)) for e in l]
Decomposition
The nested substitutions may not look like two steps on first glance, so see the following example
import re
l = [
"[!hdfjkhtxt.png!] abc",
"hghjgfsdjhfg [a!=234]",
"[![ITEM:15120710/1]/1587454425954.png!]",
"abc"
]
for e in l:
sd = re.sub(r"\[!", "(", e)
sd = re.sub(r"!\]", ")", sd)
print(e, " --> ", sd)
that produces this output:
[!hdfjkhtxt.png!] abc --> (hdfjkhtxt.png) abc
hghjgfsdjhfg [a!=234] --> hghjgfsdjhfg [a!=234]
[![ITEM:15120710/1]/1587454425954.png!] --> ([ITEM:15120710/1]/1587454425954.png)
abc --> abc
See the re.sub documentation for correct argument use.
Refinement
Because re.sub also supports back references, it's also possible to do the replacement of paired brackets.
re.sub(r"\[!(.+)!\]", r"(\1)", e)
Choose wisely!
It's important to be careful reading the actual requirement. If you have to replace bracket pairs, use the second, If you have to replace the sequences regardless of occurrences being paired or not, use the first. Otherwise you are doing it wrong.
Escaping
Keep in mind that that backslash (\), as an escape character, has to be doubled in normal string literals, an alternative is to prefix the string literal by r. Doubling the backslash (or the r prefix) is optional in all but the last example because \[ and \] have no function in python whereas \1 is the code for SOH (the control char in ASCII) or U+0001 (the Unicode point).
I am trying to remove punctuation to check if a phrase (or word) is a palindrome, though when I have a word with numbers they are removed and it return True instead of False. "1a2" after cleaning punctuation with sub returns 'a' though it should still give me '1a2'. I thought I picked up only punctuation for substitution.
import re
def isPalindrome(s):
clean = re.sub("[,.;##?+^:%-=()!&$]", " ", s)
lower = ''.join([i.lower() for i in clean.split()])
if lower == lower[::-1]:
return True
else:
return False
print(isPalindrome("1a2"))
You're using - inside your regex and you need to escape it correctly, try this instead:
re.sub("[,.;##?+^:%\-=()!&$]", " ", s)
Have a look in the doc for a list of special characters and how to note a [].
I would use str.maketrans and the punctuation set from the string module in your case, because I think that this is more readable than a regex :
import string
s = s.translate(str.maketrans('', '', string.punctuation))
Special characters must be escaped in your regex string. I.e.
clean = re.sub(r"[,\.;#\#\?\+\^:%\-=\(\)!\&\$]", " ", s)
or use re.escape, which automatically escapes special characters
esc = re.escape(r',.;##?+^:%-=()!&$')
clean = re.sub("[" + esc + "]", " ", s)
with open('templates/data.xml', 'r') as s:
for line in s:
line = line.rstrip() #removes trailing whitespace and '\n' chars
if "\\$\\(" not in line:
if ")" not in line:
continue
print(line)
start = line.index("$(")
end = line.index(")")
print(line[start+2:end])
I need to match the strings which are like $(hello). But now this even matches (hello).
Im really new to python. So what am i doing wrong here ?
Use the following regex:
\$\(([^)]+)\)
It matches $, followed by (, then anything until the last ), and catches the characters between the parenthesis.
Here we did escape the $, ( and ) since when you use a function that accepts a regex (like findall), you don't want $ to be treated as the special character $, but as the literal "$" (same holds for the ( and )). However, note that the inner parenthesis didn't get quoted since you want to capture the text between the outer parenthesis.
Note that you don't need to escape the special characters when you're not using regex.
You can do:
>>> import re
>>> escaper = re.compile(r'\$\((.*?)\)')
>>> escaper.findall("I like to say $(hello)")
['hello']
I believe something along the lines of:
import re
data = "$(hello)"
matchObj = re.match( r'\$\(([^)]+)\)', data, re.M|re.I)
print matchObj.group()
might do the trick.
If you don't want to do it with regexes (I wouldn't necessarily; they can be hard to read).
Your for loop indentation is wrong.
"\$\(" means \$\( (you're escaping the brackets, not the $ and (.
You don't need to escpae $ or (. Just do if "$(" not in line
You need to check the $( is found before ). Currently your code will match "foo)bar$(baz".
Rather than checking if $( and ) are in the string twice, it would be better to just do the .index() anyway and catch the exception. Something like this:
with open('templates/data.xml', 'r') as s:
for line in s:
try:
start = line.index("$(")
end = line.index(")", start)
print(line[start+2:end])
except ValueError:
pass
Edit: That will only match one $() per line; you'll want to add a loop.
How do I encode """ in a raw python string?
The following does not seem to work:
string = r"""\"\"\""""
since when trying to match """ with a regular expression, I have to double-escape the character ":
Returns an empty list:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\"\"\")
""", re.S|re.X)
result = re.findall(regEx, string)
in this case result is an empty list.
This same regular expression returns ['"""'] when I load a string with """ from file content.
Returns double-escaped quotations:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\\"\\"\\")
""", re.S|re.X)
result = re.findall(regEx, string)
now result is equal to ['\\"\\"\\"'].
It want it to be equal to ['"""'].
In general, there are three options:
Don't use the r prefix. That's just a convenience to avoid excessive use of double-backslashes in regexes. It isn't required.
Use r'…', inside which the " character isn't special.
Mix and match r"…" and '':, e.g. pattern = '"""' + r"\s*\d\d-'\d\d'-\d\d\s*" + '"""'
In this instance, you can do both 1 and 2: single quotes and no r prefix.
The simplest way is to just do '"""'.