Removing hash comments that are not inside quotes - python

I am using python to go through a file and remove any comments. A comment is defined as a hash and anything to the right of it as long as the hash isn't inside double quotes. I currently have a solution, but it seems sub-optimal:
filelines = []
r = re.compile('(".*?")')
for line in f:
m = r.split(line)
nline = ''
for token in m:
if token.find('#') != -1 and token[0] != '"':
nline += token[:token.find('#')]
break
else:
nline += token
filelines.append(nline)
Is there a way to find the first hash not within quotes without for loops (i.e. through regular expressions?)
Examples:
' "Phone #":"555-1234" ' -> ' "Phone #":"555-1234" '
' "Phone "#:"555-1234" ' -> ' "Phone "'
'#"Phone #":"555-1234" ' -> ''
' "Phone #":"555-1234" #Comment' -> ' "Phone #":"555-1234" '
Edit: Here is a pure regex solution created by user2357112. I tested it, and it works great:
filelines = []
r = re.compile('(?:"[^"]*"|[^"#])*(#)')
for line in f:
m = r.match(line)
if m != None:
filelines.append(line[:m.start(1)])
else:
filelines.append(line)
See his reply for more details on how this regex works.
Edit2: Here's a version of user2357112's code that I modified to account for escape characters (\"). This code also eliminates the 'if' by including a check for end of string ($):
filelines = []
r = re.compile(r'(?:"(?:[^"\\]|\\.)*"|[^"#])*(#|$)')
for line in f:
m = r.match(line)
filelines.append(line[:m.start(1)])

r'''(?: # Non-capturing group
"[^"]*" # A quote, followed by not-quotes, followed by a quote
| # or
[^"#] # not a quote or a hash
) # end group
* # Match quoted strings and not-quote-not-hash characters until...
(#) # the comment begins!
'''
This is a verbose regex, designed to operate on a single line, so make sure to use the re.VERBOSE flag and feed it one line at a time. It'll capture the first unquoted hash as group 1 if there is one, so you can use match.start(1) to get the index. It doesn't handle backslash escapes, if you want to be able to put a backslash-escaped quote in a string. This is untested.

You can remove comments using this script:
import re
print re.sub(r'(?s)("[^"\\]*(?:\\.[^"\\]*)*")|#[^\n]*', lambda m: m.group(1) or '', '"Phone #"#:"555-1234"')
The idea is to capture first parts enclosed in double-quotes and to replace them by themself before searching a sharp:
(?s) # the dot matches newlines too
( # open the capture group 1
" # "
[^"\\]* # all characters except a quote or a backslash
# zero or more times
(?: # open a non-capturing group
\\. # a backslash and any character
[^"\\]* #
)* # repeat zero or more times
" # "
) # close the capture group 1
| # OR
#[^\n]* # a sharp and zero or one characters that are not a newline.

This code was so ugly, I had to post it.
def remove_comments(text):
char_list = list(text)
in_str = False
deleting = False
for i, c in enumerate(char_list):
if deleting:
if c == '\n':
deleting = False
else:
char_list[i] = None
elif c == '"':
in_str = not in_str
elif c == '#':
if not in_str:
deleting = True
char_list[i] = None
char_list = filter(lambda x: x is not None, char_list)
return ''.join(char_list)
Seems to work though. Although I'm not sure how it might handle newline chars between windows and linux.

Related

Split string by comma, but ignore commas within brackets

I'm trying to split a string by commas using python:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
But I want to ignore any commas within brackets []. So the result for above would be:
["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]
Anybody have advice on how to do this? I tried to use re.split like so:
params = re.split(",(?![\w\d\s])", param)
But it is not working properly.
result = re.split(r",(?!(?:[^,\[\]]+,)*[^,\[\]]+])", subject, 0)
, # Match the character “,” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
(?: # Match the regular expression below
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, # Match the character “,” literally
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
] # Match the character “]” literally
)
Updated to support more than 2 items in brackets. E.g.
year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier
This regex works on your example:
,(?=[^,]+?:)
Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key> pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.
You can check out the regexr here
You can work this out using a user-defined function instead of split:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
def split_by_commas(s):
lst = list()
last_bracket = ''
word = ""
for c in s:
if c == '[' or c == ']':
last_bracket = c
if c == ',' and last_bracket == ']':
lst.append(word)
word = ""
continue
elif c == ',' and last_bracket == '[':
word += c
continue
elif c == ',':
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
main_lst = split_by_commas(s)
print(main_lst)
The result of the run of above code:
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.
Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.
(?:[^,]*\[[^][]*])+[^,]*|[^,]+
Regex demo
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*\[[^][]*])+[^,]*|[^,]+", s)
print(params)
Output
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
I adapted #Bemwa's solution (which didn't work for my use-case)
def split_by_commas(s):
lst = list()
brackets = 0
word = ""
for c in s:
if c == "[":
brackets += 1
elif c == "]":
if brackets > 0:
brackets -= 1
elif c == "," and not brackets:
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst

How can i remove only the last bracket from a string in python?

How can i remove only the last bracket from a string ?
For example,
INPUT 1:
"hell(h)o(world)"
i want this result:
"hell(h)o"
Input 2 :-
hel(lo(wor)ld)
i want :-
hel
as you can see the middle brackets remain intact only the last bracket got removed.
I tried :-
import re
string = 'hell(h)o(world)'
print(re.sub('[()]', '', string))
output :-
hellhoworld
i figured out a solution :-
i did it like this
string = 'hell(h)o(world)'
if (string[-1] == ")"):
add=int(string.rfind('(', 0))
print(string[:add])
output :-
hell(h)o
looking for other optimised solutions/suggestions..
Please see the below if this is useful, Let me know I will optimize further.
string = 'hell(h)o(world)'
count=0
r=''
for i in reversed(string):
if count <2 and (i == ')' or i=='('):
count+=1
pass
else:
r+=i
for i in reversed(r):
print(i, end='')
If you want to remove the last bracket from the string even if it's not at the end of the string, you can try something like this. This will only work if you know you have a substring beginning and ending with parentheses somewhere in the string, so you may want to implement some sort of check for that. You will also need to modify if you are dealing with nested parenthesis.
str = "hell(h)o(world)"
r_str = str[::-1] # creates reverse copy of string
for i in range(len(str)):
if r_str[i] == ")":
start = i
elif r_str[i] == "(":
end = i+1
break
x = r_str[start:end][::-1] # substring that we want to remove
str = str.replace(x,'')
print(str)
output:
hell(h)o
If the string is not at the end:
str = "hell(h)o(world)blahblahblah"
output:
hell(h)oblahblahblah
Edit: Here is a modified version to detect nested parenthesis. However, please keep in mind that this will not work if there are unbalanced parenthesis in the string.
str = "hell(h)o(w(orld))"
r_str = str[::-1]
p_count = 0
for i in range(len(str)):
if r_str[i] == ")":
if p_count == 0:
start = i
p_count = p_count+1
elif r_str[i] == "(":
if p_count == 1:
end = i+1
break
else:
p_count = p_count - 1
x = r_str[start:end][::-1]
print("x:", x)
str = str.replace(x,'')
print(str)
output:
hell(h)o
Something like this?
string = 'hell(h)o(w(orl)d)23'
new_str = ''
escaped = 0
for char in reversed(string):
if escaped is not None and char == ')':
escaped += 1
if not escaped:
new_str = char + new_str
if escaped is not None and char == '(':
escaped -= 1
if escaped == 0:
escaped = None
print(new_str)
This starts escaping when a ) and stops when it's current level is closed with (.
So a nested () would not effect it.
Using re.sub('[()]', '', string) will replace any parenthesis in the string with an empty string.
To match the last set of balanced parenthesis, and if you can make use of the regex PyPi module, you can use a recursive pattern repeating the first sub group, and assert that to the right there are no more occurrences of either ( or )
(\((?:[^()\n]++|(?1))*\))(?=[^()\n]*$)
The pattern matches:
( Capture group 1
\( Match (
(?:[^()\n]++|(?1))* Repeat 0+ times matching either any char except ( ) or a newline. If you do, recurse group 1 using (?1)
\) Match )
) Close group 1
(?=[^()\n]*$) Positive lookahead, assert till the end of the string no ( or ) or newline
See a regex demo and a Python demo.
For example
import regex
strings = [
"hell(h)o(world)",
"hel(lo(wor)ld)",
"hell(h)o(world)blahblahblah"
]
pattern = r"(\((?:[^()]++|(?1))*\))(?=[^()]*$)"
for s in strings:
print(regex.sub(pattern, "", s))
Output
hell(h)o
hel
hell(h)oblahblahblah

Replace character only when character not in parentheses

I have a string like the following:
test_string = "test:(apple:orange,(orange:apple)):test2"
I want to replace ":" with "/" only if it is not contained within any set of parentheses.
The desired output is "test/(apple:orange,(orange:apple))/test2"
How can this be done in Python?
You can use below code to achive expected ouput
def solve(args):
ans=''
seen = 0
for i in args:
if i == '(':
seen += 1
elif i== ')':
seen -= 1
if i == ':' and seen <= 0:
ans += '/'
else:
ans += i
return ans
test_string = "test:(apple:orange,(orange:apple)):test2"
print(solve(test_string))
With regex module:
>>> import regex
>>> test_string = "test:(apple:orange,(orange:apple)):test2"
>>> regex.sub(r'\((?:[^()]++|(?0))++\)(*SKIP)(*F)|:', '/', test_string)
'test/(apple:orange,(orange:apple))/test2'
\((?:[^()]++|(?0))++\) match pair of parantheses recursively
See Recursive Regular Expressions for explanations
(*SKIP)(*F) to avoid replacing the preceding pattern
See Backtracking Control Verbs for explanations
|: to specify : as alternate match
Find the first opening parentheses
Find the last closing parentheses
Replace every ":" with "/" before the first opening parentheses
Don't do anything to the middle part
Replace every ":" with "/" after the last closing parentheses
Put these 3 substrings together
Code:
test_string = "test:(apple:orange,(orange:apple)):test2"
first_opening = test_string.find('(')
last_closing = test_string.rfind(')')
result_string = test_string[:first_opening].replace(':', '/') + test_string[first_opening : last_closing] + test_string[last_closing:].replace(':', '/')
print(result_string)
Output:
test/(apple:orange,(orange:apple))/test2
Warning: as the comments pointed it out this won't work if there are multiple distinct parentheses :(

How to remove or strip off white spaces without using strip() function?

Write a function that accepts an input string consisting of alphabetic
characters and removes all the leading whitespace of the string and
returns it without using .strip(). For example if:
input_string = " Hello "
then your function should return a string such as:
output_string = "Hello "
The below is my program for removing white spaces without using strip:
def Leading_White_Space (input_str):
length = len(input_str)
i = 0
while (length):
if(input_str[i] == " "):
input_str.remove()
i =+ 1
length -= 1
#Main Program
input_str = " Hello "
result = Leading_White_Space (input_str)
print (result)
I chose the remove function as it would be easy to get rid off the white spaces before the string 'Hello'. Also the program tells to just eliminate the white spaces before the actual string. By my logic I suppose it not only eliminates the leading but trailing white spaces too. Any help would be appreciated.
You can loop over the characters of the string and stop when you reach a non-space one. Here is one solution :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if c != ' ':
return input_str[i:]
Edit :
#PM 2Ring mentionned a good point. If you want to handle all types of types of whitespaces (e.g \t,\n,\r), you need to use isspace(), so a correct solution could be :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if not c.isspace():
return input_str[i:]
Here's another way to strip the leading whitespace, that actually strips all leading whitespace, not just the ' ' space char. There's no need to bother tracking the index of the characters in the string, we just need a flag to let us know when to stop checking for whitespace.
def my_lstrip(input_str):
leading = True
for ch in input_str:
if leading:
# All the chars read so far have been whitespace
if not ch.isspace():
# The leading whitespace is finished
leading = False
# Start saving chars
result = ch
else:
# We're past the whitespace, copy everything
result += ch
return result
# test
input_str = " \n \t Hello "
result = my_lstrip(input_str)
print(repr(result))
output
'Hello '
There are various other ways to do this. Of course, in a real program you'd simply use the string .lstrip method, but here are a couple of cute ways to do it using an iterator:
def my_lstrip(input_str):
it = iter(input_str)
for ch in it:
if not ch.isspace():
break
return ch + ''.join(it)
and
def my_lstrip(input_str):
it = iter(input_str)
ch = next(it)
while ch.isspace():
ch = next(it)
return ch + ''.join(it)
Use re.sub
>>> input_string = " Hello "
>>> re.sub(r'^\s+', '', input_string)
'Hello '
or
>>> def remove_space(s):
ind = 0
for i,j in enumerate(s):
if j != ' ':
ind = i
break
return s[ind:]
>>> remove_space(input_string)
'Hello '
>>>
Just to be thorough and without using other modules, we can also specify which whitespace to remove (leading, trailing, both or all), including tab and new line characters. The code I used (which is, for obvious reasons, less compact than other answers) is as follows and makes use of slicing:
def no_ws(string,which='left'):
"""
Which takes the value of 'left'/'right'/'both'/'all' to remove relevant
whitespace.
"""
remove_chars = (' ','\n','\t')
first_char = 0; last_char = 0
if which in ['left','both']:
for idx,letter in enumerate(string):
if not first_char and letter not in remove_chars:
first_char = idx
break
if which == 'left':
return string[first_char:]
if which in ['right','both']:
for idx,letter in enumerate(string[::-1]):
if not last_char and letter not in remove_chars:
last_char = -(idx + 1)
break
return string[first_char:last_char+1]
if which == 'all':
return ''.join([s for s in string if s not in remove_chars])
you can use itertools.dropwhile to remove all particualar characters from the start of you string like this
import itertools
def my_lstrip(input_str,remove=" \n\t"):
return "".join( itertools.dropwhile(lambda x:x in remove,input_str))
to make it more flexible, I add an additional argument called remove, they represent the characters to remove from the string, with a default value of " \n\t", then with dropwhile it will ignore all characters that are in remove, to check this I use a lambda function (that is a practical form of write short anonymous functions)
here a few tests
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip(" Hello ")
'Hello '
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip("--- Hello ","-")
' Hello '
>>> my_lstrip("--- Hello ","- ")
'Hello '
>>> my_lstrip("- - - Hello ","- ")
'Hello '
>>>
the previous function is equivalent to
def my_lstrip(input_str,remove=" \n\t"):
i=0
for i,x in enumerate(input_str):
if x not in remove:
break
return input_str[i:]

Python - How to clear spaces from a text

In Python, I have a lot of strings, containing spaces.
I would like to clear all spaces from the text, except if it is in quotation marks.
Example input:
This is "an example text" containing spaces.
And I want to get:
Thisis"an example text"containingspaces.
line.split() is not good, I think, because it clears all of spaces from the text.
What do you recommend?
For the simple case that only " are used as quotes:
>>> import re
>>> s = 'This is "an example text" containing spaces.'
>>> re.sub(r' (?=(?:[^"]*"[^"]*")*[^"]*$)', "", s)
'Thisis"an example text"containingspaces.'
Explanation:
[ ] # Match a space
(?= # only if an even number of spaces follows --> lookahead
(?: # This is true when the following can be matched:
[^"]*" # Any number of non-quote characters, then a quote, then
[^"]*" # the same thing again to get an even number of quotes.
)* # Repeat zero or more times.
[^"]* # Match any remaining non-quote characters
$ # and then the end of the string.
) # End of lookahead.
There is probably a more elegant solution than this, but:
>>> test = "This is \"an example text\" containing spaces."
>>> '"'.join([x if i % 2 else "".join(x.split())
for i, x in enumerate(test.split('"'))])
'Thisis"an example text"containingspaces.'
We split the text on quotes, then iterate through them in a list comprehension. We remove the spaces by splitting and rejoining if the index is odd (not inside quotes), and don't if it is even (inside quotes). We then rejoin the whole thing with quotes.
Using re.findall is probably the more easily understood/flexible method:
>>> s = 'This is "an example text" containing spaces.'
>>> ''.join(re.findall(r'(?:".*?")|(?:\S+)', s))
'Thisis"an example text"containingspaces.'
You could (ab)use the csv.reader:
>>> import csv
>>> ''.join(next(csv.reader([s.replace('"', '"""')], delimiter=' ')))
'Thisis"an example text"containingspaces.'
Or using re.split:
>>> ''.join(filter(None, re.split(r'(?:\s*(".*?")\s*)|[ ]', s)))
'Thisis"an example text"containingspaces.'
Use regular expressions!
import cStringIO, re
result = cStringIO.StringIO()
regex = re.compile('("[^"]*")')
text = 'This is "an example text" containing spaces.'
for part in regex.split(text):
if part and part[0] == '"':
result.write(part)
else:
result.write(part.replace(" ", ""))
return result.getvalue()
You can do this with csv as well:
import csv
out=[]
for e in csv.reader('This is "an example text" containing spaces. '):
e=''.join(e)
if e==' ': continue
if ' ' in e: out.extend('"'+e+'"')
else: out.extend(e)
print ''.join(out)
Prints Thisis"an example text"containingspaces.
'"'.join(v if i%2 else v.replace(' ', '') for i, v in enumerate(line.split('"')))
quotation_mark = '"'
space = " "
example = 'foo choo boo "blaee blahhh" didneid ei did '
formated_example = ''
if example[0] == quotation_mark:
inside_quotes = True
else:
inside_quotes = False
for character in example:
if inside_quotes != True:
formated_example += character
else:
if character != space:
formated_example += character
if character == quotation_mark:
if inside_quotes == True:
inside_quotes = False
else:
inside_quotes = True
print formated_example

Categories

Resources