I have some text for latex that I am working on and I need to clean it in order to split it properly based on spacing.
So the string:
\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co
should be:
\\mathrm l > \\mathrm li ^ + > \\mathrm mg ^ + > \\mathrm a \\beta + \\mathrm co
So in order for me to split it, I have to create spacing between every character if it is a special character. Also I want to keep the latex notation intact as \something.
I can have re.compile([a-zA-Z0-9 \\]) to get all the special characters but then how can I approach to inser spaces?
I have written a code something like this but it does not look good in terms of efficiency. (or is it?)
def insert_space(sentence):
'''
Add a space around special characters So "x+y +-=y \\latex" becomes: "x + y + - = y \\latex"
'''
string = ''
for i in sentence:
if (not i.isalnum()) and i not in [' ','\\']:
string += ' '+i+' '
else:
string += i
return re.sub('\s+', ' ',string)
I haven't used LaTeX so if you're sure that [a-zA-Z0-9 \\] captures everything that isn't a special character you could do something like this.
import re
def insert_space(sentence):
sentence = re.sub(r'(?<! )(?![a-zA-Z0-9 \\])', ' ', sentence)
sentence = re.sub(r'(?<!^)(?<![a-zA-Z0-9 \\])(?! )', ' ', sentence)
return sentence
my_string = '\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co'
print('before', my_string)
# before \mathrm l >\mathrm li ^ + >\mathrm mg ^ +>\mathrm a \beta+ \mathrm co
print('after', insert_space(my_string))
# after \mathrm l > \mathrm li ^ + > \mathrm mg ^ + > \mathrm a \beta + \mathrm co
The first regex is:
(?<! ) Negative look behind for a space.
(?![a-zA-Z0-9 \\]) Negative look ahead for the character class you specified.
Replace all of these occurrences with a space ' '.
The second regex is:
(?<!^) Negative look behind for the start of the string.
(?<![a-zA-Z0-9 \\]) Negative look behind for the character class you specified.
(?! ) Negative look ahead for a space.
Replace all of these occurrences with a space ' '.
So effectively, it's first finding all the spaces between special characters and another character that is not a space and inserting a space at that position.
The reason you need to also include (?<!^) is to ignore the position between the start of the string and the first character. Otherwise it will include an extra space at the beginning.
Related
I'm trying to split a string by commas using python:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
But I want to ignore any commas within brackets []. So the result for above would be:
["year:2020", "concepts:[ab553,cd779]", "publisher:elsevier"]
Anybody have advice on how to do this? I tried to use re.split like so:
params = re.split(",(?![\w\d\s])", param)
But it is not working properly.
result = re.split(r",(?!(?:[^,\[\]]+,)*[^,\[\]]+])", subject, 0)
, # Match the character “,” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
(?: # Match the regular expression below
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, # Match the character “,” literally
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^,\[\]] # Match any single character NOT present in the list below
# The literal character “,”
# The literal character “[”
# The literal character “]”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
] # Match the character “]” literally
)
Updated to support more than 2 items in brackets. E.g.
year:2020,concepts:[ab553,cd779],publisher:elsevier,year:2020,concepts:[ab553,cd779,xx345],publisher:elsevier
This regex works on your example:
,(?=[^,]+?:)
Here, we use a positive lookahead to look for commas followed by non-comma and colon characters, then a colon. This correctly finds the <comma><key> pattern you are searching for. Of course, if the keys are allowed to have commas, this would have to be adapted a little further.
You can check out the regexr here
You can work this out using a user-defined function instead of split:
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
def split_by_commas(s):
lst = list()
last_bracket = ''
word = ""
for c in s:
if c == '[' or c == ']':
last_bracket = c
if c == ',' and last_bracket == ']':
lst.append(word)
word = ""
continue
elif c == ',' and last_bracket == '[':
word += c
continue
elif c == ',':
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
main_lst = split_by_commas(s)
print(main_lst)
The result of the run of above code:
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
Using a pattern with only a lookahead to assert a character to the right, will not assert if there is an accompanying character on the left.
Instead of using split, you could either match 1 or more repetitions of values between square brackets, or match any character except a comma.
(?:[^,]*\[[^][]*])+[^,]*|[^,]+
Regex demo
s = "year:2020,concepts:[ab553,cd779],publisher:elsevier"
params = re.findall(r"(?:[^,]*\[[^][]*])+[^,]*|[^,]+", s)
print(params)
Output
['year:2020', 'concepts:[ab553,cd779]', 'publisher:elsevier']
I adapted #Bemwa's solution (which didn't work for my use-case)
def split_by_commas(s):
lst = list()
brackets = 0
word = ""
for c in s:
if c == "[":
brackets += 1
elif c == "]":
if brackets > 0:
brackets -= 1
elif c == "," and not brackets:
lst.append(word)
word = ""
continue
word += c
lst.append(word)
return lst
I made a function that replaces multiple instances of a single character with multiple patterns depending on the character location.
There were two ways I found to accomplish this:
This one looks horrible but it works:
def xSubstitution(target_string):
while target_string.casefold().find('x') != -1:
x_finded = target_string.casefold().find('x')
if (x_finded == 0 and target_string[1] == ' ') or (target_string[x_finded-1] == ' ' and
((target_string[-1] == 'x' or 'X') or target_string[x_finded+1] == ' ')):
target_string = target_string.replace(target_string[x_finded], 'ecks', 1)
elif (target_string[x_finded+1] != ' '):
target_string = target_string.replace(target_string[x_finded], 'z', 1)
else:
target_string = target_string.replace(target_string[x_finded], 'cks', 1)
return(target_string)
This one technically works, but I just can't get the regex patterns right:
import re
def multipleRegexSubstitutions(sentence):
patterns = {(r'^[xX]\s'): 'ecks ', (r'[^\w]\s?[xX](?!\w)'): 'ecks',
(r'[\w][xX]'): 'cks', (r'[\w][xX][\w]'): 'cks',
(r'^[xX][\w]'): 'z',(r'\s[xX][\w]'): 'z'}
regexes = [
re.compile(p)
for p in patterns
]
for regex in regexes:
for match in re.finditer(regex, sentence):
match_location = sentence.casefold().find('x', match.start(), match.end())
sentence = sentence.replace(sentence[match_location], patterns.get(regex.pattern), 1)
return sentence
From what I figured it out, the only problem in the second function is the regex patterns. Could someone help me?
EDIT: Sorry I forgot to tell that the regexes are looking for the different x characters in a string, and replace an X in the beggining of a word for a 'Z', in the middle or end of a word for 'cks' and if it is a lone 'x' char replace with 'ecks'
You need \b (word boundary) and \B (position other than word boundary):
Replace an X in the beggining of a word for a 'Z'
re.sub(r'\bX\B', 'Z', s, flags=re.I)
In the middle or end of a word for 'cks'
re.sub(r'\BX', 'cks', s, flags=re.I)
If it is a lone 'x' char replace with 'ecks'
re.sub(r'\bX\b', 'ecks', s, flags=re.I)
I would just use the following set of substitutions for this:
string = re.sub(r"\b[Xx]\b", "ecks", string)
string = re.sub(r"\b[Xx](?!\s)", "Z", string)
string = re.sub(r"(?<=\w)[Xx](?=\w)", "cks", string)
Here,
(?!\s)
just asserts that the regex does not match any whitespace character,
\b
Edit: The last regex would also match x or X at the beginning of a word. So we can use the following, instead,
(?<=\w)[xX](?=\w)
to make sure there must be a character \w before/after x or X.
how do you get a list to fix the spaces in the list m.
m = ['m, a \n', 'l, n \n', 'c, l\n']
for i in m:
if (' ') in i:
i.strip(' ')
I got:
'm, a \n'
'l, n \n'
'c, l\n'
and I want it to return:
['m, a\n', 'l, n\n', 'c, l\n']
The strip() method will strip all the characters from the end of the string. In your case, strip starts at the end of your string, encounters a '\n' character, and exits.
It seems a little unclear what you are trying to do, but I will assume that you are looking to clear out any white space between the last non-whitespace character of your string and the newline. Correct me if I'm wrong.
There are many ways to do this, and this may not be the best, but here is what I came up with:
m = ['This, is a string. \n', 'another string! \n', 'final example\n ']
m = map(lambda(x): x.rstrip() + '\n' if x[-1] == '\n' else x.rstrip(' '), m)
print(m)
['This, is a string.\n', 'another string!\n', 'final example\n']
Here I use the built in map function iterate over each list element and remove all white space from the end (rstrip() instead of strip() which does both the start and end) of the string, and add in a new line if there was one present in the original string.
Your code wouldn't be useful in a script; you are just seeing the REPL displaying the result of the expression i.strip(' '). In a script, that value would just be ignored.
To create a list, use a list comprehension:
result = [i.strip(' ') for i in m if ' ' in i]
Note, however, strip only removes the requested character from either end; in your data, the space precedes the newline. You'll need to do something like removing the newline as well, then put it back:
result = ["%s\n" % i.strip() for i in m if ' ' in i]
You can use regex:
import re
m = ['m, a \n', 'l, n \n', 'c, l\n']
final_m = [re.sub('(?<=[a-zA-Z])\s+(?=\n)', '', i) for i in m]
Output:
['m, a\n', 'l, n\n', 'c, l\n']
Quick and dirty:
m = [x.replace(' \n', '\n') for x in m]
If you know that only one space goes before the '\n'
My goal is to make a regex that can handle 2 situations:
Multiple whitespace including one or more newlines in any order should become a single newline
Multiple whitespace excluding any newline should become a space
The unorderedness combined with the different cases for newline and no newline is what makes this complex.
What is the most efficient way to do this?
E.g.
' \n \n \n a' # --> '\na'
' \t \t a' # --> ' a'
' \na\n ' # --> '\na\n'
Benchmark:
s = ' \n \n \n a \t \t a \na\n '
n_times = 1000000
------------------------------------------------------
change_whitespace(s) - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s
n_times = 100000
------------------------------------------------------
change_whitespace(s * 100) - 27.9 s
change_whitespace_2(s * 100) - 16.8 s
change_whitespace_3(s * 100) - 19.7 s
(Assumes Python can do regex replace with callback function)
You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline
(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)
(?<! \s ) # No whitespace behind
(?:
( [^\S\r\n]+ ) # (1), Non-linebreak whitespace
|
( \s+ ) # (2), At least 1 linebreak
)
(?! \s ) # No whitespace ahead
This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.
import re
def change_whitespace(string):
return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))
Results:
>>> change_whitespace(' \n \n \n a')
'\na'
>>> change_whitespace(' \t \t a')
' a'
>>> change_whitespace(' \na\n ')
'\na\n'
Thanks to #sln for reminding me of regex callback functions:
def change_whitespace_2(string):
return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)
Results:
>>> change_whitespace_2(' \n \n \n a')
'\na'
>>> change_whitespace_2(' \t \t a')
' a'
>>> change_whitespace_2(' \na\n ')
'\na\n'
And here's a function with #sln's expression:
def change_whitespace_3(string):
return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)
Results:
>>> change_whitespace_3(' \n \n \n a')
'\na'
>>> change_whitespace_3(' \t \t a')
' a'
>>> change_whitespace_3(' \na\n ')
'\na\n'
I am using python to go through a file and remove any comments. A comment is defined as a hash and anything to the right of it as long as the hash isn't inside double quotes. I currently have a solution, but it seems sub-optimal:
filelines = []
r = re.compile('(".*?")')
for line in f:
m = r.split(line)
nline = ''
for token in m:
if token.find('#') != -1 and token[0] != '"':
nline += token[:token.find('#')]
break
else:
nline += token
filelines.append(nline)
Is there a way to find the first hash not within quotes without for loops (i.e. through regular expressions?)
Examples:
' "Phone #":"555-1234" ' -> ' "Phone #":"555-1234" '
' "Phone "#:"555-1234" ' -> ' "Phone "'
'#"Phone #":"555-1234" ' -> ''
' "Phone #":"555-1234" #Comment' -> ' "Phone #":"555-1234" '
Edit: Here is a pure regex solution created by user2357112. I tested it, and it works great:
filelines = []
r = re.compile('(?:"[^"]*"|[^"#])*(#)')
for line in f:
m = r.match(line)
if m != None:
filelines.append(line[:m.start(1)])
else:
filelines.append(line)
See his reply for more details on how this regex works.
Edit2: Here's a version of user2357112's code that I modified to account for escape characters (\"). This code also eliminates the 'if' by including a check for end of string ($):
filelines = []
r = re.compile(r'(?:"(?:[^"\\]|\\.)*"|[^"#])*(#|$)')
for line in f:
m = r.match(line)
filelines.append(line[:m.start(1)])
r'''(?: # Non-capturing group
"[^"]*" # A quote, followed by not-quotes, followed by a quote
| # or
[^"#] # not a quote or a hash
) # end group
* # Match quoted strings and not-quote-not-hash characters until...
(#) # the comment begins!
'''
This is a verbose regex, designed to operate on a single line, so make sure to use the re.VERBOSE flag and feed it one line at a time. It'll capture the first unquoted hash as group 1 if there is one, so you can use match.start(1) to get the index. It doesn't handle backslash escapes, if you want to be able to put a backslash-escaped quote in a string. This is untested.
You can remove comments using this script:
import re
print re.sub(r'(?s)("[^"\\]*(?:\\.[^"\\]*)*")|#[^\n]*', lambda m: m.group(1) or '', '"Phone #"#:"555-1234"')
The idea is to capture first parts enclosed in double-quotes and to replace them by themself before searching a sharp:
(?s) # the dot matches newlines too
( # open the capture group 1
" # "
[^"\\]* # all characters except a quote or a backslash
# zero or more times
(?: # open a non-capturing group
\\. # a backslash and any character
[^"\\]* #
)* # repeat zero or more times
" # "
) # close the capture group 1
| # OR
#[^\n]* # a sharp and zero or one characters that are not a newline.
This code was so ugly, I had to post it.
def remove_comments(text):
char_list = list(text)
in_str = False
deleting = False
for i, c in enumerate(char_list):
if deleting:
if c == '\n':
deleting = False
else:
char_list[i] = None
elif c == '"':
in_str = not in_str
elif c == '#':
if not in_str:
deleting = True
char_list[i] = None
char_list = filter(lambda x: x is not None, char_list)
return ''.join(char_list)
Seems to work though. Although I'm not sure how it might handle newline chars between windows and linux.