Tricky String Normalization in Python

Tricky String Normalization in Python - python

I'm sure this will be easy pickings for more experienced programmers than I, but this problem is bedeviling me and I've made a couple of failed attempts, so I wanted to see what other people might come up with.
I have about a hundred strings that look something like this:
(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(tag,bunit)` | `macro2(category)` | `macro_3(tag,\"expected\",category)` | `macro4(tag,\"timesync\")`
The goal is to find the arguments to the macro function and replace them with the count of the arguments, so that the final output looks like this:
(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(2)` | `macro2(1)` | `macro_3(3)` | `macro4(2)`
Python has ways of obtaining the count I need (I was simply counting up the number of commas in a string and adding 1), and Python has plenty of regex-type solutions for inline string replacement, but for the life of me I can't figure out how to combine them.
It seems something like re.sub won't let me identify a substring, count the number of commas in the substring, and then replace the substring with that value (unless I am missing something in the docs).
Can anybody think of a way to do this? Have I missed something obvious?

Solution:
import re
def count_commas(input_str):
c = 0
for s in input_str:
if s == ',':
c += 1
return c
pattern = r'\([A-Za-z0-9,""]+\)'
original_str = '(argument1 OR argument2) | inputlookup my_lookup.csv | `macro1(tag,bunit)` | `macro2(category)` | `macro_3(tag,\"expected\",category)` | `macro4(tag,\"timesync\")`'
matches = re.findall(pattern, original_str)
for match in matches:
comma_count = count_commas(match) + 1
match = match.replace('(', '\(').replace(')', '\)')
original_str = re.sub(r'' + match, '(' + str(comma_count) + ')', original_str)
print (original_str)
Explanation:
pattern : "\([A-Za-z0-9,""]+\)" - backslashes to escape the special characters '(' and ')' in regex, and then I am looking for alphanumeric, comma and quotations (in the square-brackets) which is followed with '+' which means one or more than one repetition of such symbols in the square brackets.
matches : list of all the matches found. Eg - (tag,bunit)
Then, I am looping over all the matches to find the number of commas in the match, followed by replacing the '(' with '\(' and ')' with '\)' so as to escape in regex.
Finally, in the last line of the loop, I am using re.sub to replace the matched string with the comma count in the original string.

Related

Split string into functions on parantheses, but not subfunctions

I am cleaning a data set that consists of concatenated function calls strings that look like this: "hello(data=x, capitalize = True)there()my(x = x)dear(x, 6L, ...)friend(x = c(1, 2, 3))". The goal is to split such a string into separate list elements, so that every function stands on its own.
So far I can split all functions that do not contain a subfunction (such as "c(1,2,3)") using regex:
import re
s="hello(data=x, capitalize = True)there()my(x = x)dear(x, 6L, ...)"
t = re.findall(r"\w+\(.*?\)", s)
['hello(data=x, capitalize = True)', 'there()', 'my(x = x)', 'dear(x, 6L, ...)']
I am however stuck when a subfunction is included inside a function call such as friend(x = c(1, 2, 3))", where the function is then split in half due to the subfunction instead of being preserved.
Is it possible to leave functions that contain other functions as substring intact using regex?

This can be done without regex and just keeping a tally of how ( and ) are balanced. I don't know where that string comes from and I want to caveat this answer with - this is pretty crude and brittle - not my finest work. Then again, I suspect a regex approach would be too. It does what you want but more complex grammar is probably in such a file but you haven't given any indication of that.
s="hello(data=x, capitalize = True)there()my(x = x)dear(x, 6L, ...)friend(x = c(1, 2, 3))"
open_count = 0
close_count = 0
last_index = 0
rebuilt = []
for i, char in enumerate(s):
if char == '(':
open_count += 1
elif char == ')':
close_count += 1
if open_count != 0 and open_count == close_count:
rebuilt.append(s[last_index:i+1])
open_count = 0
close_count = 0
last_index = i+1
print(rebuilt)

You mention in a comment that your input is actually a stream of R function calls, which means a Python parser may not work, but the same approach is valid if you can find an R parser that reports the same kind of information on a syntax error.
If you could assume that your string is syntactically correct Python code except for a lack of newlines between function calls, you can repeatedly parse the string, catching SyntaxError exceptions and using them to split the string into a valid function call and the remainder of the code to check.
from ast import parse
calls = []
while True:
try:
ast.parse(s)
except SyntaxError as exc:
i = exc.offset - 1
calls.append(s[:i])
s = s[i:]
else:
calls.append(s)
break

You can do it using the pypi/regex module (a regex module with more advanced features like references to subpatterns that allows recursion and backtracking control verbs).
import regex
s='''hello(data=x, capitalize = True)there()my(x = x)dear(x, 6L, ...)friend(x = c(1, 2, 3))
hello('little)bobby(tables')
'inastring(blablubli)'
'''
pattern = r'''(?x)
# subpatterns definitions
(?(DEFINE)
(?<string> '{3} [^'\\]*+ (?s: \\. [^'\\]* | ''? (?!') [^'\\]* )*+ (?:'{3} | ['\\]* \z )
| "{3} [^"\\]*+ (?s: \\. [^"\\]* | ""? (?!") [^"\\]* )*+ (?:"{3} | ["\\]* \z )
| ' [^'\\]*+ (?s: \\. [^'\\]* )*+ (?:' | \z )
| " [^"\\]*+ (?s: \\. [^"\\]* )*+ (?:" | \z )
)
(?<parens> \( [^'"()]*+ (?: (?&string) [^'"()]* | (?&parens) [^'"()]* )*+ (?: \) | \z )
)
)
# main pattern
(?&string) (*SKIP)(*FAIL) # to ignore all that is in a string
|
\w+ (?&parens)'''
print(regex.findall(pattern, s))
Note that this pattern is designed for the python syntax (with strings enclosed with 3 quotes), feel free to change the string subpattern according to the target language.
This pattern show you how you can deal with strings, in a same way you can add a support for comments.

Regex group doesn't match with "?" even if it should

input strings:
"| VLAN56 | LAB06 | Labor 06 | 56 | 172.16.56.0/24 | VLAN56_LAB06 | ✔️ | |",
"| VLAN57 | LAB07 | Labor 07 | 57 | 172.16.57.0/24 | VLAN57_LAB07 | ✔️ | ##848484: |"
regex:
'\|\s+(\d+).+(VLAN\d+_[0-9A-Za-z]+)\s+\|.+(#[0-9A-Fa-f]{6})?'
The goal is to get the VLAN number, hostname, and if there is one, the color code, but with a "?" it ignores the color code every time, even when it should match.
With the "?" the last capture group is always None.

You may use this regex:
\|\s+(\d+).+(VLAN\d+_[0-9A-Za-z]+)\s+\|[^|]+\|[^#|]*(#[0-9A-Fa-f]{6})?
You have a demo here: https://regex101.com/r/SWe42v/1
The reason why it didn't work with your regex is that .+ is a greedy quantifier: It matches as much as it can.
So, when you added the ? to the last part of the regex, you give no option to backtrack. The .+ matches the rest of the string/line and the group captures nothing (which is correct because it is optional)
In order to fix it, you can simply try to match the column with the emoji. You don't care about its content, so you simply use |[^|]+to skip the column.
This sort of construct is widely used in regexes: SEPARATOR[^SEPARATOR]*

The reason why the last capture group is None is that the preceding .+ can capture the rest of the line.
I would however first use the fact that this is a pipe-separated format, and split by that pipe symbol and then retrieve the elements of interest needed by slicing them from that result by their index:
import re
s = "| VLAN57 | LAB07 | Labor 07 | 57 | 172.16.57.0/24 | VLAN57_LAB07 | ✔️ | ##848484: |"
vlan,name,color = re.split(r"\s*\|\s*", s)[4:9:2]
print(vlan, name, color)
This code is in my opinion easier to read and to maintain.

I think this is what you're after: Demo
^\|\s+(VLAN[0-9A-Za-z]+)\s+\|\s+([0-9A-Za-z]+)\s+\|.*((?<=\#)[0-9A-Fa-f]{6})?.*$
^\|\s+ - the start of the line must be a pipe followed by some whitespace.
(VLAN[0-9A-Za-z]+) - What comes next is the VLAN - so we capture it; with the VLAN and all (at least 1) following alpha-numeric chars.
\s+\|\s+ - there's then another pipe delimeter, with whitespace either side.
([0-9A-Za-z]+) - the column after the vlan name is the device name; so we capture the alphanumeric value from that.
\s+\| - after our device there's more whitespace and then the delimiter
.* - following that there's a load of stuff that we're not interested in; could be anything.
((?<=\#)[0-9A-Fa-f]{6})? - next there may be a 6 hex char value preceded by a hash; we want to capture only the hex value part.
(...) says this is another capture group
(?<=\#) is a positive look behind; i.e. checks that we're preceded by some value (in this case #) but doesn't include it within the surrounding capture
[0-9A-Fa-f]{6} is the 6 hex chars to capture
? after the parenthesis says there's 0 or 1 of these (i.e. it's optional); so if it's there we capture it, but if it's not that's not an issue.
.*$ says we can have whatever else through to the end of the string.
We could strip a few of those bits out; or add more in (e.g. if we know exactly what column everythign will be in we can massively simplify by just capturing content from those columns. E.g.
^\|\s*([^\|\s]+)\s*\|\s*([^\|\s]+)\s*\|\s*[^\|]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\d]*(\d{6})?[^\|]*\s*\|$
... But amend per your requirements / whatever feels most robust and suitable for your purposes.

Python - string replace between parenthesis with wildcards

I am trying to remove some text from a string. What I want to remove could be any of the examples listed below. Basically any combination of uppercase and lowercase, any combination of integers at the end, and any combination of letters at the end. There could also be a space between or not.
(Disk 1)
(Disk 5)
(Disc2)
(Disk 10)
(Part A)
(Pt B)
(disk a)
(CD 7)
(cD X)
I have a method already to get the beginning "(type"
multi_disk_search = [ '(disk', '(disc', '(part', '(pt', '(prt' ]
if any(mds in fileName.lower() for mds in multi_disk_search): #https://stackoverflow.com/a/3389611
for mds in multi_disk_search:
if mds in fileName.lower():
print(mds)
break
That returns (disc for example.
I cannot just split by the parenthesis because there could be other tags in other parenthesis. Also there is no specific order to the tags. The one I am searching for is typically last; however many times it is not.
I think the solution will require regex, but I'm really lost when it comes to that.
I tried this, but it returns something that doesn't make any sense to me.
regex = re.compile(r"\s*\%s\s*" % (mds), flags=re.I) #https://stackoverflow.com/a/20782251/11214013
regex.split(fileName)
newName = regex
print(newName)
Which returns re.compile('\\s*\\(disc\\s*', re.IGNORECASE)
What are some ways to solve this?

Perhaps something like this:
rx = re.compile(r'''
\(
(?: dis[ck] | p(?:a?r)?t )
[ ]?
(?: [a-z]+ | [0-9]+ )
\)''', re.I | re.X)
This pattern uses only basic syntax of regex pattern except eventually the X flag, the Verbose mode (with this one any blank character is ignored in the pattern except when it is escaped or inside a character class). Feel free to read the python manual about the re module. Adding support for CD is let as an exercise.

>>> import re
>>> def remove_parens(s,multi_disk_search):
... mds = '|'.join([re.escape(x) for x in multi_disk_search])
... return re.sub(f'\((?:{mds})[0-9A-Za-z ]*\)','',s,0,re.I)
...
>>> multi_disk_search = ['disk','cd','disc','part','pt']
>>> remove_parens('this is a (disc a) string with (123xyz) parens removed',multi_disk_search)
'this is a string with (123xyz) parens removed'

python regex splitting at wrong places

I have the following regex and the input string.
pattern = re.compile(r'\s+(?=[^()|^{}|^<>]*(?:\(|\{|\<|$))')
string = "token1 token2 {a | op (b|c) | d}"
print pattern.split(string)
the result is : ["token1","token2","{a | op","(b|c) |d}"]
I want the regex to give the following result : ["token1","token2","{a | op (b|c) | d}"]

string = "token1 token2 {a | op (b|c) | d}"
re.findall(r'\w+|\{.*}',string)
output:
['token1', 'token2', '{a | op (b|c) | d}']

You can simply split by this
\s+(?![^{]*\})
See demo.
https://regex101.com/r/WjQVqZ/1

The raw pattern to use with the split method is r'\s+(?=[^\}]*(?:\{|$))'.
Every time whitespace is encountered, you want to look ahead for a closing curly brace, so you know if the white space is inside of braces - unless an opening curly brace or the end of the string is seen next.

How can I trim the whitespace immediately preceding and following a given character in python?

I have a bunch of strings that look like this
s = '| this is | my | made up string '
I'd like to write a function that removes all the whitespace immediately preceeding and following the |'s. So running myFunc(s, '|') would return
'|this is|my|made up string '
Obviously strip() is too powerful, as I'd like to respect some of the whitespace. How can I do this in python?

You can split the the string at | then trim each substring and then join it again:
'|'.join([i.strip() for i in s.split('|')])

Use regex replace.
import re
s = '| this is | my | made up string '
print(re.sub(r'\s*\|\s*', '|', s))
Will give this output -
'|this is|my|made up string '

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tricky String Normalization in Python - python

Related

Split string into functions on parantheses, but not subfunctions

Regex group doesn't match with "?" even if it should

Python - string replace between parenthesis with wildcards

python regex splitting at wrong places

How can I trim the whitespace immediately preceding and following a given character in python?

Categories

Resources