Matching a character class multiple times in a string - python

I am writing a short script to sanitise folder and file names for upload to SharePoint. Since SharePoint is fussy and has some filename rules beyond simple disallowed characters (multiple consecutive periods are disallowed for instance) it seemed like regular expressions were the way to go rather than simple replacement of single characters. One expression that doesn't seem to be working however is:
[/<>*?|:"~#%&{}\\]+
As a simple character class match I would have expected this to work fine, and it appears to do so in notepad++. My expectation was that a string like
St\r/|ng
with the above regex would match \, / and |. However no matter what I do I can only get the string to match the first backslash, or the first of whatever character is in that class that it comes across. This is being done with the Python re library. Does anyone know what the issue is here?
import os, sys, shutil, re
def cleanPath(path):
#Compiling regex...
multi_dot = re.compile(r"[\.]{2,}")
start_dot = re.compile(r"^[\.]")
end_dot = re.compile(r"[\.]$")
disallowed_chars = re.compile(r'[/<>*?|:"~#%&{}\\]+')
dis1 = re.compile(r'\.files$')
dis2 = re.compile(r'_files$')
dis3 = re.compile(r'-Dateien$')
dis4 = re.compile(r'_fichiers$')
dis5 = re.compile(r'_bestanden$')
dis5 = re.compile(r'_file$')
dis6 = re.compile(r'_archivos$')
dis7 = re.compile(r'-filer$')
dis8 = re.compile(r'_tiedostot$')
dis9 = re.compile(r'_pliki$')
dis10 = re.compile(r'_soubory$')
dis11 = re.compile(r'_elemei$')
dis12 = re.compile(r'_ficheiros$')
dis13 = re.compile(r'_arquivos$')
dis14 = re.compile(r'_dosyalar$')
dis15 = re.compile(r'_datoteke$')
dis16 = re.compile(r'_fitxers$')
dis17 = re.compile(r'_failid$')
dis18 = re.compile(r'_fails$')
dis19 = re.compile(r'_bylos$')
dis20 = re.compile(r'_fajlovi$')
dis21 = re.compile(r'_fitxategiak$')
regxlist = [multi_dot,start_dot,end_dot,disallowed_chars,dis1,dis2,dis3,dis4,dis5,dis5,dis6,dis7,dis8,dis9,dis10,dis11,dis12,dis13,dis14,dis15,dis16,dis17,dis18,dis19,dis20,dis21]
print("************************************\n\n"+path+"\n\n************************************\n")
for x in regxlist:
match = x.search(path)
if match:
print("\n")
print("MATCHED")
print(match.group())
print("___________________________________________________________________________")
return path
#testlist of conditions that should be found, some OK, some bad
testlist = ["string","str....ing","str..ing","str.ing",".string","string.",".string.","$tring",r"st\r\ing","st/r/ing",r"st\r/|ng","/str<i>ng","str.filesing","string.files"]
testlist_ans = ["OK","Match ....","Match ..","OK","Match .","Match .","Match . .","OK",r"Match \ ","Match /",r"Match \/|","Match / < >","OK","Match .files"]
count = 0
for i in testlist:
print(testlist_ans[count])
count = count + 1
cleanPath(i)

What Python re command do you use ?
You should use : re.findall

re.sub(pattern,new_txt,subject) #replace all instinces of pattern with new_txt
re.findall(pattern,subject) #find all instances

Related

Python - Possibly Regex - How to replace part of a filepath with another filepath based on a match?

I'm new to Python and relatively new to programming. I'm trying to replace part of a file path with a different file path. If possible, I'd like to avoid regex as I don't know it. If not, I understand.
I want an item in the Python list [] before the word PROGRAM to be replaced with the 'replaceWith' variable.
How would you go about doing this?
Current Python List []
item1ToReplace1 = \\server\drive\BusinessFolder\PROGRAM\New\new.vb
item1ToReplace2 = \\server\drive\BusinessFolder\PROGRAM\old\old.vb
Variable to replace part of the Python list path
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
Desired results for Python List []:
item1ToReplace1 = C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb
item1ToReplace2 = C:\ProgramFiles\Micosoft\PROGRAM\old\old.vb
Thank you for your help.
The following code does what you ask, note I updated your '' to '\', you probably need to account for the backslash in your code since it is used as an escape character in python.
import os
item1ToReplace1 = '\\server\\drive\\BusinessFolder\\PROGRAM\\New\\new.vb'
item1ToReplace2 = '\\server\\drive\\BusinessFolder\\PROGRAM\\old\\old.vb'
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
keyword = "PROGRAM\\"
def replacer(rp, s, kw):
ss = s.split(kw,1)
if (len(ss) > 1):
tail = ss[1]
return os.path.join(rp, tail)
else:
return ""
print(replacer(replaceWith, item1ToReplace1, keyword))
print(replacer(replaceWith, item1ToReplace2, keyword))
The code splits on your keyword and puts that on the back of the string you want.
If your keyword is not in the string, your result will be an empty string.
Result:
C:\ProgramFiles\Microsoft\PROGRAM\New\new.vb
C:\ProgramFiles\Microsoft\PROGRAM\old\old.vb
One way would be:
item_ls = item1ToReplace1.split("\\")
idx = item_ls.index("PROGRAM")
result = ["C:", "ProgramFiles", "Micosoft"] + item_ls[idx:]
result = "\\".join(result)
Resulting in:
>>> item1ToReplace1 = r"\\server\drive\BusinessFolder\PROGRAM\New\new.vb"
... # the above
>>> result
'C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb'
Note the use of r"..." in order to avoid needing to have to 'escape the escape characters' of your input (i.e. the \). Also that the join/split requires you to escape these characters with a double backslash.

Generating multiple strings by replacing wildcards

So i have the following strings:
"xxxxxxx#FUS#xxxxxxxx#ACS#xxxxx"
"xxxxx#3#xxxxxx#FUS#xxxxx"
And i want to generate the following strings from this pattern (i'll use the second example):
Considering #FUS# will represent 2.
"xxxxx0xxxxxx0xxxxx"
"xxxxx0xxxxxx1xxxxx"
"xxxxx0xxxxxx2xxxxx"
"xxxxx1xxxxxx0xxxxx"
"xxxxx1xxxxxx1xxxxx"
"xxxxx1xxxxxx2xxxxx"
"xxxxx2xxxxxx0xxxxx"
"xxxxx2xxxxxx1xxxxx"
"xxxxx2xxxxxx2xxxxx"
"xxxxx3xxxxxx0xxxxx"
"xxxxx3xxxxxx1xxxxx"
"xxxxx3xxxxxx2xxxxx"
Basically if i'm given a string as above, i want to generate multiple strings by replacing the wildcards that can be #FUS#, #WHATEVER# or with a number #20# and generating multiple strings with the ranges that those wildcards represent.
I've managed to get a regex to find the wildcards.
wildcardRegex = f"(#FUS#|#WHATEVER#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
Which finds correctly the target wildcards.
For 1 wildcard present, it's easy.
re.sub()
For more it gets complicated. Or maybe it was a long day...
But i think my algorithm logic is failing hard because i'm failing to write some code that will basically generate the signals. I think i need some kind of recursive function that will be called for each number of wildcards present (up to maybe 4 can be present (xxxxx#2#xxx#2#xx#FUS#xx#2#x)).
I need a list of resulting signals.
Is there any easy way to do this that I'm completely missing?
Thanks.
import re
stringV1 = "xxx#FUS#xxxxi#3#xxx#5#xx"
stringV2 = "XXXXXXXXXX#FUS#XXXXXXXXXX#3#xxxxxx#5#xxxx"
regex = "(#FUS#|#DSP#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
WILDCARD_FUS = "#FUS#"
RANGE_FUS = 3
def getSignalsFromWildcards(app, can):
sigList = list()
if WILDCARD_FUS in app:
for i in range(RANGE_FUS):
outAppSig = app.replace(WILDCARD_FUS, str(i), 1)
outCanSig = can.replace(WILDCARD_FUS, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
elif len(re.findall("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", stringV1)) > 0:
wildcard = re.search("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", app).group()
tarRange = int(wildcard.strip("#"))
for i in range(tarRange):
outAppSig = app.replace(wildcard, str(i), 1)
outCanSig = can.replace(wildcard, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
return sigList
if "#" in stringV1:
resultList = getSignalsFromWildcards(stringV1, stringV2)
for item in resultList:
print(item)
results in
('xxx0xxxxi0xxxxx', 'XXXXXXXXXX0XXXXXXXXXX0xxxxxxxxxx')
('xxx0xxxxi1xxxxx', 'XXXXXXXXXX0XXXXXXXXXX1xxxxxxxxxx')
('xxx0xxxxi2xxxxx', 'XXXXXXXXXX0XXXXXXXXXX2xxxxxxxxxx')
('xxx1xxxxi0xxxxx', 'XXXXXXXXXX1XXXXXXXXXX0xxxxxxxxxx')
('xxx1xxxxi1xxxxx', 'XXXXXXXXXX1XXXXXXXXXX1xxxxxxxxxx')
('xxx1xxxxi2xxxxx', 'XXXXXXXXXX1XXXXXXXXXX2xxxxxxxxxx')
('xxx2xxxxi0xxxxx', 'XXXXXXXXXX2XXXXXXXXXX0xxxxxxxxxx')
('xxx2xxxxi1xxxxx', 'XXXXXXXXXX2XXXXXXXXXX1xxxxxxxxxx')
('xxx2xxxxi2xxxxx', 'XXXXXXXXXX2XXXXXXXXXX2xxxxxxxxxx')
long day after-all...

can anybody help me to correct this "argv" program?

import sys
import re
x = sys.argv[1]
y = sys.argv[2]
f = open("formula.txt" ,'r')
line = f.read()
match = re.search(r'x',line,re.M|re.I)
match = re.search(r'y',line,re.M|re.I)
f.close()
print x
print y
I tried this above program but I could not get the output?
~
~
Desire output should follows:
when I want execute the above program;
>>>python argument.py circle_area rectangle_area
the output should like this:
x = 2*3.14*r*r
y = l*b
And the given file in program is formula.txt
formula.txt file contains following data;
circle_area = '3.14*r*r'
circle_circumference = '2*3.14*r'
rectangle_area = 'l*b'
rectangle_perimeter = '2(l+b)'
------------------------------------
~
can anybody help me to implement above.
You made so many mistakes in your code.
Don't put variable names inside quotes.
Use capturing groups or lookarounds to match the text you want to print.
Use .group() attribute in re.search function to get the matched text.
Code should look like.
import sys
import re
x = sys.argv[1]
y = sys.argv[2]
f = open("formula.txt" ,'r')
line = f.read()
match1 = re.search(x + r"\s*=\s*'([^']*)'" , line, re.M|re.I).group(1)
match2 = re.search(y + r"\s*=\s*'([^']*)'" , line, re.M|re.I).group(1)
f.close()
print match1
print match2
r"\s*=\s*'([^']*)'", \s* matches zero or more spaces and [^']* matches any character but not of a single quote, zero or more times. This text (value part) was captured into group 1 . Later we refer the captured chars by specifying the index number in group attribute.
First off you don't search argument values, rather do:
match = re.search(r"^%s\s+=\s*'(.*)'" % x, line, re.M|re.I)
Then do something with the match like putting it back into existing variable.
x = match.group(1)

stripping a pattern from the end of the string

I want to see if a file like test_100.webp exists and then look at the file test.yaml. Therefore, I need to strip the pattern "_100.webp" from the end. I tried to use the code below and it is giving me issues.
for i, image in enumerate(images_in_item):
if image.endswith("_100.webp"):
image_strip = image.rstrip(_100.webp)
snapshot_markup = os.path.join(image_strip + 'yaml')
Do this:
suffix = '_100.webp'
if image.endswith(suffix):
image_strip = image[:-len(suffix)]
snapshot_markup = os.path.join(image_strip + 'yaml')

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!
The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.
try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])
You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Categories

Resources