I'm reading a file and I need to replace certain empty tags ([[Image:]]).
The problem is every replacement has to be unique.
Here's the code:
import re
import codecs
re_imagematch = re.compile('(\[\[Image:([^\]]+)?\]\])')
wf = codecs.open('converted.wiki', "r", "utf-8")
wikilines = wf.readlines()
wf.close()
imgidx = 0
for i in range(0,len(wikilines)):
if re_imagematch.search(wikilines[i]):
print 'MATCH #######################################################'
print wikilines[i]
wikilines[i] = re_imagematch.sub('[[Image:%s_%s.%s]]' % ('outname', imgidx, 'extension'), wikilines[i])
print wikilines[i]
imgidx += 1
This does not work, as there can be many tags in one line:
Here's the input file.
[[Image:]][[Image:]]
[[Image:]]
This is what the output should look like:
[[Image:outname_0.extension]][Image:outname_1.extension]]
[[Image:outname_2.extension]]
This is what it currently looks likeö
[[Image:outname_0.extension]][Image:outname_0.extension]]
[[Image:outname_1.extension]]
I tried using a replacement function, the problem is this function gets only called once per line using re.sub.
You can use itertools.count here and take some advantage of the fact that default arguments are calculated when function is created and value of mutable default arguments can persist between function calls.
from itertools import count
def rep(m, cnt=count()):
return '[[Image:%s_%s.%s]]' % ('outname', next(cnt) , 'extension')
This function will be invoked for each match found and it'll use a new value for each replacement.
So, you simply need to change this line in your code:
wikilines[i] = re_imagematch.sub(rep, wikilines[i])
Demo:
def rep(m, count=count()):
return str(next(count))
>>> re.sub(r'a', rep, 'aaa')
'012'
To get the current counter value:
>>> from copy import copy
>>> next(copy(rep.__defaults__[0])) - 1
2
I'd use a simple string replacement wrapped in a while loop:
s = '[[Image:]][[Image:]]\n[[Image:]]'
pattern = '[[Image:]]'
i = 0
while s.find(pattern) >= 0:
s = s.replace(pattern, '[[Image:outname_' + str(i) + '.extension]]', 1)
i += 1
print s
Related
I'm new to Python and relatively new to programming. I'm trying to replace part of a file path with a different file path. If possible, I'd like to avoid regex as I don't know it. If not, I understand.
I want an item in the Python list [] before the word PROGRAM to be replaced with the 'replaceWith' variable.
How would you go about doing this?
Current Python List []
item1ToReplace1 = \\server\drive\BusinessFolder\PROGRAM\New\new.vb
item1ToReplace2 = \\server\drive\BusinessFolder\PROGRAM\old\old.vb
Variable to replace part of the Python list path
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
Desired results for Python List []:
item1ToReplace1 = C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb
item1ToReplace2 = C:\ProgramFiles\Micosoft\PROGRAM\old\old.vb
Thank you for your help.
The following code does what you ask, note I updated your '' to '\', you probably need to account for the backslash in your code since it is used as an escape character in python.
import os
item1ToReplace1 = '\\server\\drive\\BusinessFolder\\PROGRAM\\New\\new.vb'
item1ToReplace2 = '\\server\\drive\\BusinessFolder\\PROGRAM\\old\\old.vb'
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
keyword = "PROGRAM\\"
def replacer(rp, s, kw):
ss = s.split(kw,1)
if (len(ss) > 1):
tail = ss[1]
return os.path.join(rp, tail)
else:
return ""
print(replacer(replaceWith, item1ToReplace1, keyword))
print(replacer(replaceWith, item1ToReplace2, keyword))
The code splits on your keyword and puts that on the back of the string you want.
If your keyword is not in the string, your result will be an empty string.
Result:
C:\ProgramFiles\Microsoft\PROGRAM\New\new.vb
C:\ProgramFiles\Microsoft\PROGRAM\old\old.vb
One way would be:
item_ls = item1ToReplace1.split("\\")
idx = item_ls.index("PROGRAM")
result = ["C:", "ProgramFiles", "Micosoft"] + item_ls[idx:]
result = "\\".join(result)
Resulting in:
>>> item1ToReplace1 = r"\\server\drive\BusinessFolder\PROGRAM\New\new.vb"
... # the above
>>> result
'C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb'
Note the use of r"..." in order to avoid needing to have to 'escape the escape characters' of your input (i.e. the \). Also that the join/split requires you to escape these characters with a double backslash.
I have a code similar to this:
s = some_template
s = s.replace('<HOLDER1>',data1)
s = s.replace('<HOLDER2>',data2)
s = s.replace('<HOLDER3>',data3)
... #(about 30 similar lines)
where data1/data2/etc is often a call to a function or a complex expression which might take a while to calculate. for example:
s = some_template
s = s.replace('<HOLDER4>',long_func4(a,b,'some_flag') if c==1 else '')
s = s.replace('<HOLDER5>',long_func5(d,e).replace('.',''))
s = s.replace('<HOLDER6>',self.attr6)
s = s.replace('<HOLDER7>',f'{self.name}_{get_cur_month()}')
... #(about 30 similar lines)
in order to save on runtime, i want the string.replace() method to calculate the new value only if the old value is found in str. this can be achieved by:
if '<HOLDER1>' in s:
s = s.replace('<HOLDER1>',data1)
if '<HOLDER2>' in s:
s = s.replace('<HOLDER2>',data2)
if '<HOLDER3>' in s:
s = s.replace('<HOLDER3>',data3)
...
but i don't like this solutions because it takes double the number of lines of code which will be really messy and also finds the old value in s twice for each holder..
any ideas?
Thanks!
str is immutable. You can't change it, only creating a new instance is allowed.
You could do something like:
def replace_many(replacements, s):
for pattern, replacement in replacements:
s = s.replace(pattern, replacement)
return s
without_replacements = 'this_will_be_replaced, will it?'
replacements = [('this_will_be_replaced', 'by_this')]
with_replacements = replace_many(replacements, without_replacements)
You can easily make it lazy:
def replace_many_lazy(replacements, s):
for pattern, replacement_func in replacements:
if pattern in s:
s = s.replace(pattern, replacement_func())
return s
without_replacements = 'this_will_be_replaced, will it?'
replacements = [('this_will_be_replaced', lambda: 'by_this')]
with_replacements = replace_many_lazy(replacements, without_replacements)
...now you don't do the expensive computation unless necessary.
I want to extract the code written under a specified function. I am trying to do it like this:
With an example file TestFile.py containing the following function sub():
def sub(self,num1,num2):
# Subtract two numbers
answer = num1 - num2
# Print the answer
print('Difference = ',answer)
If I run get_func_data.py:
def giveFunctionData(data, function):
dataRequired = []
for i in range(0, len(data)):
if data[i].__contains__(str(function)):
startIndex = i
for p in range(startIndex + 1, len(data)):
dataRequired.append(data[p])
if data[p].startswith('\n' + 'def'):
dataRequired.remove(dataRequired[len(dataRequired) - 1])
break
print(dataRequired)
return dataRequired
data = []
f = open("TestFile.py", "r")
for everyLine in f:
if not(everyLine.startswith('#') or everyLine.startswith('\n' + '#')):
data.append(everyLine)
giveFunctionData(data,'sub') # Extract content in sub() function
I expect to obtain the following result:
answer = num1 - num2
print('Difference = ',answer)
But here I get the comments written inside the function as well. Instead of the list, Is there a way to get it as it is written in the file?
Returning a string from your function giveFunctionData()
In your function giveFunctionData you're instantiating the variable dataRequired as a list and returning it after assigning it a value so of course you're getting a list back.
You'd have to unpack the list back into a string. One way could be this:
# Unpack the list into a string
function_content = ''
for line in dataRequired:
function_content += line + '\n'
# function_content now contains your desired string
The reason you're still getting comment lines
Iterating from a file object instantiated via open() will give you a list of lines from a file with \n already used as a delimiter for lines. As a result, there aren't any \n# for .startswith('\n' + '#')) to find.
General comments
There is no need to specify the newline and # character separately like you did in .startswith('\n' + '#')). '\n#' would have been fine
If you intend for the file to be run as a script, you really should put your code to be run in a if __name__ == "__main__": conditional. See What does if name == “main”: do?
It might be cleaner to move the reading of the file object into your giveFunctionData() function. It also eliminates having to to iterate over it multiple times.
Putting it all together
Note that this script isn't able to ignore comments placed in the same line as code, (eg. some = statement # With comments won't be comment-stripped)
def giveFunctionData(data, function):
function_content = ''
# Tells us whether to append lines to the `function_content` string
record_content = False
for line in data:
if not record_content:
# Once we find a match, we start recording lines
if function in line:
record_content = True
else:
# We keep recording until we encounter another function
if line.startswith('def'):
break
elif line.isspace():
continue
elif '#' not in line:
# Add line to `function_content` string
function_content += line
return function_content
if __name__ == "__main__":
data = []
script = open("TestFile.py")
output = giveFunctionData(script, 'sub')
print(output)
I have generated code which an do your task. I don't think you require 2 different processing part like function and code to fetch data.
You can do one thing, create a function which accept 2 arguments i.e. File Name and Function Name. Function should return the code you want.
I have created function getFunctionCode(filename,funcname). Code is working well.
def getFunctionCode(filename, funcname):
data = []
with open(filename) as fp:
line = fp.readlines()
startIndex = 0 #From where to start reading body part
endIndex = 0 #till what line because file may have mult
for i in range(len(line)): #Finding Starting index
if(line[i].__contains__(funcname)):
startIndex = i+1
break
for i in range(startIndex,len(line)):
if(line[i].__contains__('def')): #Find end in case - multiple function
endIndex = i-1
break
else:
endIndex = len(line)
for i in range(startIndex,endIndex):
if(line[i] != None):
temp = "{}".format(line[i].strip())[0]
if(temp != '\n' and temp != '#'):
data.append(line[i][:-1])
return(data)
I have read the file provided in first argument.
Then Found out the index where function is location. Function is provided in second arguement.
Starting from the index, I cleared string and checked first character to know about comment (#) and new line (\n).
Finally, the lines without these are appended.
Here, you can find file TestFile.py :
def sub(self,num1,num2):
# Subtract two numbers
answer = num1 - num2
# Print the answer
print('Difference = ',answer)
def add(self,num1,num2):
# addition of two numbers
answer = num1 + num2
# Print the answer
print('Summation = ',answer)
def mul(self,num1,num2):
# Product of two numbers
answer = num1 * num2
# Print the answer
print('Product = ',answer)
Execution of function :
getFunctionCode('TestFile.py','sub')
[' answer = num1 - num2', " print('Difference = ',answer)"]
getFunctionCode('TestFile.py','add')
[' answer = num1 + num2', " print('Summation = ',answer)"]
getFunctionCode('TestFile.py','mul')
[' answer = num1 * num2', " print('Product = ',answer)"]
Solution by MoltenMuffins is easier as well.
Your implementation of this function would fail badly if you have multiple functions inside your TestFile.py file and if you are intending to retrieve the source code of only specific functions from TestFile.py. It would also fail if you have some variables defined between two function definitions in TestFile.py
A more idealistic and simplistic way to extract the source code of a function from TestFile.py would be to use the inspect.getsource() method as follows:
#Import necessary packages
import os
import sys
import inspect
#This function takes as input your pyton .py file and the function name in the .py file for which code is needed
def giveFunctionData(file_path,function_name):
folder_path = os.path.dirname(os.path.abspath(file_path))
#Change directory to the folder containing the .py file
os.chdir(folder_path)
head, tail = os.path.split(file_path)
tail = tail.split('.')[0]
#Contruct import statement for the function that needs to be imported
import_statement = "from " + tail + " import " + function_name
#Execute the import statement
exec(import_statement)
#Extract the function code with comments
function_code_with_comments = eval("inspect.getsource("+function_name+")")
#Now, filter out the comments from the function code
function_code_without_comments = ''
for line in function_code_with_comments.splitlines():
currentstr = line.lstrip()
if not currentstr.startswith("#"):
function_code_without_comments = function_code_without_comments + line + '\n'
return function_code_without_comments
#Specify absolute path of your python file from which function code needs to be extracted
file_path = "Path_To_Testfile.py"
#Specify the name of the function for which code is needed
function_name = "sub"
#Print the output function code without comments by calling the function "giveFunctionData(file_path,function_name)"
print(giveFunctionData(file_path,function_name))
This method will work for any kind of function code that you need to extract irrespective of the formatting of the .py file where the function is present instead of parsing the .py file as a string variable.
Cheers!
Given a Python string describing object.attribute, how do I separate the attributes's namespace from the attribute?
Desired Examples:
ns_attr_split("obj.attr") => ("obj", "attr")
ns_attr_split("obj.arr[0]") => ("obj", "arr[0]")
ns_attr_split("obj.dict['key']") => ("obj", "dict['key']")
ns_attr_split("mod.obj.attr") => ("mod.obj", "attr")
ns_attr_split("obj.dict['key.word']") => ("obj", "dict['key.word']")
Note: I understand writing my own string parser would be one option, but I am looking for a more elegant solution to this. Rolling my own string parser isn't as simple as an rsplit on '.' because of the last option listed above where a given keyword may contain the namespace delimiter.
I've recently discovered the tokenize library for tokenizing python source code. Using this library I've come up with this little code snippet:
import tokenize
import StringIO
def ns_attr_split(s):
arr = []
last_delim = -1
cnt = 0
# Tokenize the expression, tracking the last namespace
# delimiter index in last_delim
str_io = StringIO.StringIO(s)
for i in tokenize.generate_tokens(str_io.readline):
arr.append(i[1])
if i[1] == '.':
last_delim = cnt
cnt = cnt + 1
# Join the namespace parts into a string
ns = ""
for i in range(0,last_delim):
ns = ns + arr[i]
# Join the attr parts into a string
attr = ""
for i in range(last_delim + 1, len(arr)):
attr = attr + arr[i]
return (ns, attr)
This should work with intermediate index/keys as well. (i.e "mod.ns[3].obj.dict['key']")
Assuming that the namespace is always alphanumeric, you could first split on /[^a-zA-Z.]/, then rsplit on .:
>>> import re
>>> ns_attr_split = lambda s: re.split("[^a-zA-Z.]", s, 1)[0].rsplit('.')
>>> ns_attr_split("obj.dict['key.word']")
['obj', 'dict']
Obviously this isn't exactly what you want… but the fiddling would be straight forward.
A fun little regular expression problem...
This code works on all the examples you provided using Python 2.6, and assumes you don't have any intermediate index/key accesses (e.g. "obj['foo'].baz"):
import re
ns_attr_split = lambda s: re.match(r"((?:\w+\.)*\w+)\.(.+)", s).groups()
This code block works - it loops through a file that has a repeating number of sets of data
and extracts out each of the 5 pieces of information for each set.
But I I know that the current factoring is not as efficient as it can be since it is looping
through each key for each line found.
Wondering if some python gurus can offer better way to do this more efficiently.
def parse_params(num_of_params,lines):
for line in lines:
for p in range(1,num_of_params + 1,1):
nam = "model.paramName "+str(p)+" "
par = "model.paramValue "+str(p)+" "
opt = "model.optimizeParam "+str(p)+" "
low = "model.paramLowerBound "+str(p)+" "
upp = "model.paramUpperBound "+str(p)+" "
keys = [nam,par,opt,low,upp]
for key in keys:
if key in line:
a,val = line.split(key)
if key == nam: names.append(val.rstrip())
if key == par: params.append(val.rstrip())
if key == opt: optimize.append(val.rstrip())
if key == upp: upper.append(val.rstrip())
if key == low: lower.append(val.rstrip())
print "Names = ",names
print "Params = ",params
print "Optimize = ",optimize
print "Upper = ",upper
print "Lower = ",lower
Though this doesn't answer your question (other answers are getting at that) something that has helped me a lot in doing things similar to what you're doing are List Comprehensions. They allow you to build lists in a concise and (I think) easy to read way.
For instance, the below code builds a 2-dimenstional array with the values you're trying to get at. some_funct here would be a little regex, if I were doing it, that uses the index of the last space in the key as the parameter, and looks ahead to collect the value you're trying to get in the line (the value which corresponds to the key currently being looked at) and appends it to the correct index in the seen_keys 2D array.
Wordy, yes, but if you get list-comprehension and you're able to construct the regex to do that, you've got a nice, concise solution.
keys = ["model.paramName ","model.paramValue ","model.optimizeParam ""model.paramLowerBound ","model.paramUpperBound "]
for line in lines:
seen_keys = [[],[],[],[],[]]
[seen_keys[keys.index(k)].some_funct(line.index(k) for k in keys if k in line]
It's not totally easy to see the expected format. From what I can see, the format is like:
lines = [
"model.paramName 1 foo",
"model.paramValue 2 bar",
"model.optimizeParam 3 bat",
"model.paramLowerBound 4 zip",
"model.paramUpperBound 5 ech",
"model.paramName 1 foo2",
"model.paramValue 2 bar2",
"model.optimizeParam 3 bat2",
"model.paramLowerBound 4 zip2",
"model.paramUpperBound 5 ech2",
]
I don't see the above code working if there is more than one value in each line. Which means the digit is not really significant unless I'm missing something. In that case this works very easily:
import re
def parse_params(num_of_params,lines):
key_to_collection = {
"model.paramName":names,
"model.paramValue":params,
"model.optimizeParam":optimize,
"model.paramLowerBound":upper,
"model.paramUpperBound":lower,
}
reg = re.compile(r'(.+?) (\d) (.+)')
for line in lines:
m = reg.match(line)
key, digit, value = m.group(1, 2, 3)
key_to_collection[key].append(value)
It's not entirely obvious from your code, but it looks like each line can have one "hit" at most; if that's indeed the case, then something like:
import re
def parse_params(num_of_params, lines):
sn = 'Names Params Optimize Upper Lower'.split()
ks = '''paramName paramValue optimizeParam
paramLowerBound paramUpperBound'''.split()
vals = dict((k, []) for k in ks)
are = re.compile(r'model\.(%s) (\d+) (.*)' % '|'.join(ks))
for line in lines:
mo = are.search(line)
if not mo: continue
p = int(mo.group(2))
if p < 1 or p > num_of_params: continue
vals[mo.group(1)].append(mo.group(3).rstrip())
for k, s in zip(ks, sn):
print '%-8s =' % s,
print vals[k]
might work -- I exercised it with a little code as follows:
if __name__ == '__main__':
lines = '''model.paramUpperBound 1 ZAP
model.paramLowerBound 1 zap
model.paramUpperBound 5 nope'''.splitlines()
parse_params(2, lines)
and it emits
Names = []
Params = []
Optimize = []
Upper = ['zap']
Lower = ['ZAP']
which I think is what you want (if some details must differ, please indicate exactly what they are and let's see if we can fix it).
The two key ideas are: use a dict instead of lots of ifs; use a re to match "any of the following possibilities" with parenthesized groups in the re's pattern to catch the bits of interest (the keyword after model., the integer number after that, and the "value" which is the rest of the line) instead of lots of if x in y checks and string manipulation.
There is a lot of duplication there, and if you ever add another key or param, you're going to have to add it in many places, which leaves you ripe for errors. What you want to do is pare down all of the places you have repeated things and use some sort of data model, such as a dict.
Some others have provided some excellent examples, so I'll just leave my answer here to give you something to think about.
Are you sure that parse_params is the bottle-neck? Have you profiled your app?
import re
from collections import defaultdict
names = ("paramName paramValue optimizeParam "
"paramLowerBound paramUpperBound".split())
stmt_regex = re.compile(r'model\.(%s)\s+(\d+)\s+(.*)' % '|'.join(names))
def parse_params(num_of_params, lines):
stmts = defaultdict(list)
for m in (stmt_regex.match(s) for s in lines):
if m and 1 <= int(m.group(2)) <= num_of_params:
stmts[m.group(1)].append(m.group(3).rstrip())
for k, v in stmts.iteritems():
print "%s = %s" % (k, ' '.join(v))
The code given in the OP does multiple tests per line to try to match against the expected set of values, each of which is being constructed on the fly. Rather than construct paramValue1, paramValue2, etc. for each line, we can use a regular expression to try to do the matching in a cheaper (and more robust) manner.
Here's my code snippet, drawing from some ideas that have already been posted. This lets you add a new keyword to the key_to_collection dictionary and not have to change anything else.
import re
def parse_params(num_of_params, lines):
pattern = re.compile(r"""
model\.
(.+) # keyword
(\d+) # index to keyword
[ ]+ # whitespace
(.+) # value
""", re.VERBOSE)
key_to_collection = {
"paramName": names,
"paramValue": params,
"optimizeParam": optimize,
"paramLowerBound": upper,
"paramUpperBound": lower,
}
for line in lines:
match = pattern.match(line)
if not match:
print "Invalid line: " + line
elif match[1] not in key_to_collection:
print "Invalid key: " + line
# Not sure if you really care about enforcing this
elif match[2] > num_of_params:
print "Invalid param: " + line
else:
key_to_collection[match[1]].append(match[3])
Full disclosure: I have not compiled/tested this.
It can certainly be made more efficient. But, to be honest, unless this function is called hundreds of times a second, or works on thousands of lines, is it necessary?
I would be more concerned about making it clear what is happening... currently, I'm far from clear on that aspect.
Just eyeballing it, the input seems to look like this:
model.paramName 1 A model.paramValue 1 B model.optimizeParam 1 C model.paramLowerBound 1 D model.paramUpperBound 1 E model.paramName 2 F model.paramValue 2 G model.optimizeParam 2 H model.paramLowerBound 2 I model.paramUpperBound 2 J
And your desired output seems to be something like:
Names = AF
Params = BG
etc...
Now, since my input certainly doesn't match yours, the output is likely off too, but I think I have the gist.
There are a few points. First, does it matter how many parameters are passed to the function? For example, if the input has two sets of parameters, do I just want to read both, or is it necessary to allow the function to only read one? For example, your code allows me to call parse_params(1,1) and have it only read parameters ending in a 1 from the same input. If that's not actually a requirement, you can skip a large chunk of the code.
Second, is it important to ONLY read the given parameters? If I, for example, have a parameter called 'paramFoo', is it bad if I read it? You can also simplify the procedure by just grabbing all parameters regardless of their name, and extracting their value.
def parse_params(input):
parameter_list = {}
param = re.compile(r"model\.([^ ]+) [0-9]+ ([^ ]+)")
each_parameter = param.finditer(input)
for match in each_parameter:
key = match[0]
value = match[1]
if not key in paramter_list:
parameter_list[key] = []
parameter_list[key].append(value)
return parameter_list
The output, in this instance, will be something like this:
{'paramName':[A, F], 'paramValue':[B, G], 'optimizeParam':[C, H], etc...}
Notes: I don't know Python well, I'm a Ruby guy, so my syntax may be off. Apologies.