Rewrite functions in pywb without code source changing - python

I'm new in python development and I'm using PYWB library for replaying web archives (warc files).
I would like to modify a function in pywb/warcserver/index but without modifying the code source.
The idea is to modify some features while keeping the original code source. It will be useful to update the code source without loosing changes.
How can this be possible in pywb with python.
The function to rewrite in the indexsource.py file is load_index
Thanks

load_index is a method of the FileIndexSource class. You can modify the method on an instance level without having to change the source code of the library. For example:
from pywb.utils.binsearch import iter_range
from pywb.utils.wbexception import NotFoundException
from pywb.warcserver.index.cdxobject import CDXObject
from pywb.utils.format import res_template
def modified_load_index(self, params):
filename = res_template(self.filename_template, params)
try:
fh = open(filename, 'rb')
except IOError:
raise NotFoundException(filename)
def do_load(fh):
with fh:
gen = iter_range(fh, params['key'], params['end_key'])
for line in gen:
yield CDXObject(line)
# (... some modification on this method)
return do_load(fh)
# Change the "load_index" method on the instance of FileIndexSource
my_file_index_source.load_index = modified_load_index
For then on, every time the method load_index is called on my_file_index_source, it will be the modified method that will run.
Another option would be to make a new class which inherits from FileIndexSource and overwrites the load_index method.

Related

How can you get the source code of dynamically defined functions in Python?

When defining code dynamically in Python (e.g. through exec or loading it from some other medium other than import), I am unable to get to the source of the defined function.
inspect.getsource seems to look for a loaded module from where it was loaded.
import inspect
code = """
def my_function():
print("Hello dears")
"""
exec(code)
my_function() #Works, as expected
print(inspect.getsource(my_function)) ## Fails with OSError('could not get source code')
Is there any other way to get at the source of a dynamically interpreted function (or other object, for that matter)?
Is there any other way to get at the source of a dynamically interpreted function (or other object, for that matter)?
One option would be to dump the source to a file and exec from there, though that litters your filesystem with garbage you need to cleanup.
A somewhat less reliable but less garbagey alternative would be to rebuild the source (-ish) from the bytecode, using astor.to_source() for instance. It will give you a "corresponding" source but may alter formatting or lose metadata compared to the original.
The simplest would be to simply attach your original source to the created function object:
code = """
def my_function():
print("Hello dears")
"""
exec(code)
my_function.__source__ = code # has nothing to do with getsource
One more alternative (though probably not useful here as I assume you want the body to be created dynamically from a template for instance) would be to swap the codeobject for one you've updated with the correct / relevant firstlineno (and optionally filename though you can set that as part of the compile statement). That's only useful if for some weird reason you have your python code literally embedded in an other file but can't or don't want to extract it to its own module for a normal evaluation.
You can do it almost like below
import inspect
source = """
def foo():
print("Hello World")
"""
file_name ='/tmp/foo.py' # you can use any hash_function
with open(file_name, 'w') as f:
f.write(source)
code = compile(source, file_name, 'exec')
exec(code)
foo() # Works, as expected
print(inspect.getsource(foo))

Best way to pass function specified in file x as commandline parameter to file y in python

I'm writing a wrapper or pipeline to create a tfrecords dataset to which I would like to supply a function to apply to the dataset.
I would like to make it possible for the user to inject a function defined in another python file which is called in my script to transform the data.
Why? The only thing the user has to do is write the function which brings his data into the right format, then the existing code does the rest.
I'm aware of the fact that I could have the user write the function in the same file and call it, or to have an import statement etc.
So as a minimal example, I would like to have file y.py
def main(argv):
# Parse args etc, let's assume it is there.
dataset = tf.data.TFRecordDataset(args.filename)
dataset = dataset.map(args.function)
# Continue with doing stuff that is independent from actual content
So what I'd like to be able to do is something like this
python y.py --func x.py my_func
And use the function defined in x.py my_func in dataset.map(...)
Is there a way to do this in python and if yes, which is the best way to do it?
Pass the name of the file as an argument to your script (and function name)
Read the file into a string, possibly extracting the given function
use Python exec() to execute the code
An example:
file = "def fun(*args): \n return args"
func = "fun(1,2,3)"
def execute(func, file):
program = file + "\nresult = " + func
local = {}
exec(program, local)
return local['result']
r = execute(func, file)
print(r)
Similar to here however we must use locals() as we are not calling exec in global scope.
Note: the use of exec is somewhat dangerous, you should be sure that the function is safe - if you are using it then its fine!
Hope this helps.
Ok so I have composed the answer myself now using the information from comments and this answer.
import importlib, inspect, sys, os
# path is given path to file, funcion_name is name of function and args are the function arguments
# Create package and module name from path
package = os.path.dirname(path).replace(os.path.sep,'.')
module_name = os.path.basename(path).split('.')[0]
# Import module and get members
module = importlib.import_module(module_name, package)
members = inspect.getmembers(module)
# Find matching function
function = [t[1] for t in members if t[0] == function_name][0]
function(args)
This exactly solves the question, since I get a callable function object which I can call, pass around, use it as a normal function.

Obtain Python docstring of arbitrary script file

I am writing a module that requires the docstring of the calling script. Thus far I have managed to obtain the filename of the calling script using
import inspect
filename = inspect.stack()[1].filename
The docstring can be found inside the calling script using __doc__. Getting the docstring from the called script does not seem trivial however. Of course I could write a function, but this is bound to ignore some uncommon cases. I there a way to actually parse the calling script to find its docstring (without executing its code)?
Based on chaos's suggestion to use ast I wrote the following which seems to work nicely.
import ast
with open(fname, 'r') as f:
tree = ast.parse(f.read())
docstring = ast.get_docstring(tree)

How to instantiate an object once

I am instantiating this object below every time I call csv in my function. Was just wondering if there's anyway I could just instantiate the object just once?
I tried to split the return csv from def csv() to another function but failed.
Code instantiating the object
def csv():
proj = Project.Project(db_name='test', json_file="/home/qingyong/workspace/Project/src/json_files/sys_setup.json")#, _id='poc_1'
csv = CSVDatasource(proj, "/home/qingyong/workspace/Project/src/json_files/data_setup.json")
return csv
Test function
def test_df(csv,df)
..............
Is your csv function actually a pytest.fixture? If so, you can change its scope to session so it will only be called once per py.test session.
#pytest.fixture(scope="session")
def csv():
# rest of code
Of course, the returned data should be immutable so tests can't affect each other.
You can use a global variable to cache the object:
_csv = None
def csv():
global _csv
if _csv is None:
proj = Project.Project(db_name='test', json_file="/home/qingyong/workspace/Project/src/json_files/sys_setup.json")#, _id='poc_1'
_csv = CSVDatasource(proj, "/home/qingyong/workspace/Project/src/json_files/data_setup.json")
return _csv
Another option is to change the caller to cache the result of csv() in a manner similar to the snippet above.
Note that your "code to call the function" doesn't call the function, it only declares another function that apparently receives the csv function's return value. You didn't show the call that actually calls the function.
You can use a decorator for this if CSVDatasource doesn't have side effects like reading the input line by line.
See Efficient way of having a function only execute once in a loop
You can store the object in the function's local dictionary. And return that object if it exists, create a new one if it doesn't.
def csv():
if not hasattr(csv, 'obj'):
proj = Project.Project(db_name='test', json_file="/home/qingyong/workspace/Project/src/json_files/sys_setup.json")#, _id='poc_1'
csv.obj = CSVDatasource(proj, "/home/qingyong/workspace/Project/src/json_files/data_setup.json")
return csv.obj

Run multiple functions in a for loop python

Here is the start of my program. I want a lot of the functions to be inside the for loop as seen in the 3rd function here. How do I go about this?
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def split_word_part(self, word_part, line):
word_part = line.split()
return word_part
Is it just a matter of indenting everything else in the for loop or is it when the function are called that the loop has to be defined?
How does one go about calling multiple functions as part of a program? I am new to python and i don't really know.
There's no program here. Classes by themselves don't do anything. You need to instantiate the class, then call one of its methods (which is the correct term for what you seem to be calling "processes"). So, at the end of this file, you might do:
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
Inside one method, you can call another: so for your loop, you would do:
def line_for_loop(self, file):
for line in file:
self.my_method_1()
self.my_method_2()
There are some other issues with your code. For example, in the __init__ method, you define a graph local variable, but never do anything with it, so it is not stored anywhere. You need to store variables on self for them to become instance properties:
def __init__(self):
self.graph = Graph()
Also, you seem to be confused about when to pass parameters to functions. Twice (in __init__ and process_file) you accept a parameter, then override it inside the method with a local variable. If you're defining the variable in the function, you shouldn't pass it as a parameter.
Note that, as I've had occasion to say before, Python is not Java, and doesn't always require classes. In this case, the class is not contributing anything to the program, other than as a holder for methods. In Python, you would normally just use functions inside a module for that.
Process isn't the proper term to use. Those are better known as functions or methods. As far as Python loops go, indentation is important. You do need to indent.
def line_for_loop(self, file):
for line in file:
process_file("example_file_name.txt")
split_pointer_part(0, 10, "some test string")
You should make the function calls from inside the loop. The example code above may not be the exact solution for you code, but it should be sufficient enough to answer your question.

Categories

Resources