Suppose i have a list of reference numbers in a file numbers.csv.
I am writing a module checker.py that can be imported and called to check for numbers like this:
import checker
checker.check_number(1234)
checker.py will load a list of numbers and provide check_number to check if a given number is in the list. By default it should load the data from numbers.csv, although the path can be specified:
DATA_FILEPATH = 'numbers.csv'
def load_data(data_filepath=DATA_FILEPATH):
...
REF_LIST = load_data()
def check_number(num, ref_list=REF_LIST):
...
Question -- Interleaving variables and functions seems strange to me. Is there a better way to structure checker.py than the above?
I read the excellent answer to How to create module-wide variables in Python?.
Is the best practice to:
declare REF_LIST list i have done above?
create a dict like VARS = {'REF_LIST': None} and set VARS['REF_LIST'] in load_data?
create a trivial class and assign clas.REF_LIST in load_data?
or else, is it dependent on the situation? (And in what situations do i use which?)
Note
Previously, i avoided this by loading the data only when needed in the calling module. So in checker.py:
DATA_FILEPATH = 'numbers.csv'
def load_data(data_filepath=DATA_FILEPATH):
...
def check_number(num, ref_list):
...
In the calling module:
import checker
ref_list = checker.load_data()
checker.check_number(1234, ref_list)
But it didn't quite make sense for me to load in the calling module, because i would need to load_data 5 times if i want to check numbers in 5 different modules.
You can load csv data easily with help of Pandas framework
import pandas as pd
dataframe=pd.read_csv('numbers.csv')
to check a number is present in the datframe by using this code:
numbers=[1,3,8]
for number in numbers:
if number in dataframe[dataframe.columns[0]]:
print True
else:
print False
Related
import os
BUCKET = os.getenv("BUCKET")
IN_CSV = os.getenv("IN_CSV")
OUT_CSV = os.getenv("OUT_CSV")
now, you see the problem right? I don't want to retype the variable name twice, is there a way to not do it? maybe some function get_and_init_env.
get_and_init_env(BUCKET) after this is executed there should be a variable of name BUCKET with value os.getenv("BUCKET") in locals()
May not be exactly what you need but to save time typing in things built on ipython, I once made a class that took in a dict of strings (such as one that can easily be made from os.environ), and in it's __init__ it called setattr to make itself have attributes that reflected the dict contents. From there I just had to .blah that instance instead of ['blah'] but more importantly in ipython could .b<tab> and bring up the items it could be. Probably went something like
...
class DotDict:
def __init__(self,dictish):
self._original = dict(dictish) #a dict has a lot of useful capabilities that can be routed to it...
for x,y in self._original.items():
setattr(self,CleanStr(x),y)
...
...
#make useful dicts part of the module
env =DotDict(os.environ)
...
from MyMod import env as env0
env0.BUCKET #just use it...
Since most environ vars should be pretty clean, you can probably just use x instead of CleanStr(x) but should really have a way to make any x object into a valid name, be it str or repr or hash related and prefixed by some favorite character sequence.
I have a utilities.py file for my python project. It contains only util functions, for example is_float(string), is_empty(file), etc.
Now I want to have a function is_valid(number), which has to:
read from a file, valid.txt, which contains all numbers which are valid, and load them onto a map/set.
check the map for the presence of number and return True or False.
This function is called often, and running time should be as small as possible. I don't want to read open and read valid.txt everytime the function is called. The only solution I have come up with is to use a global variable, valid_dict, which is loaded once from valid.txt when utilities.py is imported. The loading code is written as main in utilities.py.
My question is how do I do this without using a global variable, as it is considered bad practice? What is a good design pattern for doing such a task without using globals? Also note again that this is a util file, so there should ideally be no main as such, just functions.
The following is a simple example of a closure. The dictionary, cache, is encapsulated within the outer function (load_func), but remains in scope of the inner, even when it is returned. Notice that load_func returns the inner function as an object, it does not call it.
In utilities.py:
def _load_func(filename):
cache = {}
with open(filename) as fn:
for line in fn:
key, value = line.split()
cache[int(key)] = value
def inner(number):
return number in cache
return inner
is_valid = _load_func('valid.txt')
In __main__:
from utilities import is_valid # or something similar
if is_valid(42):
print(42, 'is valid')
else:
print(42, 'is not valid')
The dictionary (cache) creation could have been done using a dictionary comprehension, but I wanted you to concentrate on the closure.
The variable valid_dict would not be global but local to utilities.py. It would only become global if you did something like from utilities import *. Now that is considered bad practice when you're developing a package.
However, I have used a trick in cases like this that essentially requires a static variable: Add an argument valid_dict={} to is_valid(). This dictionary will be instantiated only once and each time the function is called the same dict is available in valid_dict.
def is_valid(number, valid_dict={}):
if not valid_dict:
# first call to is_valid: load valid.txt into valid_dict
# do your check
Do NOT assign to valid_dict in the if-clause but only modify it: e.g., by setting keys valid_dict[x] = y or using something like valid_dict.update(z).
(PS: Let me know if this is considered "dirty" or "un-pythonic".)
I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)
So I just found out today that I can import variables from other Python files. For example, I can have a variable green = 1, and have a completely separate file use that variable. I've found that this is really helpful for functions.
So, here's my question. I'm sorry if the title didn't help very much, I wasn't entirely sure what this would be called.
I want my program to ask the player what his or her name is. Once the player has answered, I want that variable to be stored in a separate Python file, "player_variables.py", so every time I want to say the player's name, instead of having to go to from game import name, I can use from player_variables import name, just to make it easier.
I fully understand that this is a lazy man's question, but I'm just trying to learn as much as I could. I'm still very new, and I'm sorry if this question is ridiculous. :).
Thanks for the help, I appreciate it. (be nice to me!)
From your question, I think you're confusing some ideas about variables and the values of variables.
1) Simply creating a python file with variable names allows you to access the values of those variables defined therein. Example:
# myvariables.py
name = 'Steve'
# main.py
import myvariables
print myvariables.name
Since this requires that the variables themselves be defined as global, this solution is not ideal. See this link for more: https://docs.python.org/2/reference/simple_stmts.html#global
2) However, as your question states that you want your program to save the variable of some user-entered value in the python file, that is another thing entirely as that's metaprogramming in python - as your program is outputting Python code, in this case a file with the same name as your variables python file.
From the above, if this is the means by which you're accessing and updating variables values, it is a simpler idea to have a config file which can be loaded or changed at whim. Metaprogramming is only necessary when you need it.
I think you should use a data format for storing data, not a data manipulating programming language for this:
import json
import myconfig
def load_data():
with open(myconfig.config_loc, 'r') as json_data:
return json.loads(json_data.read())
def save_data(name, value):
existing_data = load_data()
existing_data[name] = value
with open(myconfig.config_loc, 'w') as w:
w.write(json.dumps(existing_data, indent=4, sort_keys=True))
You should only store the location of this json file in your myvariables.py file, but I'd call it something else (in this case, myconfig, or just config).
More on the json library can be found here.
I know this must be a trivial question, but I've tried many different ways, and searched quie a bit for a solution, but how do I create and reference subfunctions in the current module?
For example, I am writing a program to parse through a text file, and for each of the 300 different names in it, I want to assign to a category.
There are 300 of these, and I have a list of these structured to create a dict, so of the form lookup[key]=value (bonus question; any more efficient or sensible way to do this than a massive dict?).
I would like to keep all of this in the same module, but with the functions (dict initialisation, etc) at the
end of the file, so I dont have to scroll down 300 lines to see the code, i.e. as laid out as in the example below.
When I run it as below, I get the error 'initlookups is not defined'. When I structure is so that it is initialisation, then function definition, then function use, no problem.
I'm sure there must be an obvious way to initialise the functions and associated dict without keeping the code inline, but have tried quite a few so far without success. I can put it in an external module and import this, but would prefer not to for simplicity.
What should I be doing in terms of module structure? Is there any better way than using a dict to store this lookup table (It is 300 unique text keys mapping on to approx 10 categories?
Thanks,
Brendan
import ..... (initialisation code,etc )
initLookups() # **Should create the dict - How should this be referenced?**
print getlookup(KEY) # **How should this be referenced?**
def initLookups():
global lookup
lookup={}
lookup["A"]="AA"
lookup["B"]="BB"
(etc etc etc....)
def getlookup(value)
if name in lookup.keys():
getlookup=lookup[name]
else:
getlookup=""
return getlookup
A function needs to be defined before it can be called. If you want to have the code that needs to be executed at the top of the file, just define a main function and call it from the bottom:
import sys
def main(args):
pass
# All your other function definitions here
if __name__ == '__main__':
exit(main(sys.argv[1:]))
This way, whatever you reference in main will have been parsed and is hence known already. The reason for testing __name__ is that in this way the main method will only be run when the script is executed directly, not when it is imported by another file.
Side note: a dict with 300 keys is by no means massive, but you may want to either move the code that fills the dict to a separate module, or (perhaps more fancy) store the key/value pairs in a format like JSON and load it when the program starts.
Here's a more pythonic ways to do this. There aren't a lot of choices, BTW.
A function must be defined before it can be used. Period.
However, you don't have to strictly order all functions for the compiler's benefit. You merely have to put your execution of the functions last.
import # (initialisation code,etc )
def initLookups(): # Definitions must come before actual use
lookup={}
lookup["A"]="AA"
lookup["B"]="BB"
(etc etc etc....)
return lookup
# Any functions initLookups uses, can be define here.
# As long as they're findable in the same module.
if __name__ == "__main__": # Use comes last
lookup= initLookups()
print lookup.get("Key","")
Note that you don't need the getlookup function, it's a built-in feature of a dict, named get.
Also, "initialisation code" is suspicious. An import should not "do" anything. It should define functions and classes, but not actually provide any executable code. In the long run, executable code that is processed by an import can become a maintenance nightmare.
The most notable exception is a module-level Singleton object that gets created by default. Even then, be sure that the mystery object which makes a module work is clearly identified in the documentation.
If your lookup dict is unchanging, the simplest way is to just make it a module scope variable. ie:
lookup = {
'A' : 'AA',
'B' : 'BB',
...
}
If you may need to make changes, and later re-initialise it, you can do this in an initialisation function:
def initLookups():
global lookup
lookup = {
'A' : 'AA',
'B' : 'BB',
...
}
(Alternatively, lookup.update({'A':'AA', ...}) to change the dict in-place, affecting all callers with access to the old binding.)
However, if you've got these lookups in some standard format, it may be simpler simply to load it from a file and create the dictionary from that.
You can arrange your functions as you wish. The only rule about ordering is that the accessed variables must exist at the time the function is called - it's fine if the function has references to variables in the body that don't exist yet, so long as nothing actually tries to use that function. ie:
def foo():
print greeting, "World" # Note that greeting is not yet defined when foo() is created
greeting = "Hello"
foo() # Prints "Hello World"
But:
def foo():
print greeting, "World"
foo() # Gives an error - greeting not yet defined.
greeting = "Hello"
One further thing to note: your getlookup function is very inefficient. Using "if name in lookup.keys()" is actually getting a list of the keys from the dict, and then iterating over this list to find the item. This loses all the performance benefit the dict gives. Instead, "if name in lookup" would avoid this, or even better, use the fact that .get can be given a default to return if the key is not in the dictionary:
def getlookup(name)
return lookup.get(name, "")
I think that keeping the names in a flat text file, and loading them at runtime would be a good alternative. I try to stick to the lowest level of complexity possible with my data, starting with plain text and working up to a RDMS (I lifted this idea from The Pragmatic Programmer).
Dictionaries are very efficient in python. It's essentially what the whole language is built on. 300 items is well within the bounds of sane dict usage.
names.txt:
A = AAA
B = BBB
C = CCC
getname.py:
import sys
FILENAME = "names.txt"
def main(key):
pairs = (line.split("=") for line in open(FILENAME))
names = dict((x.strip(), y.strip()) for x,y in pairs)
return names.get(key, "Not found")
if __name__ == "__main__":
print main(sys.argv[-1])
If you really want to keep it all in one module for some reason, you could just stick a string at the top of the module. I think that a big swath of text is less distracting than a huge mess of dict initialization code (and easier to edit later):
import sys
LINES = """
A = AAA
B = BBB
C = CCC
D = DDD
E = EEE""".strip().splitlines()
PAIRS = (line.split("=") for line in LINES)
NAMES = dict((x.strip(), y.strip()) for x,y in PAIRS)
def main(key):
return NAMES.get(key, "Not found")
if __name__ == "__main__":
print main(sys.argv[-1])