Elegant way to refer to files in data science project

Elegant way to refer to files in data science project - python

I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created my_project. You can see it's structure below.
├── README.md
├── data
│ ├── processed <-- data files
│ └── raw
├── notebooks
| └── notebook_1
├── setup.py
|
├── settings.py <-- settings file
└── src
├── __init__.py
│
└── data
└── get_data.py <-- script
I defined a function that loads data from .data/processed. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks.
def data_sample(code=None):
df = pd.read_parquet('../../data/processed/my_data')
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Obviously this function won't work anywhere unless I run it directly in the script where it is defined.
My idea was to create settings.py where I'd declare:
from os.path import join, dirname
DATA_DIR = join(dirname(__file__), 'data', 'processed')
So now I can write:
from my_project import settings
import os
def data_sample(code=None):
file_path = os.path.join(settings.DATA_DIR, 'my_data')
df = pd.read_parquet(file_path)
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Questions:
Is this common practice to refer to files in this way? settings.DATA_DIR looks kinda ugly.
Is this at all how settings.py should be used? And should it be placed in this directory? I have seen it in different spot in this repo under .samr/settings.py
I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.

I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.
Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.
Initially, I reviewed pkg_resources, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.
Essentially, the helpers do two things
1. Persist project root folder and some other paths in constansts:
# shorter version
ROOT = Path(__file__).parents[3]
# longer version
def find_repo_root():
"""Returns root folder for repository.
Current file is assumed to be:
<repo_root>/src/kep/helper/<this file>.py
"""
levels_up = 3
return Path(__file__).parents[levels_up]
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')
This is similar to what you do with DATA_DIR. A possible weak point is that here I
manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.
2. Allow access to specific data in raw, interim and processed folders.
This can be a simple function returning a full path by a filename in a folder, for example:
def interim(filename):
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)
In my project I have year-month subfolders for interim and processed directories and I address data by year, month and sometimes frequency. For this data structure I have
InterimCSV and ProcessedCSV classes that give reference specific paths, like:
from . helper import ProcessedCSV, InterimCSV
# somewhere in code
csv_text = InterimCSV(self.year, self.month).text()
# later in code
path = ProcessedCSV(2018,4).path(freq='q')
The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.
In your example, you can easily have root directory fixed in setting.py,
but I think you can go a step forward with abstracting your data.
Currently data_sample() mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:
# keep this in setting.py
def processed(filename):
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)

As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01 or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py, config.py, settings.py or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project. Define one _mydir = os.path.dirname(os.path.realpath(__file__)) in one place (config.py) and rely on that to avoid refactoring pain.

No, the use of a settings.py is only common practice if you are using Django. As far as referencing the data directory this way, it depends on if you want users to be able to ever change this value. The way you have it set up to change the value requires editing of the settings.py file. If you want users to have a default but also be able to easily change it as they use your function, just create the base path value inline and make it the default in the def data_sample(..., datadir=filepath):.

You can open a file by using open() and save it in a variable and keep using the variable wherever u wish to refer to the file.
with open('Test.txt','r') as f:
or
f=open('Test.txt','r')
and use f to refer to the file.
If you want the file to be both readable & writable, you can use r+ in place of r.

Related

Using Resources Module to Import Data Files

Background
I have a function called get_player_call_logic_df which basically reads a csv file from the PLAYER_TEST_INPUT path. I have a module called player_test and inside that i have another folder called player_test_input where i store all the csv files that are used for processing.
Code
PLAYER_TEST_INPUT_DIR = Path("../dev/playerassignment/src/player/player/player_test/player_test_input/")
def get_player_call_logic_df() -> pd.DataFrame:
df = pd.read_csv(
PLAYER_TEST_INPUT_DIR / "player_call_logic.csv"
)
return df
Issue
I created a PR and I got a very good suggestion that I look at the importlib.resources module. You can store the data as a "resource" in the library. Then, instead of referring to the data via filepath, you can import it similar to any other player module.
I am unsure how i would use resources module here. I read up on the doc and here is what i could come up with. I can probably do something like this.
from importlib import resources
def get_player_call_logic_df() -> pd.DataFrame:
with resources.path("player.player_test_input", "player_call_logic.csv") as df:
return df
I feel like i am doing the same thing so i am not sure how to use the resources module correctly. Any help would be appreciated as i am new to python.

Please use
from importlib import resources
import pandas as pd
def get_player_call_logic_df() -> pd.DataFrame::
with resources.path("player.player_test_input", "player_call_logic.csv") as df:
return pd.read_csv(df)
and bear in mind the __init__.py file inside the player_test_input folder:
.
└── player
├── __init__.py
└── player_test_input
├── __init__.py
└── player_call_logic.csv
Very good reference and alternatives can be found here and here

File operations not working in relative paths

I am working on a python3 app with a fairly simple file structure, but I'm having issues reading a text file in a script, both of which are lower in the file structure than the script calling them. To be absolutely clear, the file structure is as follows:
app/
|- cli-script
|- app_core/
|- dictionary.txt
|- lib.py
cli-script calls lib.py, and lib.py requires dictionary.txt to do what I need it to, so it gets opened and read in lib.py.
The very basics of cli-script looks like this:
from app_core import lib
def cli_func():
x = lib.Lib_Class()
x.lib_func()
The problem area of lib is here:
class Lib_Class:
def __init__(self):
dictionary = open('dictionary.txt')
The problem I'm getting is that while I have this file structure, the lib file can't find the dictionary file, returning a FileNotFoundError. I would prefer to only use relative paths for portability reasons, but otherwise I just need to make the solution OS agnostic. Symlinks are a last resort option I've figured out, but I want to avoid it at all costs. What are my options?

When you run a Python script, calls involving paths are executed relative to where you run them from, not where the files are actually from.
The __file__ variable stores the path of the current file (no matter where it is), so relative files will be siblings to that.
In your structure, __file__ refers to the path app/app_core/lib.py, so to create app/app_core/dictionary.txt, you need to co up and then down again.
app/app_core/lib.py
import os.path
class Lib_Class:
def __init__(self):
path = os.path.join(os.path.dirname(__file__), 'dictionary.txt')
dictionary = open(path)
or using pathlib
path = pathlib.Path(__file__).parent / 'dictionary.txt'

Because you are expecting the dictionary.txt to be present in the same path as your lib.py file you can do the following.
Instead of dictionary = open('dictionary.txt') use
dictionary = open(Path(__file__).parent / 'dictionary.txt')

How do I ensure that a python package module saves results to a sub-directory of that package?

I'm creating a package with the following structure
/package
__init__.py
/sub_package_1
__init__.py
other_stuff.py
/sub_package_2
__init__.py
calc_stuff.py
/results_dir
I want to ensure that calc_stuff.py will save results to /results_dir, unless otherwise specified (yes, I'm not entirely certain having a results directory in my package is the best idea, but it should work well for now). However, since I don't know from where, or on which machine calc_stuff will be run, I need the package, or at least my_calc.py, to know where it is saved.
So far the two approaches I have tried:
from os import path
saved_dir = path.join(path.dirname(__file__), 'results_dir')
and
from pkg_resources import resource_filename
filepath = resource_filename(__name__, 'results_dir')
have only given me paths relative to the root of the package.
What do I need to do to ensure a statement along the lines of:
pickle.dump(my_data,open(os.path.join(full_path,
'results_dir',
'results.pkl'), 'wb')
Will result in a pickle file being saved into results_dir ?

I'm not entirely certain having a results directory in my package is the best idea, me either :)
But, if you were to put a function like the following inside a module in subpackage2, it should return a path consisting of (module path minus filename, 'results_dir', the filename you passed the function as an argument):
def get_save_path(filename):
import os
return os.path.join(os.path.dirname(__file__), "results_dir", filename)
C:\Users\me\workspaces\workspace-oxygen\test36\TestPackage\results_dir\foo.ext

Accessing resource files in Python unit tests & main code

I have a Python project with the following directory structure:
project/
project/src/
project/src/somecode.py
project/src/mypackage/mymodule.py
project/src/resources/
project/src/resources/datafile1.txt
In mymodule.py, I have a class (lets call it "MyClass") which needs to load datafile1.txt. This sort of works when I do:
open ("../resources/datafile1.txt")
Assuming the code that creates the MyClass instance created is run from somecode.py.
The gotcha however is that I have unit tests for mymodule.py which are defined in that file, and if I leave the relative pathname as described above, the unittest code blows up as now the code is being run from project/src/mypackage instead of project/src and the relative filepath doesn't resolve correctly.
Any suggestions for a best practice type approach to resolve this problem? If I move my testcases into project/src that clutters the main source folder with testcases.

I usually use this to get a relative path from my module. Never tried in a unittest tho.
import os
print(os.path.join(os.path.dirname(__file__),
'..',
'resources'
'datafile1.txt'))
Note: The .. tricks works pretty well, but if you change your directory structure you would need to update that part.

On top of the above answers, I'd like to add some Python 3 tricks to make your tests cleaner.
With the help of the pathlib library, you can explicit your ressources import in your tests. It even handles the separators difference between Unix (/) and Windows ().
Let's say we have a folder structure like this :
`-- tests
|-- test_1.py <-- You are here !
|-- test_2.py
`-- images
|-- fernando1.jpg <-- You want to import this image !
`-- fernando2.jpg
You are in the test_1.py file, and you want to import fernando1.jpg. With the help to the pathlib library, you can read your test resource with an object oriented logic as follows :
from pathlib import Path
current_path = Path(os.path.dirname(os.path.realpath(__file__)))
image_path = current_path / "images" / "fernando1.jpg"
with image_path.open(mode='rb') as image :
# do what you want with your image object
But there's actually convenience methods to make your code more explicit than mode='rb', as :
image_path.read_bytes() # Which reads bytes of an object
text_file_path.read_text() # Which returns you text file content as a string
And there you go !

in each directory that contains Python scripts, put a Python module that knows the path to the root of the hierarchy. It can define a single global variable with the relative path. Import this module in each script. Python searches the current directory first so it will always use the version of the module in the current directory, which will have the relative path to the root of the current directory. Then use this to find your other files. For example:
# rootpath.py
rootpath = "../../../"
# in your scripts
from rootpath import rootpath
datapath = os.path.join(rootpath, "src/resources/datafile1.txt")
If you don't want to put additional modules in each directory, you could use this approach:
Put a sentinel file in the top level of the directory structure, e.g. thisisthetop.txt. Have your Python script move up the directory hierarchy until it finds this file. Write all your pathnames relative to that directory.
Possibly some file you already have in the project directory can be used for this purpose (e.g. keep moving up until you find a src directory), or you can name the project directory in such a way to make it apparent.

You can access files in a package using importlib.resources (mind Python version compatibility of the individual functions, there are backports available as importlib_resources), as described here. Thus, if you put your resources folder into your mypackage, like
project/src/mypackage/__init__.py
project/src/mypackage/mymodule.py
project/src/mypackage/resources/
project/src/mypackage/resources/datafile1.txt
you can access your resource file in code without having to rely on inferring file locations of your scripts:
import importlib.resources
file_path = importlib.resources.files('mypackage').joinpath('resources/datafile1.txt')
with open(file_path) as f:
do_something_with(f)
Note, if you distribute your package, don't forget to include the resources/ folder when creating the package.

The filepath will be relative to the script that you initially invoked. I would suggest that you pass the relative path in as an argument to MyClass. This way, you can have different paths depending on which script is invoking MyClass.

How can I store testing data for python nosetests?

I want to write some tests for a python MFCC feature extractor for running with nosetest. As well as some lower-level tests, I would also like to be able to store some standard input and expected-output files with the unit tests.
At the moment we are hard-coding the paths to the files on our servers, but I would prefer the testing files (both input and expected-output) to be somewhere in the code repository so they can be kept under source control alongside the testing code.
The problem I am having is that I'm not sure where the best place to put the testing files would be, and how to know what that path is when nosetest calls each testing function. At the moment I am thinking of storing the testing data in the same folder as the tests and using __file__ to work out where that is (would that work?), but I am open to other suggestions.

I think that using __file__ to figure out where the test is located and storing data alongside the it is a good idea. I'm doing the same for some tests that I write.
This:
os.path.dirname(os.path.abspath(__file__))
is probably the best you are going to get, and that's not bad. :-)

Based on the idea of using __file__, maybe you could use a module to help with the path construction. You could find all the files contained in the module directory, gather their name and path in a dictionnary for later use.
Create a module accessible to your tests, i.e. a directory besides your test such as testData, where you can put your data files. In the __init__.py of this module, insert the following code.
import os
from os.path import join,dirname,abspath
testDataFiles = dict()
baseDir = dirname(abspath(__file__)) + os.path.sep
for root, dirs, files in os.walk(baseDir):
localDataFiles = [(join(root.replace(baseDir,""),name), join(root,name)) for name in files]
testDataFiles.update( dict(localDataFiles))
Assuming you called your module testData and it contains a file called data.txt you can then use the following construct in your test to obtain the path to the file. aFileOperation is assumed to be a function that take a parameter path
import unittest
from testData import testDataFiles
class ATestCase(unittest.TestCase):
def test_Something(self):
self.assertEqual( 0, aFileOperation(testDataFiles['data.txt'] )
It will also allow you to use subdirectories such as
def test_SomethingInASubDir(self):
self.assertEqual( 0, aFileOperation(testDataFiles['subdir\\data.txt'] )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elegant way to refer to files in data science project - python

You can open a file by using open() and save it in a variable and keep using the variable wherever u wish to refer to the file. with open('Test.txt','r') as f: or f=open('Test.txt','r') and use f to refer to the file. If you want the file to be both readable & writable, you can use r+ in place of r.

Related

Using Resources Module to Import Data Files

File operations not working in relative paths

How do I ensure that a python package module saves results to a sub-directory of that package?

Accessing resource files in Python unit tests & main code

How can I store testing data for python nosetests?

Categories

Resources