Using Resources Module to Import Data Files

Using Resources Module to Import Data Files - python

Background
I have a function called get_player_call_logic_df which basically reads a csv file from the PLAYER_TEST_INPUT path. I have a module called player_test and inside that i have another folder called player_test_input where i store all the csv files that are used for processing.
Code
PLAYER_TEST_INPUT_DIR = Path("../dev/playerassignment/src/player/player/player_test/player_test_input/")
def get_player_call_logic_df() -> pd.DataFrame:
df = pd.read_csv(
PLAYER_TEST_INPUT_DIR / "player_call_logic.csv"
)
return df
Issue
I created a PR and I got a very good suggestion that I look at the importlib.resources module. You can store the data as a "resource" in the library. Then, instead of referring to the data via filepath, you can import it similar to any other player module.
I am unsure how i would use resources module here. I read up on the doc and here is what i could come up with. I can probably do something like this.
from importlib import resources
def get_player_call_logic_df() -> pd.DataFrame:
with resources.path("player.player_test_input", "player_call_logic.csv") as df:
return df
I feel like i am doing the same thing so i am not sure how to use the resources module correctly. Any help would be appreciated as i am new to python.

Please use
from importlib import resources
import pandas as pd
def get_player_call_logic_df() -> pd.DataFrame::
with resources.path("player.player_test_input", "player_call_logic.csv") as df:
return pd.read_csv(df)
and bear in mind the __init__.py file inside the player_test_input folder:
.
└── player
├── __init__.py
└── player_test_input
├── __init__.py
└── player_call_logic.csv
Very good reference and alternatives can be found here and here

Related

understanding hierarchical python modules & packages

I am trying to work with python packages and modules for the first time and come across some import errors I don't understand.
My project has the following structure:
upper
├── __init__.py
├── upper_file.py # contains "from middle.middle_file import *"
└── middle
├── __init__.py
├── middle_file.py # contains "from lower.lower_file import Person, Animal"
└── lower
├── __init__.py
└── lower_file.py # contains the Classes Person and Animal
I can run middle_file.py and can create inside the file a Person() and Animal() without any problems.
If I try to run upper_file.py I get a ModuleNotFoundError: No module named 'lower' error.
However, I have no trouble importing Animal() or Person() in upper_file.py directly with from middle.lower.lower_file import *
If I change the import statement inside middle_file.py from from lower.lower_file import Person, Animal to from middle.lower.lower_file import Person, Animal I can successfully run upper_file.py but not middle_file.py itself (and pycharm underlines the import in middle_file.py red and says it doesn't know middle)
In the end, I need to access inside of upper_file.py a class that is located inside of middle_file.py, but middle_file.py itself depends on the imports of lower_file.py.
I already read through this answer and the docs but just don't get how it works and why it behaves the way it does.
Thanks for any help in advance.

You should use relative import to accomplish this. First link on Google I found some practical example that could help you understand better.
On middle_file try to use from .lower.lower_file import *. It should solve the issue on upper_file.

How to build package like pandas/numpy where pd/np is an object with all the functions

As per title, I am trying to build a python package myself, I am already familiar with writing python packages reading notes from https://packaging.python.org/en/latest/tutorials/packaging-projects/ and https://docs.python.org/3/tutorial/modules.html#packages. These gave me an idea of how to write a bunch of object class/functions where I can import them.
What I want is to write a package like pandas and numpy, where I run import and they work as an "object", that is to say most/all the function is a dotted after the package.
E.g. after importing
import pandas as pd
import numpy as np
The pd and np would have all the functions and can be called with pd.read_csv() or np.arange(), and running dir(pd) and dir(np) would give me all the various functions available from them. I tried looking at the pandas src code to try an replicate their functionality. However, I could not do it. Maybe there is some parts of that I am missing or misunderstanding. Any help or point in the right direction to help me do this would be much appreciated.
In a more general example, I want to write a package and import it to have the functionalities dotted after it. E.g. import pypack and I can call pypack.FUNCTION() instead of having to import that function as such from pypack.module import FUNCTION and call FUNCTION() or instead of importing it as just a submodule.
I hope my question makes sense as I have no formal training in write software.

Let's assume you have a module (package) called my_library.
.
├── main.py
└── my_library/
└── __init__.py
/my_library/__init__.py
def foo(x):
return x
In your main.py you can import my_library
import my_library
print(my_library.foo("Hello World"))
The directory with __init__.py will be your package and can be imported.
Now consider a even deeper example.
.
├── main.py
└── my_library/
├── __init__.py
└── inner_module.py
inner_module.py
def bar(x):
return x
In your /my_library/__init__.py you can add
from .inner_module import bar
def foo(x):
return x
You can use bar() in your main as follows
import my_library
print(my_library.foo("Hello World"))
print(my_library.bar("Hello World"))

what is .random. in this statement?

numpy.random.randn(100)
I understand numpy is the name of the imported module and randn is a method defined within the module but not sure what .random. is
Thanks and happy new year!

#Yann's answer is definetly correct but might not make the whole picture clear.
The best analogy for package structure are probably folders. Imagine the whole numpy package as big folder. In said folder are a bunch of files, these are our functions. But you also have subfolder. random is one of these subfolders. It contains more files (functions) who are grouped together because they have to do with the same thing, namely randomness.
numpy
├── arccos
├── vectorize
├── random
│ ├── randn
│ ├── <more functions in the random subfolder>
│ <more functions in the numpy folder>

The .random part is a module within numpy,how you can confirm this is to use the python interpreter
#first import numpy into the interpreter
import numpy
#this is so the interpreter displays the info about the random module
numpy.random
Output should be something like "<module 'numpy.random' from 'path to module'>

Elegant way to refer to files in data science project

I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created my_project. You can see it's structure below.
├── README.md
├── data
│ ├── processed <-- data files
│ └── raw
├── notebooks
| └── notebook_1
├── setup.py
|
├── settings.py <-- settings file
└── src
├── __init__.py
│
└── data
└── get_data.py <-- script
I defined a function that loads data from .data/processed. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks.
def data_sample(code=None):
df = pd.read_parquet('../../data/processed/my_data')
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Obviously this function won't work anywhere unless I run it directly in the script where it is defined.
My idea was to create settings.py where I'd declare:
from os.path import join, dirname
DATA_DIR = join(dirname(__file__), 'data', 'processed')
So now I can write:
from my_project import settings
import os
def data_sample(code=None):
file_path = os.path.join(settings.DATA_DIR, 'my_data')
df = pd.read_parquet(file_path)
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Questions:
Is this common practice to refer to files in this way? settings.DATA_DIR looks kinda ugly.
Is this at all how settings.py should be used? And should it be placed in this directory? I have seen it in different spot in this repo under .samr/settings.py
I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.

I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.
Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.
Initially, I reviewed pkg_resources, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.
Essentially, the helpers do two things
1. Persist project root folder and some other paths in constansts:
# shorter version
ROOT = Path(__file__).parents[3]
# longer version
def find_repo_root():
"""Returns root folder for repository.
Current file is assumed to be:
<repo_root>/src/kep/helper/<this file>.py
"""
levels_up = 3
return Path(__file__).parents[levels_up]
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')
This is similar to what you do with DATA_DIR. A possible weak point is that here I
manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.
2. Allow access to specific data in raw, interim and processed folders.
This can be a simple function returning a full path by a filename in a folder, for example:
def interim(filename):
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)
In my project I have year-month subfolders for interim and processed directories and I address data by year, month and sometimes frequency. For this data structure I have
InterimCSV and ProcessedCSV classes that give reference specific paths, like:
from . helper import ProcessedCSV, InterimCSV
# somewhere in code
csv_text = InterimCSV(self.year, self.month).text()
# later in code
path = ProcessedCSV(2018,4).path(freq='q')
The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.
In your example, you can easily have root directory fixed in setting.py,
but I think you can go a step forward with abstracting your data.
Currently data_sample() mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:
# keep this in setting.py
def processed(filename):
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)

As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01 or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py, config.py, settings.py or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project. Define one _mydir = os.path.dirname(os.path.realpath(__file__)) in one place (config.py) and rely on that to avoid refactoring pain.

No, the use of a settings.py is only common practice if you are using Django. As far as referencing the data directory this way, it depends on if you want users to be able to ever change this value. The way you have it set up to change the value requires editing of the settings.py file. If you want users to have a default but also be able to easily change it as they use your function, just create the base path value inline and make it the default in the def data_sample(..., datadir=filepath):.

You can open a file by using open() and save it in a variable and keep using the variable wherever u wish to refer to the file.
with open('Test.txt','r') as f:
or
f=open('Test.txt','r')
and use f to refer to the file.
If you want the file to be both readable & writable, you can use r+ in place of r.

Imported function not working

I am pretty new to Python and am trying to import a function I have made in a separate file. When I run the code I get "TypeError: signal() missing 1 required positional argument: 'handler'". I think it means the signal function is not being passed a single argument but I am pretty sure that is what the for loop does. Where am I going wrong? Also, the files are in the same folder, which is a part of the working directory. The code that calls the function is:
import numpy as np
t=np.linspace(-5,5,200)
import signal
y=[]
for i in t:
y.append(signal.signal(i))
The function code is saved in a file called signal.py. The code is:
def signal(t):
import numpy as np
y=np.cos(t)*np.exp(-abs(t))
return y

It seems you are trying to import a signal from the standard library instead of your own file. Try to import it like this:
from .signal import signal
PS: Since you are new to Python, you should also make sure you have a an __init__.py file in the directory, like so:
/Parent
__init__.py
main.py
signal.py

As suggested by chepner, you have a module name conflict with pythons inbuilt module signal
If the name is not important, then you could change the name.
If the name is important then you could create a package and place the file in that, Then import it.
For example, following will be your directory tree
signal_module/
├── __init__.py
└── signal.py
original_file.py
Then import the signal_module as follows
from signal_module import signal
The __init__.py file is import.
It can be empty, but it needs to be created for python to tree the directory as package.
As you said you are new to python, have a look at this answer to know more about the importance of __init__.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Resources Module to Import Data Files - python

Related

understanding hierarchical python modules & packages

How to build package like pandas/numpy where pd/np is an object with all the functions

what is .random. in this statement?

Elegant way to refer to files in data science project

Imported function not working

Categories

Resources