How can I create a dynamic logging structure in scrapy?

How can I create a dynamic logging structure in scrapy? - python

I am building a scraper using scrapy framework. I want to log every run in a structured manner. I already created dynamic logging using the timestamp in the setting.py file
LOG_DIR = os.path.join(BASE_DIR, 'logs')
if not os.path.exists(LOG_DIR):
try:
os.mkdir(LOG_DIR)
except OSError as e:
pass
LOG_FILE = os.path.join(LOG_DIR, f'{datetime.now().timestamp()}.log')
but I further want to store logs in a nested directory structure. that can help me access the accurate logs more easily
i.e.
- ROOT
|- logs
| |- 21-02-2022
| |- 22-02-2022
| | |- us
| | |- UK
| | | |- t-shirt
| | | |- hoodie
| | | | |- 1653029099.520938.log
Can someone please direct me on how can I achieve this?

Related

How to get absolute path of root directory from anywhere within the directory in python

Let's say I have the following directory
model_folder
|
|
------- model_modules
| |
| ---- __init__.py
| |
| ---- foo.py
| |
| ---- bar.py
|
|
------- research
| |
| ----- training.ipynb
| |
| ----- eda.ipynb
|
|
------- main.py
and I want to import model_modules into a script in research
I can do that with the following
import sys
sys.path.append('/absolute/path/model_folder')
from model_modules.foo import Foo
from model_modules.bar import Bar
However, let's say I don't explicitly know the absolute path of the root, or perhaps just don't want to hardcode it as it may change locations. How could I get the absolute path of module_folder from anywhere in the directory so I could do something like this?
import sys
sys.path.append(root)
from model_modules.foo import Foo
from model_modules.bar import Bar
I referred to this question in which one of the answers recommends adding the following to the root directory, like so:
utils.py
from pathlib import Path
def get_project_root() -> Path:
return Path(__file__).parent.parent
model_folder
|
|
------- model_modules
| |
| ---- __init__.py
| |
| ---- foo.py
| |
| ---- bar.py
|
|
|
------- src
| |
| ---- utils.py
|
|
|
|
|
------- research
| |
| ----- training.ipynb
| |
| ----- eda.ipynb
|
|
------- main.py
But then when I try to import this into a script in a subdirectory, like training.ipynb, I get an error
from src.utils import get_project_root
root = get_project_root
ModuleNotFoundError: No module named 'src'
So my question is, how can I get the absolute path to the root directory from anywhere within the directory in python?

sys.path[0] contain your root directory (the directory where the program is located). You can use that to add your sub-directories.
import sys
sys.path.append( sys.path[0] + "/model_modules")
import foo
and for cases where foo.py may exist elsewhere:
import sys
sys.path.insert( 1, sys.path[0] + "/model_modules") # put near front of list
import foo

Python extension with multiple modules

I am building Python bindings for a standalone C library that I wrote. The file layout of the library is as following:
<project root>
|
`- cpython
| |
| `- module1_mod.c
| `- module2_mod.c
| `- module3_mod.c
|
`- include
| |
| `- module1.h
| `- module2.h
| `- module3.h
|
`- src
| |
| `- module1.c
| `- module2.c
| `- module3.c
|
`- setup.py
I want to obtain a Python package so I can import modules in a namespace such as my_package.module1, my_package.module2, etc.
This is my setup.py so far:
from os import path
from setuptools import Extension, setup
ROOT_DIR = path.dirname(path.realpath(__file__))
MOD_DIR = path.join(ROOT_DIR, 'cpython')
SRC_DIR = path.join(ROOT_DIR, 'src')
INCL_DIR = path.join(ROOT_DIR, 'include')
EXT_DIR = path.join(ROOT_DIR, 'ext')
ext_libs = [
path.join(EXT_DIR, 'ext_lib1', 'lib.c'),
# [...]
]
setup(
name="my_package",
version="1.0a1",
ext_modules=[
Extension(
"my_package.module1",
[
path.join(SRC_DIR, 'module1.c',
path.join(MOD_DIR, 'module1_mod.c',
] + ext_libs,
include_dirs=[INCL_DIR],
libraries=['uuid', 'pthread'],
),
],
)
Importing mypackage.module1 works but the problem is that the external libraries are also needed by module2 and module3 (not all of them for all the modules), and I assume that if I include the same external libs in the other modules, I would get a lot of bloat.
I looked around sample setups in Github but haven't found an example resolving this problem.
What is a good way to organize my builds?
EDIT: This is actually a more severe problem in that I have symbols in module1 that are needed in module2, etc. E.g. an object in module2 requires an object type defined in module1. If I create separate binaries without including all sources for each dependency, the symbols won't be available at linking time, thus increasing redundancy and complexity of keeping track of what is needed for which module.

After a couple of days of digging into Python bug reports and scarcely documented features, I found an answer to this, which resolved both the multiple external dependencies and the internal cross-linking.
The solution was to create a monolithic "module" with all the modules defined inside it, then exposing them with a few lines of Python code in a package initialization file.
To do this I changed the module source files to header files, maintaining most of their methods static and only exposing the PyTypeObject structs and my object type structs so they can be used in other modules.
Then I moved the PyMODINIT_FUNC functions defining all the modules in a "package" module (py_mypackage.c), which also defines an empty module. the "package" module is defined as _my_package.
Finally I added some internal machinery to an __init__.py script that extracts the module symbols from the .so file and exposes them as modules of the package. This is documented in the Python docs :
import importlib.util
import sys
import _my_package
pkg_path = _my_package.__file__
def _load_module(mod_name, path):
spec = importlib.util.spec_from_file_location(mod_name, path)
module = importlib.util.module_from_spec(spec)
sys.modules[mod_name] = module
spec.loader.exec_module(module)
return module
for mod_name in ('module1', 'module2', 'module3'):
locals()[mod_name] = _load_module(mod_name, pkg_path)
The new layout is thus:
<project root>
|
`- cpython
| |
| `- my_package
| |
| `- __init__.py
|
| `- py_module1.h
| `- py_module2.h
| `- py_module3.h
| `- py_mypackage.c
|
`- include
| |
| `- module1.h
| `- module2.h
| `- module3.h
|
`- src
| |
| `- module1.c
| `- module2.c
| `- module3.c
|
`- setup.py
And setup.py:
setup(
name="my_package",
version="1.0a1",
package_dir={'my_package': path.join(CPYTHON_DIR, 'my_package')},
packages=['my_package'],
ext_modules=[
Extension(
"_my_package",
"<all .c files in cpython folder + ext library sources>",
libraries=[...],
),
],
)
For the curious, the complete code is at https://notabug.org/scossu/lsup_rdf/src/e08da1a83647454e98fdb72f7174ee99f9b8297c/cpython (pinned at the current commit).

PyCharm PYTHONPATH with different parts of single logical package

Assume I have projects deployment and cms with this structure:
+ deployment
| + src
| | + my_company
| | | + __init__.py
| | | + deployment
| | | | + ...
+ cms
| + src
| | + my_company
| | | + __init__.py
| | | + cms
| | | | + ...
+ ...
My company has many projects that are distributed as single logical package my_company. This functionality ensures extend_path in each my_company/__init__.py file.
https://docs.python.org/2/library/pkgutil.html#pkgutil.extend_path
So then is possible import like this:
from mp_company import cms
from mp_company import deployment
Problem comes when I mark all src directories as Sources Root in PyCharm. Because then PyCharm sees just only one package (probably the first it encounters) for the first level of imports in suggestions box. So if I want sugesstions for phrase import my_company. it appears only deployment. Strange is that for second level of imports all working right. So all suggestions for phrase import my_company.cms. suddenly appears after I write dot character after cms package name.
Is there any option in settings to fix this problem?

It looks like it is known issue https://youtrack.jetbrains.com/issue/PY-23087.

Get absolute path in Python

I'm having a problem with assigning static files containing resources.
My working directory structure is:
|- README.md
|- nlp
| |-- morpheme
| |-- |-- morpheme_builder.py
| |-- fsa_setup.py
| - tests
| |-- test_fsa.py
| - res
| |-- suffixes.xml
The code for fsa_setup.py is:
class FSASetup():
fsa = None
def get_suffixes():
list_suffix = list()
file = os.path.realpath("../res/suffixes.xml")
.....
if __name__ == "__main__":
FSASetup.get_suffixes()
The code for morpheme_builder.py is:
class MorphemeBuilder:
def get_all_words_from_fsa(self):
......
if __name__ == "__main__":
FSASetup.get_suffixes()
When it is called in fsa_setup.py, the file path's value is '\res\suffixes.xml' and that is correct, but when the other case realized, the file path value is '\nlp\res\suffixes.xml'.
I know how it works like this. So how can I give the path of the resource to the file.

The problem is that morpheme_builder.py is in the directory morphem. So when you say ../res/suffixes.xml it will go on directory back ... so it will go to nlp/res/suffixes.xml. What about if you use os.path.abspath("../res/suffixes.xml")?

How to structure celery tasks

I have 2 types of task: async tasks and schedule tasks. So, here is my dir structure:
proj
|
-- tasks
|
-- __init__.py
|
-- celeryapp.py => celery instance defined in this file.
|
-- celeryconfig.py
|
-- async
| |
| -- __init__.py
| |
| -- task1.py => from proj.tasks.celeryapp import celery
| |
| -- task2.py => from proj.tasks.celeryapp import celery
|
-- schedule
|
-- __init__.py
|
-- task1.py => from proj.tasks.celeryapp import celery
|
-- task2.py => from proj.tasks.celeryapp import celery
But when I run celery worker like below, it does not work. It can not accept the task from celery beat scheduler.
$ celery worker --app=tasks -Q my_queue,default_queue
So, is there any best practice on multiple task files organization?

Based on celery documentation you can import a structure of celery tasks like this:
For example if you have an (imagined) directory tree like this:
|
|-- foo
| |-- __init__.py
| |-- tasks.py
|
|-- bar
|-- __init__.py
|-- tasks.py
Then calling app.autodiscover_tasks(['foo', bar']) will result in the modules foo.tasks and bar.tasks being imported.

Celery tasks can be async, sync or scheduled depends on its invocation
task.delay(arg1,arg2) #will be async
task.delay(arg1,arg2).get() #will be sync
task.delay(arg1,arg2).get() #will be sync
task.apply_async(args = [arg1,arg2], {'countdown' : some_seconds}) #async with delay
There's a lot of invocations depending on your needs
However, you must start celery with -B flag to enable celery scheduler
$ celery worker --app=tasks -B -Q my_queue,default_queue
So the way you take to organize your tasks is something personal and it deppends on your project complexity, but I think that organize them by its type of synchronism wouldn't be the best option.
I've googled this topic and I haven't found any guide or advise, but I've read some cases that organize their task by their functionality.
I've followed this advise, because this isn't a pattern, in my projects. Here one example of how I did
your_app
|
-- reports
|
-- __init__.py
-- foo_report.py
-- bar_report.py
-- tasks
|
-- __init__.py
-- report_task.py
-- maintenance
|
-- __init__.py
-- tasks
|
-- __init__.py
-- delete_old_stuff_task.py
-- twitter
|
-- __init__.py
-- tasks
|
-- __init__.py
-- batch_timeline.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I create a dynamic logging structure in scrapy? - python

Related

How to get absolute path of root directory from anywhere within the directory in python

Python extension with multiple modules

PyCharm PYTHONPATH with different parts of single logical package

Get absolute path in Python

How to structure celery tasks

Categories

Resources