python Packaging and distributing projects

python Packaging and distributing projects - python

I want to package the python project, but when I package the project and the python files in the project need to use data files, ‘FileNotFoundError’ appears. The following figure is my project structure:enter image description here
enter image description here
enter image description here
setup.py:
from setuptools import setup, find_packages
# To use a consistent encoding
from codecs import open
from os import path
here = path.abspath(path.dirname(__file__))
# Get the long description from the README file
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
setup(
name='NERChinese',
version='0.0.6',
description='基于BiLSTM-CRF的字级别中文命名实体识别模型',
long_description=long_description,
long_description_content_type='text/markdown',
# The project's main homepage.
url='https://github.com/cswangjiawei/Chinese-NER',
# Author details
# author='wangjiawei',
# author_email='cswangjiawei#163.com',
# Choose your license
# license='MIT',
# What does your project relate to?
keywords='Named-entity recognition using neural networks',
# You can just specify the packages manually here if your project is
# simple. Or you can use find_packages().
packages=find_packages(),
zip_safe=False,
package_data={
'NERChinese': ['data/*'],
},
)
‘FileNotFoundError: [Errno 2] No such file or directory: 'data/msra_train.txt'’ appears when 'with open (train path,' R ', encoding ='utf-8') as F1: 'in utils.py file is run. How should I package data files

It seems you don't specify the correct path when you open it, you can try instead:
with open(os.path.join(os.path.dirname(__file__), 'data/msra_train.txt'), 'r', encoding='utf-8') as F1:
This way, it is relative to the module of your package, not relative to where you are calling it.
If it does not work, can you check if you have your data in site-packages?

Related

python yaml path after deployment

So this is a question about how to handle settings files and relative paths in python (probably also something about best practice).
So I have coded a smaller project that i want to deploy to a docker image and everything is set up now except when I try to run the python task (Through cron) I get the error: settings/settings.yml not found.
tree .
├───settings
│ └───settings/settings.yml
└───main.py
And am referencing the yml file as
open('settings/settings.yml', 'r') as f:
config = yaml.load(f, Loader=yaml.FullLoader)
I can see this is what is causing the problem but am unsure about how to fix it. I wish to reference the main file basically by using the entry_points from setuptools in the future so my quick fix with cd'ing before python main.py will not be a lasting solution.

Instead of hardcoding a path as a string, you can find the directories and build the file path with os.path. For example:
import os
import yaml
current_dir = os.path.dirname(os.path.abspath(__file__))
settings_dir = os.path.join(current_dir, "settings")
filename = "settings.yml"
settings_path = os.path.join(settings_dir, filename)
with open(settings_path, "r") as infile:
settings_data = yaml.load(infile)
This way it can be run in any file system and the python file can be called from any directory.

Python - How to change configuration files in data package during installtion

As title states, I write a library that contains data package with several cofiguration files.
The configuration files contains hard-coded paths to other configuration files, that I would like to change during installation time, so the new hard-coded paths will point to where the library is actually installed.
I tried different approaches that work well under the Windows environmet, but not under Unix based platorms (e.g. Ubuntu).
My setup.py code:
import atexit
import os
import sys
import fileinput
import fnmatch
import glob
from setuptools import setup
from setuptools.command.develop import develop
from setuptools.command.install import install
from setuptools.command.egg_info import egg_info
LIB_NAME = "namsim"
NAMSIM_DATA_DIRECTORY = "data"
NAMSIM_CONF_DIRECTORY = "default_namsim_conf"
def post_install_operations(lib_path):
# TODO: workaround to exit in library creation process
if 'site-packages' not in lib_path:
return
# set conf path and replace slash to backslash to support UNIX systems
conf_dir_path = os.path.join(lib_path, NAMSIM_DATA_DIRECTORY, NAMSIM_CONF_DIRECTORY)
conf_dir_path = conf_dir_path.replace(os.sep, '/')
# change paths in all conf .xml files
file_pattern = "*.xml"
for path, dirs, files in os.walk(conf_dir_path):
for filename in fnmatch.filter(files, file_pattern):
full_file_path = os.path.join(path, filename)
print(full_file_path)
# replace stub with the actual path
stub_name = 'STUB_PATH'
# Read in the file
with open(full_file_path, 'r') as file:
file_data = file.read()
print(file_data)
# Replace the target string and fix slash direction based
file_data = file_data.replace(stub_name, conf_dir_path)
print(file_data)
# Write the file out again
with open(full_file_path, 'w') as file:
file.write(file_data)
def post_install_decorator(command_subclass):
"""A decorator for classes subclassing one of the setuptools commands.
It modifies the run() method so that it will change the configuration paths.
"""
orig_run = command_subclass.run
def modified_run(self):
def find_module_path():
for p in sys.path:
if os.path.isdir(p) and LIB_NAME in os.listdir(p):
return os.path.join(p, LIB_NAME)
orig_run(self)
lib_path = find_module_path()
post_install_operations(lib_path)
command_subclass.run = modified_run
return command_subclass
#post_install_decorator
class CustomDevelopCommand(develop):
pass
#post_install_decorator
class CustomInstallCommand(install):
pass
#post_install_decorator
class CustomEggInfoCommand(egg_info):
pass
atexit.register(all_done)
setup(
name="namsim",
version="1.0.0",
author="Barak David",
license="MIT",
keywords="Name similarity mock-up library.",
packages=['namsim', 'namsim.wrapper', 'namsim.data'],
package_date={'data': ['default_namsim_conf/*']},
include_package_data=True,
cmdclass={
'develop': CustomDevelopCommand,
'install': CustomInstallCommand,
'egg_info': CustomEggInfoCommand
}
)
Picture of my library source tree:
To be clear, the original namsim_config.xml original contains the text:
STUB_PATH/conf/multiplier_config.xml
My goal is that the text will be changed after installaion to:
{actual lib installation path}/conf/multiplier_config.xml
Some additional information:
I tried the above code on both python 2.7 and 3.x platforms.
On Windows I get the expected result, in contrast to Unix based platforms.
I use "python setup.py sdist" command on Windows to create the libary, and I install the resulting tar.gz on the different platforms.
I also tried using the atexit module to change the configurations before process termination, but I got the same result.
Thank you.

How to extract the title of a PDF document from within a script for renaming?

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. "aluminum carbonate" for a0001.pdf, "aluminum nitrate" in a0002.pdf, etc., which I'd like to extract to rename my files.
I use this program to rename a file:
path=r"C:\Users\YANN\Desktop\..."
old='string 1'
new='string 2'
def rename(path,old,new):
for f in os.listdir(path):
os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))
rename(path,old,new)
I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?

Installing the package
This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.
On Windows, first make sure you have a recent version of pip using the shell command:
python -m pip install -U pip
On Linux:
pip install -U pip
On both platforms, install then the pdfrw package using
pip install pdfrw
The code
I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:
import os
from pdfrw import PdfReader
path = r'C:\Users\YANN\Desktop'
def renameFileToPDFTitle(path, fileName):
fullName = os.path.join(path, fileName)
# Extract pdf title from pdf file
newName = PdfReader(fullName).Info.Title
# Remove surrounding brackets that some pdf titles have
newName = newName.strip('()') + '.pdf'
newFullName = os.path.join(path, newName)
os.rename(fullName, newFullName)
for fileName in os.listdir(path):
# Rename only pdf files
fullName = os.path.join(path, fileName)
if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
continue
renameFileToPDFTitle(path, fileName)

What you need is a library that can actually read PDF files. For example pdfrw:
In [8]: from pdfrw import PdfReader
In [9]: reader = PdfReader('example.pdf')
In [10]: reader.Info.Title
Out[10]: 'Example PDF document'

You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :
[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os
start = "0000"
def convert(var):
while len(var) < 4:
var = "0" + var
return var
for i in range(1,3622):
var = str(i)
var = convert(var)
file_name = "a" + var + ".pdf"
fp = open(file_name, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fp.close()
metadata = doc.info # The "Info" metadata
print metadata
metadata = metadata[0]
for x in metadata:
if x == "Title":
new_name = metadata[x] + ".pdf"
os.rename(file_name,new_name)

You can look at only the metadata using a ghostscript tool pdf_info.ps. It used to ship with ghostscript but is still available at https://r-forge.r-project.org/scm/viewvc.php/pkg/inst/ghostscript/pdf_info.ps?view=markup&root=tm

Building on Ciprian Tomoiagă's suggestion of using pdfrw, I've uploaded a script which also:
renames files in sub-directories
adds a command-line interface
handles when file name already exists by appending a random string
strips any character which is not alphanumeric from the new file name
replaces non-ASCII characters (such as á è í ò ç...) for ASCII (a e i o c) in the new file name
allows you to set the root dir and limit the length of the new file name from command-line
show a progress bar and, after the script has finished, show some statistics
does some error handling
As TextGeek mentioned, unfortunately not all files have the title metadata, so some files won't be renamed.
Repository: https://github.com/favict/pdf_renamefy
Usage:
After downloading the files, install the dependencies by running pip:
$pip install -r requirements.txt
and then to run the script:
$python -m renamefy <directory> <filename maximum length>
...in which directory is the full path you would like to look for PDF files, and filename maximum length is the length at which the filename will be truncated in case the title is too long or was incorrectly set in the file.
Both parameters are optional. If none is provided, the directory is set to the current directory and filename maximum length is set to 120 characters.
Example:
$python -m renamefy C:\Users\John\Downloads 120
I used it on Windows, but it should work on Linux too.
Feel free to copy, fork and edit as you see fit.

has some issues with defined solutions, here is my recipe
from pathlib import Path
from pdfrw import PdfReader
import re
path_to_files = Path(r"C:\Users\Malac\Desktop\articles\Downloaded")
# Exclude windows forbidden chars for name <>:"/\|?*
# Newlines \n and backslashes will be removed anyway
exclude_chars = '[<>:"/|?*]'
for i in path_to_files.glob("*.pdf"):
try:
title = PdfReader(i).Info.Title
except Exception:
# print(f"File {i} not renamed.")
pass
# Some names was just ()
if not title:
continue
# For some reason, titles are returned in brackets - remove brackets if around titles
if title.startswith("("):
title = title[1:]
if title.endswith(")"):
title = title[:-1]
title = re.sub(exclude_chars, "", title)
title = re.sub(r"\\", "", title)
title = re.sub("\n", "", title)
# Some names are just ()
if not title:
continue
try:
final_path = (path_to_files / title).with_suffix(".pdf")
if final_path.exists():
continue
i.rename(final_path)
except Exception:
# print(f"Name {i} incorrect.")
pass

Once you have installed it, open the app and go to the Download folder. You will see your downloaded files there. Just long press the file you wish to rename and the Rename option will appear at the bottom.

py2exe not recognizing jsonschema

I've been attempting to build a Windows executable with py2exe for a Python program that uses the jsonschema package, but every time I try to run the executable it fails with the following error:
File "jsonschema\__init__.pyc", line 18, in <module>
File "jsonschema\validators.pyc", line 163, in <module>
File "jsonschema\_utils.pyc", line 57, in load_schema
File "pkgutil.pyc", line 591, in get_data
IOError: [Errno 0] Error: 'jsonschema\\schemas\\draft3.json'
I've tried adding json and jsonschema to the package options for py2exe in setup.py and I also tried manually copying the jsonschema directory from its location in Python27\Libs\site-packages into library.zip, but neither of those work. I also attempted to use the solution found here (http://crazedmonkey.com/blog/python/pkg_resources-with-py2exe.html) that suggests extending py2exe to be able to copy files into the zip file, but that did not seem to work either.
I'm assuming this happens because py2exe only includes Python files in the library.zip, but I was wondering if there is any way for this to work without having to convert draft3.json and draft4.json into .py files in their original location.
Thank you in advance

Well after some more googling (I hate ugly) I got it working without patching the build_exe.py file. The key to the whole thing was the recipe at http://crazedmonkey.com/blog/python/pkg_resources-with-py2exe.html. My collector class looks like this:
from py2exe.build_exe import py2exe as build_exe
class JsonSchemaCollector(build_exe):
"""
This class Adds jsonschema files draft3.json and draft4.json to
the list of compiled files so it will be included in the zipfile.
"""
def copy_extensions(self, extensions):
build_exe.copy_extensions(self, extensions)
# Define the data path where the files reside.
data_path = os.path.join(jsonschema.__path__[0], 'schemas')
# Create the subdir where the json files are collected.
media = os.path.join('jsonschema', 'schemas')
full = os.path.join(self.collect_dir, media)
self.mkpath(full)
# Copy the json files to the collection dir. Also add the copied file
# to the list of compiled files so it will be included in the zipfile.
for name in os.listdir(data_path):
file_name = os.path.join(data_path, name)
self.copy_file(file_name, os.path.join(full, name))
self.compiled_files.append(os.path.join(media, name))
What's left is to add it to the core setup like this:
options = {"bundle_files": 1, # Bundle ALL files inside the EXE
"compressed": 2, # compress the library archive
"optimize": 2, # like python -OO
"packages": packages, # Packages needed by lxml.
"excludes": excludes, # COM stuff we don't want
"dll_excludes": skip} # Exclude unused DLLs
distutils.core.setup(
cmdclass={"py2exe": JsonSchemaCollector},
options={"py2exe": options},
zipfile=None,
console=[prog])
Some of the code is omitted since it's not relevant in this context but I think you get the drift.

Python's docx module giving AsserionError when .exe is created

I've a Python file titled my_python_file.py that makes, among other things, a .doc file using the python-docx module. The .doc is created perfectly and gives no problem. The problem comes when I build a .exe of my script and I try to make the .doc. An AssertionError problem appears.
This is my exe maker code (exe_maker.py):
from distutils.core import setup
import py2exe, sys, os
sys.argv.append('py2exe')
setup(
options = {'py2exe': {'bundle_files': 3, 'compressed': True, 'includes': ['lxml.etree', 'lxml._elementpath', 'gzip', 'docx']}},
windows = [{'script': "my_python_file.py"}],
zipfile = None,
)
It seems that moving the python script to a different location produces the error.
File "docx.pyc", line 1063, in savedocx
AssertionError
This is the savedocx line:
document = newdocument()
[...]
coreprops = coreproperties(title=title, subject=subject, creator=creator, keywords=keywords)
approps = appproperties()
contenttypes2 = contenttypes()
websettings2 = websettings()
wordrelationships2 = wordrelationships(relationships)
path_save = "C:\output"
savedocx(document, coreprops, approps, contenttypes2, websettings2, wordrelationships2, path_save)
The savedox is well writen as it works when it's not an .exe file.
How can I make the docx module work correctly? Do I've to add any other path/variable more when I make the exe?
Thanks in advance

I solved the problem by edditing the api.py file of docx egg folder which is located in the Python folder of the system.
Changing this:
_thisdir = os.path.split(__file__)[0]
_default_docx_path = os.path.join(_thisdir, 'templates', 'default.docx')
To this:
thisdir = os.getcwd()
_default_docx_path = os.path.join(thisdir, 'templates', 'default.docx')
The first one was taking the actual running program and adding it to the path to locate the templates folder.
C:\myfiles\myprogram.exe\templates\default.docx
The solution takes only the path, not the running program.
C:\myfiles\templates\default.docx

Instead of changing some library file, I find it easier and cleaner to tell python-docx explicitly where to look for the template, i.e.:
document = Document('whatever/path/you/choose/to/some.docx')
This effectively solves the py2exe and docx path problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python Packaging and distributing projects - python

Related

python yaml path after deployment

Python - How to change configuration files in data package during installtion

How to extract the title of a PDF document from within a script for renaming?

py2exe not recognizing jsonschema

Python's docx module giving AsserionError when .exe is created

Categories

Resources