exception in newsplease commoncrawl.py file - python

i am using newsplease library that i have cloned from https://github.com/fhamborg/news-please.
i want to use newsplease to get news artices from commoncrawl news datasets.
i am running commoncrawl.py file as instruct here.
i have used the command below -
python -m newsplease.examples.commoncrawl
on executing the following command i am getting following errors -
my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
rec_headers = self.arc_parser.parse(stream, statusline)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
main()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
continue_process=True)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
self.__run()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
self.__process_warc_gz_file(local_path_name)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
for record in ArchiveIterator(stream):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
self.record = self._next_record(self.next_line)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
self.check_digests)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
known_format))
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
what is the error here how can i resolve this.
https://github.com/fhamborg/news-please says that adopt the config section in
newsplease/examples/commoncrawl.py.
what does this mean ?
i have copied the configurations from this file and pasted in
config.cfg which is present in the newsplease/config directory.
is this what thay have instructed ? or i have made a mistake here.
i am using python 3.6. i have only one python installed in my machine.

this error is because of the libraries being used by the newsplease. mistake is made when we manually install every library, while installing focus on the versions of packages. version info of every library is given in setup.py file. install exact version given in setup.py file.
now there may be problems while executing the setup.py.
so use this command -
python3 setup.py install
if you need to uninstall all the previous verions of installed packeges then run -
pip3 freeze --user | xargs pip3 uninstall -y
for more ways to do this click here

Related

Python Beeware Briefcase Git SHA issue

I'm trying to teach myself to use the beeware briefcase package for Python, although having issues setting it up. I've got pyenv installed, and my local python version in the root of my project is set to 3.8.9. I'm using windows and Powershell
In PowerShell, I've created the python virtual environment, and have installed briefcase via pip.
I've installed git as well, and linked the repo to github.
When I try to run "briefcase new", and go through the prompts, I receive the following traceback (for both powershell and gitbash):
(I've removed my root details in below errors)
Traceback (most recent call last):
File "C:\~\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\~\.pyenv\pyenv-win\versions\3.8.9\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\~\CodeProjects\beeware-tutorial\.venv\Scripts\briefcase.exe\__main__.py", line 7, in <module>
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\briefcase\__main__.py", line 14, in main
command(**options)
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\briefcase\commands\new.py", line 537, in __call__
return self.new_app(template=template, **options)
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\briefcase\commands\new.py", line 488, in new_app
cached_template = self.update_cookiecutter_cache(
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\briefcase\commands\base.py", line 569, in update_cookiecutter_cache
f"Using existing template (sha {head.commit.hexsha}, "
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\refs\symbolic.py", line 217, in _get_commit
obj = self._get_object()
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\refs\symbolic.py", line 210, in _get_object
return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path)))
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\objects\base.py", line 85, in new_from_sha
oinfo = repo.odb.info(sha1)
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\db.py", line 43, in info
hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\cmd.py", line 1253, in get_object_header
return self.__get_object_header(cmd, ref)
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\cmd.py", line 1240, in __get_object_header
return self._parse_object_header(cmd.stdout.readline())
File "c:\~\codeprojects\beeware-tutorial\.venv\lib\site-packages\git\cmd.py", line 1198, in _parse_object_header
raise ValueError("SHA could not be resolved, git returned: %r" % (header_line.strip()))
ValueError: SHA could not be resolved, git returned: b''
Not sure how I managed to fix the issue but I uninstalled git version 2.37.1 and installed git version 2.30.2, used GitBash to create a new environment, and installed a briefcase.
Seems to have solved the issue.

Command '['git' 'clone', '--recurse-submodules', '--', 'ssh://git#git/b2b/py_client.git'] returned non-zero exit status 128

I am trying to create virtualenv with make setup and poetry in Git Bash:
$ make setup
poetry install --no-root
Creating virtualenv ad-ml in C:\Users\user1\Documents\ad_ml\.venv
Installing dependencies from lock file
Warning: The lock file is not up to date with the latest changes in pyproject.toml. You may be getting outdated dependencies. Run update to update them.
CalledProcessError
Command '['C:\\Users\\user1\\AppData\\Local\\Programs\\Git\\mingw64\\bin\\git.exe'
, 'clone', '--recurse-submodules', '--', 'ssh://git#git.bcb.local:7999/b2b/py_client.git',
'C:\\Users\\EMANZH~1\\AppData\\Local\\Temp\\pypoetry-git-py_clien9fdvh9lr']'
returned non-zero exit status 128.
at ~\AppData\Roaming\pypoetry\venv\lib\site-packages\poetry\utils\_compat.py:217 in run
and get CalledProcessError with exit status 128 and another exception:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user1\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\user1\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\user1\AppData\Roaming\Python\Scripts\poetry.exe\__main__.py", line 7, in <module>
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\poetry\console\__init__.py", line 5, in main
return Application().run()
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\console_application.py", line 142, in run
trace.render(io, simple=isinstance(e, CliKitException))
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\ui\components\exception_trace.py", line 232, in render
return self._render_exception(io, self._exception)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\ui\components\exception_trace.py", line 269, in _render_exception
self._render_snippet(io, current_frame)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\ui\components\exception_trace.py", line 289, in _render_snippet
self._render_line(io, code_line)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\ui\components\exception_trace.py", line 402, in _render_line
io.write_line("{}{}".format(indent * " ", line))
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\cleo\io\io_mixin.py", line 65, in write_line
super(IOMixin, self).write_line(string, flags)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\api\io\io.py", line 66, in write_line
self._output.write_line(string, flags=flags)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\api\io\output.py", line 69, in write_line
self.write(string, flags=flags, new_line=True)
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\api\io\output.py", line 61, in write
self._stream.write(to_str(formatted))
File "C:\Users\user1\AppData\Roaming\pypoetry\venv\lib\site-packages\clikit\io\output_stream\stream_output_stream.py", line 24, in write
self._stream.write(string)
File "C:\Users\user1\Anaconda3\lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2502' in position 27: character maps to <undefined>
Are there any ideas how to fix this error, any help would be much appreciated.
pc windows 10, git bash 2.34.0, cloned repo with sourcetree from bitbucket, python 3.8.8
Check first if this is similar to python-poetry/poetry issue 3297, which refers to a pypa/virtualenv issue 1986
The first link includes (by Daniel Taylor):
We downgrade virtualenv inside the conda environment in our circle CI windows executors, not sure if it with pip or not.
So adding a step like this to your yml config should fix the issue (or just adding virtualenv=20.0.33 to the step where you install your conda dependencies):
- run: conda install virtualenv=20.0.33
The OP Taky proposes in the comments:
I changed dependency link to py_client.git in pyproject.toml from "ssh" to "https", and that's worked for me.

How to run a ParlAI Blender chat bot?

I wish to play with the Blender chatbot, but most of scripts on the Quick Start or Recipes pages won't work. For example
From https://parl.ai/projects/recipes/
python parlai/scripts/safe_interactive.py -t blended_skill_talk -mf zoo:blender/blender_90M/model
I got
Traceback (most recent call last):
File "parlai/scripts/safe_interactive.py", line 11, in <module>
from parlai.core.params import ParlaiParser
ModuleNotFoundError: No module named 'parlai'
or
from https://parl.ai/docs/tutorial_quick.html
parlai eval_model -t twitter -mf zoo:blender/blender_90M/model
I got
17:28:54 INFO | loading dictionary from /home/jacek/ParlAI/data/models/blender/blender_90M/model.dict
17:28:54 INFO | num words = 54944
Traceback (most recent call last):
File "/usr/local/bin/parlai", line 11, in <module>
load_entry_point('parlai', 'console_scripts', 'parlai')()
File "/home/jacek/ParlAI/parlai/core/script.py", line 266, in superscript_main
SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
File "/home/jacek/ParlAI/parlai/core/script.py", line 88, in _run_from_parser_and_opt
return script.run()
File "/home/jacek/ParlAI/parlai/scripts/eval_model.py", line 232, in run
return eval_model(self.opt, print_parser=self.parser)
File "/home/jacek/ParlAI/parlai/scripts/eval_model.py", line 198, in eval_model
agent = create_agent(opt, requireModelExists=True)
File "/home/jacek/ParlAI/parlai/core/agents.py", line 394, in create_agent
model = create_agent_from_opt_file(opt)
File "/home/jacek/ParlAI/parlai/core/agents.py", line 347, in create_agent_from_opt_file
return model_class(opt_from_file)
File "/home/jacek/ParlAI/parlai/core/torch_generator_agent.py", line 445, in __init__
super().__init__(opt, shared)
File "/home/jacek/ParlAI/parlai/core/torch_agent.py", line 728, in __init__
self.dict = self.build_dictionary()
File "/home/jacek/ParlAI/parlai/core/torch_agent.py", line 812, in build_dictionary
d = self.dictionary_class()(self.opt)
File "/home/jacek/ParlAI/parlai/core/dict.py", line 305, in __init__
self.bpe = bpe_factory(opt, shared)
File "/home/jacek/ParlAI/parlai/utils/bpe.py", line 80, in bpe_factory
bpe_helper = SubwordBPEHelper(opt, shared)
File "/home/jacek/ParlAI/parlai/utils/bpe.py", line 292, in __init__
raise RuntimeError(
RuntimeError: Please run "pip install 'git+https://github.com/rsennrich/subword-nmt.git#egg=subword-nmt'"
(I tried to do that pip install and it says it's already installed)
After
parlai train_model -t babi:task10k:1 -mf /tmp/babi_memnn -bs 1 -nt 4 -eps 5 -m memnn --no-cuda
I got
Parse Error: argument -mtw/--multitask-weights: invalid 'multitask_weights' value: 'memnn'
although I figured that one my self- there should be '--model' flag instead of '-m'.
I guess all of these are result of some configuration errors. I have installed ParlAI, Python and PyTorch using instructions provided on the Quick Start page, though.
System info: Ubuntu 20.04 LTS, Intel® Core™ i7-9750H CPU # 2.60GHz × 12, 31,2 GiB RAM, GeForce GTX 1660 Ti/PCIe/SSE2.
I'm stuck, please help.
you just have to install parlai
pip install parlai
then
parlai interactive -mf zoo:blender/blender_90M/model -t blended_skill_talk
For me Blender worked out of the box within less than 10 minutes.
I was actually pleasantly surprised how easy it was to run it.

Canopy is having difficulties updating packages, "AbortedOperationDetected"

I run Canopy version Version: 2.1.3.3542 (64 bit) on Windows 10.
Canopy cant manage to update any package, all results in the same
log output for a numpy update (for example):
Warming up...
Traceback (most recent call last):
File "build\bdist.win-amd64\egg\canopy_dashboard\packman\package_action_worker.py", line 54, in run
File "build\bdist.win-amd64\egg\canopy_dashboard\packman\package_action.py", line 196, in execute
File "build\bdist.win-amd64\egg\canopy_dashboard\packman\packman.py", line 626, in <lambda>
File "build\bdist.win-amd64\egg\canopy_dashboard\packman\packman.py", line 1051, in _install
File "build\bdist.win-amd64\egg\canopy_platform\cpython_packages_manager.py", line 152, in install_packages_prompt
File "build\bdist.win-amd64\egg\canopy_platform\cpython_packages_manager.py", line 137, in _install_packages_prompt
PackagesInstallationError: installation of packages ['numpy 1.11.3-3'] failed. Details below:
Traceback (most recent call last):
File "build\bdist.win-amd64\egg\canopy_platform\edm_api.py", line 64, in wrapper
File "build\bdist.win-amd64\egg\canopy_platform\edm_api.py", line 384, in install_command
File "build\bdist.win-amd64\egg\canopy_platform\edm_api.py", line 414, in _install_packages_command
File "build\bdist.win-amd64\egg\edm\core\packages_manager.py", line 124, in decorator
File "build\bdist.win-amd64\egg\edm\core\packages_manager.py", line 219, in install_command
File "build\bdist.win-amd64\egg\edm\core\packages_manager.py", line 223, in _install_command
File "build\bdist.win-amd64\egg\edm\core\packages_manager.py", line 549, in _compute_fix_aborted_actions
AbortedOperationDetected: Aborted operation detected in environment 'User'
There is absolutly 0 results on google for this error code, so stackexchange is my last resort
It looks like a previous update was force-aborted, possibly leaving the environment corrupted. Assuming that you are using the standard installer, then it should suffice to
reboot your computer
temporarily disable your anti-virus software if possible (or at least disable its more intrusive / slow functionality, such as online checking each of the tens of thousands of package files that Canopy provides),
from the Canopy Tools menu, select Troubleshoot => Reset Python environment

Unable to install Flask-Mail

I am trying to send an email using Flask when a user registers on my website. I used the command pip install Flask-Mail to install. However, I get the following error of possible version mismatch:
Downloading/unpacking Flask-mail
Downloading Flask-Mail-0.9.1.tar.gz (45kB): 45kB downloaded
Running setup.py (path:/tmp/pip_build_root/Flask-mail/setup.py) egg_info for package Flask-mail
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/Flask-mail/setup.py", line 52, in <module>
'Topic :: Software Development :: Libraries :: Python Modules'
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 225, in __init__
_Distribution.__init__(self,attrs)
File "/usr/lib/python2.7/distutils/dist.py", line 287, in __init__
self.finalize_options()
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2029, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 592, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (certifi 2016.2.28 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('certifi==2015.11.20'))
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_root/Flask-mail/setup.py", line 52, in <module>
'Topic :: Software Development :: Libraries :: Python Modules'
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 225, in __init__
_Distribution.__init__(self,attrs)
File "/usr/lib/python2.7/distutils/dist.py", line 287, in __init__
self.finalize_options()
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2029, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 592, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (certifi 2016.2.28 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('certifi==2015.11.20'))
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/Flask-mail
Any possible workarounds? Any help will be greatly appreciated. Thanks!
The easiest way to avoid these kind of problems is to create a virtual environment
>>>pip install virtualenv
>>>cd my_project_folder
>>>virtualenv venv
now activate your virtual environment
>>>source venv/bin/activate
now install there pip install Flask-Mail
hopefully now it should work there
if your done working then deactivate it
>>>deactivate
or
It seems like all the problem is because of certifi version conflict
so try downloading the source from here
https://pypi.python.org/pypi/certifi
and install from source
Extract it go into the folder and run this command
sudo python setup.py install
and it should work
peace
If you do not have virtual env set up I would suggest do so first. If you have that already you may need to activate that from your terminal:
>>>source venv/bin/activate
Once completed just deactivate it for your rest of the commands:
>>>deactivate

Categories

Resources