importing module (nltk) causes multiprocessing to hang - python

I tracked a python multiprocessing headache down to the import of a module (nltk). Reproducible (hopefully) code is pasted below. This doesn't make any sense to me, does anybody have any ideas?
from multiprocessing import Pool
import time, requests
#from nltk.corpus import stopwords # uncomment this and it hangs
def gethtml(key, url):
r = requests.get(url)
return r.text
def getnothing(key, url):
return "nothing"
if __name__ == '__main__':
pool = Pool(processes=4)
result = list()
nruns = 4
url = 'http://davidchao.typepad.com/webconferencingexpert/2013/08/gartners-magic-quadrant-for-cloud-infrastructure-as-a-service.html'
for i in range(0,nruns):
# print gethtml(i,url)
result.append(pool.apply_async(gethtml, [i,url]))
# result.append(pool.apply_async(getnothing, [i,url]))
pool.close()
# monitor jobs until they complete
running = nruns
while running > 0:
time.sleep(1)
running = 0
for run in result:
if not run.ready(): running += 1
print "processes still running:",running
# print results
for i,run in enumerate(result):
print i,run.get()[0:40]
Note that the 'getnothing' function works. It's a combination of the nltk module import and the requests call. Sigh
> python --version
Python 2.7.6
> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'
('7fffffffffffffff', True)
> pip freeze | grep requests
requests==2.2.1
> pip freeze | grep nltk
nltk==2.0.4

I would redirect others with similar problems to solutions which do not use the multiprocessing module:
1) Apache Spark for scalability/flexibility. However, this doesn't seem to a solution for python multiprocessing. Looks like pyspark is also limited by the Global Interpreter Lock?
2) 'gevent' or 'twisted' for general python asynchronous processing
http://sdiehl.github.io/gevent-tutorial/
3) grequests for asynchronous requests
Asynchronous Requests with Python requests

Related

cant run with pypy3 script which uses pyshark

here i tryied to run script with pypy3 c.py but above error occured ,
i installed pypy3 -m pip install pyshark but ...
pypy3 c.py
ModuleNotFoundError: No module named 'lxml.objectify'
import pyshark
import pandas as pd
import numpy as np
from multiprocessing import Pool
import re
import sys
temp_array = []
cap = pyshark.FileCapture("ddos_attack.pcap")
#print(cap._extract_packet_json_from_data(cap[0]))
def parse(capture):
print(capture)
packet_raw = [i.strip('\r').strip('\t').split(':') for i in str(capture).split('\n')]
packet_raw = map(lambda num:[num[0].replace('(',''),num[1].strip(')').replace('(','')] if len(num)== 2 else [num[0],':'.join(num[1:])] ,[i for i in packet_raw])
raw = list(packet_raw)[:-1]
cols = [i[0] for i in raw]
vals = [i[1] for i in raw]
temp_array.append(dict(zip(cols,vals)))
return dict(zip(cols,vals))
def preprocess_dataset(x):
count = 0
temp = []
#print(list(cap))
#p = Pool(5)
#r = p.map(parse,cap)
#p.close()
#p.join()
#print(r)
try:
for i in list(cap):
temp.append(parse(i))
count += 1
except Exception:
print("somethin")
data = pd.DataFrame(temp)
print(data)
data = data[['Packet Length','.... 0101 = Header Length','Protocol','Time to Live','Source Port','Length','Time since previous frame in this TCP stream','Window']]
data.rename(columns={".... 0101 = Header Length": 'Header Length'})
filtr = ["".join(re.findall(r'\d.',str(i))) for i in data['Time since previous frame in this TCP stream']]
data['Time since previous frame in this TCP stream'] = filtr
print(data.to_csv('data.csv'))
here i tryied to run script with pypy3 c.py
but above error occured ,
i installed pypy3 -m pip install pyshark but ...
Check your terminal settings.
Try to use another compiler like PyCharm.
It seems lxml is not installed correctly. It is hard to figure out what is going on since you only show the last line of the traceback, and do not state what platform you are on nor what version of PyPy you are using. The lxml package is listed as a requirement for pyshark, so it should have been installed. What happens when you try import lxml ?

How Do I Make an API Call from Python v2.7.13 When the Requests Module Isn't Available

Good-day Folks,
I am writing a small Python script that will be used on a Ubiquiti EdgeRouter 12P to make an API Call, using Digest Authentication, against another router for some JSON data. This is my first attempt at writing a Python script and I have been able to do this using the Requests module, but only in a Python v3.x.y environment. As I was writing my script, I discovered that the Python version on the EdgeRouter is v2.7.13 and the Requests module isn't installed/loaded.
I have attempted to do a pip install requests but it fails with an invalid syntax error message. So with my limited knowledge, I can't figure out what my options are now. Googled around a bit and saw references to using UrlLib or UrlLib2 - but I'm struggling with figuring out how to use either to make my API Call using Digest Authentication.
I would very much like to stick to the Requests module, as it appears to be the simplest and cleanest approach. Below is a snippet of my code, any help would be really appreciated, thanks.
PS. My script is not yet finished, as I'm still learning how to parse the response data received.
HERE'S MY SCRIPT
#PYTHON MODULE IMPORTS
import sys #Import the Python Sys module - used later to detect the Python version
PythonVersion = sys.version_info.major #Needed to put this global variable up here, so I can use it early
import json #Import the JSON module - used to print the API Call response content in JSON format
if PythonVersion == 3: #Import the REQUESTS and HTTPDigestAuth modules - used to make API Calls if we detect Python version 3
import requests
from requests.auth import HTTPDigestAuth
if PythonVersion == 2: #Import the UrlLib module - used to make API Calls if we detect Python version 2
import urllib
#GLOBAL VARIABLES
NodeSerial = '1234'
NodeIP = '192.168.1.100'
BasePath = 'api/v0.1'
apiURL = f"http://{NodeIP}/{BasePath}/MagiNodes"
#Define the credentials to use
Username='admin'
Passwd='mypassword'
#MAIN ROUTINE
print("This is a test script")
if PythonVersion == 2:
print("Python Version 2 detected")
elif PythonVersion == 3:
print("Python Version 3 detected")
print(f"Now querying MagiLink-{NodeSerial}...")
Query1 = requests.get(f"{apiURL}/{NodeSerial}", auth=HTTPDigestAuth(Username, Passwd)) #Using an F string
pretty_Query1 = json.dumps(Query1.json(), indent=3) #Pretty printing the response content
print(pretty_Query1)

Run Python API script from cron

I'm trying to run Python script for API. When I run it manualy it works fine. When I try to run the same script using cron, it fails.
In cron job i've something like this:
0 8-17 * * * root /usr/bin/python3 /root/scripts/python_script.sh
I've tryied also to run Python script from bash script, and manually it works, but running bash script with Python inside via cron doesn't work.
I think that problem is with saving file in my script, but i can't figured it out.
My Python script for API is below:
import requests
import json
import shutil
import os
import glob
def main():
ses = requests.Session()
base_url = 'URL/auth'
headers = {'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8'}
payload = {'username':'USER', 'password':'PASS'}
response = ses.post(base_url, data=payload, headers=headers)
cookieJar = ses.cookies
s = str(cookieJar)
base_url_export = 'URL/export'
headers_export = {'accept':'application/json', 'Cookie':get_cookies, 'Content-Type':'application/json'}
payload_export = {'names':'FILE_1'}
response_export_route = ses.post(base_url_export, data=json.dumps(payload_export), headers=headers_export)
file_download_FILE_1 = str(response_export_FILE_1.text)
file_request = ses.get("URL/getFile/"+file_download_FILE_1)
open(file_download_FILE_1+".xlsx", "wb").write(file_request.content)
shutil.move('/root/scripts/'+file_download_FILE_1+'.xlsx', 'DST-DIR')

python script slow after base64 module import

I am trying to generate keys using os.urandom() and base64 methods. Please see the below code. gen_keys() itself may not be very slow, but
the script overall run time is very slow. For example, gen_keys() takes
about 0.85 sec where as the overall script run time is is 2 minutes 6 seconds. I suspect this is some thing to do with module imports. Although I need all of the modules from my script.
Any thoughts on the real issue? Thanks.
I am using python3.4
#!/usr/bin/env python3
import netifaces
import os
import subprocess
import shlex
import requests
import time
import json
import psycopg2
import base64
def gen_keys():
start_time = time.time()
a_tok = os.urandom(40)
a_key = base64.urlsafe_b64encode(a_tok).rstrip(b'=').decode('ascii')
s_tok = os.urandom(64)
s_key = base64.urlsafe_b64encode(s_tok).rstrip(b'=').decode('ascii')
print("a_key: ", a_key)
print("s_key: ", s_key)
end_time = time.time()
print("time taken: ", end_time-start_time)
def Main():
gen_keys()
if __name__ == '__main__':
Main()
$~: time ./keys.py
a_key: 52R_5u4I1aZENTsCl-fuuHU1P4v0l-urw-_5_jCL9ctPYXGz8oFnsQ
s_key: HhJgnywrfgfplVjvtOciZAZ8E3IfeG64RCAMgW71Z8Tg112J11OHewgg0r4CWjK_SJRzYzfnN-igLJLRi1CkeA
time taken: 0.8523025512695312
real 2m6.536s
user 0m0.287s
sys 0m7.007s
$~:

How to integrate checking of readme in pytest

I use pytest in my .travis.yml to check my code.
I would like to check the README.rst, too.
I found readme_renderer via this StackO answer
Now I ask myself how to integrate this into my current tests.
The docs of readme_renderer suggest this, but I have not clue how to integrate this into my setup:
python setup.py check -r -s
I think the simplest and most robust option is to write a pytest plugin that replicates what the distutils command you mentioned in you answer does.
That could be as simple as a conftest.py in your test dir. Or if you want a standalone plugin that's distributable for all of us to benefit from there's a nice cookiecutter template.
Ofc there's inherently nothing wrong with calling the check manually in your script section after the call to pytest.
I check it like this now:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function
import os
import subx
import unittest
class Test(unittest.TestCase):
def test_readme_rst_valid(self):
base_dir = os.path.dirname(os.path.dirname(os.path.dirname(__file__)))
subx.call(cmd=['python', os.path.join(base_dir, 'setup.py'), 'check', '--metadata', '--restructuredtext', '--strict'])
Source: https://github.com/guettli/reprec/blob/master/reprec/tests/test_setup.py
So I implemented something but it does require some modifications. You need to modify your setup.py as below
from distutils.core import setup
setup_info = dict(
name='so1',
version='',
packages=[''],
url='',
license='',
author='tarun.lalwani',
author_email='',
description=''
)
if __name__ == "__main__":
setup(**setup_info)
Then you need to create a symlink so we can import this package in the test
ln -s setup.py setup_mypackage.py
And then you can create a test like below
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function
import os
import unittest
from distutils.command.check import check
from distutils.dist import Distribution
import setup_mypackage
class Test(unittest.TestCase):
def test_readme_rst_valid(self):
dist = Distribution(setup_mypackage.setup_info)
test = check(dist)
test.ensure_finalized()
test.metadata = True
test.strict = True
test.restructuredtext = True
global issues
issues = []
def my_warn(msg):
global issues
issues += [msg]
test.warn = my_warn
test.check_metadata()
test.check_restructuredtext()
if len(issues) > 0:
assert len(issues) == 0, "\n".join(issues)
Running the test then I get
...
AssertionError: missing required meta-data: version, url
missing meta-data: if 'author' supplied, 'author_email' must be supplied too
Ran 1 test in 0.067s
FAILED (failures=1)
This is one possible workaround that I can think of
Upvoted because checking readme consistence is a nice thing I never integrated in my own projects. Will do from now on!
I think your approach with calling the check command is fine, although it will check more than readme's markup. check will validate the complete metadata of your package, including the readme if you have readme_renderer installed.
If you want to write a unit test that does only markup check and nothing else, I'd go with an explicit call of readme_renderer.rst.render:
import pathlib
from readme_renderer.rst import render
def test_markup_is_generated():
readme = pathlib.Path('README.rst')
assert render(readme.read_text()) is not None
The None check is the most basic test: if render returns None, it means that the readme contains errors preventing it from being translated to HTML. If you want more fine-grained tests, work with the HTML string returned. For example, I expect my readme to contain the word "extensions" to be emphasized:
import pathlib
import bs4
from readme_renderer.rst import render
def test_extensions_is_emphasized():
readme = pathlib.Path('README.rst')
html = render(readme.read_text())
soup = bs4.BeautifulSoup(html)
assert soup.find_all('em', string='extensions')
Edit: If you want to see the printed warnings, use the optional stream argument:
from io import StringIO
def test_markup_is_generated():
warnings = StringIO()
with open('README.rst') as f:
html = render(f.read(), stream=warnings)
warnings.seek(0)
assert html is not None, warnings.read()
Sample output:
tests/test_readme.py::test_markup_is_generated FAILED
================ FAILURES ================
________ test_markup_is_generated ________
def test_markup_is_generated():
warnings = StringIO()
with open('README.rst') as f:
html = render(f.read(), stream=warnings)
warnings.seek(0)
> assert html is not None, warnings.read()
E AssertionError: <string>:54: (WARNING/2) Title overline too short.
E
E ----
E fffffff
E ----
E
E assert None is not None
tests/test_readme.py:10: AssertionError
======== 1 failed in 0.26 seconds ========

Categories

Resources