Ipyparallel and numba (jit) - python

I've been trying to run a code using ipyparallel. However, since some of my functions pass through jit. This ends up in the ipython console not doing anything and freezing. Here is an example:
import time
import numpy
from numba import jit
from ipyparallel import Client
client = Client() # create client and direct view to all engines available
dview = client[:]
dview.block = True
with dview.sync_imports():
import numpy
#jit
def TestingFnc(x):
a=x[0]
b=x[1]
c=x[2]
Result=0
for i in range(100000):
Result = Result + ((a+i)**2 + (b-i)**3) /(c+i)**3
return numpy.array([Result,1])
d=numpy.array([[0.,0.,0.],[0.,0.,1.],[0.,0.,2]])
Results = dview.map(TestingFnc, d)
Without the #jit, the code runs in parallel.

Related

issues with Dask in python

I have built a simple Dask application to use multiprocessing to loop through files and create summaries.The code is looping through all the zip files in the directory and creating a list of names while iterating through the files( Dummy task). I was not able to either print the name or append it in the list. what's the issue, i cant figure out.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.ioff()
import time
import os
from pathlib import Path
import glob
import webbrowser
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2) # In this example I have 8 cores and processes (can also use threads if desired)
webbrowser.open(client.dashboard_link)
print(client)
os.chdir("D:\spx\Complete data\item_000027392")
csv_file_list=[file for file in glob.glob("*.zip")]
total_file=len(csv_file_list)
data_date=[]
columns=['Date', 'straddle_price_open', 'straddle_price_close']
summary=pd.DataFrame(columns =columns)
def my_function(i):
df=pd.read_csv(Path("D:\spx\Complete data\item_000027392",csv_file_list[i]),skiprows=0)
date = csv_file_list
data_date.append(date)
print(date)
return date
futures = []
for i in range(0,total_file):
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
The idea is that I should be able to make operations on the data and print outputs and charts while using dask but for some reason i can't.

Dask diagnostics - progress bar with map_partition / delayed

I am using the distributed scheduler and distributed progressbar.
Is there a way of having the progress bar work for Dataframe.map_partition or delayed? I assume the lack of futures is what causes the bar not to work. If I change my code to client.submit the progressbar does work.
Code looks like this:
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
client = Client("tcp://....")
...
ddf = dd.read_parquet("...")
ddf = ddf.map_partitions(..)
progress(ddf) # no futures to pass
dask.compute(ddf)
Alternative with dask.delayed does not work either:
delayed = [dask.deplayed(myfunc)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(delayed)
dask.compute(*delayed)
Client.submit does produce a working progress bar, but code execution fails and I haven't managed to debug it yet.
futures = [client.submit(myfunc, ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)
Is there a way to get the progress bar (or a report of tasks completed vs total) working for map_partitions or dask.delayed ?
Full code example with delayed:
import dask
import npumpy as np
import pandas as pd
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
import time
cl = Client("tcp://10.0.2.15:8786")
def wait(df):
print("Received chunk")
time.sleep(2)
print("finish")
df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=4)
futures = [dask.delayed(wait)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)
Yes you are right, progress is intended to work with futures or collections that contain futures. You don't need submit a big list of futures to use it, though:
ddf = ddf.map_partitions(..)
fut = client.compute(ddf)
progress(fut)
# wait on fut, call fut.result() or continue
Also don't forget: the distributed scheduler that you are using, even if on a single machine only, comes with a diagnostics dashboard that contains the same information. Usually this is at http://localhost:8787, and you can access from any browser.

python script slow after base64 module import

I am trying to generate keys using os.urandom() and base64 methods. Please see the below code. gen_keys() itself may not be very slow, but
the script overall run time is very slow. For example, gen_keys() takes
about 0.85 sec where as the overall script run time is is 2 minutes 6 seconds. I suspect this is some thing to do with module imports. Although I need all of the modules from my script.
Any thoughts on the real issue? Thanks.
I am using python3.4
#!/usr/bin/env python3
import netifaces
import os
import subprocess
import shlex
import requests
import time
import json
import psycopg2
import base64
def gen_keys():
start_time = time.time()
a_tok = os.urandom(40)
a_key = base64.urlsafe_b64encode(a_tok).rstrip(b'=').decode('ascii')
s_tok = os.urandom(64)
s_key = base64.urlsafe_b64encode(s_tok).rstrip(b'=').decode('ascii')
print("a_key: ", a_key)
print("s_key: ", s_key)
end_time = time.time()
print("time taken: ", end_time-start_time)
def Main():
gen_keys()
if __name__ == '__main__':
Main()
$~: time ./keys.py
a_key: 52R_5u4I1aZENTsCl-fuuHU1P4v0l-urw-_5_jCL9ctPYXGz8oFnsQ
s_key: HhJgnywrfgfplVjvtOciZAZ8E3IfeG64RCAMgW71Z8Tg112J11OHewgg0r4CWjK_SJRzYzfnN-igLJLRi1CkeA
time taken: 0.8523025512695312
real 2m6.536s
user 0m0.287s
sys 0m7.007s
$~:

Python type error while embedding

I am trying to run following python code from c++(embedding python).
import sys
import os
import time
import win32com.client
from com.dtmilano.android.viewclient import ViewClient
import re
import pythoncom
import thread
os.popen('adb devices')
CANalyzer = None
measurement = None
def can_start(config_path):
global CANalyzer,measurement
CANalyzer = win32com.client.Dispatch('CANalyzer.Application')
CANalyzer.Visible = 1
measurement = CANalyzer.Measurement
CANalyzer.Open(config_path)
measurement.Start()
com_marshall_stream = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch,CANalyzer)
return com_marshall_stream
When i try to call can_start function, i am getting python type error . Error traceback is mentioned below.
"type 'exceptions.TypeError'. an integer is required. traceback object at 0x039A198"
The function is executing if i directly ran it from the python and also it is executing in my pc, where the code was developed. But later when i transferred to another laptop, i am experiencing this problem.

importing module (nltk) causes multiprocessing to hang

I tracked a python multiprocessing headache down to the import of a module (nltk). Reproducible (hopefully) code is pasted below. This doesn't make any sense to me, does anybody have any ideas?
from multiprocessing import Pool
import time, requests
#from nltk.corpus import stopwords # uncomment this and it hangs
def gethtml(key, url):
r = requests.get(url)
return r.text
def getnothing(key, url):
return "nothing"
if __name__ == '__main__':
pool = Pool(processes=4)
result = list()
nruns = 4
url = 'http://davidchao.typepad.com/webconferencingexpert/2013/08/gartners-magic-quadrant-for-cloud-infrastructure-as-a-service.html'
for i in range(0,nruns):
# print gethtml(i,url)
result.append(pool.apply_async(gethtml, [i,url]))
# result.append(pool.apply_async(getnothing, [i,url]))
pool.close()
# monitor jobs until they complete
running = nruns
while running > 0:
time.sleep(1)
running = 0
for run in result:
if not run.ready(): running += 1
print "processes still running:",running
# print results
for i,run in enumerate(result):
print i,run.get()[0:40]
Note that the 'getnothing' function works. It's a combination of the nltk module import and the requests call. Sigh
> python --version
Python 2.7.6
> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'
('7fffffffffffffff', True)
> pip freeze | grep requests
requests==2.2.1
> pip freeze | grep nltk
nltk==2.0.4
I would redirect others with similar problems to solutions which do not use the multiprocessing module:
1) Apache Spark for scalability/flexibility. However, this doesn't seem to a solution for python multiprocessing. Looks like pyspark is also limited by the Global Interpreter Lock?
2) 'gevent' or 'twisted' for general python asynchronous processing
http://sdiehl.github.io/gevent-tutorial/
3) grequests for asynchronous requests
Asynchronous Requests with Python requests

Categories

Resources