Dask diagnostics - progress bar with map_partition / delayed - python

I am using the distributed scheduler and distributed progressbar.
Is there a way of having the progress bar work for Dataframe.map_partition or delayed? I assume the lack of futures is what causes the bar not to work. If I change my code to client.submit the progressbar does work.
Code looks like this:
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
client = Client("tcp://....")
...
ddf = dd.read_parquet("...")
ddf = ddf.map_partitions(..)
progress(ddf) # no futures to pass
dask.compute(ddf)
Alternative with dask.delayed does not work either:
delayed = [dask.deplayed(myfunc)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(delayed)
dask.compute(*delayed)
Client.submit does produce a working progress bar, but code execution fails and I haven't managed to debug it yet.
futures = [client.submit(myfunc, ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)
Is there a way to get the progress bar (or a report of tasks completed vs total) working for map_partitions or dask.delayed ?
Full code example with delayed:
import dask
import npumpy as np
import pandas as pd
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
import time
cl = Client("tcp://10.0.2.15:8786")
def wait(df):
print("Received chunk")
time.sleep(2)
print("finish")
df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=4)
futures = [dask.delayed(wait)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)

Yes you are right, progress is intended to work with futures or collections that contain futures. You don't need submit a big list of futures to use it, though:
ddf = ddf.map_partitions(..)
fut = client.compute(ddf)
progress(fut)
# wait on fut, call fut.result() or continue
Also don't forget: the distributed scheduler that you are using, even if on a single machine only, comes with a diagnostics dashboard that contains the same information. Usually this is at http://localhost:8787, and you can access from any browser.

Related

issues with Dask in python

I have built a simple Dask application to use multiprocessing to loop through files and create summaries.The code is looping through all the zip files in the directory and creating a list of names while iterating through the files( Dummy task). I was not able to either print the name or append it in the list. what's the issue, i cant figure out.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.ioff()
import time
import os
from pathlib import Path
import glob
import webbrowser
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2) # In this example I have 8 cores and processes (can also use threads if desired)
webbrowser.open(client.dashboard_link)
print(client)
os.chdir("D:\spx\Complete data\item_000027392")
csv_file_list=[file for file in glob.glob("*.zip")]
total_file=len(csv_file_list)
data_date=[]
columns=['Date', 'straddle_price_open', 'straddle_price_close']
summary=pd.DataFrame(columns =columns)
def my_function(i):
df=pd.read_csv(Path("D:\spx\Complete data\item_000027392",csv_file_list[i]),skiprows=0)
date = csv_file_list
data_date.append(date)
print(date)
return date
futures = []
for i in range(0,total_file):
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
The idea is that I should be able to make operations on the data and print outputs and charts while using dask but for some reason i can't.

How to productionise Python script for AWS Glue?

I'm following this tutorial video: https://www.youtube.com/watch?v=EzQArFt_On4
The example code provided in this video:
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
glueJob = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueJob.init(args['JOB_NAME'], args)
sparkSession = glueContext.spark_session
#ETL process code
def etl_process():
...
return xxx
glueJob.commit()
I'm wondering if the part before the function etl_process can be used in production directly? Or do I need to wrap that part into a separate function so that I can add unit test for it?
something like this:
def define_spark_session():
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
glue_job = Job(glue_context)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job.init(args['JOB_NAME'], args)
spark_session = glue_context.spark_session
return spark_session
But it seems doesn't need a parameter...
Or should I just write unit test for etl_process function?
Or maybe I can create a separate python file with etl_process function and import it in this script?
I'm new to this, a bit confused, might someone be able to help please? Thanks.
As for now it is very difficult to test AWS Glue itself locally, although there are some solutions like downloading a docker image AWS provides you and run it from there (you'll probably need some tweaks but should be all right).
I guess the easies way is to transform the DynamicFrame you get from gluelibs into a Spark DataFrame (.toDf()) and then do thinks in pure Spark (PySpark) so you'll be able to test the result.
dataFrame = dynamic_frame.toDf()
def transormation(dataframe):
return dataframe.withColumn(...)
def test_transformation()
result = transformation(input_test_dataframe)
assert ...

Use Pandas dataframe in mrJob

I have a python code and i need to use mrjob to make my python script more faster.
How do I make below script to use mrJob?
the below script works fine for small file, but when i run large file it takes forever. so I am planning to use mrJob which is a mapReducer python package. So, problem is : I dont know how to use mrJob for this script, please advise?
import os
import pandas as pd
import pyffx
import string
import sys
column='first_name'
filename="python_test.csv"
encrypted_value_list = []
alpha=string.printable
key=b'sec-key'
seperator_in='|'
seperator_out='|'
outputfile='encypted.csv'
compression_in=None
compression_out=None
df = pd.read_csv(filename,compression=compression_in, sep=seperator, low_memory=False, encoding='utf-8-sig')
df_null = df[df[column].isnull()]
df_notnull = df[df[column].notnull()].copy()
for index,row in df_notnull.iterrows():
e = pyffx.String(key, alphabet=alpha, length=len(row[column]))
encrypted_value_list.append(e.encrypt(row[column]))
df_notnull[column]=encrypted_value_list
df_merged = pd.concat([df_notnull, df_null], axis=0, ignore_index=True, sort=False)
df_merged

How to do Data profile to a table using pandas_profiling

When I'm trying to do data profiling one sql server table by using pandas_profiling throwing an error like
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
This is the code which I'm using to run,I couldn't figure out how to resolve this issue.
import pandas as pd
import pandas_profiling
df=pd.DataFrame(read)
profile=pandas_profiling.ProfileReport(df)
enter code here
I expect to see a profiling result of a given table:
try using multiprocessing.freeze_support() as below:
import multiprocessing
import numpy as np
import pandas as pd
import pandas_profiling
def test_profile():
df = pd.DataFrame(
np.random.rand(100, 5),
columns=['a', 'b', 'c', 'd', 'e']
)
profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="output.html")
if __name__ == '__main__':
multiprocessing.freeze_support()
test_profile()

Ipyparallel and numba (jit)

I've been trying to run a code using ipyparallel. However, since some of my functions pass through jit. This ends up in the ipython console not doing anything and freezing. Here is an example:
import time
import numpy
from numba import jit
from ipyparallel import Client
client = Client() # create client and direct view to all engines available
dview = client[:]
dview.block = True
with dview.sync_imports():
import numpy
#jit
def TestingFnc(x):
a=x[0]
b=x[1]
c=x[2]
Result=0
for i in range(100000):
Result = Result + ((a+i)**2 + (b-i)**3) /(c+i)**3
return numpy.array([Result,1])
d=numpy.array([[0.,0.,0.],[0.,0.,1.],[0.,0.,2]])
Results = dview.map(TestingFnc, d)
Without the #jit, the code runs in parallel.

Categories

Resources