issues with Dask in python - python

I have built a simple Dask application to use multiprocessing to loop through files and create summaries.The code is looping through all the zip files in the directory and creating a list of names while iterating through the files( Dummy task). I was not able to either print the name or append it in the list. what's the issue, i cant figure out.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.ioff()
import time
import os
from pathlib import Path
import glob
import webbrowser
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2) # In this example I have 8 cores and processes (can also use threads if desired)
webbrowser.open(client.dashboard_link)
print(client)
os.chdir("D:\spx\Complete data\item_000027392")
csv_file_list=[file for file in glob.glob("*.zip")]
total_file=len(csv_file_list)
data_date=[]
columns=['Date', 'straddle_price_open', 'straddle_price_close']
summary=pd.DataFrame(columns =columns)
def my_function(i):
df=pd.read_csv(Path("D:\spx\Complete data\item_000027392",csv_file_list[i]),skiprows=0)
date = csv_file_list
data_date.append(date)
print(date)
return date
futures = []
for i in range(0,total_file):
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
The idea is that I should be able to make operations on the data and print outputs and charts while using dask but for some reason i can't.

Related

NameError: name 'athena' is not defined , when importing athena query function from another jupyter notebook

My query_distinct_data() function executes successfully when run.
But when I try to import the query_distinct_data() using Jupyter notebook from my function page map_distinct_data on to my main page I get the following error.
NameError: name 'athena' is not defined
Below is my main page below
import pandas as pd
import requests
import xml.etree.ElementTree as ET
from datetime import date
import boto3
import time
import geopandas
import folium
from ipynb.fs.full.qld_2 import qld_data
from ipynb.fs.full.vic_2 import vic_data
from ipynb.fs.full.put_to_s3_bucket import put_to_s3_bucket
from ipynb.fs.full.map_distinct_data import query_distinct_data
from ipynb.fs.full.map_distinct_data import distinct_data_df
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
aws_region = "ap-southeast-2"
schema_name = "fire_data"
table_name ='rfs_fire_data'
result_output_location = "s3://camgoo2-rfs-visualisation/query_results/"
bucket='camgoo2-rfs-visualisation'
athena = boto3.client("athena",region_name=aws_region)
qld_data()
vic_data()
put_to_s3_bucket()
execution_id = query_distinct_data()
df = distinct_data_df()
create_distinct_data_map()
Below is my function that I am wanting to import from map_distinct_data notebook. This successfully executes but am getting the error when trying to import to my main page.
def query_distinct_data():
query = "SELECT DISTINCT * from fire_data.rfs_fire_data where state in ('NSW','VIC','QLD')"
response = athena.start_query_execution(
QueryString=query,
ResultConfiguration={"OutputLocation": result_output_location})
return response["QueryExecutionId"]
I am able to run function query_distinct_data() and it executes when run seperately.
But it fails when I try to import the function.
The other functions that I import using ipynb.fs.full that do involve athena are executing okay when imported.
It is all about variables visibility scope (1, 2)
In short: map_distinct_data module knows nothing about main page's athena variable.
The good and correct way is to pass athena variable inside function as parameter:
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
...
athena = boto3.client("athena",region_name=aws_region)
execution_id = create_distinct_data_map(athena)
where create_distinct_data_map should be defined as
def create_distinct_data_map(athena):
...
The second way is to set variable inside imported module:
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
from ipynb.fs.full import map_distinct_data
athena = boto3.client("athena",region_name=aws_region)
map_distinct_data.athena = athena
execution_id = create_distinct_data_map()
Even if second way is working it is a really bad style.
Here is some must to know information about encapsulation in Python.

Unable to read URL

My code is the following, it used to run perfectly for quite a while but suddenly got the error message. Tried other stocks data providers like Google & alpha vantage and got the same error message.
import plotly.graph_objects as go
import plotly.express as px
from datetime import datetime
import numpy as np
!pip install ffn
import ffn
import pandas_datareader.data as web
from pandas.plotting import register_matplotlib_converters
from pylab import *
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plot
from matplotlib import style
%matplotlib inline
stocks = 'alf,mrin,auud,sncr,ddd,ssnt,seac,ttd'
df = ffn.get(stocks, start='06/18/2021').to_returns().dropna()
print(df.as_format('.2%'))
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
sums = df.select_dtypes(np.number).sum()
sort_sums = sums.sort_values(ascending = False)
pd.set_option('Display.max_rows', len(stocks))
sharpe = ffn.core.calc_sharpe(df)
sharpe = sharpe.sort_values(ascending = False)
df.append({stocks: sharpe},ignore_index=True)
print(str(sort_sums.as_format('.2%')))
print(sharpe.head(10))
df.shape
I'm using Google Colaboratory
Please run the code and you will see the Error message I'm getting (I can't copy it to here).
Please help & thank you very much in advance!

Dask diagnostics - progress bar with map_partition / delayed

I am using the distributed scheduler and distributed progressbar.
Is there a way of having the progress bar work for Dataframe.map_partition or delayed? I assume the lack of futures is what causes the bar not to work. If I change my code to client.submit the progressbar does work.
Code looks like this:
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
client = Client("tcp://....")
...
ddf = dd.read_parquet("...")
ddf = ddf.map_partitions(..)
progress(ddf) # no futures to pass
dask.compute(ddf)
Alternative with dask.delayed does not work either:
delayed = [dask.deplayed(myfunc)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(delayed)
dask.compute(*delayed)
Client.submit does produce a working progress bar, but code execution fails and I haven't managed to debug it yet.
futures = [client.submit(myfunc, ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)
Is there a way to get the progress bar (or a report of tasks completed vs total) working for map_partitions or dask.delayed ?
Full code example with delayed:
import dask
import npumpy as np
import pandas as pd
import dask.dataframe as dd
from distributed import Client
from distributed.diagnostics.progressbar import progress
import time
cl = Client("tcp://10.0.2.15:8786")
def wait(df):
print("Received chunk")
time.sleep(2)
print("finish")
df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=4)
futures = [dask.delayed(wait)(ddf.get_partition(i)) for i in range(ddf.npartitions)]
progress(futures)
dask.compute(*futures)
Yes you are right, progress is intended to work with futures or collections that contain futures. You don't need submit a big list of futures to use it, though:
ddf = ddf.map_partitions(..)
fut = client.compute(ddf)
progress(fut)
# wait on fut, call fut.result() or continue
Also don't forget: the distributed scheduler that you are using, even if on a single machine only, comes with a diagnostics dashboard that contains the same information. Usually this is at http://localhost:8787, and you can access from any browser.

Use Pandas dataframe in mrJob

I have a python code and i need to use mrjob to make my python script more faster.
How do I make below script to use mrJob?
the below script works fine for small file, but when i run large file it takes forever. so I am planning to use mrJob which is a mapReducer python package. So, problem is : I dont know how to use mrJob for this script, please advise?
import os
import pandas as pd
import pyffx
import string
import sys
column='first_name'
filename="python_test.csv"
encrypted_value_list = []
alpha=string.printable
key=b'sec-key'
seperator_in='|'
seperator_out='|'
outputfile='encypted.csv'
compression_in=None
compression_out=None
df = pd.read_csv(filename,compression=compression_in, sep=seperator, low_memory=False, encoding='utf-8-sig')
df_null = df[df[column].isnull()]
df_notnull = df[df[column].notnull()].copy()
for index,row in df_notnull.iterrows():
e = pyffx.String(key, alphabet=alpha, length=len(row[column]))
encrypted_value_list.append(e.encrypt(row[column]))
df_notnull[column]=encrypted_value_list
df_merged = pd.concat([df_notnull, df_null], axis=0, ignore_index=True, sort=False)
df_merged

How to let my subfile use my definition in main file?

I have this sentence in my main.py file:
import pandas as pd
from modules.my_self_defined import *
input='1.csv'
df=just_an_example(input)
in ./modules/my_self_defined.py:
def just_an_example(csv_file):
a=pd.read_csv(csv_file)
return a
Then when I run the file, it says pd is not defined in ./modules/my_self_defined.py
How could I make it work?
You use pandas (pd) in my_self_defined.py, not in main.py. So import it in my_self_defined.py instead and it'll work.

Categories

Resources