How to do Data profile to a table using pandas_profiling - python

When I'm trying to do data profiling one sql server table by using pandas_profiling throwing an error like
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
This is the code which I'm using to run,I couldn't figure out how to resolve this issue.
import pandas as pd
import pandas_profiling
df=pd.DataFrame(read)
profile=pandas_profiling.ProfileReport(df)
enter code here
I expect to see a profiling result of a given table:

try using multiprocessing.freeze_support() as below:
import multiprocessing
import numpy as np
import pandas as pd
import pandas_profiling
def test_profile():
df = pd.DataFrame(
np.random.rand(100, 5),
columns=['a', 'b', 'c', 'd', 'e']
)
profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="output.html")
if __name__ == '__main__':
multiprocessing.freeze_support()
test_profile()

Related

NameError: name 'athena' is not defined , when importing athena query function from another jupyter notebook

My query_distinct_data() function executes successfully when run.
But when I try to import the query_distinct_data() using Jupyter notebook from my function page map_distinct_data on to my main page I get the following error.
NameError: name 'athena' is not defined
Below is my main page below
import pandas as pd
import requests
import xml.etree.ElementTree as ET
from datetime import date
import boto3
import time
import geopandas
import folium
from ipynb.fs.full.qld_2 import qld_data
from ipynb.fs.full.vic_2 import vic_data
from ipynb.fs.full.put_to_s3_bucket import put_to_s3_bucket
from ipynb.fs.full.map_distinct_data import query_distinct_data
from ipynb.fs.full.map_distinct_data import distinct_data_df
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
aws_region = "ap-southeast-2"
schema_name = "fire_data"
table_name ='rfs_fire_data'
result_output_location = "s3://camgoo2-rfs-visualisation/query_results/"
bucket='camgoo2-rfs-visualisation'
athena = boto3.client("athena",region_name=aws_region)
qld_data()
vic_data()
put_to_s3_bucket()
execution_id = query_distinct_data()
df = distinct_data_df()
create_distinct_data_map()
Below is my function that I am wanting to import from map_distinct_data notebook. This successfully executes but am getting the error when trying to import to my main page.
def query_distinct_data():
query = "SELECT DISTINCT * from fire_data.rfs_fire_data where state in ('NSW','VIC','QLD')"
response = athena.start_query_execution(
QueryString=query,
ResultConfiguration={"OutputLocation": result_output_location})
return response["QueryExecutionId"]
I am able to run function query_distinct_data() and it executes when run seperately.
But it fails when I try to import the function.
The other functions that I import using ipynb.fs.full that do involve athena are executing okay when imported.
It is all about variables visibility scope (1, 2)
In short: map_distinct_data module knows nothing about main page's athena variable.
The good and correct way is to pass athena variable inside function as parameter:
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
...
athena = boto3.client("athena",region_name=aws_region)
execution_id = create_distinct_data_map(athena)
where create_distinct_data_map should be defined as
def create_distinct_data_map(athena):
...
The second way is to set variable inside imported module:
from ipynb.fs.full.map_distinct_data import create_distinct_data_map
from ipynb.fs.full import map_distinct_data
athena = boto3.client("athena",region_name=aws_region)
map_distinct_data.athena = athena
execution_id = create_distinct_data_map()
Even if second way is working it is a really bad style.
Here is some must to know information about encapsulation in Python.

issues with Dask in python

I have built a simple Dask application to use multiprocessing to loop through files and create summaries.The code is looping through all the zip files in the directory and creating a list of names while iterating through the files( Dummy task). I was not able to either print the name or append it in the list. what's the issue, i cant figure out.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.ioff()
import time
import os
from pathlib import Path
import glob
import webbrowser
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2) # In this example I have 8 cores and processes (can also use threads if desired)
webbrowser.open(client.dashboard_link)
print(client)
os.chdir("D:\spx\Complete data\item_000027392")
csv_file_list=[file for file in glob.glob("*.zip")]
total_file=len(csv_file_list)
data_date=[]
columns=['Date', 'straddle_price_open', 'straddle_price_close']
summary=pd.DataFrame(columns =columns)
def my_function(i):
df=pd.read_csv(Path("D:\spx\Complete data\item_000027392",csv_file_list[i]),skiprows=0)
date = csv_file_list
data_date.append(date)
print(date)
return date
futures = []
for i in range(0,total_file):
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
The idea is that I should be able to make operations on the data and print outputs and charts while using dask but for some reason i can't.

Use Pandas dataframe in mrJob

I have a python code and i need to use mrjob to make my python script more faster.
How do I make below script to use mrJob?
the below script works fine for small file, but when i run large file it takes forever. so I am planning to use mrJob which is a mapReducer python package. So, problem is : I dont know how to use mrJob for this script, please advise?
import os
import pandas as pd
import pyffx
import string
import sys
column='first_name'
filename="python_test.csv"
encrypted_value_list = []
alpha=string.printable
key=b'sec-key'
seperator_in='|'
seperator_out='|'
outputfile='encypted.csv'
compression_in=None
compression_out=None
df = pd.read_csv(filename,compression=compression_in, sep=seperator, low_memory=False, encoding='utf-8-sig')
df_null = df[df[column].isnull()]
df_notnull = df[df[column].notnull()].copy()
for index,row in df_notnull.iterrows():
e = pyffx.String(key, alphabet=alpha, length=len(row[column]))
encrypted_value_list.append(e.encrypt(row[column]))
df_notnull[column]=encrypted_value_list
df_merged = pd.concat([df_notnull, df_null], axis=0, ignore_index=True, sort=False)
df_merged

An attempt has been made to start a new process before the current process has finished its bootstrapping phase

I am new to dask and I found so nice to have a module that makes it easy to get parallelization. I am working on a project where I was able to parallelize in a single machine a loop as you can see here . However, I would like to move over to dask.distributed. I applied the following changes to the class above:
diff --git a/mlchem/fingerprints/gaussian.py b/mlchem/fingerprints/gaussian.py
index ce6a72b..89f8638 100644
--- a/mlchem/fingerprints/gaussian.py
+++ b/mlchem/fingerprints/gaussian.py
## -6,7 +6,7 ## from sklearn.externals import joblib
from .cutoff import Cosine
from collections import OrderedDict
import dask
-import dask.multiprocessing
+from dask.distributed import Client
import time
## -141,13 +141,14 ## class Gaussian(object):
for image in images.items():
computations.append(self.fingerprints_per_image(image))
+ client = Client()
if self.scaler is None:
- feature_space = dask.compute(*computations, scheduler='processes',
+ feature_space = dask.compute(*computations, scheduler='distributed',
num_workers=self.cores)
feature_space = OrderedDict(feature_space)
else:
stacked_features = dask.compute(*computations,
- scheduler='processes',
+ scheduler='distributed',
num_workers=self.cores)
stacked_features = numpy.array(stacked_features)
Doing so generates this error:
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
I have tried different ways of adding if __name__ == '__main__': without any success. This can be reproduced by running this example. I would appreciate if anyone could help me to figure this out. I have no clue on how I should change my code to make it work.
Thanks.
Edit: The example is cu_training.py.
The Client command starts up new processes, so it will have to be within the if __name__ == '__main__': block as described in this SO question or this GitHub issue
This is the same as with the multiprocessing module
I faced several issues in my code even after including main if __name__ == '__main__': .
I am using several python files and modules and multiprocess is used by only one function for some sampling operation. The only fix that worked for me is to include main at the very first file & first line of the whole code (even including imports). The following worked well:
if __name__ == '__main__':
from mjrl.utils.gym_env import GymEnv
from mjrl.policies.gaussian_mlp import MLP
from mjrl.baselines.quadratic_baseline import QuadraticBaseline
from mjrl.baselines.mlp_baseline import MLPBaseline
from mjrl.algos.npg_cg import NPG
from mjrl.algos.dapg import DAPG
from mjrl.algos.behavior_cloning import BC
from mjrl.utils.train_agent import train_agent
from mjrl.samplers.core import sample_paths
import os
import json
import mjrl.envs
import mj_envs
import time as timer
import pickle
import argparse
import numpy as np
# ===============================================================================
# Get command line arguments
# ===============================================================================
parser = argparse.ArgumentParser(description='Policy gradient algorithms with demonstration data.')
parser.add_argument('--output', type=str, required=True, help='location to store results')
parser.add_argument('--config', type=str, required=True, help='path to config file with exp params')
args = parser.parse_args()
JOB_DIR = args.output
if not os.path.exists(JOB_DIR):
os.mkdir(JOB_DIR)
with open(args.config, 'r') as f:
job_data = eval(f.read())
assert 'algorithm' in job_data.keys()
assert any([job_data['algorithm'] == a for a in ['NPG', 'BCRL', 'DAPG']])
job_data['lam_0'] = 0.0 if 'lam_0' not in job_data.keys() else job_data['lam_0']
job_data['lam_1'] = 0.0 if 'lam_1' not in job_data.keys() else job_data['lam_1']
EXP_FILE = JOB_DIR + '/job_config.json'
with open(EXP_FILE, 'w') as f:
json.dump(job_data, f, indent=4)
# ===============================================================================
# Train Loop
# ===============================================================================
e = GymEnv(job_data['env'])
policy = MLP(e.spec, hidden_sizes=job_data['policy_size'], seed=job_data['seed'])
baseline = MLPBaseline(e.spec, reg_coef=1e-3, batch_size=job_data['vf_batch_size'],
epochs=job_data['vf_epochs'], learn_rate=job_data['vf_learn_rate'])

ImportError: cannot import X from the file that has if __name__ == "__main__". Any solutions without delete if __name__ == "__main__"?

I have 2 files n1711_001_insilico and n1711_002_insilico in the same folder. I want two variables (mzdf which is <class 'pandas.core.frame.DataFrame'>, and charge is'int') from the first file so I do import at the top of the second file:
import numpy as np
import pandas as pd
from n1711_001_insilico import mzdf, charge
I got ImportError: cannot import name mzdf(as well as for charge). In the first file, I explicitly return both mzdf and charge from the function and call them like this:
if __name__ == "__main__":
mzdf, charge = CALC(peptides_report, aa_dict, charge_from=1, charge_to=6)
Update: from the comment I now know the issue comes from if __name__ == "__main__": in the first file. Any ways that I can fix this without deleting if __name__ == "__main__"?
In the file you are trying to import, you have this statement:
if __name__ == "__main__":
mzdf, charge = CALC(peptides_report, aa_dict, charge_from=1, charge_to=6)
This means that only if the file is directly executed in the command line, for example with python n1711_001_insilico.py will the command CALC actually run and be executed, and this is the function that is setting the two variables mzdf and charge.
In other words, only when you run the file directly with python n1711_001_insilico.py do these two variables exist, when you import it, Python will not run that function.
This is by design; once you import the file, the variable __name__ points to the name of the file, so the condition fails.
Now, to solve this problem you will have to run the CALC function when you import the file, and get your own copy of the result:
import numpy as np
import pandas as pd
from n1711_001_insilico import peptides_report, aa_dict, CALC
mzdf, charge = CALC(peptides_report, aa_dict, charge_form=1, charge_to=6)

Categories

Resources