Related
I wrote a code to remove the background of 8000 images but that whole code is taking approximately 8 hours to give the result.
How to improve its time complexity as I have to work on a large dataset in future?
Or do I have to write a whole new code? If it is, please suggest some sample codes.
from rembg import remove
import cv2
import glob
for img in glob.glob('../images/*.jpg'):
a = img.split('../images/')
a1 = a[1].split('.jpg')
try:
cv_img = cv2.imread(img)
output = remove(cv_img)
except:
continue
cv2.imwrite('../output image/' + str(a1[0]) + '.png', output)
One simple approach would be to divide the work into multiple threads. See ThreadPoolExecutor for more.
You can play around with max_workers= to see what get's the best results. Note that max-workers can be any number between 1 and 32.
This sample code is ready to run. It assumes the image files are in the same directory as your main.py and the output_image directory exits.
import cv2
import rembg
import sys
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
out_dir = Path("output_image")
in_dir = Path(".")
def is_image(absolute_path: Path):
return absolute_path.is_file and str(absolute_path).endswith('.png')
input_filenames = [p for p in filter(is_image, Path(in_dir).iterdir())]
def process_image(in_dir):
try:
image = cv2.imread(str(in_dir))
if image is None or not image.data:
raise cv2.error("read failed")
output = rembg.remove(image)
in_dir = out_dir / in_dir.with_suffix(".png").name
cv2.imwrite(str(in_dir), output)
except Exception as e:
print(f"{in_dir}: {e}", file=sys.stderr)
executor = ThreadPoolExecutor(max_workers=4)
for result in executor.map(process_image, input_filenames):
print(f"Processing image: {result}")
Check out the U^2Net repository. Like u2net_test.py, Writing your own remove function and using dataloaders can speed up the process. if it is not necessary skip the alpha matting else you can add the alpha matting code from rembg.
def main():
# --------- 1. get image path and name ---------
model_name='u2net'#u2netp
image_dir = os.path.join(os.getcwd(), 'test_data', 'test_images')
prediction_dir = os.path.join(os.getcwd(), 'test_data', model_name + '_results' + os.sep)
model_dir = os.path.join(os.getcwd(), 'saved_models', model_name, model_name + '.pth')
img_name_list = glob.glob(image_dir + os.sep + '*')
print(img_name_list)
#1. dataloader
test_salobj_dataset = SalObjDataset(img_name_list = img_name_list,
lbl_name_list = [],
transform=transforms.Compose([RescaleT(320),
ToTensorLab(flag=0)])
)
test_salobj_dataloader = DataLoader(test_salobj_dataset,
batch_size=1,
shuffle=False,
num_workers=1)
for i_test, data_test in enumerate(test_salobj_dataloader):
print("inferencing:",img_name_list[i_test].split(os.sep)[-1])
inputs_test = data_test['image']
inputs_test = inputs_test.type(torch.FloatTensor)
if torch.cuda.is_available():
inputs_test = Variable(inputs_test.cuda())
else:
inputs_test = Variable(inputs_test)
d1,d2,d3,d4,d5,d6,d7= net(inputs_test)
# normalization
pred = d1[:,0,:,:]
pred = normPRED(pred)
# save results to test_results folder
if not os.path.exists(prediction_dir):
os.makedirs(prediction_dir, exist_ok=True)
save_output(img_name_list[i_test],pred,prediction_dir)
del d1,d2,d3,d4,d5,d6,d7
Try to use parallelization with multiprocessing like Mark Setchell mentioned in his comment. I rewrote your code according to Method 8 from here. Multiprocessing should speed up your execution time. I did not test the code, try if it works.
import glob
from multiprocessing import Pool
import cv2
from rembg import remove
def remove_background(filename):
a = filename.split("../images/")
a1 = a[1].split(".jpg")
try:
cv_img = cv2.imread(filename)
output = remove(cv_img)
except:
continue
cv2.imwrite("../output image/" + str(a1[0]) + ".png", output)
files = glob.glob("../images/*.jpg")
pool = Pool(8)
results = pool.map(remove_background, files)
Ah, you used the example from https://github.com/danielgatis/rembg#usage-as-a-library as template for your code. Maybe try the other example with PIL image instead of OpenCV. The latter is mostly less fast, but who knows. Try it with maybe 10 images and compare execution time.
Here is your code using PIL instead of OpenCV. Not tested.
import glob
from PIL import Image
from rembg import remove
for img in glob.glob("../images/*.jpg"):
a = img.split("../images/")
a1 = a[1].split(".jpg")
try:
cv_img = Image.open(img)
output = remove(cv_img)
except:
continue
output.save("../output image/" + str(a1[0]) + ".png")
I have trained a word2vec model in the python h2o package.
Is there a simple way for me to save that word2vec model and load it back later for use?
I have tried the h2o.save_model() and h2o.load_model() functions with no luck.
I get an error using that approach like
ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/99/Models.bin/)
water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:
I am using the same version of h2o to train and load the model back in so the issue outlined in this question is not applicable Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed
Any one with any insights on how to save and load an h2o word2vec model?
My sample code with a few of the important snippets
import h2o
from h2o.estimators import H2OWord2vecEstimator
df['text'] = df['text'].ascharacter()
# Break text into sequence of words
words = tokenize(df["text"])
# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip, port=h2o_port, min_mem_size=h2o_min_memory)
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)
# Calculate a vector for each row
word_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")
#Save model to path
wv_path = '/models/wordvec/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)
# Load model in later script
w2v_model = h2o.load_model(model_path)
It sounds like you might have an access issue with the directory you trying to read from. I just tested on H2O 3.30.0.1 following the w2v example from docs and it ran fine:
job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"),
col_names = ["category", "jobtitle"],
col_types = ["string", "string"],
header = 1)
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can",
"lines","re","what","there","all","we","one","the",
"a","an","of","or","in","for","by","on","but","is",
"in","a","not","with","as","was","if","they","are",
"this","and","it","have","from","at","my","be","by",
"not","that","to","from","com","org","like","likes",
"so"]
# Make the 'tokenize' function:
def tokenize(sentences, stop_word = STOP_WORDS):
tokenized = sentences.tokenize("\\W+")
tokenized_lower = tokenized.tolower()
tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
return tokenized_words
# Break job titles into a sequence of words:
words = tokenize(job_titles["jobtitle"])
# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)
#Save model
wv_path = 'models/'
model_path = h2o.save_model(model = w2v_model, path= wv_path ,force=True)
#Load Model
w2v_model2 = h2o.load_model(model_path)
I am trying to construct a pipeline in Microsoft Azure having (for now) a simple python script in input.
The problem is that I cannot find my output.
In my Notebooks section I have constructed the following two codes:
1) script called "test.ipynb"
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset, Datastore
import pandas as pd
import numpy as np
import datetime
import math
#Upload datasets
subscription_id = 'myid'
resource_group = 'myrg'
workspace_name = 'mywn'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset_zre = Dataset.get_by_name(workspace, name='file1')
dataset_SLA = Dataset.get_by_name(workspace, name='file2')
df_zre = dataset_zre.to_pandas_dataframe()
df_SLA = dataset_SLA.to_pandas_dataframe()
result = pd.concat([df_SLA,df_zre], sort=True)
result.to_csv(path_or_buf="/mnt/azmnt/code/Users/aniello.spiezia/outputs/output.csv",index=False)
def_data_store = workspace.get_default_datastore()
def_data_store.upload(src_dir = '/mnt/azmnt/code/Users/aniello.spiezia/outputs', target_path = '/mnt/azmnt/code/Users/aniello.spiezia/outputs', overwrite = True)
print("\nFinished!")
#End of the file
2) pipeline code called "pipeline.ipynb"
import os
import pandas as pd
import json
import azureml.core
from azureml.core import Workspace, Run, Experiment, Datastore
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.runconfig import CondaDependencies, RunConfiguration
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.telemetry import set_diagnostics_collection
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
print("SDK Version:", azureml.core.VERSION)
###############################
ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
'Subscription id: ' + ws.subscription_id,
'Resource group: ' + ws.resource_group, sep = '\n')
experiment_name = 'aml-pipeline-cicd' # choose a name for experiment
project_folder = '.' # project folder
experiment = Experiment(ws, experiment_name)
print("Location:", ws.location)
set_diagnostics_collection(send_diagnostics=True)
###############################
cd = CondaDependencies.create(pip_packages=["azureml-sdk==1.0.17", "azureml-train-automl==1.0.17", "pyculiarity", "pytictoc", "cryptography==2.5", "pandas"])
amlcompute_run_config = RunConfiguration(framework = "python", conda_dependencies = cd)
amlcompute_run_config.environment.docker.enabled = False
amlcompute_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
amlcompute_run_config.environment.spark.precache_packages = False
###############################
aml_compute_target = "aml-compute"
try:
aml_compute = AmlCompute(ws, aml_compute_target)
print("found existing compute target.")
except:
print("creating new compute target")
provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
idle_seconds_before_scaledown=1800,
min_nodes = 0,
max_nodes = 4)
aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
print("Azure Machine Learning Compute attached")
###############################
def_data_store = ws.get_default_datastore()
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))
# Naming the intermediate data as anomaly data and assigning it to a variable
output_data = PipelineData("output_data", datastore = def_blob_store)
print("output_data object created")
step = PythonScriptStep(name = "test",
script_name = "test.ipynb",
compute_target = aml_compute,
source_directory = project_folder,
allow_reuse = True,
runconfig = amlcompute_run_config)
print("Step created.")
###############################
steps = [step]
print("Step lists created")
pipeline = Pipeline(workspace = ws, steps = steps)
print ("Pipeline is built")
pipeline.validate()
print("Pipeline validation complete")
pipeline_run = experiment.submit(pipeline)
print("Pipeline is submitted for execution")
pipeline_run.wait_for_completion(show_output = False)
print("Pipeline run completed")
###############################
def_data_store.download(target_path = '.',
prefix = 'outputs',
show_progress = True,
overwrite = True)
model_fname = 'output.csv'
model_path = os.path.join("outputs", model_fname)
pipeline_run.upload_file(name = model_path, path_or_stream = model_path)
print('Uploaded the model {} to experiment {}'.format(model_fname, pipeline_run.experiment.name))
And this give me the following error:
Pipeline run completed
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-22-a8a523969bb3> in <module>
111
112 # Upload the model file explicitly into artifacts (for CI/CD)
--> 113 pipeline_run.upload_file(name = model_path, path_or_stream = model_path)
114 print('Uploaded the model {} to experiment {}'.format(model_fname, pipeline_run.experiment.name))
115
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/run.py in wrapped(self, *args, **kwargs)
47 "therefore, the {} cannot upload files, or log file backed metrics.".format(
48 self, self.__class__.__name__))
---> 49 return func(self, *args, **kwargs)
50 return wrapped
51
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/run.py in upload_file(self, name, path_or_stream)
1749 :rtype: azure.storage.blob.models.ResourceProperties
1750 """
-> 1751 return self._client.artifacts.upload_artifact(path_or_stream, RUN_ORIGIN, self._container, name)
1752
1753 #_check_for_data_container_id
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/_restclient/artifacts_client.py in upload_artifact(self, artifact, *args, **kwargs)
108 if isinstance(artifact, str):
109 self._logger.debug("Uploading path artifact")
--> 110 return self.upload_artifact_from_path(artifact, *args, **kwargs)
111 elif isinstance(artifact, IOBase):
112 self._logger.debug("Uploading io artifact")
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/_restclient/artifacts_client.py in upload_artifact_from_path(self, path, *args, **kwargs)
100 path = os.path.normpath(path)
101 path = os.path.abspath(path)
--> 102 with open(path, "rb") as stream:
103 return self.upload_artifact_from_stream(stream, *args, **kwargs)
104
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/azmnt/code/Users/aniello.spiezia/outputs/output.csv'
Do you know what the problem could be?
In particular I am interested in saving somewhere the output file called "output.csv"
The best way for you to do this depends a bit on how you want to process the output.csv file after the run completed. But, in general you can just write your csv to the ./outputs folder:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset, Datastore
import pandas as pd
import numpy as np
import datetime
import math
#Upload datasets
subscription_id = 'myid'
resource_group = 'myrg'
workspace_name = 'mywn'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset_zre = Dataset.get_by_name(workspace, name='file1')
dataset_SLA = Dataset.get_by_name(workspace, name='file2')
df_zre = dataset_zre.to_pandas_dataframe()
df_SLA = dataset_SLA.to_pandas_dataframe()
result = pd.concat([df_SLA,df_zre], sort=True)
if not os.path.isdir('outputs')
os.mkdir('outputs')
result.to_csv('outputs/output.csv', index=False)
print("\nFinished!")
#End of the file
After the run has completed, AzureML will upload the contents of the outputs directory to the run history, so no need to datastore.upload().
Afterwards, you can see the file in http://ml.azure.com when you navigate to the run like my model.pt file below:
See here for some information on the ./outputs and ./logs folders: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-save-write-experiment-files#where-to-write-files
If you actually want to create another DataSet as a result of your Run, please see this post here: Azure Machine Learning Service - dataset API question
In Daniel's example above, you would need to download the output from the run rather than the datastore in your pipeline.ipynb code. Instead of calling def_data_store.download(), you would call pipeline_run.download('outputs/output.csv', '.').
Another option is to output your data using PipelineData. PipelineData represents a named piece of output of a pipeline step, and is useful if you want to connect multiple steps together with inputs and outputs. With PipelineData, you would need to pass the PipelineData object into PythonScriptStep when you declare your step (as part of arguments=[] and outputs=[]), and then have your script read the output path from the command-line arguments.
This notebook has examples of using PipelineData within a pipeline and downloading the outputs: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb
And this blog post has details about how to handle this within your script (parsing the command-line arguments, creating the output directory, and writing the output file): https://blog.x5ff.xyz/blog/ai-azureml-python-data-pipelines/
I have trained multiple tensorflow models for the same set of data, each with slightly different configuration.
Now I want to run prediction for the given input file utilizing each tensorflow model and store the prediction in a csv.
It seems I am unable to get tensorflow to completely unload/reset before loading new model.
Here is my code. It works fine for the first model, then it generates error. I have tried changing sequence of models, it always run the first model without any issue, no matter which model is first.
import tensorflow as tf
import os
import numpy as np
predictionoutputfile = 'data\\prediction.csv'
predictioninputfile = 'data\\today.csv'
modelslist = 'data\\models.csv'
def predict(dirname,testfield,testper,threshold,prediction_OutFile):
with tf.Session() as sess:
print(dirname)
exported_path = 'imp\\exported\\' + dirname
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], exported_path)
# get the predictor , refer tf.contrib.predictor
predictor = tf.contrib.predictor.from_saved_model(exported_path)
with open(predictioninputfile) as inf:
# Skip header
#next(inf)
for line in inf:
# Read data, using python, into our features
var1,var2,var3,var4,var5 = line.strip().split(",")
# Create a feature_dict for train.example - Get Feature Columns using
feature_dict = {
'var1': _bytes_feature(value=var1.encode()),
'var2': _bytes_feature(value=var2.encode()),
'var3': _bytes_feature(value=var3.encode()),
'var4':_float_feature(value=int(var4)),
'var5':_float_feature(value=int(var5)),
}
# Prepare model input
model_input = tf.train.Example(features=tf.train.Features(feature=feature_dict))
model_input = model_input.SerializeToString()
output_dict = predictor({"inputs": [model_input]})
# Positive label = 1
if float(output_dict['scores'][0][1])>=float(threshold) :
prediction_OutFile.write(str(var1)+ "," + str(var2)+ "," + str(var3)+ "," + str(var4)+ "," + str(var5)+ ",")
label_index = tf.argmax(output_dict['scores'])
prediction_OutFile.write(str(output_dict['classes'][0][1]))
prediction_OutFile.write(',')
prediction_OutFile.write(str(output_dict['scores'][0][1]))
prediction_OutFile.write('\n')
def main():
prediction_OutFile = open(predictionoutputfile, 'w')
prediction_OutFile.write("model,SYMBOL,RECORDDATE,TESTFIELD,TESTPER,prediction,probability")
prediction_OutFile.write('\n')
with open(modelslist) as modlist:
#Skip header
next(modlist)
for mline in modlist:
try:
dirname = ''
modelname,datafield,dataper,testfield,testper,threshold,truepositive,falsepositive,truenegative,falsenegative,correct,incorrect,accuracy,precision = mline.strip().split(",")
# load the current model
predict(modelname,testfield,testper,threshold,prediction_OutFile)
# Read file and create feature_dict for each record
except:
print('error' + modelname)
prediction_OutFile.close()
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
if __name__ == "__main__":
main()
You can, just use tf.reset_default_graph
# some stuff
with tf.Session() as sess:
# more stuff
tf.reset_default_graph()
# some otherstuff (again)
with tf.Session() as sess:
# more other stuff
The elephant in the room: Why not using flags call the python script multiple times?
I have used tf to make two separate models. During training I saved each one alone. Now I want to use them both. I can use the first one but when I try to load the second one I get this message (in part):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 850M, pci bus id: 0000:0a:00.0)
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [5,5,32,64] rhs shape= [1024,2]
[[Node: save_1/Assign_16 = Assign[T=DT_FLOAT, _class=["loc:#Variable_6"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_6, save_1/restore_slice_16/_47)]]
also there was a message signifying that the error took place in the 'restore' part of the code. Here's a snippet of that code:
def save(self):
filename = self.save_name
folder = self.ckpt_folder + os.sep + "ckpt"
if not os.path.exists(folder) :
os.makedirs(folder)
saver = tf.train.Saver()
save_path = saver.save(self.sess, folder + os.sep + self.ckpt_name + "."+ filename)
print ("saved?", filename)
def load(self):
filename = self.save_name
file = self.ckpt_folder + os.sep + "ckpt" + os.sep + self.ckpt_name +"."+ filename
if os.path.isfile(file) :
saver = tf.train.Saver()
saver.restore(self.sess, file)
print ("load?", filename)
The functions above, and specifically the load() is called by the model after the session object is initialized. How can I run both tf models together from the data I have already saved??
Depending on what you want to do with them, you should either:
Use name scoping to make them unique
Load them into separate graphs