SageMaker Batch Transform: Could not convert string to float '*'

SageMaker Batch Transform: Could not convert string to float '*' - python

Hello and thank you for reading. To put it simply, I want to perform Batch Transform on my XGBoost model that I made using SageMaker Experiments. I trained my model on csv data stored in S3, deployed an endpoint for my model, successfully hit said endpoint with single csv lines and got back expected inferences.
(I followed this tutorial to the letter before starting to work on Batch Transformation)
Now I am attempting to run Batch Transformation using the model created from the above tutorial and I'm running into an error (skip to the bottom to see my error logs). Before I get straight to the error, I want to show my batch transform code.
(imports are done from SageMaker SDK v2.24.4)
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
region = boto3.Session().region_name
role = get_execution_role()
image = sagemaker.image_uris.retrieve('xgboost', region, '1.2-1')
model_location = '{mys3info}/output/model.tar.gz'
model = Model(image_uri=image,
model_data=model_location,
role=role,
)
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path='myOutputPath',
accept='text/csv',
max_concurrent_transforms=1,
max_payload=20)
transformer.transform(data='s3://test-s3-prefix/short_test_data.csv',
content_type='text/csv',
split_type='Line',
join_source='Input'
)
transformer.wait()
short_test_data.csv
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown
57,blue-collar,married,primary,no,52,yes,no,unknown,5,may,38,1,-1,0,unknown
32,blue-collar,single,primary,no,23,yes,yes,unknown,5,may,160,1,-1,0,unknown
53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,unknown
29,management,single,tertiary,no,0,yes,no,unknown,5,may,363,1,-1,0,unknown
32,management,married,tertiary,no,0,yes,no,unknown,5,may,179,1,-1,0,unknown
38,management,single,tertiary,no,424,yes,no,unknown,5,may,104,1,-1,0,unknown
I made the above csv test data using my original dataset in my command line by running:
head original_training_data.csv > short_test_data.csv
and then I uploaded it to my S3 bucket manually.
Logs
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=20, BatchStrategy=MULTI_RECORD
[sagemaker logs]: */short_test_data.csv: ClientError: 415
[sagemaker logs]: */short_test_data.csv:
[sagemaker logs]: */short_test_data.csv: Message:
[sagemaker logs]: */short_test_data.csv: Loading csv data failed with Exception, please ensure data is in csv format:
[sagemaker logs]: */short_test_data.csv: <class 'ValueError'>
[sagemaker logs]: */short_test_data.csv: could not convert string to float: 'entrepreneur'
I understand the concept of one-hot encoding and other methods for converting strings to numbers for usage by an algorithm like XGBoost. My problem here is that I was easily able to input the exact same format of data into a deployed endpoint and get results back without doing that level of encoding. I am clearly missing something though, so any help is greatly appreciated!

Your Batch Transform code good and does not throw out any alarms, but looking at error message, it is clearly an input format error. As silly as it may sound. I'd advice you use pandas to save off the test_data from validation set to ensure the formatting is appropriate.
You could do something like this -
data = pd.read_csv("file")
#specify columns to save from ectracted df
data.columns["choose columns"]
# save the data to csv
data.to_csv("data.csv", sep=',', index=False)

Related

AWS Sagemaker batch transform job with large parquet file and split_type

I'm trying to run a sagemaker batch transform job on a large parquet file (2GB) and I keep having issues with it. In my transformer, I have had to specify split_type='Line' so that I don't get the following error, even when using max_payload=100
Too much data for max payload size
Instead of the above error I get another error when pd.read_parquet(data) is called
sagemaker_containers._errors.ClientError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
I also tried using max_payload=0 rather than split_type='Line', however that consumes too much memory when doing the transformation. So I do want the benefits of splitting the data for parquet files.
Here is my code
transformer = Transformer(
model_name=model_name,
instance_type='ml.m5.4xlarge',
instance_count=1,
output_path=output_path,
accept='application/x-parquet',
strategy='MultiRecord',
max_payload=100,
)
transformer.transform(
data=data,
content_type='application/x-parquet',
split_type='Line',
)
And in the model,
def input_fn(input_data, content_type):
if content_type == 'application/x-parquet':
data = BytesIO(input_data)
df = pd.read_parquet(data)
return df
else:
raise ValueError("{} not supported by script!".format(content_type))
def output_fn(prediction, accept):
if accept == "application/x-parquet":
buffer = BytesIO()
output.to_parquet(buffer)
return buffer.getvalue()
else:
raise Exception("Requested unsupported ContentType in Accept: " + accept)
I have looked into the answers here, and none of them fix the problem that I'm having.
Is there any way that I can use split_type with the parquet file and not run into this error?

Write data from broadcast variable (databricks) to azure blob

I have a url from where I download the data (which is in JSON format) using Databricks:
url="https://tortuga-prod-eu.s3-eu-west-1.amazonaws.com/%2FNinetyDays/amzf277698d77514b44"
testfile = urllib.request.URLopener()
testfile.retrieve(url, "file.gz")
with gzip.GzipFile("file.gz", 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8')
data = json.loads(json_str)
Now I want to save this data in Azure container as a blob .json file.
I have tried saving data in a dataframe and write df to mounted location but data is huge in GBs and I get spark.rpc.message.maxSize (268435456 bytes) error.
I have tried saving data in a broadcast variable (it saves successfully) but I am not sure how to write data from broadcast variable to mounted location.
Here is how I save data in broadcast variable
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
broadcastStates = spark.sparkContext.broadcast(data)
print(broadcastStates.value)
M question is
Is there any way I can write data from broadcast variable to azure mounted location
if not then please guide me what is right/best way to get this job done.

It is not possible to write the broadcast variable into mounted azure blob storage. However, there is a way you can write the value of broadcast variable into a file.
pyspark.broadcast.Broadcast provides 2 methods, dump() and load_from_path(), using which you can write and read the value of a broadcast variable. Since you have created a broadcast variable using:
broadcastStates = spark.sparkContext.broadcast(data)
Use the following syntax to write the value of broadcast variable to a file.
<broadcast_variable>.dump(<broadcast_variable>.value, filename)
Note: the file must have write attribute and the write() argument must be string.
To read this data from the file, you can use load_from_path() as shown below:
<broadcast_variable>.load_from_path(filename)
Note: the file must have read and readline attributes.
There might also be a way to avoid “spark.rpc.message.maxSize (268435456 bytes) error”. The default value of spark.rpc.message.maxSize is 128. Refer to the following document to know more about this error.
https://spark.apache.org/docs/latest/configuration.html#networking
While creating a cluster in Databricks, we can configure and increase the value to avoid this error. The steps to configure a cluster are:
⦁ While creating cluster, choose advanced options (present at the bottom).
⦁ Under spark tab, choose the configuration and its value as shown below.
Click create cluster.
This might help in writing the dataframe directly to mounted blob storage without using broadcast variables.
You can also try increasing the number of partitions to save the dataframe into multiple smaller files to avoid maxSize error. Refer to the following document about configuring spark and partitioning.
https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html

Meeting ValueError while using flopy to load a MODFLOW-USG model

I am using FloPy to load an existing MODFLOW-USG model.
load_model = flopy.modflow.Modflow.load('HTHModel',model_ws='model_ws',version='mfusg',exe_name='exe_name',
verbose = True, check = False)
In the process of loading the LPF package, python shows that hk and hani have been successfully loaded, and then the following error is reported：
loading bas6 package file...
adding Package: BAS6
BAS6 package load...success
loading lpf package file...
loading IBCFCB, HDRY, NPLPF...
loading LAYTYP...
loading LAYAVG...
loading CHANI...
loading LAYVKA...
loading LAYWET...
loading hk layer 1...
loading hani layer 1...
D:\Anaconda\program\lib\site-packages\flopy\utils\util_array.py in parse_control_record(line,
current_unit, dtype, ext_unit_dict, array_format)
3215 locat = int(line[0:10].strip())
ValueError: invalid literal for int() with base 10: '-877.0
How can I solve this kind of problem.
By the way, I created this model by using the"save native text copy" function in GMS. Flopy can read other contents in the LPF package normally, and the position where it reports the error appears in the part of reading the [ANGLEX(NJAG)] data.
I compared the LFP file with the input and output description of MODFLOW-USG, and it meets the format requirements of the input file.
I am a newbie to pyhton and flopy and this question confused me a lot. Thank you very much for providing me with some reference information, whether it is about Python, FloPy, MODFLOW-USG or GMS.

Can you upload your lpf file? Then I can check this out. But at first glance, that "'" before the -877.0 looks suspect - is that in the lpf file?

How to load waymo scenario data?

How do I load the waymo motion scenario files (Scenario protocol buffers)?
According to the tutorial, this should work:
FILENAME = "uncompressed_scenario_testing_testing.tfrecord-00000-of-00150"
dataset = tf.data.TFRecordDataset(FILENAME, compression_type='')
for data in dataset:
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(data.numpy()))
break
However, I get "Error parsing message".
Nevertheless, loading tf.example is no problem when utilizing:
dataset = tf.data.TFRecordDataset(FILENAME, compression_type='')
data = next(dataset.as_numpy_iterator())
parsed = tf.io.parse_single_example(data, features_description)
Unfortunately, this does not work for the scenario data, because I dont know the feature_description of the data... or I am unable to understand detailed explanation of waymo
Could somebody help me to not only load data but to understand the basic architecture of the dataset
Thanks

How to extract metadata from tflite model

I'm loading this object detection model in python. I can load it with the following lines of code:
import tflite_runtime.interpreter as tflite
model_path = 'path_to_model_file.tf'
interpreter = tflite.Interpreter(model_path)
I'm able to perform inferences on this without any problem. However, labels are suposed to be included in the metadata, according to model's documentation, but I can't extract it.
The closest I was, it was when following this:
from tflite_support import metadata as _metadata
displayer = _metadata.MetadataDisplayer.with_model_file(model_path)
export_json_file = "extracted_metadata.json")
json_file = displayer.get_metadata_json()
# Optional: write out the metadata as a json file
with open(export_json_file, "w") as f:
f.write(json_file)
but the very first line of code, fails with this error: {AtributeError}'int' object has no attribute 'tobytes'.
How to extract it?

If you only care about the label file, you can simply run command like unzip model_path on Linux or Mac. TFLite model with metadata is essentially a zip file. See the public introduction for more details.
You code snippet to extract metadata works on my end. Make sure to double check model_path. It should be a string, such as "lite-model_ssd_mobilenet_v1_1_metadata_2.tflite".
If you'd like to read label files in an Android app, here is the sample code to do so.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.