Hello and thank you for reading. To put it simply, I want to perform Batch Transform on my XGBoost model that I made using SageMaker Experiments. I trained my model on csv data stored in S3, deployed an endpoint for my model, successfully hit said endpoint with single csv lines and got back expected inferences.
(I followed this tutorial to the letter before starting to work on Batch Transformation)
Now I am attempting to run Batch Transformation using the model created from the above tutorial and I'm running into an error (skip to the bottom to see my error logs). Before I get straight to the error, I want to show my batch transform code.
(imports are done from SageMaker SDK v2.24.4)
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
region = boto3.Session().region_name
role = get_execution_role()
image = sagemaker.image_uris.retrieve('xgboost', region, '1.2-1')
model_location = '{mys3info}/output/model.tar.gz'
model = Model(image_uri=image,
model_data=model_location,
role=role,
)
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path='myOutputPath',
accept='text/csv',
max_concurrent_transforms=1,
max_payload=20)
transformer.transform(data='s3://test-s3-prefix/short_test_data.csv',
content_type='text/csv',
split_type='Line',
join_source='Input'
)
transformer.wait()
short_test_data.csv
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown
57,blue-collar,married,primary,no,52,yes,no,unknown,5,may,38,1,-1,0,unknown
32,blue-collar,single,primary,no,23,yes,yes,unknown,5,may,160,1,-1,0,unknown
53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,unknown
29,management,single,tertiary,no,0,yes,no,unknown,5,may,363,1,-1,0,unknown
32,management,married,tertiary,no,0,yes,no,unknown,5,may,179,1,-1,0,unknown
38,management,single,tertiary,no,424,yes,no,unknown,5,may,104,1,-1,0,unknown
I made the above csv test data using my original dataset in my command line by running:
head original_training_data.csv > short_test_data.csv
and then I uploaded it to my S3 bucket manually.
Logs
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=20, BatchStrategy=MULTI_RECORD
[sagemaker logs]: */short_test_data.csv: ClientError: 415
[sagemaker logs]: */short_test_data.csv:
[sagemaker logs]: */short_test_data.csv: Message:
[sagemaker logs]: */short_test_data.csv: Loading csv data failed with Exception, please ensure data is in csv format:
[sagemaker logs]: */short_test_data.csv: <class 'ValueError'>
[sagemaker logs]: */short_test_data.csv: could not convert string to float: 'entrepreneur'
I understand the concept of one-hot encoding and other methods for converting strings to numbers for usage by an algorithm like XGBoost. My problem here is that I was easily able to input the exact same format of data into a deployed endpoint and get results back without doing that level of encoding. I am clearly missing something though, so any help is greatly appreciated!
Your Batch Transform code good and does not throw out any alarms, but looking at error message, it is clearly an input format error. As silly as it may sound. I'd advice you use pandas to save off the test_data from validation set to ensure the formatting is appropriate.
You could do something like this -
data = pd.read_csv("file")
#specify columns to save from ectracted df
data.columns["choose columns"]
# save the data to csv
data.to_csv("data.csv", sep=',', index=False)
I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)
I'm trying to extract data from an OSM.PBF file using the python GDAL/OGR module.
Currently my code looks like this:
import gdal, ogr
osm = ogr.Open('file.osm.pbf')
## Select multipolygon from the layer
layer = osm.GetLayer(3)
# Create list to store pubs
pubs = []
for feat in layer:
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
While this little bit of code works fine with small.pbf files (15mb). However, when parsing files larger than 50mb I get the following error:
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES MODE
When I turn this mode on with:
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
ogr does not return any features at all anymore, even when parsing small files.
Does anyone know what is going on here?
Thanks to scai's answer I was able to figure it out.
The special reading pattern required for interleaved reading that is mentioned in gdal.org/1.11/ogr/drv_osm.html is translated into a working python example that can be found below.
This is an example of how to extract all features in an .osm.pbf file that have the 'amenity=pub' tag
import gdal, ogr
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
osm = ogr.Open('file.osm.pbf')
# Grab available layers in file
nLayerCount = osm.GetLayerCount()
thereIsDataInLayer = True
pubs = []
while thereIsDataInLayer:
thereIsDataInLayer = False
# Cycle through available layers
for iLayer in xrange(nLayerCount):
lyr=osm.GetLayer(iLayer)
# Get first feature from layer
feat = lyr.GetNextFeature()
while (feat is not None):
thereIsDataInLayer = True
#Do something with feature, in this case store them in a list
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
#The destroy method is necessary for interleaved reading
feat.Destroy()
feat = lyr.GetNextFeature()
As far as I understand it, a while-loop is needed instead of a for-loop because when using the interleaved reading method, it is impossible to obtain the featurecount of a collection.
More clarification on why this piece of code works like it does would be greatly appreciated.