Parsing and displaying large json files in SageMaker

Parsing and displaying large json files in SageMaker - python

I have gzipped json files in S3 and I'm trying to show them in a SageMaker Studio Notebook like so:
import boto3
import gzip
s3_object = s3.Object("bucket", "path")
with gzip.GzipFile(fileobj=s3_object.get()["Body"]) as gzip_file:
print("reading s3 object through gzip stream")
raw_json = gzip_file.read()
print("done reading s3 object, about to flush")
gzip_file.flush()
print("done flushing")
print("about to print")
print(raw_json.decode("utf-8"))
print("done printing")
My constraint is that it has to be done in memory, I've resorted to running on a ml.m5.2xlarge instance which should be more than enough.
I know of IPython.display.JSON, pandas.read_json and json.load()/json.loads(), I'm treating the content as a plain string to keep the question simple.
The (unexpected) output of the above code is:
reading s3 object through gzip stream
done reading s3 object, about to flush
done flushing
about to display
At this point the kernel status is 'Busy' and it can remain like that for a few good minutes until finally it seems to just kind of 'give up' with no output.
If I run the exact same code in a notebook running on my laptop it works just fine, quickly showing the json content.
What am I missing? Is there a better way to do this?
My endgame is to sometimes present data in a pandas DataFrame and sometimes show it as an IPython.display.JSON for conveniently viewing content

This is a known issue with Studio.How big is your json file?
I would recommend you convert it to pandas DF to view it on the console.

Related

Google Cloud Storage streaming upload from Python generator

I have a Python generator that will yield a large and unknown amount of byte data. I'd like to stream the output to GCS, without buffering to a file on disk first.
While I'm sure this is possible (e.g., I can create a subprocess of gsutil cp - <...> and just write my bytes into its stdin), I'm not sure what's a recommended/supported way and the documentation gives the example of uploading a local file.
How should I do this right?

The BlobWriter class makes this a bit easier:
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = BlobWriter(blob)
for d in your_generator:
writer.write(d)
writer.close()

Using IO library to load string variable as a txt file to/from s3

I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.

Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.

Saving binary data into file on model via django storages boto s3

I'm pulling back a pdf from the echosign API, which gives me the bytes of a file.
I'm trying to take those bytes and save them into a boto s3 backed FileField. I'm not having much luck.
This was the closest I got, but it errors on saving 'speaker', and the pdf, although written to S3, appears to be corrupt.
Here speaker is an instance of my model and fileData is the 'bytes' string returned from the echosign api
afile = speaker.the_file = S3BotoStorageFile(filename, "wb", S3BotoStorage())
afile.write(fileData)
afile.close()
speaker.save()

I'm closer!
content = ContentFile(fileData)
speaker.profile_file.save(filename, content)
speaker.save()
Turns out the FileField is already a S3BotoStorage, and you can create a new file by passing the raw datadin like that. What I don't know is how to make it binary (I'm assuming its not). My file keeps coming up corrupted, despite having a good amount of data in it.
For reference here is the response from echosign:
https://secure.echosign.com/static/apiv14/sampleXml/getDocuments-response.xml
I'm essentially grabbing the bytes and passing it to ContentFile as fileData. Maybe I need to base64 decode. Going to try that!
Update
That worked!
It seems I have to ask the question here before I figure out the answer. Sigh. So the final code looks something like this:
content = ContentFile(base64.b64decode(fileData))
speaker.profile_file.save(filename, content)
speaker.save()

Saving a temporary file

I'm using xlwt in python to create a Excel spreadsheet. You could interchange this for almost anything else that generates a file; it's what I want to do with the file that's important.
from xlwt import *
w = Workbook()
#... do something
w.save('filename.xls')
I want to I have two use cases for the file: I stream it out to the user's browser or I attach it to an email. In both cases the file only needs to exist the duration of the web request that generates it.
What I'm getting at, the reason for starting this thread is saving to a real file on the filesystem has its own hurdles (stopping overwriting, cleaning up the file once done). Is there somewhere I could "save" it where it lives only in memory and only for the duration of the request?

cStringIO
(or mmap if it should be mutable)

Generalising the answer, as you suggested: If the "anything else that generates a file" won't accept a file-like object as well as a filepath, then you can reduce the hassle by using tempfile.NamedTemporaryFile

App Engine - Save response from an API in the data store as file (blob)

I'm banging my head against the wall with this one:
What I want to do is store a file that is returned from an API in the data store as a blob.
Here is the code that I use on my local machine (which of course works due to an existing file system):
client.convertHtml(html, open('html.pdf', 'wb'))
Since I cannot write to a file on App Engine I tried several ways to store the response, without success.
Any hints on how to do this? I was trying to do it with StringIO and managed to store the response but then weren't able to store it as a blob in the data store.
Thanks,
Chris

Found the error. Here is how it looks like right now (simplified).
output = StringIO.StringIO()
try:
client.convertURI("example.com", output)
Report.pdf = db.Blob(output.getvalue())
Report.put()
except pdfcrowd.Error, why:
logging.error('PDF creation failed %s' % why)
I was trying to save the output without calling "getvalue()", that was the problem. Perhaps this is of use to someone in the future :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing and displaying large json files in SageMaker - python

This is a known issue with Studio.How big is your json file? I would recommend you convert it to pandas DF to view it on the console.

Related

Google Cloud Storage streaming upload from Python generator

Using IO library to load string variable as a txt file to/from s3

Saving binary data into file on model via django storages boto s3

Saving a temporary file

App Engine - Save response from an API in the data store as file (blob)

Categories

Resources