How to validate JSON format without loading the file? I am copying files from one S3 bucket to another S3 bucket. After JSONL files are
copied , I want to check if file format is correct in the sense curly braces and commas are fine.
I don't want to use json.load() because file size and number are big and it will slow down the process plus file is already copied so no need to parse it , just validation is requirement.
There is no capability within Amazon S3 itself to validate the content of objects.
You could configure S3 to trigger an AWS Lambda function whenever a file is created in the S3 bucket. The Lambda function could then parse the file and perform some action (eg send a notification or move the object to another location) if the validation fails.
Streaming the file seems to be the way to go about it, put into a generator and yield line by line to check if the JSON is valid. The requests library supports streaming of a file.
The solution would look something like this:
import requests
def get_data():
r = requests.get('s3_file_url', stream=True)
yield from r.iter_lines():
def parse_data():
# initialize generator
gd_gen = get_data()
while True:
try:
ge_gen.__next__()
except StopIteration:
break
# put your validation code here
Let me know you need a better clarification
Related
I am making a python based webapp that uses jsons to store information for different accounts. I am not able to temporarily save the json data to a local file and then upload it to a blob. Is there a way to remove that step and just upload the json data directley as a file in a blob?
This is how you normally upload a file
with open(upload_file_path, "rb") as data:
blob_client.upload_blob(data)
Is there something like blob_client.upload_blob(<json object>) that works?
Is there something like blob_client.upload_blob(<json object>) that
works?
Yes. If you look at the documentation for upload_blob, you will notice that the data parameter can be of AnyStr, Iterable or IO type.
What you could do is serialize the JSON object as string using json.dumps() and pass that as data to the upload_blob method.
I need to read a zip file, located in the Google Storage Bucket, without unzipping using airflow. I am using a library called stream-unzip.
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Iterable that yields the bytes of a zip file
with httpx.stream('GET', "<google_bucket_file_path>") as r:
yield from r.iter_bytes(chunk_size=65536)
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks(), password=b'my-password'):
# unzipped_chunks must be iterated to completion or UnfinishedIterationError will be raised
for chunk in unzipped_chunks:
print(chunks)
The error which I got when I tried with above code is:
stream_unzip.TruncatedDataError
This zip file contains a text-file that have millions of lines. I need to stream these lines and get the data in each line on the fly. The airflow is deployed in the composer environment.
Please suggest a good solution for this.
Thank you
I am using Backblaze B2 and b2sdk.v2 in Flask to upload files.
This is code I tried, using the upload method:
# I am not showing authorization code...
def upload_file(file):
bucket = b2_api.get_bucket_by_name(bucket_name)
file = request.files['file']
bucket.upload(
upload_source=file,
file_name=file.filename,
)
This shows an error like this
AttributeError: 'SpooledTemporaryFile' object has no attribute 'get_content_length'
I think it's because I am using a FileStorage instance for the upload_source parameter.
I want to know whether I am using the API correctly or, if not, how should I use this?
Thanks
You're correct - you can't use a Flask FileStorage instance as a B2 SDK UploadSource. What you need to do is to use the upload_bytes method with the file's content:
def upload_file(file):
bucket = b2_api.get_bucket_by_name(bucket_name)
file = request.files['file']
bucket.upload_bytes(
data_bytes=file.read(),
file_name=file.filename,
...other parameters...
)
Note that this reads the entire file into memory. The upload_bytes method may need to restart the upload if something goes wrong (with the network, usually), so the file can't really be streamed straight through into B2.
If you anticipate that your files will not fit into memory, you should look at using create_file_stream to upload the file in chunks.
I am fairly new to both S3 as well as boto3. I am trying to read in some data in the following format:
https://blahblah.s3.amazonaws.com/data1.csv
https://blahblah.s3.amazonaws.com/data2.csv
https://blahblah.s3.amazonaws.com/data3.csv
I am importing boto3, and it seems like I would need to do something like:
import boto3
s3 = boto3.client('s3')
However, what should I do after creating this client if I want to read in all files separately in-memory (I am not supposed to locally download this data). Ideally, I would like to read in each CSV data file into separate Pandas DataFrames (which I know how to do once I know how to access the S3 data).
Please understand I'm fairly new to both boto3 as well as S3, so I don't even know where to begin.
You'll have 2 options, both the options you've already mentioned:
Downloading the file locally using download_file
s3.download_file(
"<bucket-name>",
"<key-of-file>",
"<local-path-where-file-will-be-downloaded>"
)
See download_file
Loading the file contents into memory using get_object
response = s3.get_object(Bucket="<bucket-name>", Key="<key-of-file>")
contentBody = response.get("Body")
# You need to read the content as it is a Stream
content = contentBody.read()
See get_object
Either approach is fine and you can just chose which one fits your scenario better.
Try this:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(<<bucketname>>, <<itemname>>)
body = obj.get()['Body'].read()
I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.
Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.