Azure blob storage to JSON in azure function using SDK - python

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.

figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.

Related

RedVox Python SDK | Not Reading in .rdvxz Files

I'm attempting to read in a series of files for processing contained in a single directory using RedVox:
input_directory = "/home/ben/Documents/Data/F1D1/21" # file location
rdvx_data = DataWindow(input_dir=input_directory, apply_correction=False, debug=True) # using RedVox to read in the files
print(os.listdir(input_directory)) # verifying the files actually exist...
# returns "['file1.rdvxz', 'file2.rdvxz', file3.rdvxz', ...etc]", they exist
# write audio portion to file
rdvx_data.to_json_file(base_dir=output_rpd_directory,
file_name=output_filename)
# this never runs, because rdvx_data.stations = [] (verified through debugging)
for station in rdvx_data.stations:
# some code here
Enabling debugging through arguments as seen above does not provide an extra details. In fact, there is no error message whatsoever. It writes the JSON file and pickle to disk, but the JSON file is full of null values and the pickle object is just a shell, no contents. So the files definitely exist, os.listdir() sees them, but RedVox does not.
I assume this is some very silly error or lack of understanding on my part. Any help is greatly appreciated. I have not worked with RedVox previously, nor do I have much understanding of what these files contain other than some audio data and some other data. I've simply been tasked with opening them to work on a model to analyze the data within.
SOLVED: Not sure why the previous code doesn't work (it was handed to me), however, I worked around the DataWindow call and went straight to calling the "redvox.api900.reader" object:
from redvox.api900 import reader
dataset_dir = "/home/*****/Documents/Data/F1D1/21/"
rdvx_files = glob(dataset_dir+"*.rdvxz")
for file in rdvx_files:
wrapped_packet = reader.read_rdvxz_file(file)
From here I can view all of the sensor data within:
if wrapped_packet.has_microphone_sensor():
microphone_sensor = wrapped_packet.microphone_sensor()
print("sample_rate_hz", microphone_sensor.sample_rate_hz())
Hope this helps anyone else who's confused.

Convert .txt to .csv with AWS Lambda function

I am trying to create a Lambda function which is triggered whenever a .txt file is dropped in the bucket 'testinput' (a private bucket), convert it from a .txt file (with a pipe delimiter) to a .csv file, and then put that converted file into the bucket 'testoutput' (a private bucket).
I created a Lambda function using the code below and set up the S3 trigger. I followed the steps in this blog so that I could import Pandas. However, when I drop a file 'pipedelimitedtest' into 'testinput', nothing happened.
This leads me to believe that there is something wrong with the code I am using for converting the .txt file to a .csv file.
If someone could provide insight into what specifically is wrong with my code, or if there is a simpler way to address this problem, I would greatly appreciate it.
UPDATE: I changed the script based on the helpful feedback from Marcin and added a pandas layer to the function. However, when I test the function, I am now getting the error: Unable to import module 'lambda_function': No module named 'pandas'.
import boto3
import io
import pandas as pd
import json
def lambda_handler(event, context):
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='lambdatestinput985', Key='Pipedelimitedtest.txt')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
csv_buffer = StringIO()
df.to_csv (csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(Bucket, 'df.csv').put(Body = csv_buffer.getvalue())
return {
'statusCode': 200,
'body': json.dumps('Hello from S3 events Lambda!')
}
Your code is invalid lambda function. You require a lambda handler which is an entry point to your function.
Also you need to bundle pandas in your lambda deployment package or use pandas layer to provide pandas library to your function.

Reading Data From Cloud Storage Via Cloud Functions

I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.
I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.
In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.
The function does not actually receive the contents of the file, just some metadata about it.
You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.
Putting that together with the tutorial you're using, you get a function like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
bucket = storage_client.get_bucket(data['bucket'])
blob = bucket.blob(data['name'])
contents = blob.download_as_string()
# Process the file contents, etc...
This is an alternative solution using pandas:
Cloud Function Code:
import pandas as pd
def GCSDataRead(event, context):
bucketName = event['bucket']
blobName = event['name']
fileName = "gs://" + bucketName + "/" + blobName
dataFrame = pd.read_csv(fileName, sep=",")
print(dataFrame)

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

App Engine - Save response from an API in the data store as file (blob)

I'm banging my head against the wall with this one:
What I want to do is store a file that is returned from an API in the data store as a blob.
Here is the code that I use on my local machine (which of course works due to an existing file system):
client.convertHtml(html, open('html.pdf', 'wb'))
Since I cannot write to a file on App Engine I tried several ways to store the response, without success.
Any hints on how to do this? I was trying to do it with StringIO and managed to store the response but then weren't able to store it as a blob in the data store.
Thanks,
Chris
Found the error. Here is how it looks like right now (simplified).
output = StringIO.StringIO()
try:
client.convertURI("example.com", output)
Report.pdf = db.Blob(output.getvalue())
Report.put()
except pdfcrowd.Error, why:
logging.error('PDF creation failed %s' % why)
I was trying to save the output without calling "getvalue()", that was the problem. Perhaps this is of use to someone in the future :)

Categories

Resources