Split up a massive WSDL file - python

I am working on interacting with the Netsuite Web Services layer with Python. Using suds to parse the WSDL takes close to two minutes. I was able to write a caching layer using redis that solves a bit of the loading headaches once the client has been parsed, but it still takes a ton of time the first time around.
>>> # Takes several minutes to load
>>> client = suds.Client(huge_four_mb_wsdl_file)
Since I only use a small subset of the services, is there a way to pull only those services from the WSDL and put them into my own smaller WSDL?

If you look at the v2013_2 version of the wsdl source you'll see that it's actually importing 38 other xsd files.
You can speed up your proccess by:
Creating a local wsdl that only imports some of the xsd files. (saves download/parse time)
Serialising a ready client using pickle and loading it on boot (saves parse time)
Also make sure you only have to create the client once in your application lifetime.

Related

Running python script concurrently based on trigger

What would be best way to solve following problem with Python ?
I have real-time data stream coming to my object-oriented storage from user application (json files being stored into S3 storage in Amazon).
Upon receiving of each JSON file, I have to within certain time (1s in this instance) process data in the file and generate response that is send back to the user. This data is being processed by simple Python script.
My issue is, that the real-time data stream can at the same time generate even few hundreds JSON files from user applications that I need to run trough my Python script and I don't know how to approach this the best way.
I understand, that way to tackle this would be to use trigger based Lambdas that would execute job on the top of every file once uploaded from real-time stream in server-less environment, however this option is quite expensive compared to have single server instance running and somehow triggering jobs inside.
Any advice is appreciated. Thanks.
Serverless can actually be cheaper than using a server. It is much cheaper when there are periods of no activity because you don't need to pay for a server doing nothing.
The hardest part of your requirement is sending the response back to the user. If an object is uploaded to S3, there is no easy way to send back a response and it isn't even obvious who is the user that sent the file.
You could process the incoming file and then store a response back in a similarly-named object, and the client could then poll S3 for the response. That requires the upload to use a unique name that is somehow generated.
An alternative would be for the data to be sent to AWS API Gateway, which can trigger an AWS Lambda function and then directly return the response to the requester. No server required, automatic scaling.
If you wanted to use a server, then you'd need a way for the client to send a message to the server with a reference to the JSON object in S3 (or with the data itself). The server would need to be running a web server that can receive the request, perform the work and provide back the response.
Bottom line: Think about the data flow first, rather than the processing.

I want to get a stream object from Azure Inheritance Iterator ItemPaged - ItemPaged[TableEntity] to Stream (Python). Is it possible?

I want to get a stream object from Azure Inheritance Iterator ItemPaged - ItemPaged[TableEntity] to stream (Python). Is it possible?
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
#Updated 11.08.2021
I have a realization to backup Azure Tables to Azure Blob - Current process to backup Azure Tables. But I want to improve this process and I am considering different options. I try to get the stream from Azure Tables to use create_blob_from_stream
I assume you want to stream bytes from the HTTP response, and not the use the iterator of objects you receive.
Each API in the SDK supports a keyword argument call raw_response_hook that gives you access to the HTTP response object, and then let you use a stream download API if you want to. Note that since the payload is considered to represent objects, it will be pre-loaded in memory no matter what, but you can still use a stream syntax nonetheless.
The callback is simply one parameter:
def response_callback(response):
# Do something with the response
requests_response = response.internal_response
# Use "requests" API now
for chunk in requests_response.iter_content():
work_with_chunk(chunk)
Note that this is pretty advanced, you may encounter difficulties and this might not fit what you want precisely. We are working on a new pattern on SDK to simplify complex scenario like that, but it's not shipped yet. You would be able to send and receive raw requests using a send_request method, which gives you absolute control on all aspect of the query, like explaining you just want to stream (no pre-load in memory) or disabling the deserialization by default.
Feel free to open an issue on the Azure SDK for Python repo if you have additional questions or clarification: https://github.com/Azure/azure-sdk-for-python/issues
Edit with new suggestions: TableEntity is a dict like class, so you can json.dumps as string, or json.dump as a stream while using the ItemPaged<TableEntity>. If JSON dumps raise an exception, you can try our JSON encoder in azure.core.serialization.AzureJSONEncoder: https://github.com/Azure/azure-sdk-for-python/blob/1ffb583d57347257159638ae5f71fa85d14c2366/sdk/core/azure-core/tests/test_serialization.py#L83
(I work at MS in the Azure SDK for Python team.)
Ref:
https://docs.python-requests.org/en/master/api/#requests.Response.iter_content
https://azuresdkdocs.blob.core.windows.net/$web/python/azure-core/1.17.0/azure.core.pipeline.policies.html#azure.core.pipeline.policies.CustomHookPolicy

How to send real time data generated by Azure webjob to angular UI

I have a python script running continuously as a webjob on Azure. In almost every 3 minutes it generates a new set of data. Once the data is generated we want to send it to UI(angular) in real time.
What could be the ideal approach (fastest) to get this functionality?
The data generated is a json containing 50 key value pairs. I read about signalr, but can I directly use signalr with my python code? Is there any other approach like sockets etc.?
What you need is called WebSocket, this is a protocol which allows back-end servers to push data to connected web clients.
There are implementations of WebSocket for python (a quick search found me this one).
Once you have a WebSocket going, you can create a service in o your angular project to handle the yields from your python service, most likely using observables.
Hopefully this sets you on the right path

How to speed up Flask response download speed

My frontend web app is calling my python Flask API on an endpoint that is cached and returns a JSON that is about 80,000 lines long and 1.7 megabytes.
It takes my UI about 7.5 seconds to download all of it.
It takes Chrome when calling the path directly about 6.5 seconds.
I know that I can split up this endpoint for performance gains, but out of curiosity, what are some other great options to improve the download speed of all this content?
Options I can think of so far:
1) compressing the content. But then I would have to decompress it on the frontend
2) Use something like gRPC
Further info:
My flask server is using WSGIServer from gevent and the endpoint code is below. PROJECT_DATA_CACHE is the already Jsonified data that is returned:
#blueprint_2.route("/projects")
def getInitialProjectsData():
global PROJECT_DATA_CACHE
if PROJECT_DATA_CACHE:
return PROJECT_DATA_CACHE
else:
LOGGER.debug('No cache available for GET /projects')
updateProjectsCache()
return PROJECT_DATA_CACHE
Maybe you could stream the file? I cannot see any way to transfer a file 80,000 lines long without some kind of download or wait.
This would be an opportunity to compress and decompress it, like you suggested. Definitely make sure that the JSON is minified.
One way to minify a JSON: https://www.npmjs.com/package/json-minify
Streaming a file:
https://blog.al4.co.nz/2016/01/streaming-json-with-flask/
It also really depends on the project, maybe you could get the users to download it completely?
The best way to do this is to break your JSON into chunks and stream it by passing a generator to the Response. You can then render the data as you receive it or show a progress bar displaying the percentage that is done. I have an example of how to stream data as a file is being downloaded from AWS s3 here. That should point you in the right direction.

Combining many log files in Amazon S3 and read in locally

I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.

Categories

Resources