While working with JavaMail API, I used below two properties to create a session and then to download emails from server. This (setting below properties) will cause data to be fetched from the server in larger chunks and hence to download large message bodies efficiently. I am looking for similar option in python so that data to be fetched from the server should be in larger chunks. Can someone help me achieve this in Python?.
props.setProperty("mail.imaps.partialfetch","true");
props.setProperty("mail.imaps.fetchsize", "2000000");
Related
What would be best way to solve following problem with Python ?
I have real-time data stream coming to my object-oriented storage from user application (json files being stored into S3 storage in Amazon).
Upon receiving of each JSON file, I have to within certain time (1s in this instance) process data in the file and generate response that is send back to the user. This data is being processed by simple Python script.
My issue is, that the real-time data stream can at the same time generate even few hundreds JSON files from user applications that I need to run trough my Python script and I don't know how to approach this the best way.
I understand, that way to tackle this would be to use trigger based Lambdas that would execute job on the top of every file once uploaded from real-time stream in server-less environment, however this option is quite expensive compared to have single server instance running and somehow triggering jobs inside.
Any advice is appreciated. Thanks.
Serverless can actually be cheaper than using a server. It is much cheaper when there are periods of no activity because you don't need to pay for a server doing nothing.
The hardest part of your requirement is sending the response back to the user. If an object is uploaded to S3, there is no easy way to send back a response and it isn't even obvious who is the user that sent the file.
You could process the incoming file and then store a response back in a similarly-named object, and the client could then poll S3 for the response. That requires the upload to use a unique name that is somehow generated.
An alternative would be for the data to be sent to AWS API Gateway, which can trigger an AWS Lambda function and then directly return the response to the requester. No server required, automatic scaling.
If you wanted to use a server, then you'd need a way for the client to send a message to the server with a reference to the JSON object in S3 (or with the data itself). The server would need to be running a web server that can receive the request, perform the work and provide back the response.
Bottom line: Think about the data flow first, rather than the processing.
I have a python script running continuously as a webjob on Azure. In almost every 3 minutes it generates a new set of data. Once the data is generated we want to send it to UI(angular) in real time.
What could be the ideal approach (fastest) to get this functionality?
The data generated is a json containing 50 key value pairs. I read about signalr, but can I directly use signalr with my python code? Is there any other approach like sockets etc.?
What you need is called WebSocket, this is a protocol which allows back-end servers to push data to connected web clients.
There are implementations of WebSocket for python (a quick search found me this one).
Once you have a WebSocket going, you can create a service in o your angular project to handle the yields from your python service, most likely using observables.
Hopefully this sets you on the right path
I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.
One of out client who will be supplying data to us has REST based API. This API will fetch data from client's big data columnar store and will dump data as response to requested query parameters.
We will be issuing queries as below
http://api.example.com/biodataid/xxxxx
Challenge is that though response is quite huge though. For given id it contains JSON or XML response with at least 800 - 900 attributes in response for single id. Client is refusing to change service for whatever reason I can't cite here. In addition , due to some constraints we will get only 4-5 hour window daily to download this data for about 25000 to 100000 ids.
I have read about synchronous vs asynchronous handling of response. What are options available to design data processing service for efficiently loading to relational database ? We use python for data processing and mysql as current data ( more recent data ) store and H-Base as backend big data-store (recent and historical data). Goal is retrieve this data and process and load it either MySQL database or to H-Base store as fast as possible.
If you have built high throughput processing services any pointers will be helpful. Are there any resources for creating such services with example implementation ?
PS - If this question sounds too high level please comment and I will provide additional details.
I appreciate your response.
I set up a server using cherrypy to which files can be uploaded. However, I want to prevent files being uploaded if they exceed a certain size. I searched a bit but was not able to find out an answer. Is there a way to achieve this with cherrypy or in general ?
cherrypy._cpserver.Server.max_request_body_size is probably what you want.
Before an HTTP client uploads a file, in the HTTP headers they must specify the size of their message body. Based on that you can immediately reject the upload attempt with a HTTP 413 Request Entity Too Large error message.
It's possible to circumvent this by saying a certain amount and then uploading more - but most servers are smart enough to stop reading when they've hit the maximum they intend on accepting from a client.
Unfortunately, this isn't always the case because there's also a method of HTTP upload called 'Chunked encoding', in which the client (or server, depending on which way it's going) is not required to advertise the size of their upload. You mostly see this on the server side as a way to stream data to clients.