How to efficiently process large amount data response from REST API? - python

One of out client who will be supplying data to us has REST based API. This API will fetch data from client's big data columnar store and will dump data as response to requested query parameters.
We will be issuing queries as below
http://api.example.com/biodataid/xxxxx
Challenge is that though response is quite huge though. For given id it contains JSON or XML response with at least 800 - 900 attributes in response for single id. Client is refusing to change service for whatever reason I can't cite here. In addition , due to some constraints we will get only 4-5 hour window daily to download this data for about 25000 to 100000 ids.
I have read about synchronous vs asynchronous handling of response. What are options available to design data processing service for efficiently loading to relational database ? We use python for data processing and mysql as current data ( more recent data ) store and H-Base as backend big data-store (recent and historical data). Goal is retrieve this data and process and load it either MySQL database or to H-Base store as fast as possible.
If you have built high throughput processing services any pointers will be helpful. Are there any resources for creating such services with example implementation ?
PS - If this question sounds too high level please comment and I will provide additional details.
I appreciate your response.

Related

Extract Data from Qlik API using Python

I have a requirement where I need to fetch data from Qlik API in JSON format (just as we did in Power BI dataset) and parse it in CSV format.
Essentially, attached is kind of data I'm trying to extract from Qlik Engine/QRS API.
Anyway of achieving this requirement ?
Communication with Qlik Engine is done via web sockets (JSON-PRC)
Please, have a look at the official documentation
In your case the workflow should be:
establish communication channel with the Engine
connect to the app that contains the data
contruct table object and provide the required measures/dimensions in the definition
get the layout of the table/object
extract the data from the layout (if the data is more than 10 000 data cells then you'll have to implement paging)
once have all the data - do whatever you want with it
There a few basic exmples how to, at least, connect with Python. For example Qlik Sense: call Qlik Sense Engine API with Python

How to fetch data from server in larger chunks in Python

While working with JavaMail API, I used below two properties to create a session and then to download emails from server. This (setting below properties) will cause data to be fetched from the server in larger chunks and hence to download large message bodies efficiently. I am looking for similar option in python so that data to be fetched from the server should be in larger chunks. Can someone help me achieve this in Python?.
props.setProperty("mail.imaps.partialfetch","true");
props.setProperty("mail.imaps.fetchsize", "2000000");

Update a row multiple time when data stream from Google Datastore to BigQuery

We are trying to push the datastore entities update to BigQuery as streaming input to provide the realtime data analysis.
Each entities in the datastore will be updated multiple times in a day. When we push the entities, I need to make sure that only latest data should be to bigquery record. How can I archive this?
There is no built-in streaming path to go from Datastore to BigQuery as far as I know. What is supported is making a Datastore backup (exported to Cloud Storage) and loading the backup into BigQuery with a load job.
Instead of using a job to load data into BigQuery, you can also choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job.
Generally streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table. However data in the streaming buffer may be temporarily unavailable. When data is unavailable, queries continue to run successfully, but they skip some of the data that is still in the streaming buffer.
For more details you can check below links:
Link-1
Link-2

Near Real time streaming data from Mongo DB to Datawarehouse

We have a number MongoDB collections which will store data generated from the application(E-Commerce Order update, delivery, cancel, new orders etc).Currently, we are following traditional ETL approach to schedule data pull (Covert to file in s3/Staging load) and load to DW.As data volume is increasing we feel like it's an inefficient way since we are lagging at least one day to generate reports compared to real time streaming /similar kind of new ETL approaches.So as a streaming option first I read about Apache Kafka which is very popular.But the biggest challenge facing is how to convert this Mongo DB collection to Kafka topics.
I read this MongoDb Streaming Out Inserted Data in Real-time (or near real-time). We are not using capped collections so that recommended solution won't work for us.
Can MongoDB collection be a Kafka producer?
Is there any better way to pull data from MongoDB realtime /near realtime to Target DB /s3 other than Kafka
Note: I prefer a python solution which can easily integrate into our current workflow than Java/Scala.
Thanks

Google App Engine request logs are duplicated with small time difference

Looking into GAE logs of a Python project I found out some amount of duplicated or even triplicated entries with very small difference in their timestamps.
These are the requests triggered by an iPhone device which sends only unique data, so it seems extremely unlikely that this duplication comes from the phone. Especially if you take into account the differences in time between the requests.
00:53:32.139 POST 200 93B 1.6 s APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
00:53:32.142 POST 200 93B 930 ms APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
00:53:32.279 POST 200 93B 835 ms APPNAME/1.2 CFNetwork/758.4.3 Darwin/15.5.0 /logData
Requests are the same (source ip, headers and so on) with the equal data inside:
{u'version': 1.2, u'data': u'some data', u'user': u'0a9b....0a57'}
And the actual question is "How is that possible"?
Could there be an explanation of such short intervals between duplicated logs?
It happened because of asynchronous tasks: When the iPhone gets a location from its location manager, and when the OS has some free time to run our code, then the application fires an HTTP request to the /logData endpoint. At the same time user activity causes one more HTTP request. Data from local variables will be removed only after the acknowledgment that data is received (HTTP response 200). So as they were triggered almost simultaneously -- both of them got into database and to the GAE logs.
All data stored by Google is triplicated so that there is always a backup, even if one of the nodes goes down.
This is a big factor, because as your data grows, the likelihood of
losing the data becomes very real. Most companies deal with this by
making multiple copies of the data in different datacenters in
different locations. Google also very obviously replicates important
data at least 3x, as the GFS paper suggests.
-- https://www.quora.com/How-does-Google-store-their-data
So it's not surprising that the logs are triplicated onto different servers. Those servers might have slightly different timestamps as you have observed. For me, the question is why you got all three copies of the log entry.

Categories

Resources