Near Real time streaming data from Mongo DB to Datawarehouse - python

We have a number MongoDB collections which will store data generated from the application(E-Commerce Order update, delivery, cancel, new orders etc).Currently, we are following traditional ETL approach to schedule data pull (Covert to file in s3/Staging load) and load to DW.As data volume is increasing we feel like it's an inefficient way since we are lagging at least one day to generate reports compared to real time streaming /similar kind of new ETL approaches.So as a streaming option first I read about Apache Kafka which is very popular.But the biggest challenge facing is how to convert this Mongo DB collection to Kafka topics.
I read this MongoDb Streaming Out Inserted Data in Real-time (or near real-time). We are not using capped collections so that recommended solution won't work for us.
Can MongoDB collection be a Kafka producer?
Is there any better way to pull data from MongoDB realtime /near realtime to Target DB /s3 other than Kafka
Note: I prefer a python solution which can easily integrate into our current workflow than Java/Scala.
Thanks

Related

Is there any way to replicate realtime streaming from azure blob storage to to azure my sql

We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.

How to read large data from big query table using cloud run python api and what should be system config?

I have created a flask api in python and deployed as a container image in gcp cloud run and running through the cloud scheduler, in my code i am reading large data (15 million rows and 20 columns) from big query, i have set my system config to 8gm ram 4 cpu.
problem1: It is taking too much time to read for about (2200 secs to read data)
import numpy as np
import pandas as pd
from pandas.io import gbq
query = """ SELECT * FROM TABLE_SALES"""
df = gbq.read_gbq(query), project_id="project_name")
Is there any efficient way to read the data from BQ?
Problem2 : my code has stopped working after reading the data. when i checked the logs, i got this:
error - 503
textPayload: "The request failed because either the HTTP response was malformed or connection to the instance had an error.
While handling this request, the container instance was found to be using too much memory and was terminated. This is likely to cause a new container instance to be used for the next request to this revision. If you see this message frequently, you may have a memory leak in your code or may need more memory. Consider creating a new revision with more memory."
one of the work around is to enhance the system config if that's the solution please let me know the cost around it.
You can try GCP Dataflow batch job to read through a large data from BQ.
Depending on the complexity of your Bigquery query you may want to consider the high performant Google Bigquery Storage API https://cloud.google.com/bigquery/docs/reference/storage/libraries

Elasticsearch Data Insertion with Python

I'm brand new to using the Elastic Stack so excuse my lack of knowledge on the subject. I'm running the Elastic Stack on a Windows 10, corporate work computer. I have Git Bash installed for a bash cli, and I can successfully launch the entire Elastic Stack. My task is to take log data that is stored in one of our databases and display it on a Kibana dashboard.
From what my team and I have reasoned, I don't need to use Logstash because the database that the logs are sent to is effectively our 'log stash', so to use the Logstash service would be redundant. I found this nifty diagram
on freecodecamp, and from what I gather, Logstash is just the intermediary for log retrieval different services. So instead of using Logstash, since the log data is already in a database, I could just do something like this
USER ---> KIBANA <---> ELASTICSEARCH <--- My Python Script <--- [DATABASE]
My python script successfully calls our database and retrieves the data, and a function that molds the data into a dict object (as I understand, Elasticsearch takes data in a JSON format).
Now I want to insert all of that data into Elasticsearch - I've been reading the Elastic docs, and there's a lot of talk about indexing that isn't really indexing, and I haven't found any API calls I can use to plug the data right into Elasticsearch. All of the documentation I've found so far concerns the use of Logstash, but since I'm not using Logstash, I'm kind of at a loss here.
If there's anyone who can help me out and point me in the right direction I'd appreciate it. Thanks
-Dan
You ingest data on elasticsearch using the Index API, it is basically a request using the PUT method.
To do that with Python you can use elasticsearch-py, the official python client for elasticsearch.
But sometimes what you need is easier to be done using Logstash, since it can extract the data from your database, format it using many filters and send to elasticsearch.

Update a row multiple time when data stream from Google Datastore to BigQuery

We are trying to push the datastore entities update to BigQuery as streaming input to provide the realtime data analysis.
Each entities in the datastore will be updated multiple times in a day. When we push the entities, I need to make sure that only latest data should be to bigquery record. How can I archive this?
There is no built-in streaming path to go from Datastore to BigQuery as far as I know. What is supported is making a Datastore backup (exported to Cloud Storage) and loading the backup into BigQuery with a load job.
Instead of using a job to load data into BigQuery, you can also choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job.
Generally streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table. However data in the streaming buffer may be temporarily unavailable. When data is unavailable, queries continue to run successfully, but they skip some of the data that is still in the streaming buffer.
For more details you can check below links:
Link-1
Link-2

How to efficiently process large amount data response from REST API?

One of out client who will be supplying data to us has REST based API. This API will fetch data from client's big data columnar store and will dump data as response to requested query parameters.
We will be issuing queries as below
http://api.example.com/biodataid/xxxxx
Challenge is that though response is quite huge though. For given id it contains JSON or XML response with at least 800 - 900 attributes in response for single id. Client is refusing to change service for whatever reason I can't cite here. In addition , due to some constraints we will get only 4-5 hour window daily to download this data for about 25000 to 100000 ids.
I have read about synchronous vs asynchronous handling of response. What are options available to design data processing service for efficiently loading to relational database ? We use python for data processing and mysql as current data ( more recent data ) store and H-Base as backend big data-store (recent and historical data). Goal is retrieve this data and process and load it either MySQL database or to H-Base store as fast as possible.
If you have built high throughput processing services any pointers will be helpful. Are there any resources for creating such services with example implementation ?
PS - If this question sounds too high level please comment and I will provide additional details.
I appreciate your response.

Categories

Resources