We are thinking about creating a dynamic and scheduled data-fetching application from a number of data sources (rest API calls). The considerations are as follows
User shall be able to configure API/webservice endpoints, frequency of fetching, and response content type (can be JSON or CSV)
Once the user completes the configuration part, a job queue will be created programmatically.
A scheduler framework shall be used used to make requests to the endpoints and push the response into the respective queues. We are thinking of a queue here to preserve the order of the responses and also as an intermediate storage for the raw response from the endpoints.
The items stored in the queues shall be processed using python/pandas. We are planning to use a NoSQL DB storage for this.
Question
For this purpose is it better to use celery or RabbitMQ? We are thinking of using Celery as it has a relatively simple implementation.
Any thoughts on this is greatly appreciated.
Thank you.
Related
I am wondering about the available solutions to my problem. Need to retrieve data from API every (preferably) 200ms and save this data to the database since it will still be processed by another service. My solution I wanted to base on RabbitMQ and task queuing. That is, my API from which you can send or delete a task that fetches data every 200ms and adds it to the database. There may be several such tasks though not many. While I know that the latency associated with the database can not be avoided, I do not know if the solution with RabbitMQ is optimal in this case. Maybe someone has experience and can suggest better solutions to this problem ? My API is based on python and FastAPI.
A bit of the context:
want to write an algorithm that accepts tickets from the client. Sorts them on some constraints, handles them, and replies back to the client with the results.
I did some research and though REST API of python is a good idea. But as I explored it I found out that, It usually is built to handle one request at once.
Is there a way to add tasks(REST API requests) to the queue, sort them and execute them with workers and reply back to clients once processing is done?
I can suggest three ways to do that.
Try to use database to store the request content, constraint and status as 'pending'. Later, when you want to trigger the processing of the requests just retrieve them in sorted order by your constraint and update the status to 'processed'.
You can use Redis task queue with flask. See the article. https://realpython.com/flask-by-example-implementing-a-redis-task-queue/
You can also use Celery module with Flask. See the documentation. https://flask.palletsprojects.com/en/1.1.x/patterns/celery/
I need some direction as to how to achieve the following functionality using Django.
I want my application to enable multiple users to submit jobs to make calls to an API.
Each user job will require multiple API calls and will store the results in a db or a file.
Each user should be able to submit multiple jobs.
In case of some failure such as network blocked or API not returning results I want the application to pause for a while and then resume completing that job.
Basically want the application to pickup from where it was left off.
Any ideas as to how I could implement this solution or any technologies such as celery I should be looking at or even if you can suggest an opensource project where I can learn how to perform this would be a great help.
You can do this with rabbitmq and celery.
This post might be helpful.
https://medium.com/#ffreitasalves/executing-time-consuming-tasks-asynchronously-with-django-and-celery-8578eebab356
I am building an application in django that collects hotel information from various sources and format this data to a uniform format. There after I need to expose API to allow hotels access to web apps and devices using django-rest-framework.
So For example if I have 4 sources
[HotelPlus, xHotelService, HotelSignup, HotelSource]
So please let me know the best implementation practice in terms of django. Being a PHP developer, I prefer to do this by writing a custom third party services implementing an interface so adding more sources becomes easy. That way I only need to call execute() method from the cron task and rest is done by the service controller (fetching feed and populating it in database).
But I am new to python django, so I dont have much idea of creating services or middleware is a right fit for this task.
For fetching data from the sources you will need dedicated worker processes and broker so that your main django process won't be blocked. You can use celery for that and it already supports django.
After writing the tasks for fetching and formatting the data, you should need a scheduler to call this tasks periodically. You can use celery beat for that.
I was wondering what is the 'best' way of passing data between views. Is it better to create invisible fields and pass it using POST or should I encode it in my URLS? Or is there a better/easier way of doing this? Sorry if this question is stupid, I'm pretty new to web programming :)
Thanks
There are different ways to pass data between views. Actually this is not much different that the problem of passing data between 2 different scripts & of course some concepts of inter-process communication come in as well. Some things that come to mind are -
GET request - First request hits view1->send data to browser -> browser redirects to view2
POST request - (as you suggested) Same flow as above but is suitable when more data is involved
Django session variables - This is the simplest to implement
Client-side cookies - Can be used but there is limitations of how much data can be stored.
Shared memory at web server level- Tricky but can be done.
REST API's - If you can have a stand-alone server, then that server can REST API's to invoke views.
Message queues - Again if a stand-alone server is possible maybe even message queues would work. i.e. first view (API) takes requests and pushes it to a queue and some other process can pop messages off and hit your second view (another API). This would decouple first and second view API's and possibly manage load better.
Cache - Maybe a cache like memcached can act as mediator. But then if one is going this route, its better to use Django sessions as it hides a whole lot of implementation details but if scale is a concern, memcached or redis are good options.
Persistent storage - store data in some persistent storage mechanism like mysql. This decouples your request taking part (probably a client facing API) from processing part by having a DB in the middle.
NoSql storages - if speed of writes are in other order of hundreds of thousands per sec, then MySql performance would become bottleneck (there are ways to get around by tweaking mysql config but its not easy). Then considering NoSql DB's could be an alternative. e.g: dynamoDB, Redis, HBase etc.
Stream Processing - like Storm or AWS Kinesis could be an option if your use-case is real-time computation. In fact you could use AWS Lambda in the middle as a server-less compute module which would read off and call your second view API.
Write data into a file - then the next view can read from that file (real ugly). This probably should never ever be done but putting this point here as something that should not be done.
Cant think of any more. Will update if i get any. Hope this helps in someway.