Django + Scrapy multi scrapers architecture

Django + Scrapy multi scrapers architecture - python

Recently I took over Django project whose one component is Scrapy scrapprs (a lot of - core functionality). It is worth adding that scrapers simply feed the database several times a day and django web app is using this data.
__Scraper__s have direct access to Django model, but in my opinion is not the best idea (mixed responsibilities - django rather should act as a web app, not also scrapers, isn't it?). For example after such split scrapers could be run serverless, saving money and being spawned only when needed.
I see it at least as separate component in the architecture. But if I would separate scrapers from Django website then I would need to populate DB there as well - and change in model either in Django webapp or in scraping app would require change in second app to adjust.
I haven't seen really articles about splitting those apps.
What are the best practices here? Is it worth splitting it? How would you organise deployment to cloud solution(e.g. AWS)?
Thank you

Well, this is a big discussion and I have the same "good problem".
Short answer:
I suggest you that if you want to separate it, you can separate the logic from the data using different schemes. I did it before and is a good approach.
Long answer:
The questions are:
Once you gather information from scrapers, are you doing something with them (Aggregation, treatment, or anything else)?
If the answer is yes, you can separate it in 2 DB. One with the raw information and the other with the treated one (which will be the shared with Django).
If the answer is no, I don't see any reason to separate it. At the end, Django is only the visualizer of the data.
The Django website is using a lot of stored data that for the Single Responsibility you want to separate it from the scraped data?
If the answer is yes, separate it by schemas or even DB.
If the answer is no, you can store it in the same DB of Django. At the end, the important data will be the extracted data. Django maybe will have a configuration's DB or other extra data to manage the web, but the big percentage of the DB will be the data crawled/treated. Depends how much cost it will take you to separate it and maintain. If you are doing from the beginning, I would do it separately.

Related

Handling repetitive content within django apps

I am currently building a tool in Django for managing the design information within an engineering department. The idea is to have a common catalogue of items accessible to all projects. However, the projects would be restricted based on user groups.
For each project, you can import items from the catalogue and change them within the project. There is a requirement that each project must be linked to a different database.
I am not entirely sure how to approach this problem. From what I read, the solution I came up with is to have multiple django apps. One represents the common catalogue of items (linked to its own database) and then an app for each project(which can write and read from its own database but it can additionally read also from the common items catalogue database). In this way, I can restrict what user can access what database/project. However, the problem with this solution is that it is not DRY. All projects look the same: same models, same forms, same templates. They are just linked to different database and I do not know how to do this in a smart way (without copy-pasting entire files cause I think managing this would be a pain).
I was thinking that this could be avoided by changing the database label when doing queries (employing the using attribute) depending on the group of the authenticated user. The problem with this is that an user can have access to multiple projects. So, I am again at a loss.

It looks for me that all you need is a single application that will manage its access properly.
If the requirement is to have separate DBs then I will not argue that, but ... there is always small chance that separate tables in 1 DB is what they will accept

Django apps don't segregate objects, they are a way of structuring your code base. The idea is that an app can be re-used in other projects. Having a separate app for your catalogue of items and your projects is a good idea, but having them together in one is not a problem if you have a small codebase.
If I have understood your post correctly, what you want is for the databases of different departments to be separate. This is essentially a multi-tenancy question which is a big topic in itself, there are a few options:
Code separation - all of your projects/departments exist in a single database and schema but are separate by code that filters departments depending on who the end user is (literally by using Django .filters()). This is easy to do but there is a risk that data could be leaked to the wrong user if you get your code wrong. I would recommend this one for your use-case.
Schema separation - you are still using a single database but each department has its own schema. You would need to use Postgresql for this but once a schema has been set, there is far less chance that data is going to be visible to the wrong user. There are some Django libraries such as django-tenants that can do a lot of the heavy lifting.
Database separation - each department has their own database. There is even less of a chance that data will be leaked but you have to manage multi-databases and it is more difficult to scale. You can manage this through django as there is support for multi-databases.
Application separation - each department not only has their own database but their own application instance. The separation is absolute but again you need to manage multiple applications on a host like Heroku, which is even less scalable.

How to properly migrate a monolith architecture app to a microservice app in Django

Today I come with this question probably to someone who has large experience in this.
Basically what the title indicates. We have that app and we have to migrate it to microservices.
We didn't find any solid approach (or we felt it like that) about this. What we ended up doing is creating 1 project per microservice (a single functionality related to a module app, in general) but then we had some problems because we already had a database to work with since this is already a functioning app.
We had problems communicating with the existing models, so basically what we did was to point in every settings.py of the projects to the existing DB, and with python3 manage.py inspectdb, we grabbed the existing models. This approach ended up working, but we feel that is not the best approach. We had a lot of problems with circular imports and more.
Is out there good practices about microservices with Django, and how to properly do it, like in most of the cases that we want to create something with the framework?
If someone knows or has something we would really appreciate it!

you can use Django for Microservices. in this case you have only one few apps, and you start every service on own port:
first Django project
pdf generator + views generate small pdf
second Django project.
pdf generator + views generate big pdf (code can inherit other project)
orchestra:
third Django project: Autorization + call big or small pdf generator service
settings
settings.py for first and second is very easy and allows only call from internal ip's. here we don't need middleaware, template, cache, admin and other settings.
settings.py for orchestra is also very easy and used only auth and made call by internal ip ant send response to user. Here we don't need much middlaware, and don't need many other settings.
gains:
All is independent. if one server fall, other can work.
Updates are easy. One small server update is always easy than monolith update.
development is easy: three small teams works on own small projects.
Units testing is easy and fast
For complex business goals the whole system is faster.
pains:
after 100 micro-services it is completely complex to work with that all.
code style from many teams is always different. Don't matter how strict you define styleguide or settings for black-linter.
integrate Testing is difficult - If something not work, it is hard to find where.
Ecosystem Auth/services/messaging is really complex
For easy business goals the whole system is overcomplicated.
summary
Don't matter how much DB you want to use. it can be monolith or many services.
i don't see any problem to import in one Microservice something from other project: it can be model / admin or other staff. it works. probably you need to smart split monolith, but it also easy, for that we have a many experience (my own and in internet or books)

How do I structure a database cache (memcached/Redis) for a Python web app with many different variables for querying?

For my app, I am using Flask, however the question I am asking is more general and can be applied to any Python web framework.
I am building a comparison website where I can update details about products in the database. I want to structure my app so that 99% of users who visit my website will never need to query the database, where information is instead retrieved from the cache (memcached or Redis).
I require my app to be realtime, so any update I make to the database must be instantly available to any visitor to the site. Therefore I do not want to cache views/routes/html.
I want to cache the entire database. However, because there are so many different variables when it comes to querying, I am not sure how to structure this. For example, if I were to cache every query and then later need to update a product in the database, I would basically need to flush the entire cache, which isn't ideal for a large web app.
I would prefer is to cache individual rows within the database. The problem is, how do I structure this so I can flush the cache appropriately when an update is made to the database? Also, how can I map all of this together from the cache?
I hope this makes sense.

I had this exact question myself, with a PHP project, though. My solution was to use ElasticSearch as an intermediate cache between the application and database.
The trick to this is the ORM. I designed it so that when Entity.save() is called it is first stored in the database, then the complete object (with all references) is pushed to ElasticSearch and only then the transaction is committed and the flow is returned back to the caller.
This way I maintained full functionality of a relational database (atomic changes, transactions, constraints, triggers, etc.) and still have all entities cached with all their references (parent and child relations) together with the ability to invalidate individual cached objects.
Hope this helps.

So a free eBook called "Redis in Action" by Josiah Carlson answered all of my questions. It it quite long, but after reading through, I have a fairly solid understanding of how to structure a caching architecture. It gives real world examples, such as a social network and a shopping site with tons of traffic. I will need to read through it again once or twice to fully understand. A great book!
Link: Redis in Action

Django code organisation

I've recently started working with Django. I'm working on an existing Django/Python based site. In particular I'm implementing some functionality to create and display a PDF document when a particular URL is hit. I have an entry in the app's urls file that routes to a function in the views file and the PDF generation is working fine.
However, the view function is pretty big and I want to extract the code out somewhere to keep my view as thin as possible, but I'm not sure of the best/correct approach. I'll probably need to generate other PDFs in due course so would it make sense to create a 'pdfs' app and put code in there? If so, should it go in a model or view?
In a PHP/CodeIgniter environment for example I would put the code into a model, but models seem to be closely linked to database tables in Django and I don't need any db functionality for this.
Any pointers/advice from more experienced Django users would be appreciated.
Thanks

If you plan to scale your project, I would suggest moving it to a separate app. Generally speaking, generating PDFs based on an url hit directly is not the best thing to do performance-wise. Generating a PDF file is pretty heavy on you server, so if multiple people do it at the same time, the performance of your system will suffer.
As a first step, just put it in a separate class, and execute that code from the view. At some point you will probably want to do some permission checks etc - that stays in the view, while generation of the PDF itself will be cleanly separated.
Once you test your code, scale etc - then you can substitute that one line call in the view into putting the PDF generation in a queue and only pulling it once it's done - that will allow you to manage your computing powers better.

Yes you can in principle do it in an app (the concept of reusable apps is the basis for their existence)
However not many people do it/not many applications require it. It depends on how/if the functionality will be shared. In other words there must be a real benefit.
The code normally goes in both the view/s and in the models (to isolate code and for the model managers)

Psych Experiment in Python (w/Django) - how to port to interactive web app?

I'm writing a psychology experiment in Python, and I need to make it available as a web app. I've already got the Python basically working as a command-line program. On the recommendation of a CS buddy I'm using Django with a sqlite db. This is also working, my development server is up and the database tables are ready and waiting.
What I don't understand is how to glue these two pieces together. The Django tutorials I've found are all about building things like blogs, messaging systems or polls; systems based on sending form data. I can't do that, because I'm timing responses to presented stimuli in milliseconds - I need to build an interactive app that doesn't rely (during the exercise) on form POST data or URL changes.
In short: I have no idea how to go from my simple command line program to a "real time" interactive web application.
Maximum kudos for links to relevant tutorials! I will also really appreciate a high-level explanation of the concept I'm missing here.
(FYI, I asked a previous question (choice of database) about this project here)

You are going to need to use HTML/Javascript, and then you can collect and send the results to the server. The results can get gamed though, as the code for the exercise is going to be client side.
Edit: I recommend a Javascript library, jQuery: http://docs.jquery.com/Tutorials
Edit 2:
I'll be a bit more specific, you need at least two models in Django, Exercise, and ExecutedExercise. Exercise will have fields with its name, number, etc., generic data for each exercise. ExecutedExercise will have two fields, a foreign key to Exercise, and a field to store how long it took to finish.
Now in Javascript, you're going to time the exercises, and then post them to a Django view that will handle the data storage. How to post them? You could use http://api.jquery.com/jQuery.post/ Create the data string, data = { e1: timingE1, e2: timingE2 } and post it to the view. You can handle the POST parameters in that view, create a ExecutedExercise object (you'll have the time it took for each exercise) and save them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.