Scrapy - Database choice [closed]

Scrapy - Database choice [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm currently building a scrapy project that can crawl any website from the first depth to the last. I don't extract many data, but I store the all page HTML (response.body) in a database.
I am currently using Elasticsearch with the bulk API, to store my raw HTML.
I had a look in Cassandra but I did not find an equivalent of the Elasticsearch bulk and it affects the performances of my spider.
I am interested in performance, and was wondering if Elasticsearch was a good choice, and if maybe there was a more appropriate NoSQL database.

That very much depends on what you are planning to do with the scraped data later on.
Elasticsearch does some complex indexing operations upon insertion which will make subsequent searches in the database quite fast ... but this also costs processing time and introduces a latency.
So to answer your question whether Elasticsearch was a good choice:
If you plan to build some kind of search engine later on, Elasticsearch was a good choice (as the name indicates). But you should have a close look at the configuration of Elasticsearch's indexing options etc to make sure it works the way you need it.
If on the other hand you just want to store the data and do processing tasks on it later, Elasticsearch was a poor choice and you would be better off with Cassandra or another NoSQL database.
Which NoSQL database suits your needs best, depends - again - on the actual usage scenario.

Related

Should I use a SQLite database or Pandas for my application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a user installable application the takes a 2-5 MB JSON file and then queries the data for metrics. It will pull metrics like the number of unique items, or the number of items with a field set to a certain value, etc. Sometimes, it pulls metrics that are more tabular like returning all items with certain properties and all their fields from the JSON.
I need help making a technology choice. I am between using either Pandas or SQLite with peewee as an ORM. I am not concerned about converting the JSON file to a SQLite database, I already have this prototyped. I want help evaluating the pros and cons of a SQLite database versus Pandas.
Other factors to consider are that my application may require analyzing metrics across multiple JSON files of the same structure. For example, how many unique items are there across 3 selected JSON files.
I am news to Pandas so I can't make a strong argument for or against it yet. I am comfortable with SQLite with an ORM, but don't want to settle if this technology choice would be restrictive for future development. I don't want to factor in a learning curve. I just want an evaluation on the technologies head-to-head for my application.

You are comparing a database to an in-memory processing library. They are two seperate ideas. Do you need persistent storage over multiple runs of code? Use SQLite (since you're using metrics I would guess this is the path you need). You could use Pandas to write CSV's/TSV's and use those as permanent storage but you'll eventually start to bottleneck having to load multiple CSV's into one Dataframe for processing.
Your use case sounds better suited to using SQLite, in my opinion.

MongoDB Or PostGreSQL which will be better With Python Django and Express and Node js [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
As i am dealing with money like debiting from one account and crediting with other so for this work I think my database must strongly follow all ACID properties So for this work which will be more compatible MongoDB or Postgresql as i have read mongoDB does not follows ACID properties thats why i am confused.

A SQL RDBMS is definitely your choice. If you are selecting between Mongo and PostgreSQL then PostgreSQL will be the answer.
As it's stated in Mongo official FAQ:
MongoDB may not be a good fit for some applications. For example,
applications that require complex transactions (e.g., a double-entry bookkeeping system) and scan-oriented applications
that access large subsets of the data most of the time may not be a
good fit for MongoDB. MongoDB is not a drop-in replacement for legacy
applications built around the relational data model and SQL.

Need(easy to use) database for analysis data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for the database (python api) for tasks described below.
There is an increasing amount of data. In the initial period of accumulation of the data will be simple analysis that can be done with simple sql queries. But in the future it is planned to extract data from more complex queries, finding complex relationships. Requires initially choose a data storage system, which in future will allow to analyze this database different, things get complicated ( as you explore topics and skills development ) tools.
Example:
First, there is only data on buckwheat and rice. Required to compare sales growth during the month. No problem - two sql query by product name, limited sampling time (month). Drawn graphs, clearly see what is what. There are more kinds of goods. Now it took to learn how to depend on sales growth Soya sauce from rice sales growth. This is somehow possible from using sql query. And now we have 5000 names in the database and requires using some algorithms (eg neural networks) seek any dependencies in the database automatically.
That is, start with simple, needs grow, become more complicated tools.
What db suited to increasing requirements, being a simple enough for use in the beginning?
Is Redis, for example?
I would have been very useful to know that in my question incorrectly. I am totally new to this subject. it tells me what to look

I agree, MongoDB is suited for that. If you had millions of entries with multiple relations, SQL would be ahead. But for some thousands of entries a document-based DB does the job. As a benefit you don't have to care about the structure of your DB before you create it and are able to easily change it later. Take a look at the PyMongo Tutorial.

Sqlalchemy, python, easiest way to populate database with data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I tend to start projects that are far beyond what I am capable of doing, bad habit or a good way to force myself to learn, I don't know. Anyway, this project uses a postgresql database, python and sqlalchemy. I am slowly learning everything from sql to sqlalchemy and python. I have started to figure out models and the declarative approach, but I am wondering: what is the easiest way to populate the database with data that needs to be there from the beginning, such as an admin user for my project? How is this usually done?
Edit:
Perhaps this question was worder in a bad way. What I wanted to know was the possible ways to insert initial data in my database, I tried using sqlalchemy and checking if every item existed or not, if not, insert it. This seemed tedious and can't be the way to go if there is a lot of initial data. I am a beginner at this and what better way to learn is there than to ask the people who do this regularly how they do it? Perhaps not a good fit for a question on stackoverflow, sorry.

You could use a schema change management tool like liquibase. Normally this is used to keep your database schema in source control, and apply patches to update your schema.
You can also use liquibase to load data from CSV files. So you could add a startup.csv file in liquibase that would be run the first time you run liquibase against your database. You can also have it run any time, and will merge data in the CSV with the database.

Search engine from scratch [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a MySQL database with around 10,000 articles in it, but that number will probably go up with time. I want to be able to search through these articles and pull out the most relevent results based on some keywords. I know there are a number of projects that I can plug into that can essentially do this for me. However, the application for this is very simple, and it would be nice to have direct control and working knowledge of how the whole thing operates. Therefore, I would like to look into building a very simple search engine from scratch in Python.
I'm not even sure where to start, really. I could just dump everything from the MySQL DB into a list and try to sort that list based on relevance, however that seems like it would be slow, and get slower as the amount of database items increase. I could use some basic MySQL search to get the top 100 most relevant results from what MySQL thinks, then sort those 100. But that is a two step process which may be less efficient, and I might risk missing an article if it is just out of range.
What are the best approaches I can take to this?

The best bet for you to do "Search Engine" for the 10,000 Articles is to read "Programming Collective Intelligence" by Toby Segaran. Wonderful read and to save your time go to Chapter 4 of August 2007 issue.

If you don't mind replacing the MySQL database with something else then I suggest elasticsearch, using pyes.
It has the functionality you would expect of a search engine, including full text search, great performance, pagination, more-like-this, plugin-able scoring algorithm, and is real time - so when more data is added it will be instantly shown in the search results.
If you dont want to remove the current database then you can easily run them side by side, and treat the MySQL as the master.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.