Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a MySQL database with around 10,000 articles in it, but that number will probably go up with time. I want to be able to search through these articles and pull out the most relevent results based on some keywords. I know there are a number of projects that I can plug into that can essentially do this for me. However, the application for this is very simple, and it would be nice to have direct control and working knowledge of how the whole thing operates. Therefore, I would like to look into building a very simple search engine from scratch in Python.
I'm not even sure where to start, really. I could just dump everything from the MySQL DB into a list and try to sort that list based on relevance, however that seems like it would be slow, and get slower as the amount of database items increase. I could use some basic MySQL search to get the top 100 most relevant results from what MySQL thinks, then sort those 100. But that is a two step process which may be less efficient, and I might risk missing an article if it is just out of range.
What are the best approaches I can take to this?
The best bet for you to do "Search Engine" for the 10,000 Articles is to read "Programming Collective Intelligence" by Toby Segaran. Wonderful read and to save your time go to Chapter 4 of August 2007 issue.
If you don't mind replacing the MySQL database with something else then I suggest elasticsearch, using pyes.
It has the functionality you would expect of a search engine, including full text search, great performance, pagination, more-like-this, plugin-able scoring algorithm, and is real time - so when more data is added it will be instantly shown in the search results.
If you dont want to remove the current database then you can easily run them side by side, and treat the MySQL as the master.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a list of words, and a few definitions as follows:
word - trans1, trans2, trans3 ...
I'm now unsure if all those translations are correct. I want to use a library that for that given word gives all translations possible
word - trans1 ... transn
I'll then match each of my translations to those provided by the library / translation software, and make sure that it is a valid translation. The problem is I don't know of such a library. I do not want something like googletrans as it only provides one possible translation [because it is used to translating paragraphs] and there is a clear word / search limit every time I use it, stopping abruptly after a few with just running a few trials. It also is inconsistent in its translation pattern, for example sometimes adding "to" to the infinitives of verbs and sometimes not. Is there something like this that exists? Essentially what I want is a many result English-destinationlanguage dictionary library.
Google Translate API is probably your best bet out there. I'm no Google fanboy but credit where credit's due, and Google Translate is probably the best in the game right now.
As far as the problem of the program abruptly stopping, make sure that you're using the API correctly(read this).
As far as the infinitives are concerned, machines generally suck at translation, to understand watch this great video by Tom Scott.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am working on a project where there is a necessity to store considerable data. I was wondering what is the difference between using SQL and the datascience library in python. I intend to use SQL from its python based libraries too or use a csv file to store info if I am going to use "datascience". I am leaning very much towards "datascience" as I find the following advantages:
It is subjectively very easy to use for me. I make much less mistakes.
With my limited knowledge in runtime, I think the datascience library will be more efficient.
Most importantly, it has many inbuilt functions that could allow me to make easier functions.
However, since so many people are using SQL, I was wondering if I am missing something major, particularly in scalability.
Some people online said that SQL allows us to store files on a database, but I do not see how that makes a difference. I can simply store the file in a folder on a system and save the link in the "datascience" table.
The "datascience library" is only intended to be a tool for teaching basic concepts in an academic entry level class. Unless you are taking such a class, you should ignore it and learn more standard tools.
If it helps you, you can learn Data Science using Pandas starting just from flat data files, such as CSV and JSON. You will absolutely need to learn to interface with SQL and NoSQL servers eventually. The advantages of a database over flat files are numerous and well described elsewhere.
It's up to you whether you want to learn Pandas first and SQL second, or SQL first. Many people in the real world would have learned SQL before Python/Pandas/Data Science, so you may want to go that route.
If you go ahead and study that datascience library, you will learn some concepts, but will then have to re-learn everything in there "for real." Maybe this is best for your learning style, maybe it isn't. We don't know you well enough. Do you want academic hand holding or do you want to do things the real way?
Good luck and enjoy your journey.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm currently building a scrapy project that can crawl any website from the first depth to the last. I don't extract many data, but I store the all page HTML (response.body) in a database.
I am currently using Elasticsearch with the bulk API, to store my raw HTML.
I had a look in Cassandra but I did not find an equivalent of the Elasticsearch bulk and it affects the performances of my spider.
I am interested in performance, and was wondering if Elasticsearch was a good choice, and if maybe there was a more appropriate NoSQL database.
That very much depends on what you are planning to do with the scraped data later on.
Elasticsearch does some complex indexing operations upon insertion which will make subsequent searches in the database quite fast ... but this also costs processing time and introduces a latency.
So to answer your question whether Elasticsearch was a good choice:
If you plan to build some kind of search engine later on, Elasticsearch was a good choice (as the name indicates). But you should have a close look at the configuration of Elasticsearch's indexing options etc to make sure it works the way you need it.
If on the other hand you just want to store the data and do processing tasks on it later, Elasticsearch was a poor choice and you would be better off with Cassandra or another NoSQL database.
Which NoSQL database suits your needs best, depends - again - on the actual usage scenario.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm writing some system which uses a lot of rules. It's time for me to organize them and make them efficient. Main requirements are - business friendly, easy to understand, easy to find, easy to maintain, testable.
This question is not about how to create rule engine. I'm not writing one. My goal is to find way to maintain lot of rules in one place. and make it easy. I need some expertise advice how to do so, what approach to take. Below are examples what I did already to show that I'm working on this task and not simply ask somebody to do my job.
So far I have 3 approaches:
1) Array typed:
item=context.GetNextItem()
if ['banana','apple','orange'].Contains(item): EatRaw(item)
if ['banana','apple','potato'].Contains(item): BakeAndEat(item)
if ['meat','egg','potato','fish'].Contains(item): FryAndEat(item)
if ['pasta','egg','potato'].Contains(item): BoilAndEat(item)
2) Separated file for each item:
item=context.GetNextItem()
execfile(str(item)+'.py')
#banana.py:
EatRaw(item)
BakeAndEat(item)
#potato.py:
BakeAndEat(item)
FryAndEat(item)
BoilAndEat(item)
3) Database stored:
item=context.GetNextItem()
SQL = "SELECT rule FROM rules where item='"+str(item)+"';"
for row in cursor.execute(SQL):
eval( row.rule+"(str("+item+"))" )
Table RULES
banana,EatRaw
banana,BakeAndEat
potato,BakeAndEat
potato,FryAndEat
potato,BoilAndEat
3.a) Data in file
File RULES.txt
banana,EatRaw
banana,BakeAndEat
potato,BakeAndEat
potato,FryAndEat
potato,BoilAndEat
This file could be considered as a UI.
Each approach has it's own cons and pros but, to be honest, I'm not satisfied with any. Files grow, became bulky, hard to search, maintain and understand. Any other approach or suggestion is welcome.
Lets zoom in on this statement:
Main requirements are - business friendly, easy to understand
However, your approaches so far are easy for programmers to understand, but not particularly easy for business users to understand.
You are approaching this problem from the wrong direction, you are starting with the "which datastructure is good for ergonomics", rather than "How business will users view or modify the 'rules'"
Start with a good few rounds of UI design. Once you've got those, put it in front of potential users (if you have any), then the implementation will follow naturally, whichever closely models or supports the way the resulting UI works and is used.
edit:
a "ui" need not be a fancy single page javascript application, it can be a text file on a particular network share that gets read every day by a cron job, that's still a "user interface". Design that in a way that is compatible with both the business users needs and the available budget.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for the database (python api) for tasks described below.
There is an increasing amount of data. In the initial period of accumulation of the data will be simple analysis that can be done with simple sql queries. But in the future it is planned to extract data from more complex queries, finding complex relationships. Requires initially choose a data storage system, which in future will allow to analyze this database different, things get complicated ( as you explore topics and skills development ) tools.
Example:
First, there is only data on buckwheat and rice. Required to compare sales growth during the month. No problem - two sql query by product name, limited sampling time (month). Drawn graphs, clearly see what is what. There are more kinds of goods. Now it took to learn how to depend on sales growth Soya sauce from rice sales growth. This is somehow possible from using sql query. And now we have 5000 names in the database and requires using some algorithms (eg neural networks) seek any dependencies in the database automatically.
That is, start with simple, needs grow, become more complicated tools.
What db suited to increasing requirements, being a simple enough for use in the beginning?
Is Redis, for example?
I would have been very useful to know that in my question incorrectly. I am totally new to this subject. it tells me what to look
I agree, MongoDB is suited for that. If you had millions of entries with multiple relations, SQL would be ahead. But for some thousands of entries a document-based DB does the job. As a benefit you don't have to care about the structure of your DB before you create it and are able to easily change it later. Take a look at the PyMongo Tutorial.