I'm currently working on an AppEngine project, and I'd like to implement autocompletion of search terms. The items that can be searched for are reasonably unambiguous and short, so I was thinking of implementing it by giving each item a list of incomplete typings. So foobar would get a list like [f, fo, foo, foob, fooba, foobar]. The user's text in the searchbox is then compared to this list, and positive matches are suggested.
There are a couple of possible optimizations in this list that I was thinking of:
Removing spaces punctuation from search terms. Foo. Bar to FooBar.
Removing capital letters
Removing leading particles like "the", "a", "an". The Guy would be guy, and indexed as [g, gu, guy].
Only adding substring longer than 2 or 3 to the indexing list. So The Guy would be indexed as [gu, guy]. I thought that suggestions that only match the first letter would not be so relevant.
The users search term would also be formatted in this way, after which the DB is searched. Upon suggesting a search term, the particles, punctuation, and capital letters would be added according to the suggested object's full name. So searching for "the" would give no suggestions, but searching for "The Gu.." or "gu" would suggest "The Guy".
Is this a good idea? Mainly: would this formatting help, or only cause trouble?
I have already run into the same problem and the solution that I adopted was very similar to your idea. I split the items into words, convert them to lowercase, remove accents, and create a list of startings. For instance, "Báz Bar" would become ['b', 'ba', 'bar', 'baz'].
I have posted the code in this thread. The search box of this site is using it. Feel free to use it if you like.
Related
Today I wrote my first program, which is essentially a vocabulary learning program! So naturally I have pretty huge lists of vocabulary and a couple of questions. I created a class with the parameters, one of which is the German vocab and one of which is the Spanish vocab. My first question is: is there anyway to turn all the plain text vocabulary that I copy from an internets vocab list into strings and separate them without adding the " and the commas manually?
And my second question:
I created another list to assign each German vocab to each Spanish vocab and it looks a little bit like that:
vocabs = [
Vocabulary(spanish_word[0], german_word[0])
Vocabulary(spanish_word[1], german_word[1])
etc.
]
Vocabulary would be the class, spanish_word the first word list and German the other obviously.
But with a lot of vocab that's a lot of work too. Is there anyway to automate the process to add each word from the Spanish word list to the German one? I first tried it with the
vocabs = [
for spanish word in german word
Vocabulary(spanish_word[0], german_word[0])
]
But that didn't work. Researching on the internet also didn't help much.
Please don't be rude if those are noob questions I'm actually pretty happy that my program is running so well and I would be thankful for all the help to make it better.
Without knowing what it is you're looking to do with the result, it appears you're trying to do this:
vocabs = [Vocabulary(s, g) for s, g in zip(spanish_word, german_word)]
You didn't provide any code or example data around the "turn all the plain text vocabulary [..] into strings and separate them without adding the quotes and the commas manually". There's sure to be a way to do what you need, but you should probably ask a separate question, after first looking for a solution yourself and coming up with a solution. Ask a question if you can't get it to work.
I have an unstructured, free form text (taken from emails, phone conversation transcriptions), a list of first names and a list of last names.
What would be the most effective and pythonistic method to replace all the first names in the text with "--FIRSTNAME--" and last names with "--LASTNAME--" based on the lists I have?
I could iterate over each of the first name list and do a
text.replace(firstname, '--FIRSTNAME--')
but that seems very inefficient, especially for a very long list of names and many long texts to process. Are there better options?
Example:
Text: "Hello, this is David, how may I help you? Hi, my name is Alex Bender and I am trying to install my new coffee machine."
First name list: ['Abe', 'Alex', 'Andy', 'David', 'Mark', 'Timothy']
Last name list: ['Baxter', Bender', 'King', McLoud']
Expected output: "Hello, this is --FIRSTNAME--, how may I help you? Hi, my name is --FIRSTNAME-- --LASTNAME-- and I am trying to install my new coffee machine."
I followed the advice of #furas and checked out the flashtext module. This pretty much answers my need to the fullest.
I did run into a problem as I am working with Hebrew (non ASCII characters) and the text replacement would not follow word boundaries.
There is a method (add_non_word_boundary(self, character)) of class KeywordProcessor, which for some reason is not documented, that allows to add characters which are not to be considered as boundary characters (in addition to the default [a-zA-Z0-9_], allowing for whole word replacement only.
I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.
This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)
i currently have several names that look like this
SmithJohn
smithJohn
O'BrienPeter
both of these have no spaces, but have a capital letter in between.
is there a regex to match these types of names (but won't match names like Smith, John, Smith John or Smith.John)? furthermore, how could i split up the last name and first name into two different variables?
thanks
If all you want is a string with a capital letter in the middle and lowercase letters around it, this should work okay: [a-z][A-Z] (make sure you use re.search and not match). It handles "O'BrienPeter" fine, but might match names like "McCutchon" when it shouldn't. It's impossible to come up with a regex, or any program really, that does that you want for all names (see Falsehoods Programmers Believe About Names).
As Brian points out, there's a question you need to ask yourself here: What guarantees do you have about the strings you will be processing?
Do you know without a doubt that the only capitals will be the beginnings of the names? Or could something like "McCutchonBrian", or in my case "Mallegol-HansenPhilip" have found its way in there as well?
In the greater context of software in general, you need to consider the assumptions you are going in with. Otherwise you're going to be solving a problem, that is in fact not the problem you have.
I got a set of strings that contain concatenated words like the followings:
longstring (two English words)
googlecloud (a name and an English word)
When I type these terms into Google, it recognizes the words with "did you mean?" ("long string", "google cloud"). I need similar functionality in my application.
I looked into the options provided by Python and ElasticSearch. All the tokenizing examples I found are based on whitespace, upper case, special characters etc.
What are my options provided the strings are in English (but they may contain names)? It doesn't have to be on a specific technology.
Can I get this done with Google BigQuery?
Can you also roll your own implementation? I am thinking of an algorithm like this:
Get a dictionary with all words you want to distinguish
Build a data structure that allows quick lookup (I am thinking of a trie)
Try to find the first word (starting with one character and increasing it until a word is found); if found, use the remaining string and do the same until nothing is left. If it doesn't find anything, backtrack and extend the previous word.
Should be ok-ish if the string can be split, but will try all possibilities if its gibberish. Of course, it depends on how big your dictionary is going to be. But this was just a quick thought, maybe it helps.
If you do choose to solve this with BigQuery, then the following is a candidate solution:
Load list of all possible English words into a table called words. For example, https://github.com/dwyl/english-words has list of ~350,000 words. There are other datasets (i.e. WordNet) freely available in Internet too.
Using Standard SQL, run the following query over list of candidates:
SELECT first, second FROM (
SELECT word AS first, SUBSTR(candidate, LENGTH(word) + 1) AS second
FROM dataset.words
CROSS JOIN (
SELECT candidate
FROM UNNEST(["longstring", "googlecloud", "helloxiuhiewuh"]) candidate)
WHERE STARTS_WITH(candidate, word))
WHERE second IN (SELECT word FROM dataset.words)
For this example it produces:
Row first second
1 long string
2 google cloud
Even very big list of English words would be only couple of MBs, so the cost of this query is minimal. First 1 TB scan is free - which is good enough for about 500,000 scans on 2 MB table. After that each additional scan is 0.001 cents.