I have a table which has two columns: ID (primary key, auto increment) and keyword (text, full-text index).
The values entered in the keyword column include the following:
keyword
Car
Car sales
Cars
Sports cars
Sports foo
Car bar
Statistics
Suppose that we have this sentence as an input:
"Find sports car sales statistics in Manhattan."
I'm looking (and I have been searching for quite a while) to find either a MySQL query or an algorithm which takes in the given input, and detects the keywords used from the keywords column, resulting in an output of:
"Sports cars", "Car sales", "Statistics"
In other words, I'm trying to take an input that's in the form of a sentence, and then match all the existing (and most relevant) keyword values in the database that are found in the sentence. Note that these keywords could be phrases that consist of words separated by a space.
After researching I got to know that MySQL does a similar job through its full-text search feature. I have tried all the natural language, boolean, and query expansion options, but they include keyword records that only have half of its contents matching with the input. For example, it outputs:
"Car", "Car sales", "Sports cars", "Sports foo", "Cars bar", "Statistics".
I don't want this to happen because it includes words that aren't even in the input (i.e. foo and bar).
Here's the MySQL query for the above mentioned search:
SELECT * FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)
I also tried to improve on the relevancy, but this one only returns a single record:
SELECT *, SUM(MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)) as score FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE) ORDER BY score DESC
If we suppose that you have your column in a list as a pythonic way for such tasks you can use set.intersection to get the intersection between two set (the second element could be another iterables like list or tuple) :
>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'}
>>> col={i.lower() for i in col}
>>> s="Find sports car sales statistics in Manhattan."
>>> col.intersection(s.strip('.').split())
set(['car', 'statistics'])
And in your case you can put the result of your query within a set or convert it to set.
Note : the following set comprehension will convert the elements if your column to lower case :
>>> col={i.lower() for i in col}
But this recipe will find the intersection between your column and the splitted string with white spaces. so the result will be :
set(['car', 'statistics'])
As another way you can use re.search :
>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'}
>>> s='Find sports car sales statistics in Manhattan.'
>>> for i in col:
... g=re.search('{}'.format(i),s,re.IGNORECASE)
... if g:
... print g.group(0)
...
statistics
car sales
car
As a simple way you can use a function like following to get a combinations of your phrases :
from itertools import permutations
def combs(phrase):
sp=phrase.split()
com1=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
for i,k in enumerate(sp):
if not k.endswith('s'):
sp[i]=k+'s'
com2=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
return com1+com2
print {j for i in combs('Car sales') for j in i}
set(['Car', 'sales', 'sales Cars', 'Car sales', 'Cars sales', 'sales Car', 'Cars'])
Note that this function could be more efficient and complete.
Related
I'm new to Python but have worked in R for a while but am mostly used to working with data frames. I web-scraped the following for which the output is four lists included below: but with values taken out to create a minimal reproducible example.
I am trying to put these into a list of dictionaries such that the data is set up in the following way.
Each dictionary, rankings[i], should have these keys:
rankings[i]['name']: The name of the restaurant, a string.
rankings[i]['stars']: The star rating, as a string, e.g., '4.5', '4.0'
rankings[i]['numrevs']: The number of reviews, as an integer.
rankings[i]['price']: The price range, as dollar signs, e.g., '$', '$$', '$$$', or '$$$$'.
I get so confused by dictionaries within lists and sequences of sequences in general, so if you have any great resources, please link them here! I've been reading through Think Python.
This is what I have, but it ends up returning one dictionary with lists of values, which is not what I need.
Thanks much!
def Convert(tup, di):
for w, x,y,z in tup:
di.setdefault(w, []).extend((w,x,y,z))
return di
yelp = list(zip(name, stars, numrevs, price))
dictionary = {}
output = Convert(yelp, dictionary)
#data
name = ['Mary Mac’s Tea Room', 'Busy Bee Cafe', 'Richards’ Southern Fried']
stars = ['4.0 ', '4.0 ', '4.0']
numrevs = ['80', '49', '549']
price = ['$$', '$$', '$$']
Update:
This will give me a dictionary of a single restaurant:
def Convert(tup, di):
dictionary = {}
for name, stars, numrevs, price in tup:
yelp = list(zip(name, stars, numrevs, price))
name, stars, numrevs, price = tup[0]
entry = {"name" : name,
"stars" : stars,
"numrevs": numrevs,
"print" : price}
return entry
output = Convert(yelp, dictionary)
output
This is my attempt to iterate over the all restaurants and add them to a list. It looks like I am only getting the final restaurant and everything else is being written over. Perhaps I need to do something like
def Convert(tup, di):
dictionary = {}
for name, stars, numrevs, price in tup:
yelp = list(zip(name, stars, numrevs, price))
for i in name, stars, numrevs, price:
l = []
entry = {"name" : name,
"stars" : stars,
"numrevs": numrevs,
"price" : price}
l.append(entry)
return l
output = Convert(yelp, dictionary)
output
There is nothing in your function that attempts to build a list of dicts in the format you want. You make each value with the extend method. This adds to a list; it does not contrsuct a dict.
Start from the inside: build a single dict in the format you want, from only the first set of elements. Also, use meaningful variable names -- do you expect everyone else to learn your one-letter mappings, so you can save a few seconds of typing?
name, stars, numrevs, price = tup[0]
entry = {"name" : name,
"stars" : stars,
"numrevs": numrevs,
"print" : price}
That is how you make a single dict for your list of dicts.
That should get you started. Now you need to (1) learn to add entry to a list; (2) iterate through all of the quadruples in tup (another meaningless name to replace) so that you end up adding each entry to the final list, which you return.
Update for second question:
It keeps overwriting because you explicitly tell it to do so. Every time through the loop, you reset l to an empty list. Stop that! There are plenty of places that show you how to accumulate iteration results in a list. Start by lifting the initialization to before the loop.
l = []
for ...
first question here so apologies if there are any mistakes or unclear points!
I am trying to develop a sort of search engine to look through some tabular data in a pandas dataframe, but am getting partial matches included in the search.
For example, I have a table with the following values:
style
release_id
7306 House, Deep House
37759 House, Tech House
38319 House, Techno
39202 House
And I want to highlight the columns where the style only matches my input eg: 'House' with the code:
df_search_2 = df_search[(df_search['style'].str.match('House'))]
however this also returns all the other rows where the style contains the word House:
style
release_id
7306 House, Deep House
37759 House, Tech House
39202 House
Furthermore, when I try to run a search with multiple tags eg: 'House, Deep House', I end up with an empty dataframe, even if the string is in fact contained in a row.
any help on the matter would be greatly appreciated.
You can use this:
df_search_2 = df_search[df_search['style'] == 'House']]
I have a dataframe of +- 130k tweets, alongside a label (1=positive, 0=negative). From this dataframe, I want to extract the tweets that are movie-related. To do this, I've come up with a list of movie-related words:
movie_related_words = ["movie", "movies", "watch",
"watching", "film", "cinema",
"actor", "video", "thriller",
"horror", "dvd", "bluray", "soundtrack",
"director", "remake", "blockbuster"]
After some pre-processing the tweets in the dataframe are tokenized, so that the text column of my dataframe contains lists of tweets, where every word is a seperate list element. For your reference, please find three random elements of my dataframe below:
[well, time, for, bed, 500, am, comes, early, nice, chatting, with, everyone, have, a, good, evening, and, rest, of, the, weekend, whats, left, of, it]
[tekkah, defyingsantafe, umm, dont, forget, that, youre, all, gay, socialist, atheists]
[s, mom, nearly, got, ran, over, by, a, truck, on, her, bike, and, dropped, her, work, bag, with, all, her, information, which, was, then, stolen, fb]
I want to filter the tweets, when any word of a given tweet (so an element of a list) is in the movie_related_words list, i want to retain that observation, and if not I want to discard it.
I have tried applying a lambda expression like so:
def filter_movies(text):
movie_filtered = "".join([i for i in text if i in movie_related_words])
return movie_filtered
twitter_loaded_df['text'] = twitter_loaded_df['text'].apply(lambda x : filter_movies(x))
But it gives me a strange result. Any guidance on how to achieve this would be greatly appreciated. A pythonic/efficient way will result in eternal love from me for you. My hope is that there exists some kind of pandas function for this purpose, but I have not yet found it...
If I understood you right, try:
twitter_loaded_df['movie_related'] = twitter_loaded_df['text'].map(lambda x: max([word in movie_related_words for word in x]))
It should add a column "movie_related" with True/False if any of these words are in your list.
How about this:
MOVIE_RELATED_WORDS = set(["movie", "movies", "watch",
"watching", "film", "cinema",
"actor", "video", "thriller",
"horror", "dvd", "bluray", "soundtrack",
"director", "remake", "blockbuster"])
def contains_movie_word(words):
return any(word in MOVIE_RELATED_WORDS for word in words)
is_movie_related = df['text'].apply(contains_movie_word)
df = df[is_movie_related] # Filter using boolean series
The advantages of this approach are:
It short-circuits (returns early) as soon as a single movie-related word is found in a given tweet.
It is O(N_tweet_words) for each row in the dataframe, since set lookups are O(1) on average.
Example:
import pandas
df = pandas.DataFrame({'text': [['Hello', 'world'], ['Great', 'movie'], ['Bad', 'weather']]})
Here, df is:
text
0 [Hello, world]
1 [Great, movie]
2 [Bad, weather]
After applying the solution, is_movie_related is:
0 False
1 True
2 False
Name: text, dtype: bool
I am using FuzzyWuzzy to match a string against tuples contains two strings. For example:
from fuzzywuzzy import fuzz, process
query = "cat"
animals = [('cat','owner1'),('dog','owner3'),('lizard','owner45')]
result = process.extractOne(query, animals, scorer=fuzz.ratio)
This code returns an error because the list being compared to, animals , is not a list of strings. I would only like to compare to the 1st item in the tuple. What I would like to be returned is: (('cat','owner1), 100) because it is a 100% match.
The below code works, outputting ('cat', 100) but I need the other part of the tuple.
from fuzzywuzzy import fuzz, process
query = "cat"
animals = ["cat","dog",'lizard']
result = process.extractOne(query, lex, scorer=fuzz.ratio)
print(result)
Any ideas?
edit: I know that I can get a list of 1st elements with a list comprehension, but for memory and speed reasons, I would like to do this without creating a new list, because I am working with large data sets.
From your list of tuples you can create a sub-list containing only the first item of each tuple by using a list comprehension.
>>> animal_owners = [('cat','owner1'),('dog','owner3'),('lizard','owner45')]
>>> [ao[0] for ao in animal_owners]
['cat', 'dog', 'lizard']
With this technique you can substitute the second expression where you only need the animals while leaving the original list alone.
I know the post is older, but this is an issue I just had to contend with and found a way to do it! If you look at its signature:
process.extractOne(
query,
choices,
processor: function=function,
scorer:function=function,
score_cutoff: int=0
)
you can utilize the processor function to return the value each tuple you want to be analyzed. For example, say you have a list of company names and their ticker symbols in tuples, and want to get the closest match based on the company name:
from fuzzywuzzy import process
def get_company_name(tup):
return tup[0]
choices = [
('Apple, Inc.', 'AAPL'),
('Google, Inc.', 'GOOGL'),
('Tesla, Inc.', 'TSLA')
]
closest_match = process.extractOne("apple", choices, processor=get_company_name)
and then the script will return a tuple with the best match tuple and the pct match:
(('Apple, Inc.', 'AAPL'), 100)
This is a homework question, I got the basics down, but I can't seem to find the correct method of searching two parallel arrays.
Original Question: Design a program that has two parallel arrays: a String array named people that is initialized with the names of seven people, and a String array named phoneNumbers that is initialized with your friends' phone numbers. The program should allow the user to enter a person's name (or part of a person's name). It should then search for that person in the people array. If the person is found, it should get that person's phone number from the phoneNumbers array and display it. If the person is not found, program should display a message indicating so.
My current code:
# create main
def main():
# take in name or part of persons name
person = raw_input("Who are you looking for? \n> ")
# convert string to all lowercase for easier searching
person = person.lower()
# run people search with the "person" as the parameters
peopleSearch(person)
# create module to search the people list
def peopleSearch(person):
# create list with the names of the people
people = ["john",
"tom",
"buddy",
"bob",
"sam",
"timmy",
"ames"]
# create list with the phone numbers, indexes are corresponding with the names
# people[0] is phoneNumbers[0] etc.
phoneNumbers = ["5503942",
"9543029",
"5438439",
"5403922",
"8764532",
"8659392",
"9203940"]
Now, my entire problem begins here. How do I conduct a search (or partial search) on a name, and return the index of the persons name in the people array and print the phone number accordingly?
Update: I added this to the bottom of the code in order to conduct the search.
lookup = dict(zip(people, phoneNumbers))
if person in lookup:
print "Name: ", person ," \nPhone:", lookup[person]
But this only works for full matches, I tried using this to get a partial match.
[x for x in enumerate(people) if person in x[1]]
But when I search it on 'tim' for example, it returns [(5, 'timmy')]. How do I get that index of 5 and apply it in print phoneNumbers[the index returned from the search]?
Update 2: Finally got it to work perfectly. Used this code:
# conduct a search for the person in the people list
search = [x for x in enumerate(people) if person in x[1]]
# for each person that matches the "search", print the name and phone
for index, person in search:
# print name and phone of each person that matches search
print "Name: ", person , "\nPhone: ", phoneNumbers[index]
# if there is nothing that matches the search
if not search:
# display message saying no matches
print "No matches."
Since this is homework, I'll refrain from giving the code outright.
You can create a dict that works as a lookup table with the name as the key and the phone number as its value.
Creating the lookup table:
You can easily convert the parallel arrays into a dict using dict() and zip(). Something along the lines of:
lookup = dict(zip(people, phoneNumbers))
To see how that works, have a look at this example:
>>> people = ["john", "jacob", "bob"]
>>> phoneNumbers = ["5503942", "8659392", "8659392"]
>>> zip(people, phoneNumbers)
[('john', '5503942'), ('jacob', '8659392'), ('bob', '8659392')]
>>> dict(zip(people, phoneNumbers))
{'jacob': '8659392', 'bob': '8659392', 'john': '5503942'}
Finding if a person exist:
You can quickly figure out if a person (key) exist in the lookup table using:
if name in lookup:
# ... phone number will be lookup[name]
List of people whose name matches substring:
This answer should put you on the right track.
And of course, if the search returns an empty list there are no matching names and you can display an appropriate message.
Alternative suggestion
Another approach is to search the list directly and obtain the index of matches which you can then use to retrieve the phone number.
I'll offer you this example and leave it up to you to expand it into a viable solution.
>>> people = ["john", "jacob", "bob", "jacklyn", "cojack", "samantha"]
>>> [x for x in enumerate(people) if "jac" in x[1]]
[(1, 'jacob'), (3, 'jacklyn'), (4, 'cojack')]
If you hit a snag along the way, share what you've done and we'll be glad to assist.
Good luck, and have fun.
Response to updated question
Note that I've provided two alternative solutions, one using a dict as a lookup table and another searching the list directly. Your updates indicate you're trying to mix both solutions together, which is not necessary.
If you need to search through all the names for substring matches, you might be better off with the second solution (searching the listdirectly). The code example I provided returns a list (since there may be more than one name that contain that substring), with each item being a tuple of (index, name). You'll need to iterate throught the list and extract the index and name. You can then use the index to retrieve the phone number.
To avoid just giving you the solution, here's related example:
>>> people = ["john", "jacob", "bob", "jacklyn", "cojack", "samantha"]
>>> matches = [x for x in enumerate(people) if "jac" in x[1]]
>>> for index, name in matches:
... print index, name
...
1 jacob
3 jacklyn
4 cojack
>>> matches = [x for x in enumerate(people) if "doesnotexist" in x[1]]
>>> if not matches:
... print "no matches"
...
no matches
You might want to look here for the answer to How do I ... return the index of the persons name in the people array.