Select exact match in pandas row or with combined search terms - python

first question here so apologies if there are any mistakes or unclear points!
I am trying to develop a sort of search engine to look through some tabular data in a pandas dataframe, but am getting partial matches included in the search.
For example, I have a table with the following values:
style
release_id
7306 House, Deep House
37759 House, Tech House
38319 House, Techno
39202 House
And I want to highlight the columns where the style only matches my input eg: 'House' with the code:
df_search_2 = df_search[(df_search['style'].str.match('House'))]
however this also returns all the other rows where the style contains the word House:
style
release_id
7306 House, Deep House
37759 House, Tech House
39202 House
Furthermore, when I try to run a search with multiple tags eg: 'House, Deep House', I end up with an empty dataframe, even if the string is in fact contained in a row.
any help on the matter would be greatly appreciated.

You can use this:
df_search_2 = df_search[df_search['style'] == 'House']]

Related

Processing dictionaries in Python with large amount of data

I am now trying to process IMDb data with Python dictionary. After some basic data cleaning, I have a dictionary people_dict, which looks like
people_dict = {...,936: ['And White Was the Night (2015)', 'Lipton Cockton in the Shadows of Sodoma (1995)', 'Maraton (1997)', 'Rundi (1990)', 'Sounds Like Suomi (2008)'],...}
where the key stands for the id of an actor/actress and the list is a set of movies he/she has acted in.
Now I am trying to get another dictionary movie_dict based on people_dict, which looks like
movie_dict = {...,'Beats, Rhymes & Life: The Travels of a Tribe Called Quest (2011)': [3],...}
where the key is name of movie while the value is actor/actress id.
However, my implementation (see below) for this is nested loops but almost 100, 000 movies and actor/actress are involved. It optimistically could give what I want in a week.
for value in movie_dict.keys():
for people_id, movie_list in people_dict.items():
if value in movie_list:
movie_dict[value].append(people_id)
So is there anything I could do to significantly reduce the runtime. I have checked out this thread where map seems to be a good option.

List Matching in Python using nested for loops

I have three lists, (1) treatments (2) medicine name and (3) medicine code symbol. I am trying to identify the respective medicine code symbol for each of 14,700 treatments. My current approach is to identify if any name in (2) is "in" (1), and then return the corresponding (3). However, I am returned an abitrary list (correct length) of medicine code symbols corresponding to the 14,700 treatments. Code for the method I've written is below:
codes = pandas.read_csv('Codes.csv', dtype=str)
codes_list = _codes.values.tolist()
names = pandas.read_csv('Names.csv', dtype=str)
names_list = names.values.tolist()
treatments = pandas.read_csv('Treatments.csv', dtype=str)
treatments_list = treatments.values.tolist()
matched_codes_list = range(len(treatments_list))
for i in range(len(treatments_list)):
for j in range(len(names_list)):
if names_list[j] in treatments_list[i]:
matched_codes_list[i]=codes_list_text[j]
print matched_codes_list
Any suggestions for where I am going wrong would be much appreciated!
I can't tell what you are expecting. You should replace the xxx_list code with examples instead, since you don't seem to have any problems with the csv reading.
Let's suppose you did that, and your result looks like this.
codes_list = ['shark', 'panda', 'horse']
names_list = ['fin', 'paw', 'hoof']
assert len(codes_list) == len(names_list)
treatments_list = ['tape up fin', 'reverse paw', 'stand on one hoof', 'pawn affinity maneuver', 'alert wing patrol']
it sounds like you are trying to determine the 'code' for each 'treatment', assuming that the number of codes and names are the same (and indicate some mapping). You plan to use the presence of the name to determine the code.
we can zip together the name and codes list to avoid using indexes there, and we can use iteration over the treatment list instead of indexes for pythonic readability
matched_codes_list = []
for treatment in treatment:
matched_codes = []
for name, code in zip(names_list, codes_list):
if name in treatment:
matched_codes.append(code)
matched_codes_list.append(matched_codes)
this would give something like
assert matched_codes_list == [
['shark'], # 'tape up fin'
['panda'], # 'reverse paw'
['horse'], # 'stand on one hoof'
['shark', 'panda', 'horse'], # 'pawn affinity maneuver'
[], # 'alert wing patrol'
]
note that the method used to do this is quite slow (and probably will give false positives, see 4th entry). You will traverse the text of all treatment descriptions once for each name/code pair.
You can use a dictionary like 'lookup = {name: code for name, code in zip(names_list, codes_list)}, or itertools.izip for minor gains. Otherwise something more clever might be needed, perhaps splitting treatments into a set containing words, or mapping words into multiple codes.

Searching for Phrase Keywords in MySQL

I have a table which has two columns: ID (primary key, auto increment) and keyword (text, full-text index).
The values entered in the keyword column include the following:
keyword
Car
Car sales
Cars
Sports cars
Sports foo
Car bar
Statistics
Suppose that we have this sentence as an input:
"Find sports car sales statistics in Manhattan."
I'm looking (and I have been searching for quite a while) to find either a MySQL query or an algorithm which takes in the given input, and detects the keywords used from the keywords column, resulting in an output of:
"Sports cars", "Car sales", "Statistics"
In other words, I'm trying to take an input that's in the form of a sentence, and then match all the existing (and most relevant) keyword values in the database that are found in the sentence. Note that these keywords could be phrases that consist of words separated by a space.
After researching I got to know that MySQL does a similar job through its full-text search feature. I have tried all the natural language, boolean, and query expansion options, but they include keyword records that only have half of its contents matching with the input. For example, it outputs:
"Car", "Car sales", "Sports cars", "Sports foo", "Cars bar", "Statistics".
I don't want this to happen because it includes words that aren't even in the input (i.e. foo and bar).
Here's the MySQL query for the above mentioned search:
SELECT * FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)
I also tried to improve on the relevancy, but this one only returns a single record:
SELECT *, SUM(MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)) as score FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE) ORDER BY score DESC
If we suppose that you have your column in a list as a pythonic way for such tasks you can use set.intersection to get the intersection between two set (the second element could be another iterables like list or tuple) :
>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'}
>>> col={i.lower() for i in col}
>>> s="Find sports car sales statistics in Manhattan."
>>> col.intersection(s.strip('.').split())
set(['car', 'statistics'])
And in your case you can put the result of your query within a set or convert it to set.
Note : the following set comprehension will convert the elements if your column to lower case :
>>> col={i.lower() for i in col}
But this recipe will find the intersection between your column and the splitted string with white spaces. so the result will be :
set(['car', 'statistics'])
As another way you can use re.search :
>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'}
>>> s='Find sports car sales statistics in Manhattan.'
>>> for i in col:
... g=re.search('{}'.format(i),s,re.IGNORECASE)
... if g:
... print g.group(0)
...
statistics
car sales
car
As a simple way you can use a function like following to get a combinations of your phrases :
from itertools import permutations
def combs(phrase):
sp=phrase.split()
com1=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
for i,k in enumerate(sp):
if not k.endswith('s'):
sp[i]=k+'s'
com2=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
return com1+com2
print {j for i in combs('Car sales') for j in i}
set(['Car', 'sales', 'sales Cars', 'Car sales', 'Cars sales', 'sales Car', 'Cars'])
Note that this function could be more efficient and complete.

Whoosh NestedChildren search not returning all results

I'm making a search index which must support nested hierarchies of data.
For test purposes, I'm making a very simple schema:
test_schema = Schema(
name_ngrams=NGRAMWORDS(minsize=4, field_boost=1.2),
name=TEXT(stored=True),
id=ID(unique=True, stored=True),
type=TEXT
)
For test data I'm using these:
test_data = [
dict(
name=u'The Dark Knight Returns',
id=u'chapter_1',
type=u'chapter'),
dict(
name=u'The Dark Knight Triumphant',
id=u'chapter_2',
type=u'chapter'),
dict(
name=u'Hunt The Dark Knight',
id=u'chapter_3',
type=u'chapter'),
dict(
name=u'The Dark Knight Falls',
id=u'chapter_4',
type=u'chapter')
]
parent = dict(
name=u'The Dark Knight Returns',
id=u'book_1',
type=u'book')
I've added to the index all the (5) documents, like this
with index_writer.group():
index_writer.add_document(
name_ngrams=parent['name'],
name=parent['name'],
id=parent['id'],
type=parent['type']
)
for data in test_data:
index_writer.add_document(
name_ngrams=data['name'],
name=data['name'],
id=data['id'],
type=data['type']
)
So, to get all the chapters for a book, I've made a function which uses a NestedChildren search:
def search_childs(query_string):
os.chdir(settings.SEARCH_INDEX_PATH)
# Initialize index
index = open_dir(settings.SEARCH_INDEX_NAME, indexname='test')
parser = qparser.MultifieldParser(
['name',
'type'],
schema=index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
parser.add_plugin(DateParserPlugin())
myquery = parser.parse(query_string)
# First, we need a query that matches all the documents in the "parent"
# level we want of the hierarchy
all_parents = And([parser.parse(query_string), Term('type', 'book')])
# Then, we need a query that matches the children we want to find
wanted_kids = And([parser.parse(query_string),
Term('type', 'chapter')])
q = NestedChildren(all_parents, wanted_kids)
print q
with index.searcher() as searcher:
#these results are the parents
results = searcher.search(q)
print "number of results:", len(results)
if len(results):
for result in results:
print(result.highlights('name'))
print(result)
return results
But for my test data, if I search for "dark knigth", I'm only getting 3 results when it must be 4 search results.
I don't know if the missing result is excluded for having the same name as the book, but it's simply not showing in the search results
I know that all the items are in the index, but I don't know what I'm missing here.
Any thoughts?
Turns out that I was using NestedChildren wrong.
Here is the answer I get from Matt Chaput in Google Groups:
I'm making a search index which must support nested hierarchies of data.
The second parameter to NestedChildren isn't what you think it is.
TL;DR: you're using the wrong query type. Let me know what you're trying to do, and I can tell you how to do it :)
ABOUT NESTED CHILDREN
(Note, I found a bug, see the end)
NestedChildren is hard to understand, but hopefully I can try to explain it better.
NestedChildren is about searching for certain PARENTS, but getting their CHILDREN as the hits.
The first argument is a query that matches all documents of the "parent" class (e.g. "type:book"). The second argument is a query that matches all documents of the parent class that match your search criteria (e.g. "type:book AND name:dark").
In you example, this would mean searching for a certain book, but getting its chapters as the search results.
This isn't super useful on its own, but you can combine it with queries on the children to do complex queries like "show me chapters with 'hunt' in their names that are in books with 'dark' in their names":
# Find the children of books matching the book criterion
all_parents = query.Term("type", "book")
wanted_parents = query.Term("name", "dark")
children_of_wanted_parents = query.NestedChildren(all_parents, wanted_parents)
# Find the children matching the chapter criterion
wanted_chapters = query.And([query.Term("type", "chapter"),
query.Term("name", "hunted")])
# The intersection of those two queries are the chapters we want
complex_query = query.And([children_of_wanted_parents,
wanted_children])
OR, at least, that's how it SHOULD work. But I just found a bug in the implementation of NestedChildren's skip_to() method that makes the above example not work :( :( :( The bug is now fixed on Bitbucket, I'll have to make a new release.
Cheers,
Matt

Group related objects in Django

I'm building an app where you can search for objects in a database (let's assume the objects you search for are persons). What I want to do is to group related objects, for example married couples. In other words, if two people share the same last name, we assume that they are married (not a great example, but you get the idea). The last name is the only thing that identifies two people as married.
In the search results I want to display the married couples next to each other, and all the other persons by themselves.
Let's say you search for "John", this is what I want:
John Smith - Jane Smith
John Adams - Nancy Adams
John Washington
John Andersson
John Ryan
Each name is then a link to that person's profile page.
What I have right now is a function that finds all pairs, and returns a list of tuples, where each tuple is a pair. The problem is that on the search results, every name that is in a pair is listed twice.
I do a query for the search query (Person.objects.filter(name__contains="John")), and the result of that query is sent to the match function. I then send both the original queryset and the match function result to the template.
I guess I could just exclude every person that the match function finds a match for, but I don't know, but is that the most efficient solution?
Edit:
As I wrote in a comment, the actual strings that I want to match are not identical. To quote myself:
In fact, the strings I want to match are not identical, instead they
look more like this: "foo2(bar13)" - "foo2(bar14)". That is, if two
strings have the same foo id (2), and if the bar id is an odd number
(13), then its match is the bar id + 1 (14). I have a regular
expression to find these matches
First get your objects sorted by last name:
def keyfun(p):
return p.name.split()[-1]
persons = sorted(Person.objects.all(), key = keyfun)
Then use groupby:
from itertools import groupby
for lname, persons in groupby(persons, keyfun):
print ' - '.join(p.name for p in persons)
Update Yes, this solution works for your new requirement too. All you need is a stable way to generate keys for each item, and replace the body of the keyfun with it:
from re import findall
def keyfun(p):
v1, v2 = findall(p.name, '\d+')
tot = int(v1) + int(v2) % 2
return tot
Your description for how to generate the key for each item is not clear enough, although you should be able to figure it out yourself with the above example.

Categories

Resources