Searching a normal query in an inverted index - python

I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
so that, the documents containing any word can be given by index[word].keys() , and frequency in that document by len(index[word][document])
Now my question is, how do I implement a normal query search in this index. i.e. given a query containing lets say 4 words, find documents containing all four matches (ranked by total frequency of occurrence ), then docs containing 3 matches and so on ....
**
Added this code, using S. Lott's answer.
This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.
**
from collections import defaultdict
from operator import itemgetter
# Take input
query = input(" Enter the query : ")
# Some preprocessing
query = query.lower()
query = query.strip()
# now real work
wordlist = query.split()
search_words = [ x for x in wordlist if x in index ] # list of words that are present in index.
print "\nsearching for words ... : ", search_words, "\n"
doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
for p in d:
doc_words[p].append(w)
# create a dictionary identifying matches for each document
result_set = {}
for i in doc_words.keys():
count = 0
matches = len(doc_words[i]) # number of matches
for w in doc_words[i]:
count += len(index[w][i]) # count total occurances
result_set[i] = (matches,count)
# Now print in sorted order
print " Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
print doc, "\t",doc_words[doc],"\t",count
Pls comment ....
Thanx.

Here's a start:
doc_has_word = [ (index[word].keys(),word) for word in wordlist ]
This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.
But
from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
doc_words[tuple(d.items())].append(w)
Might be helpful.

import itertools
index = {...}
def query(*args):
result = []
doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])
for doc, group in doc_group:
result.append((doc, sum([elem[1] for elem in group])))
return sorted(result, key=lambda x:x[1])[::-1]

Here is a solution for finding the similar documents (the hardest part):
wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])
wordMatches gets a list where each element is a dictionary of the document matches for one of the words being matched.
similarDocs is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.
Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.
Related links:
This answer demonstrates defaultdict(int). defaultdict(list) works pretty much the same way.
set.intersection example

Related

n-gram fuzzy matching from Dictionary

Given a string S of variable length and a dictionary D of n-grams N, I want to:
extract all N in S that match with a fuzzy matching logic (to catch spelling errors)
extract all Numbers in S
show the results in the same order as they are in S
I accomplished points 1 and 2, but my approach, based on the creation of n-grams from S and fuzzy matching against the dictionary (plus matching of numbers) does not maintain the order in which the items are in S
from nltk import everygrams
from flask_caching import Cache
import re
string = "Hello everybody, today we have 2.000 cell phones here"
ngrams = (list(everygrams(string.split(), 1, 4)))
my_dict = {
"brand": "ITEM_01",
"model": "ITEM_02",
"cell phone": "ITEM_04",
"today" : "ITEM_05"
}
result=""
results=[] # list with final results
d = FuzzyDict(my_dict) # create the dictionary for fuzzy matching
for k in ngrams:
candidate = ' '.join(k)
print (f"Searching for {candidate}")
try:
#matching n-gram in Dictionary using fuzzy match
result = d[candidate]
print (f"Found {result}")
results.append(result)
except:
print("An exception occurred")
#matching complex numbers
numbers = re.findall(r'(?:[+-]|\()?\$?\d+(?:,\d+)*(?:\.\d+)?\)?', candidate)
#appending numbers to list
results.extend(numbers)
#NOTE chronological order is not kept!
#keeping unque values since my approach will extract several instances of the same item
myset = set(results)
results_unique = list(myset)
This should give me "ITEM_5 2.000 ITEM_4" (now the order is casual)

Python program to find if a certain keyword is present in a list of documents (string)

Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.
The function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
My code:-
keywords=["casino"]
def multi_word_search(document,keywords):
dic={}
z=[]
for word in document:
i=document.index(word)
token=word.split()
new=[j.rstrip(",.").lower() for j in token]
for k in keywords:
if k.lower() in new:
dic[k]=z.append(i)
else:
dic[k]=[]
return dic
It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead.
I wonder if someone could help me?
I would first tokenize the string "new" using split(), then build a set to speed up look up.
If you want case insensitive you need to lower case both sides
for k in keywords:
s = set(new.split())
if k in s:
dic[k] = z.append(i)
else:
dic[k]=[]
return dic
This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).
import nltk
# stemmer = nltk.stem.PorterStemmer()
def multi_word_search(documents, keywords):
# Initialize result dictionary
dic = {kw: [] for kw in keywords}
for i, doc in enumerate(documents):
# Preprocess document
doc = doc.lower()
tokens = nltk.word_tokenize(doc)
tokens = [stemmer.stem(token) for token in tokens]
# Search each keyword
for kw in keywords:
# kw = stemmer.stem(kw.lower())
kw = kw.lower()
if kw in tokens:
# If found, add to result dictionary
dic[kw].append(i)
return dic
documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)
To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)
This should work too..
document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']
def findme(term):
for x in document:
val = x.split(' ')
for v in val:
if term.lower() == v.lower():
return document.index(x)
for key in keywords:
n = findme(key)
print(f'{key}:{n}')

Store specific lines from a multiline file as values in a dictionary (Python)

I have a multiline transcript file that contains lines of text and corresponding timestamps. It looks like this:
00:02:01,640 00:02:04,409
word word CHERRY word word
00:02:04,409 00:02:07,229
word APPLE word word
00:02:07,229 00:02:09,380
word word word word
00:02:09,380 00:02:12,060
word BANANA word word word
Now, if the text contains specific words (types of fruit) which I have already stored in a list, these words shall be stored as keys in a dictionary. My code for this:
Dict = {}
FruitList = []
for w in transcript.split():
if w in my_list:
FruitList.append(w)
keys = FruitList
The output of printing keys is: ['CHERRY', 'APPLE', 'BANANA'].
Moving on, my problem is that I want to extract the timestamps belonging to the lines containing fruits, and store them in the dictionary as values - but only those timestamps which correspond to the line underneath in which a type of fruit is given.
For this task, I have several code snippets:
values = [] # shall contain timestamps later
timestamp_pattern = re.compile(r"\d{2}:\d{2}:\d{2},\d{3} \d{2}:\d{2}:\d{2},\d{3}")
for i in keys:
Dict[i] = values[i]
Unfortunately, I have no idea how to write the code in order to get only the relevant timestamps and store them as values with their keys (fruits) in the Dict.
The desired output (Dict) should look like this:
{'CHERRY': '00:02:01,640 -> 00:02:04,409',
'APPLE': '00:02:04,409 -> 00:02:07,229',
'BANANA': '00:02:09,380 -> 00:02:12,060'}
Can anyone help?
Thank you very much!
This looks like something you can do using zip and avoid regex, considering the pattern of lines:
d = {}
lines = transcript.split('\n')
for x, y in zip(lines, lines[1:]):
for w in my_list:
if w in y.split():
splits = x.split()
d[w] = f'{splits[0]} -> {splits[1]}'
print(d)
You may use
^(\d{2}:\d{2}:\d{2},\d{3} \d{2}:\d{2}:\d{2},\d{3})\n.*\b(CHERRY|APPLE|BANANA)\b
See the regex demo. With this pattern, you capture the time span line and the keyword into separate groups that can be retrieved with re.findall. After swapping the two captured values, you may cast the list of tuples into a dictionary.
If you read the data from a file, you need to use with open(fpath, 'r') as r: and then contents = r.read() to read the whole contents into a single string variable.
See Python demo:
import re
text = "00:02:01,640 00:02:04,409\nword word CHERRY word word\n\n00:02:04,409 00:02:07,229\nword APPLE word word\n\n00:02:07,229 00:02:09,380\nword word word word\n\n00:02:09,380 00:02:12,060\nword BANANA word word word"
t = r"\d{2}:\d{2}:\d{2},\d{3}"
keys = ['CHERRY', 'APPLE', 'BANANA']
rx = re.compile(fr"^({t} {t})\n.*\b({'|'.join(keys)})\b", re.M)
print( dict([(y,x) for x, y in rx.findall(text)]) )
Output:
{'CHERRY': '00:02:01,640 00:02:04,409', 'APPLE': '00:02:04,409 00:02:07,229', 'BANANA': '00:02:09,380 00:02:12,060'}

Remove close matches / similar phrases from list

I am working on removing similar phrases in a list, but I have hit a small roadblock.
I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.
Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]
Basically, I want to get all the longest unique phrases of a list.
I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.
here is my code :
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
Can someone point out what I am missing, or what is a better way to do this?
Thanks in advance!
I would use the difference between each phrase and all the other phrases.
If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.
I've also made it robust to exact matches and added spaces
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
Output:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]

Most efficient way to compare words in list / dict in Python

I have the following sentence and dict :
sentence = "I love Obama and David Card, two great people. I live in a boat"
dico = {
'dict1':['is','the','boat','tree'],
'dict2':['apple','blue','red'],
'dict3':['why','Obama','Card','two'],
}
I want to match the number of the elements that are in the sentence and in a given dict. The heavier method consists in doing the following procedure:
classe_sentence = []
text_splited = sentence.split(" ")
dic_keys = dico.keys()
for key_dics in dic_keys:
for values in dico[key_dics]:
if values in text_splited:
classe_sentence.append(key_dics)
from collections import Counter
Counter(classe_sentence)
Which gives the following output:
Counter({'dict1': 1, 'dict3': 2})
However it's not efficient at all since there are two loops and it is raw comparaison. I was wondering if there is a faster way to do that. Maybe using itertools object. Any idea ?
Thanks in advance !
You can use the set data data type for all you comparisons, and the set.intersection method to get the number of matches.
It will increare algorithm efficiency, but it will only count each word once, even if it shows up in several places in the sentence.
sentence = set("I love Obama and David Card, two great people. I live in a boat".split())
dico = {
'dict1':{'is','the','boat','tree'},
'dict2':{'apple','blue','red'},
'dict3':{'why','Obama','Card','two'}
}
results = {}
for key, words in dico.items():
results[key] = len(words.intersection(sentence))
Assuming you want case-sensitive matching:
from collections import defaultdict
sentence_words = defaultdict(lambda: 0)
for word in sentence.split(' '):
# strip off any trailing or leading punctuation
word = word.strip('\'";.,!?')
sentence_words[word] += 1
for name, words in dico.items():
count = 0
for x in words:
count += sentence_words.get(x, 0)
print('Dictionary [%s] has [%d] matches!' % (name, count,))

Categories

Resources