Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.
The function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
My code:-
keywords=["casino"]
def multi_word_search(document,keywords):
dic={}
z=[]
for word in document:
i=document.index(word)
token=word.split()
new=[j.rstrip(",.").lower() for j in token]
for k in keywords:
if k.lower() in new:
dic[k]=z.append(i)
else:
dic[k]=[]
return dic
It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead.
I wonder if someone could help me?
I would first tokenize the string "new" using split(), then build a set to speed up look up.
If you want case insensitive you need to lower case both sides
for k in keywords:
s = set(new.split())
if k in s:
dic[k] = z.append(i)
else:
dic[k]=[]
return dic
This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).
import nltk
# stemmer = nltk.stem.PorterStemmer()
def multi_word_search(documents, keywords):
# Initialize result dictionary
dic = {kw: [] for kw in keywords}
for i, doc in enumerate(documents):
# Preprocess document
doc = doc.lower()
tokens = nltk.word_tokenize(doc)
tokens = [stemmer.stem(token) for token in tokens]
# Search each keyword
for kw in keywords:
# kw = stemmer.stem(kw.lower())
kw = kw.lower()
if kw in tokens:
# If found, add to result dictionary
dic[kw].append(i)
return dic
documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)
To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)
This should work too..
document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']
def findme(term):
for x in document:
val = x.split(' ')
for v in val:
if term.lower() == v.lower():
return document.index(x)
for key in keywords:
n = findme(key)
print(f'{key}:{n}')
i want to get correct result from my condition, here is my condition
this is my database
and here is my code :
my define text
#define
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
c1="Country"
c2="City"
c3="<blank>"
and condition ("text" here is passing from select database, ofc using looping - for)
if str(text) in str(country) :
stat=c1
elif str(text) in str(city) :
stat=c2
else :
stat=c3
and i got wrong result for the condition, like this
any solution to make this code work ? it work when just contain 1 text when using "in", but this case define so many text condition.
If i understood you correctly you need.
text = "i was born in paris"
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
def check(text):
for i in country:
if i in text.lower():
return "Country"
for i in city:
if i in text.lower():
return "City"
return "<blank>"
print(check(text))
print(check("I dnt like vacation in america"))
Output:
City
Country
You could be better off using dictionaries. I assume that text is a list:
dict1 = {
"countries" : ['america','indonesia', 'england', 'france'],
"city" : ['new york', 'jakarta', 'london', 'paris']
}
for x in text:
for y in dict1['countries']:
if y in x:
print 'country: ' + x
for z in dict1['city']:
if z in x:
print 'city: ' + x
First of all, check what you are testing.
>>> country = ('america','indonesia', 'england', 'france')
>>> city = ('new york', 'jakarta', 'london', 'paris')
>>>
>>> c1="Country"
>>> c2="City"
>>> c3="<blank>"
Same as your setup. So, you are testing for the presence of a substring.
>>> str(country)
"('america', 'indonesia', 'england', 'france')"
Let's see if we can find a country.
>>> 'america' in str(country)
True
Yes! Unfortunately a simple string test such as the one above, besides involving an unnecessary conversion of the list to a string, also finds things that aren't countries.
>>> "ca', 'in" in str(country)
True
The in test for strings is true if the string to the right contains the substring on the left. The in test for lists is different, however, and is true when the tested list contains the value on the left as an element.
>>> 'america' in country
True
Nice! Have got got rid of the "weird other matches" bug?
>>> "ca', 'in" in country
False
It would appear so. However, using the list inclusion test you need to check every word in the input string rather than the whole string.
>>> "I don't like to vacation in america" in country
False
The above is similar to what you are doing now, but testing list elements rather than the list as a string. This expression generates a list of words in the input.
>>> [word for word in "I don't like to vacation in america".split()]
['I', "don't", 'like', 'to', 'vacation', 'in', 'america']
Note that you may have to be more careful than I have been in splitting the input. In the example above, "america, steve" when split would give ['america,', 'steve'] and neither word would match.
The any function iterates over a sequence of expressions, returning True at the first true member of the sequence (and False if no such element is found). (Here I use a generator expression instead of a list, but the same iterable sequence is generated).
>>> any(word in country for word in "I don't like to vacation in america".split())
True
For extra marks (and this is left as an exercise for the reader) you could write
a function that takes two arguments, a sentence and a list of possible matches,
and returns True if any of the words in the sentence are present in the list. Then you could use two different calls to that function to handle the countries and the
cities.
You could speed things up somewhat by using sets rather than lists, but the principles are the same.
I have the following sentence and dict :
sentence = "I love Obama and David Card, two great people. I live in a boat"
dico = {
'dict1':['is','the','boat','tree'],
'dict2':['apple','blue','red'],
'dict3':['why','Obama','Card','two'],
}
I want to match the number of the elements that are in the sentence and in a given dict. The heavier method consists in doing the following procedure:
classe_sentence = []
text_splited = sentence.split(" ")
dic_keys = dico.keys()
for key_dics in dic_keys:
for values in dico[key_dics]:
if values in text_splited:
classe_sentence.append(key_dics)
from collections import Counter
Counter(classe_sentence)
Which gives the following output:
Counter({'dict1': 1, 'dict3': 2})
However it's not efficient at all since there are two loops and it is raw comparaison. I was wondering if there is a faster way to do that. Maybe using itertools object. Any idea ?
Thanks in advance !
You can use the set data data type for all you comparisons, and the set.intersection method to get the number of matches.
It will increare algorithm efficiency, but it will only count each word once, even if it shows up in several places in the sentence.
sentence = set("I love Obama and David Card, two great people. I live in a boat".split())
dico = {
'dict1':{'is','the','boat','tree'},
'dict2':{'apple','blue','red'},
'dict3':{'why','Obama','Card','two'}
}
results = {}
for key, words in dico.items():
results[key] = len(words.intersection(sentence))
Assuming you want case-sensitive matching:
from collections import defaultdict
sentence_words = defaultdict(lambda: 0)
for word in sentence.split(' '):
# strip off any trailing or leading punctuation
word = word.strip('\'";.,!?')
sentence_words[word] += 1
for name, words in dico.items():
count = 0
for x in words:
count += sentence_words.get(x, 0)
print('Dictionary [%s] has [%d] matches!' % (name, count,))
I managed to do that but the case I'm struggling with is when I have to consider 'color' equal to 'colour' for all such words and return count accordingly. To do this, I wrote a dictionary of common words with spelling changes in American and GB English for this, but pretty sure this isn't the right approach.
ukus=dict() ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE',
'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}
This is the dictionary I wrote to check the values. As you can see this is degrading the performance. Pyenchant isn't available for 64bit python. Someone please help me out. Thank you in advance.
Okay, I think I know enough from your comments to provide this as a solution. The function below allows you to choose either UK or US replacement (it uses US default, but you can of course flip that) and allows for you to either perform minor hygiene on the string.
import re
ukus={'COLOUR':'COLOR','CHEQUE':'CHECK',
'PROGRAMME':'PROGRAM','GREY':'GRAY',
'JEWELLERY':'JEWELERY','ALUMINIUM':'ALUMINUM',
'THEATER':'THEATRE','LICENSE':'LICENCE','ARMOUR':'ARMOR',
'ARTEFACT':'ARTIFACT','CENTRE':'CENTER',
'CYPHER':'CIPHER','DISC':'DISK','FIBRE':'FIBER',
'FULFILL':'FULFIL','METRE':'METER',
'SAVOURY':'SAVORY','TONNE':'TON','TYRE':'TIRE'}
usuk={'COLOR':'COLOUR','CHECK':'CHEQUE',
'PROGRAM':'PROGRAMME','GRAY':'GREY',
'JEWELERY':'JEWELLERY','ALUMINUM':'ALUMINIUM',
'THEATRE':'THEATER','LICENCE':'LICENSE','ARMOR':'ARMOUR',
'ARTIFACT':'ARTEFACT','CENTER':'CENTRE',
'CIPHER':'CYPHER','DISK':'DISC','FIBER':'FIBRE',
'FULFIL':'FULFILL','METER':'METRE','SAVORY':'SAVOURY',
'TON':'TONNNE','TIRE':'TYRE'}
def str_wd_count(my_string, uk=False, hygiene=True):
us = not(uk)
# if the UK flag is TRUE, default to UK version, else default to US version
print "Using the "+uk*"UK"+us*"US"+" dictionary for default words"
# optional hygiene of non-alphanumeric characters for pure word counting
if hygiene:
my_string = re.sub('[^ \d\w]',' ',my_string)
my_string = re.sub(' {1,}',' ',my_string)
# create a list of the unqique words in the text
ttl_wds = [ukus.get(w,w) if us else usuk.get(w,w) for w in my_string.upper().split(' ')]
wd_counts = {}
for wd in ttl_wds:
wd_counts[wd] = wd_counts.get(wd,0)+1
return wd_counts
As a sample of use, consider the string
str1 = 'The colour of the dog is not the same as the color of the tire, or is it tyre, I can never tell which one will fulfill'
# Resulting sorted dict.items() With Default Settings
'[(THE,5),(TIRE,2),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(IT,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1)]'
# Resulting sorted dict.items() With hygiene=False
'[(THE,5),(COLOR,2),(OF,2),(IS,2),(FULFIL,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(I,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(ONE,1),(OR,1),(IT,1),(TYRE,,1)]'
# Resulting sorted dict.items() With UK Swap, hygiene=True
'[(THE,5),(OF,2),(IS,2),(TYRE,2),(COLOUR,2),(WHICH,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(OR,1),(WILL,1),(AS,1),(CAN,1),(TELL,1),(NOT,1),(FULFILL,1),(ONE,1),(IT,1)]'
# Resulting sorted dict.items() With UK Swap, hygiene=False
'[(THE,5),(OF,2),(IS,2),(COLOUR,2),(ONE,1),(I,1),(NEVER,1),(DOG,1),(SAME,1),(TIRE,,1),(WILL,1),(AS,1),(CAN,1),(WHICH,1),(TELL,1),(NOT,1),(FULFILL,1),(TYRE,,1),(IT,1),(OR,1)]'
You can use the resulting dictionary of word counts in any way you'd like, and if you need the original string with the modifications added it is easy enough to modify the function to also return that.
Step 1:
Create a temporary string and then replace all the words with values of your dict with it's corresponding keys as:
>>> temp_string = str(my_string)
>>> for k, v in ukus.items():
... temp_string = temp_string.replace(" {} ".format(v), " {} ".format(k)) # <--surround by space " " to replace only words
Step 2:
Now, in order to find words in the string, firstly split it into list of words and then use itertools.Counter() to get count of each element in the list. Below is the sample code:
>>> from collections import Counter
>>> my_string = 'Hello World! Hello again. I am saying Hello one more time'
>>> count_dict = Counter(my_string.split())
# Value of count_dict:
# Counter({'Hello': 3, 'saying': 1, 'again.': 1, 'I': 1, 'am': 1, 'one': 1, 'World!': 1, 'time': 1, 'more': 1})
>>> count_dict['Hello']
3
Step 3:
Now, since you want the count of both "colour" and "color" in your dict, re-iterate the dict to add those values, and the missing values as "0"
for k, v in ukus.items():
if k in count_dict:
count_dict[v] = count_dict[k]
else:
count_dict[v] = count_dict[k] = 0
I need to keep a count of words in the list that appear once in a list, and one list for words that appear twice without using any count method, I tried using a set but it removes only the duplicate not the original. Is there any way to keep the words appearing once in one list and words that appear twice in another list?
the sample file is text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'Andy Gosling\n'], so technically Andy, and Andy would be in one list, and the rest in the other.
Using dictionaries is not allowed :/
for word in text:
clean = clean_up(word)
for words in clean.split():
clean2 = clean_up(words)
l = clean_list.append(clean2)
if clean2 not in clean_list:
clean_list.append(clean2)
print(clean_list)
This is a very bad, unPythonic way of doing things; but once you disallow Counter and dict, this is about all that's left. (Edit: except for sets, d'oh!)
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
once_words = []
more_than_once_words = []
for sentence in text:
for word in sentence.split():
if word in more_than_once_words:
pass # do nothing
elif word in once_words:
once_words.remove(word)
more_than_once_words.append(word)
else:
once_words.append(word)
which results in
# once_words
['Fennimore', 'Cooper', 'Peter,', 'Paul,', 'and', 'Mary', 'Gosling']
# more_than_once_words
['Andy']
It is a silly problem removing key data structures or loops or whatever. Why not just program is C then? Tell your teacher to get a job...
Editorial aside, here is a solution:
>>> text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n','Andy Gosling\n']
>>> data=' '.join(e.strip('\n,.') for e in ''.join(text).split()).split()
>>> data
['Andy', 'Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Andy', 'Gosling']
>>> [e for e in data if data.count(e)==1]
['Fennimore', 'Cooper', 'Peter', 'Paul', 'and', 'Mary', 'Gosling']
>>> list({e for e in data if data.count(e)==2})
['Andy']
If you can use a set (I wouldn't use it either, if you're not allowed to use dictionaries), then you can use the set to keep track of what words you have 'seen'... and another one for the words that appear more than once. Eg:
seen = set()
duplicate = set()
Then, each time you get a word, test if it is on seen. If it is not, add it to seen. If it is in seen, add it to duplicate.
At the end, you'd have a set of seen words, containing all the words, and a duplicate set, with all those that appear more than once.
Then you only need to substract duplicate from seen, and the result is the words that have no duplicates (ie. the ones that appear only once).
This can also be implemented using only lists (which would be more honest to your homework, if a bit more laborious).
from itertools import groupby
from operator import itemgetter
text = ['Andy Fennimore Cooper\n', 'Peter, Paul, and Mary\n', 'Andy Gosling\n']
one, two = [list(group) for key, group in groupby( sorted(((key, len(list(group))) for key, group in groupby( sorted(' '.join(text).split()))), key=itemgetter(1)), key=itemgetter(1))]