wordnet alternatives to find semantic relationships between words python - python

i have a project to get semantic relationships between two words , i want to get word to word relationships like hypernyms,hyponyms, Synonyms, holonyms, ...
i try wordnet nltk but most of relationships is none,
here is sample code:
from nltk.corpus import wordnet as wn
from wordhoard import synonyms
Word1 = 'red'
Word2 = 'color'
LSTWord1 =[]
for syn in wn.synsets(Word1):
for lemma in syn.part_meronyms():
LSTWord1.append(lemma)
for s in LSTWord1:
if Word2 in s.name() :
print(Word1 +' is meronyms of ' + Word2)
break
LSTWord2 =[]
for syn in wn.synsets(Word2):
for lemma in syn.part_meronyms ():
LSTWord2.append(lemma)
for s in LSTWord2:
if Word1 in s.name() :
print( Word2 +' is meronyms of ' + Word1)
break
here an example of words:
scheduled ,geometry
games,river
campaign,sea
adventure,place
session,road
long,town
campaign,road
session,railway
difficulty of session,place of interest
campaign,town
leader,historic place
have,town
player,town
skills,church
campaign,cultural interest
character name,town
player,monument
player,province
games,beach
expertise level,gas station
character,municipality
world,electrict line
social interaction,municipality
world,electric line
percentage,municipality
character,hospital
inhabitants,mine
active character,municipality
campaign,altitude
died,municipality
many time,mountain
adventurer,altitude
campaign,peak
gain,place of interest
new capabilities,cultural interest
player,cultural interest
achievement,national park
campaign,good
first action,railway station
player,province
may wordnet is limit or may there is no relation between words, my question is there any alternatives to wordnet to handle semantic relationships between words, or is there any better way to get semantic relation between words?
Thanks

As I previously stated, I'm the author of the Python package wordhoard that you used in your question. Based on your question, I decided to add some additional modules to the package. These modules focus on:
homophones
hypernyms
hyponyms
I could not find an easy way to add the meronyms, but I'm still looking at the best way to do that.
The homophones modules will query a hand-built list of 60,000+ most frequently used English words for known homophone. I plan to expand this list in the future.
from wordhoard import Homophones
words = ['scheduled' ,'geometry', 'games', 'river', 'campaign', 'sea', 'adventure','place','session', 'road', 'long', 'town', 'campaign', 'road', 'session', 'railway']
for word in words:
homophone = Homophones(word)
results = homophone.find_homophones()
print(results)
# output
no homophones for scheduled
no homophones for geometry
no homophones for games
no homophones for river
no homophones for campaign
['sea is a homophone of see', 'sea is a homophone of cee']
no homophones for adventure
['place is a homophone of plaice']
['session is a homophone of cession']
['road is a homophone of rowed', 'road is a homophone of rode']
truncated...
The hypernyms module queries various online repositories.
from wordhoard import Hypernyms
words = ['scheduled' ,'geometry', 'games', 'river', 'campaign','sea', 'adventure',
'place','session','road', 'long','town', 'campaign','road', 'session', 'railway']
for word in words:
hypernym = Hypernyms(word)
results = hypernym.find_hypernyms()
print(results)
# output
['no hypernyms found']
['arrangement', 'branch of knowledge', 'branch of math', 'branch of mathematics', 'branch of maths', 'configuration', 'figure', 'form', 'math', 'mathematics', 'maths', 'pure mathematics', 'science', 'shape', 'study', 'study of numbers', 'study of quantities', 'study of shapes', 'system', 'type of geometry']
['lake', 'recreation']
['branch', 'dance', 'fresh water', 'geological feature', 'landform', 'natural ecosystem', 'natural environment', 'nature', 'physical feature', 'recreation', 'spring', 'stream', 'transportation', 'watercourse']
['action', 'actively seek election', 'activity', 'advertise', 'advertisement', 'battle', 'canvass', 'crusade', 'discuss', 'expedition', 'military operation', 'operation', 'political conflict', 'politics', 'promote', 'push', 'race', 'run', 'seek votes', 'wage war']
truncated...
The hyponyms module queries repositories.
from wordhoard import Hyponyms
words = ['scheduled' ,'geometry', 'games', 'river', 'campaign','sea', 'adventure',
'place','session','road', 'long','town', 'campaign','road', 'session', 'railway']
for word in words:
hyponym = Hyponyms(word)
results = hyponym.find_hyponyms()
print(results)
# output
['no hyponyms found']
['absolute geometry', 'affine geometry', 'algebraic geometry', 'analytic geometry', 'combinatorial geometry', 'descriptive geometry', 'differential geometry', 'elliptic geometry', 'euclidean geometry', 'finite geometry', 'geometry of numbers', 'hyperbolic geometry', 'non-euclidean geometry', 'perspective', 'projective geometry', 'pythagorean theorem', 'riemannian geometry', 'spherical geometry', 'taxicab geometry', 'tropical geometry']
['jack in the box', 'postseason']
['affluent', 'arkansas river', 'arno river', 'avon', 'big sioux river', 'bighorn river', 'brazos river', 'caloosahatchee river', 'cam river', 'canadian river', 'cape fear river', 'changjiang', 'chari river', 'charles river', 'chattahoochee river', 'cimarron river', 'colorado river', 'orange', 'red', 'tunguska']
['ad blitz', 'ad campaign', 'advertising campaign', 'agitating', 'anti-war movement', 'campaigning', 'candidacy', 'candidature', 'charm campaign', 'come with', 'electioneering', 'feminism', 'feminist movement', 'fund-raising campaign', 'fund-raising drive', 'fund-raising effort', 'military campaign', 'military expedition', 'political campaign', 'senate campaign']
truncated...
Please let me know if you have any issues when you used these new modules.

Looks like you are looking for arbitrary semantic relationships between a pair of given words and in a large vocabulary. Probably simple cosine similarity of the word embedding can help here. You can start with GloVe.

Related

How do I get the top 'realistic' hypernym from wordnet synset hyper_paths?

Python3.8 nltk wordnet
How can I find the highest reasonable hypernym for a given word?
I'm using wordnet synset.hypernym_paths() which if I traverse all the way to the top gives me a hypernym that is way too abstract (ie: entity).
I've tried creating a list of 'too high' hypernyms but the list is difficult to determine and way too long.
import sys
import nltk
from nltk.corpus import wordnet as wn
arTooHigh = ['act', 'event', 'action', 'artifact', 'instrumentality', 'furnishing', 'organism', 'cognition', 'content', 'discipline', 'humanistic_discipline', 'diversion', 'communication', 'auditory_communication', 'speech', 'state', 'feeling', 'causal_agent', 'think', 'reason','information', 'evidence', 'measure', 'fundamental_quantity', 'condition']
arWords = ['work', 'frog', 'bad', 'corpus', 'chair', 'dancing', 'gossip', 'love', 'jerry', 'compute', 'satisfied', 'gift', 'candy', 'bookkeeper', 'construction', 'amplified', 'party', 'dinner', 'family', 'sky', 'office', 'project', 'budget', 'price']
for word in arWords:
hypernym = word
last = ''
synsets = wn.synsets(word)
syn = synsets[0]
for path in syn.hypernym_paths():
for i, ss in reversed(list(enumerate(path))):
test = ss.lemmas()[0].name()
if test in arTooHigh:
hypernym = last
break
else:
last = test
print(word + ' => ' + hypernym)

how to stop letter repeating itself python

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items
As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

Find the anagram pairs of from 2 lists and create a list of tuples of the anagrams

say I have two lists
list_1 = [ 'Tar', 'Arc', 'Elbow', 'State', 'Cider', 'Dusty', 'Night', 'Inch', 'Brag', 'Cat', 'Bored', 'Save', 'Angel','bla', 'Stressed', 'Dormitory', 'School master','Awesoame', 'Conversation', 'Listen', 'Astronomer', 'The eyes', 'A gentleman', 'Funeral', 'The Morse Code', 'Eleven plus two', 'Slot machines', 'Fourth of July', 'Jim Morrison', 'Damon Albarn', 'George Bush', 'Clint Eastwood', 'Ronald Reagan', 'Elvis', 'Madonna Louise Ciccone', 'Bart', 'Paris', 'San Diego', 'Denver', 'Las Vegas', 'Statue of Liberty']
and
list_B = ['Cried', 'He bugs Gore', 'They see', 'Lives', 'Joyful Fourth', 'The classroom', 'Diagnose', 'Silent', 'Taste', 'Car', 'Act', 'Nerved', 'Thing', 'A darn long era', 'Brat', 'Twelve plus one', 'Elegant man', 'Below', 'Robed', 'Study', 'Voices rant on', 'Chin', 'Here come dots', 'Real fun', 'Pairs', 'Desserts', 'Moon starer', 'Dan Abnormal', 'Old West action', 'Built to stay free', 'One cool dance musician', 'Dirty room', 'Grab', 'Salvages', 'Cash lost in me', "Mr. Mojo Risin'", 'Glean', 'Rat', 'Vase']
What I am looking for is to find the anagram pairs of list_A in list_B. Create a list of tuples of the anagrams.
For one list I can do the following and generate the list of tuples, however, for two lists I need some assistance. Thanks in advance for the help!
What I have tried for one list,
from collections import defaultdict
anagrams = defaultdict(list)
for w in list_A:
anagrams[tuple(sorted(w))].append(w)
You can use a nested for loop, outer for the first list, inner for the second (also, use str.lower to make it case-insensitive):
anagram_pairs = [] # (w_1 from list_A, w_2 from list_B)
for w_1 in list_A:
for w_2 in list_B:
if sorted(w_1.lower()) == sorted(w_2.lower()):
anagram_pairs.append((w_1, w_2))
print(anagram_pairs)
Output:
[('Tar', 'Rat'), ('Arc', 'Car'), ('Elbow', 'Below'), ('State', 'Taste'), ('Cider', 'Cried'), ('Dusty', 'Study'), ('Night', 'Thing'), ('Inch', 'Chin'), ('Brag', 'Grab'), ('Cat', 'Act'), ('Bored', 'Robed'), ('Save', 'Vase'), ('Angel', 'Glean'), ('Stressed', 'Desserts'), ('School master', 'The classroom'), ('Listen', 'Silent'), ('The eyes', 'They see'), ('A gentleman', 'Elegant man'), ('The Morse Code', 'Here come dots'), ('Eleven plus two', 'Twelve plus one'), ('Damon Albarn', 'Dan Abnormal'), ('Elvis', 'Lives'), ('Bart', 'Brat'), ('Paris', 'Pairs'), ('Denver', 'Nerved')]
You are quite close with your current attempt. All you need to do is repeat the same process on list_B:
from collections import defaultdict
anagrams = defaultdict(list)
list_A = [ 'Tar', 'Arc', 'Elbow', 'State', 'Cider', 'Dusty', 'Night', 'Inch', 'Brag', 'Cat', 'Bored', 'Save', 'Angel','bla', 'Stressed', 'Dormitory', 'School master','Awesoame', 'Conversation', 'Listen', 'Astronomer', 'The eyes', 'A gentleman', 'Funeral', 'The Morse Code', 'Eleven plus two', 'Slot machines', 'Fourth of July', 'Jim Morrison', 'Damon Albarn', 'George Bush', 'Clint Eastwood', 'Ronald Reagan', 'Elvis', 'Madonna Louise Ciccone', 'Bart', 'Paris', 'San Diego', 'Denver', 'Las Vegas', 'Statue of Liberty']
list_B = ['Cried', 'He bugs Gore', 'They see', 'Lives', 'Joyful Fourth', 'The classroom', 'Diagnose', 'Silent', 'Taste', 'Car', 'Act', 'Nerved', 'Thing', 'A darn long era', 'Brat', 'Twelve plus one', 'Elegant man', 'Below', 'Robed', 'Study', 'Voices rant on', 'Chin', 'Here come dots', 'Real fun', 'Pairs', 'Desserts', 'Moon starer', 'Dan Abnormal', 'Old West action', 'Built to stay free', 'One cool dance musician', 'Dirty room', 'Grab', 'Salvages', 'Cash lost in me', "Mr. Mojo Risin'", 'Glean', 'Rat', 'Vase']
for w in list_A:
anagrams[tuple(sorted(w))].append(w)
for w in list_B:
anagrams[tuple(sorted(w))].append(w)
result = [b for b in anagrams.values() if len(b) > 1]
Output:
[['Cider', 'Cried'], ['The eyes', 'They see'], ['Damon Albarn', 'Dan Abnormal'], ['Bart', 'Brat'], ['Paris', 'Pairs']]
Another solution using dictionary:
out = {}
for word in list_A:
out.setdefault(tuple(sorted(word.lower())), []).append(word)
for word in list_B:
word_s = tuple(sorted(word.lower()))
if word_s in out:
out[word_s].append(word)
print(list(tuple(v) for v in out.values() if len(v) > 1))
Prints:
[
("Tar", "Rat"),
("Arc", "Car"),
("Elbow", "Below"),
("State", "Taste"),
("Cider", "Cried"),
("Dusty", "Study"),
("Night", "Thing"),
("Inch", "Chin"),
("Brag", "Grab"),
("Cat", "Act"),
("Bored", "Robed"),
("Save", "Vase"),
("Angel", "Glean"),
("Stressed", "Desserts"),
("School master", "The classroom"),
("Listen", "Silent"),
("The eyes", "They see"),
("A gentleman", "Elegant man"),
("The Morse Code", "Here come dots"),
("Eleven plus two", "Twelve plus one"),
("Damon Albarn", "Dan Abnormal"),
("Elvis", "Lives"),
("Bart", "Brat"),
("Paris", "Pairs"),
("Denver", "Nerved"),
]

match one or more specific strings in a list

I would like to have a function that appends a single or more than two words in one list. For example, I have a list called single_Word consists of four strings:
single_Word = ['news in media', 'car in automobile', 'email in technology', 'painting in art']
I would like to the extract 1st word (or basically any strings before 'in'), so it can return the following output:
['news', 'car', 'email', 'painting']
I have the following code that shows what I intended to do:
text_list = []
for text in single_Word:
x = text.split()
text_list.append(x[0])
print(text_list)
# ['news', 'car', 'email', 'painting']
which is fine for me and works as expected, but once I have another list that has more than a single string, it fails to catch that. I know the main reason for that is the x[0], which returns the first element, but how can I change this so it can match more than one string (or any string before 'in'). The following are the lists that I would match on.
two_Word = ['news online in media', 'car insurance in automobile', 'email account in technology', 'painting ideas in art']
three_Word = ['news online live in media', 'car insurance online in automobile', 'email account settings in technology', 'painting ideas pinterest in art']
the desired output for 2nd and 3rd lists:
['news online', 'car insurance', 'email account', 'painting ideas']
['news online live', 'car insurance online', 'email account settings', 'painting ideas pinterest']
Thank you #Ma0. it works by using: split(' in ', 1)
text_list = []
for text in three_Word:
x = text.split(' in ', 1)[0]
text_list.append(x)
print(text_list)
# ['news', 'car', 'email', 'painting']
# ['news online', 'car insurance', 'email account', 'painting ideas']
# ['news online live', 'car insurance online', 'email account settings', 'painting ideas pinterest']

Search for multiple words in a list using python

I'm currently working on my first python project. The goal is to be able to summarise a webpage's information by searching for and printing sentences that contain a specific word from a word list I generate. For example, the following (large) list contains 'business key terms' I generated by using cewl on business websites;
business_list = ['business', 'marketing', 'market', 'price', 'management', 'terms', 'product', 'research', 'organisation', 'external', 'operations', 'organisations', 'tools', 'people', 'sales', 'growth', 'quality', 'resources', 'revenue', 'account', 'value', 'process', 'level', 'stakeholders', 'structure', 'company', 'accounts', 'development', 'personal', 'corporate', 'functions', 'products', 'activity', 'demand', 'share', 'services', 'communication', 'period', 'example', 'total', 'decision', 'companies', 'service', 'working', 'businesses', 'amount', 'number', 'scale', 'means', 'needs', 'customers', 'competition', 'brand', 'image', 'strategies', 'consumer', 'based', 'policy', 'increase', 'could', 'industry', 'manufacture', 'assets', 'social', 'sector', 'strategy', 'markets', 'information', 'benefits', 'selling', 'decisions', 'performance', 'training', 'customer', 'purchase', 'person', 'rates', 'examples', 'strategic', 'determine', 'matrix', 'focus', 'goals', 'individual', 'potential', 'managers', 'important', 'achieve', 'influence', 'impact', 'definition', 'employees', 'knowledge', 'economies', 'skills', 'buying', 'competitive', 'specific', 'ability', 'provide', 'activities', 'improve', 'productivity', 'action', 'power', 'capital', 'related', 'target', 'critical', 'stage', 'opportunities', 'section', 'system', 'review', 'effective', 'stock', 'technology', 'relationship', 'plans', 'opportunity', 'leader', 'niche', 'success', 'stages', 'manager', 'venture', 'trends', 'media', 'state', 'negotiation', 'network', 'successful', 'teams', 'offer', 'generate', 'contract', 'systems', 'manage', 'relevant', 'published', 'criteria', 'sellers', 'offers', 'seller', 'campaigns', 'economy', 'buyers', 'everyone', 'medium', 'valuable', 'model', 'enterprise', 'partnerships', 'buyer', 'compensation', 'partners', 'leaders', 'build', 'commission', 'engage', 'clients', 'partner', 'quota', 'focused', 'modern', 'career', 'executive', 'qualified', 'tactics', 'supplier', 'investors', 'entrepreneurs', 'financing', 'commercial', 'finances', 'entrepreneurial', 'entrepreneur', 'reports', 'interview', 'ansoff']
And the following program allows me to copy all the text from a URL i specify and organises it into a list, in which the elements are separated by sentence;
from bs4 import BeautifulSoup
import urllib.request as ul
url = input("Enter URL: ")
html = ul.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style"]):
script.decompose()
strips = list(soup.stripped_strings)
# Joining list to form single text
text = " ".join(strips)
text = text.lower()
# Replacing substitutes of '.'
for i in range(len(text)):
if text[i] in "?!:;":
text = text.replace(text[i], ".")
# Splitting text by sentences
sentences = text.split(".")
My current objective is for the program to print all sentences that contain one (or more) of the key terms above, however i've only been succesful with single words at a time;
# Word to search for in the text
word_search = input("Enter word: ")
word_search = word_search.lower()
sentences_with_word = []
for x in sentences:
if x.count(word_search)>0:
sentences_with_word.append(x)
# Separating sentences into separate lines
sentence_text = "\n\n".join(sentences_with_word)
print(sentence_text)
Could somebody demonstrate how this could be achieved for an entire list at once? Thanks.
Edit
As suggested by MachineLearner, here is an example of the output for a single word. If I use wikipedia's page on marketing for the URL and choose the word 'marketing' as the input for 'word_search', this is a segment of the output generated (although the entire output is almost 600 lines long);
marketing mix the marketing mix is a foundational tool used to guide decision making in marketing
the marketing mix represents the basic tools which marketers can use to bring their products or services to market
they are the foundation of managerial marketing and the marketing plan typically devotes a section to the marketing mix
the 4ps [ edit ] the traditional marketing mix refers to four broad levels of marketing decision
Use a double loop to check multiple words contained in a list:
for sentence in sentences:
for word in words:
if sentence.count(word) > 0:
output.append(sentence)
# Do not forget to break the second loop, else
# you'll end up with multiple times the same sentence
# in the output array if the sentence contains
# multiple words
break

Categories

Resources