I am looking for a way to create several lists and for the keywords in those lists to be extracted and matched with a responce.
User Input: This is a good day I am heading out for a jog.
List 1 : Keywords : good day, great day, awesome day, best day.
List 2 : Keywords : a run, a swim, a game.
But for a huge database of words, can this be linked to just the lists? Or does it need to be especific words?
Also would you recommend Python for a huge database of keywords?
The first thing to do is to break the input string up into tokens. A token is just a piece of the string that you want to match. In your case, it looks like your token size is 2 words (but it doesn't have to be). You might also want to strip all punctuation from the input string as well.
Then for your input, your tokens are
['This is', 'is a', 'a good', 'good day', 'day I', 'I am', 'am heading', 'heading out', 'out for', 'for a', 'a jog']
Then you can iterate over the tokens and check to see if they're contained in each one of the lists. Might look like this:
input = 'This is a good day I am heading out for a jog'
words = input.split(' ')
tokens = [' '.join(words[i:i+2]) for i in range(len(words) - 1)]
for token in tokens:
if token in list1:
print('{} is in list1'.format(token))
if token in list2:
print('{} is in list2'.format(token))
One thing you will likely want to do to optimize this is to use sets for list1 and list2, instead of lists.
set1 = set(list1)
sets offer O(1) lookups, as opposed to O(n) for lists, which is critical if your keyword lists are large.
Related
I am a middle school student studying Python. Is there a way to omit certain characters from the list and mix them?
Input list
['Hello', 'Middle school student', 'I am']
Expected output
['Middle school student', 'Hello', 'I am']
If you specify is, everything except for is mixed.
Here is a simple shuffle that is effective and efficient. Basically, you randomly swap each element with another element.
import random
def shuffle(lst):
for i in range(len(lst)):
j = random.randrange(len(lst))
lst[i],lst[j] = lst[j],lst[i]
I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.
I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]
Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo
One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]
It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.
You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']
Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]
I'm trying to find keywords within a sentence, where the keywords are usually single words, but can be multi-word combos (like "cost in euros"). So if I have a sentence like cost in euros of bacon it would find cost in euros in that sentence and return true.
For this, I was using this code:
if any(phrase in line for phrase in keyword['aliases']:
where line is the input and aliases is an array of phrases that match a keyword (like for cost in euros, it's ['cost in euros', 'euros', 'euro cost']).
However, I noticed that it was also triggering on word parts. For example, I had a match phrase of y and a sentence of trippy cake. I'd not expect this to return true, but it does, since it apparently finds the y in trippy. How do I get this to only check whole words? Originally I was doing this keyword search with a list of words (essentially doing line.split() and checking those), but that doesn't work for multi-word keyword aliases.
This should accomplish what you're looking for:
import re
aliases = [
'cost.',
'.cost',
'.cost.',
'cost in euros of bacon',
'rocking euros today',
'there is a cost inherent to bacon',
'europe has cost in place',
'there is a cost.',
'I was accosted.',
'dealing with euro costing is painful']
phrases = ['cost in euros', 'euros', 'euro cost', 'cost']
matched = list(set([
alias
for alias in aliases
for phrase in phrases
if re.search(r'\b{}\b'.format(phrase), alias)
]))
print(matched)
Output:
['there is a cost inherent to bacon', '.cost.', 'rocking euros today', 'there is a cost.', 'cost in euros of bacon', 'europe has cost in place', 'cost.', '.cost']
Basically, we're grabbing all matches, using pythons re module as our test, including cases where multiple phrases occur in a given alias, using a compound list comprehension, then using set() to trim duplicates from the list, then using list() to coerce the set back into a list.
Refs:
Lists:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
List comprehensions:
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
Sets:
https://docs.python.org/3/tutorial/datastructures.html#sets
re (or regex):
https://docs.python.org/3/library/re.html#module-re
This has been resolved. See bottom of this post for a solution
I'm trying to filter out a continuous loop that has a constant feed of strings coming in(from an API).
Heres an example of the code I'm using -
I have a filter set up with an array like so:
filter_a = ['apples and oranges', 'a test', 'bananas']
a function I found on Stackoverflow like this:
def words_in_string(word_list, a_string):
return set(word_list).intersection(a_string.split())
title = 'bananas'
#(this is a continuously looping thing, so sometimes it
# might be for example 'apples and oranges')
And my if statement:
if words_in_string(filter_a, str(title.lower())):
print(title.lower())
For some reason it would detect 'bananas' but not 'apples and oranges'. It will skip right over strings with multiple words. I'm guessing it's because of the split() but I'm not sure.
Edit:
Here's another example of what I meant:
Match this and make it successful:
title = 'this is 1'
word_list = ['this is','a test']
if title in word_list:
print("successful")
else:
print("unsuccessful")
Edit 2:
Solution
title = 'this is 1'
word_list = ['this is','a test']
if any(item in title for item in word_list):
print("successful")
else:
print("unsuccessful")
I don't think your code makes sense. Let's analyze what does words_in_string do.
word_list means a list of words you want to keep, and set(word_list) transform this list into a set which only contains unique elements. In your example, transform ['apples and oranges', 'a test', 'bannanas'] into a set is {'apples and oranges', 'a test', 'bannanas'}.
Next, a_string.split() splits a_string into a list, then call set's function intersection to get the intersection of set and what a_string.split() created.
Finally, return the result.
To be more clearly, given a list of words, this function will return the words in a_string if these words are also contained in list.
For example:
given ["banana", "apple", "orange"] and a_string = "I like banana and apple". It will return {"banana", "apple"}.
But if you change list into ["bananas", "apple", "orange"], it will only return {"apple"} as banana doesn't equal to bananas.