Python, find words from array in string

Python, find words from array in string - python

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks

import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']

Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]

Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Related

mark "***" at the end of every 12 words

There is a text.
text = """Among domestic cats, males are more likely to fight than females. Among feral cats, the most common reason for cat fighting is competition between two males to mate with a female. In such cases, most fights are won by the heavier male. Another common reason for fighting in domestic cats is the difficulty of establishing territories within a small home. Female cats also fight over territory or to defend their kittens."""
How to implement this function (mark "***" every 12 words), please tell me in python 3?
"""
Among domestic cats, males are more likely to fight than females. Among***
feral cats, the most common reason for cat fighting is competition between***
next ...***
"""

Use a list comprehension:
text = "Create your own function that takes in a sentence and mark every 12th word with ***"
mark = " ".join(["{}***".format(word)
for idx, word in enumerate(text.split())
if idx % 12 == 0])
print(mark)
The main point here is to use the enumerate() function and the modulo operator (%).

First we break the text into individual words using str.split(), we can then iterate through every 12 words by setting the step of range to be 12, adding "***" where appropriate and rejoining the words with a space.
words = text.split()
for i in range(0, len(words), 12): # step by 12
words[i] += "***"
new_text = " ".join(words)
NOTE: this will mark the 0th word with "***", use range(11, len(words), 12): to start with 12th word

Delete based on presence

I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?

Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)

Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.

I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question

I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])

You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"

Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

Count number of times each word has repeated in a string?

For example if I had a string without any punctuation:
"She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
I know how to do it by converting the string to uppercase/lowercase and using the function
from collections import Counter
but I can't think of any other way to count without using built-in functions (this includes set.default, get, sorted, etc.)
It should come out in a key:value format. Any ideas?

Forget about libraries and "fast" ways of doing it, use simpler logic:
Start by splitting your string using stringName.split(). This returns to you an array of words. Now create an empty dicitonary. Then iterate through the array and do one of two things, if it exists in the dictionary, increment the count by 1, otherwise, create the key value pair with key as the word and value as 1.
At the end, you'll have a count of words.
The code:
testString = "She walked the dog to the park and played ball with the dog When she threw the ball to the dog the dog missed the ball and ran to the other side of the park to fetch it"
dic = {}
words = testString.split()
for raw_word in words:
word = raw_word.lower()
if word in dic:
dic[word] += 1
else:
dic[word] = 1
print dic

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.