Scenario:
I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)
While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it.
And if not, new Section Header will get appended to the Existing Section Header List.
Existing Section Header Looks Like This:
[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable
running from disk", "Actions from File"]
For below set of String the expected behaviour is as follows:
"Activity (Last 30 Days) - New Section Should be Added
"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".
"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]
Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.
This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.
import re #this will allow you to use regular expressions
def modifyHeader(header):
#change the # of days to 30
modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
#add an s to "executable"
modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
#add "a"
modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)
return modifiedHeader
The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.
Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.
Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here
If there is any issue with nltk, here is the list of stop words:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
Now use this simple code to get your desired output:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
Demo:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list
Related
I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))
I'm trying to split string using regular expression with python and get all the matched literals.
RE: \w+(\.?\w+)*
this need to capture [a-zA-Z0-9_] like stuff only.
Here is example
but when I try to match and get all the contents from string, it doesn't return proper results.
Code snippet:
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']
it's only returning some of words, but actually it should return all the words, numbers and underscore(s)[as in linked example].
python version: Python 3.6.2 (default, Jul 17 2017, 16:44:45)
Thanks.
You need to use a non-capturing group (see here why) and escape the dot (see here what chars should be escaped in regex):
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']
Also, to only match ASCII letters, digits and _ you must pass re.A flag.
See the Python demo.
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)
# store individual words
words = obama_4427_str_processed.split(' ')
print(words)
Long story short, I have a speech from President Obama, and am looking to remove all punctuation, so that I'm left only with the words. I've imported the punctuation module, ran a for loop which didn't remove all my punctuation. What am I doing wrong here?
str.replace() searches for the whole value of the first argument. It is not a pattern, so only if the whole `string.punctuation* value is there will this be replaced with an empty string.
Use a regular expression instead:
import re
from string import punctuation as p
punctuation = re.compile('[{}]+'.format(re.escape(p)))
obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()
Note that you can just use str.split() without an argument to split on any arbitrary-width whitespace, including newlines.
If you want to remove the punctuation you can rstrip it off:
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................
using python3 to remove from anywhere use str.translate:
from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())
For python2:
from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................
On a another note, you can iterate over a string so list(p) is redundant in your own code.
I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!
There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)
>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How
Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]
Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])
I am trying to parse a query which has text plus number.
Example: Apple iphone 6 results in:
Results for And([Term('title', u'apple'), Term('title', u'iphone')])
while Apple iphone 62 results in:
Results for And([Term('title', u'apple'), Term('title', u'iphone'), Term('title', u'62')])
Why isn't it accepting single digit number?
All words with single-character is considered as stop words in Whoosh by default and ignored. This means all letters and digits are ignored.
stop words are words which are filtered out before or after processing of natural language data (text). (ref)
You can check that StopFilter has a minsize = 2 by default added to pre-defined set.
class whoosh.analysis.StopFilter(
stoplist=frozenset(['and', 'is', 'it', 'an', 'as', 'at', 'have', 'in', 'yet', 'if', 'from', 'for', 'when', 'by', 'to', 'you', 'be', 'we', 'that', 'may', 'not', 'with', 'tbd', 'a', 'on', 'your', 'this', 'of', 'us', 'will', 'can', 'the', 'or', 'are']),
minsize=2,
maxsize=None,
renumber=True,
lang=None
)
So You can resolve this issue by redefining your schema and removing the StopFilter or using it with minsize = 1:
from whoosh.analysis import StandardAnalyzer
schema = Schema(content=TEXT(analyzer=StandardAnalyzer(stoplist=None)))