newContents = ['The', 'crazy', 'panda', 'walked', 'to', 'the', 'Maulik', 'and', 'then', 'picked.', 'A', 'nearby', 'Ankur', 'was\n', 'unaffected', 'by', 'these', 'events.\n']
print(' '.join(newContents))
output:
The crazy panda walked to the Maulik and then picked. A nearby Ankur was
unaffected by these events.
there is space before the (first) word unaffected on second line I don't want a space there.
There's a simple enough solution: replace \n[space] with \n. That way all spaces are left alone and only string replaced is \n[space] with newline without space
>>> newContents = ['The', 'crazy', 'panda', 'walked', 'to', 'the', 'Maulik', 'and', 'then', 'picked.', 'A', 'nearby', 'Ankur', 'was\n', 'unaffected', 'by', 'these', 'events.\n']
>>> print(' '.join(newContents).replace('\n ', '\n'))
The crazy panda walked to the Maulik and then picked. A nearby Ankur was
unaffected by these events.
You could remove it after join:
your_string = ' '.join(newContents).replace('\n ', '\n')
print(your_string)
You could use replace to check for a space after a newline:
print(' '.join(newContents).replace('\n ', '\n'))
It outputs :
The crazy panda walked to the Maulik and then picked. A nearby Ankur was
unaffected by these events.
Use re.sub function to remove spaces right after newline:
import re
newContents = ['The', 'crazy', 'panda', 'walked', 'to', 'the', 'Maulik', 'and', 'then', 'picked.', 'A', 'nearby', 'Ankur', 'was\n', 'unaffected', 'by', 'these', 'events.\n']
print(re.sub(r'\n\s+', '\n',' '.join(newContents)))
The output:
The crazy panda walked to the Maulik and then picked. A nearby Ankur was
unaffected by these events.
The above will also remove multiple spaces(if occur) after newline
Strip the whitespace from each one:
>>> newContents = ['The', 'crazy', 'panda', 'walked', 'to', 'the', 'Maulik', 'and', 'then', 'picked.', 'A', 'nearby', 'Ankur', 'was\n', 'unaffected', 'by', 'these', 'events.\n']
>>> print(' '.join(item.strip() for item in newContents))
The crazy panda walked to the Maulik and then picked. A nearby Ankur was unaffected by these events.
Related
I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))
I am having trouble joining a pre-split string after modification while preserving the previous structure.
say I have a string like this:
string = """
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
"""
I have to do some tests of that string.. finding specific words and characters within those words etc...and then replace them accordingly. so to accomplish that I had to break it up using
string.split()
The problem with this is, is that split also gets rid of the \n and extra spaces immediately ruining the integrity of the previous structure
Are there some extra methods in split that will allow me to accomplish this or should I seek an alternative route?
Thank you.
The split method takes an optional argument to specify the delimiter. If you only want to split words using space (' ') characters, you can pass that as an argument:
>>> string = """
...
... This is a nice piece of string isn't it?
... I assume it is so. I have to keep typing
... to use up the space. La-di-da-di-da.
...
... Bonjour.
... """
>>>
>>> string.split()
['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?', 'I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing', 'to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.', 'Bonjour.']
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nBonjour.\n']
>>>
The split method will split your string based on all white-spaces by default. If you want to split the lies separately, you can first split your string with new-lines then split the lines with white-space:
>>> [line.split() for line in string.strip().split('\n')]
[['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?'], ['I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing'], ['to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.'], [], ['Bonjour.']]
Just split with a delimiter:
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nThis', '', '', 'is', '', '', '', 'a', '', '', '', 'spaced', '', '', 'out', '', '', 'sentence\n\nBonjour.\n']
And to get it back:
>>> ' '.join(a)
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
just do string.split(' ') (note the space argument to the split method).
this will keep your precious new lines within the strings that go into the resulting array...
You can save the spaces in another list then after modifying the words list you join them together.
In [1]: from nltk.tokenize import RegexpTokenizer
In [2]: spacestokenizer = RegexpTokenizer(r'\s+', gaps=False)
In [3]: wordtokenizer = RegexpTokenizer(r'\s+', gaps=True)
In [4]: string = """
...:
...: This is a nice piece of string isn't it?
...: I assume it is so. I have to keep typing
...: to use up the space. La-di-da-di-da.
...:
...: This is a spaced out sentence
...:
...: Bonjour.
...: """
In [5]: spaces = spacestokenizer.tokenize(string)
In [6]: words = wordtokenizer.tokenize(string)
In [7]: print ''.join([s+w for s, w in zip(spaces, words)])
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)
# store individual words
words = obama_4427_str_processed.split(' ')
print(words)
Long story short, I have a speech from President Obama, and am looking to remove all punctuation, so that I'm left only with the words. I've imported the punctuation module, ran a for loop which didn't remove all my punctuation. What am I doing wrong here?
str.replace() searches for the whole value of the first argument. It is not a pattern, so only if the whole `string.punctuation* value is there will this be replaced with an empty string.
Use a regular expression instead:
import re
from string import punctuation as p
punctuation = re.compile('[{}]+'.format(re.escape(p)))
obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()
Note that you can just use str.split() without an argument to split on any arbitrary-width whitespace, including newlines.
If you want to remove the punctuation you can rstrip it off:
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................
using python3 to remove from anywhere use str.translate:
from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())
For python2:
from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................
On a another note, you can iterate over a string so list(p) is redundant in your own code.
Scenario:
I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)
While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it.
And if not, new Section Header will get appended to the Existing Section Header List.
Existing Section Header Looks Like This:
[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable
running from disk", "Actions from File"]
For below set of String the expected behaviour is as follows:
"Activity (Last 30 Days) - New Section Should be Added
"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".
"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]
Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.
This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.
import re #this will allow you to use regular expressions
def modifyHeader(header):
#change the # of days to 30
modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
#add an s to "executable"
modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
#add "a"
modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)
return modifiedHeader
The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.
Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.
Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here
If there is any issue with nltk, here is the list of stop words:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
Now use this simple code to get your desired output:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
Demo:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list
I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn't the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
for c in to_be_removed:
s = s.replace(c, '')
s.split()
BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.
EDIT: probably a simple regex can solve your porblem:
(\w[\w']*)
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
(\w[\w']*\w)
This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.
Example:
rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:
(\w[\w']*\w|\w)
rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
You have too many capturing groups in your regular expression; make them non-capturing:
(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']
That returns only one element that is empty.
This regex will only allow one ending apostrophe, which may be followed by one more character:
([\w][\w]*'?\w?)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
I am new to python but i think i have figured it out
import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)
result
['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']