Splitting words and take specific words - python

I have a problem with my python script, it's a straightforward problem but I can't resolve it.
For example, I have 11 words text and then I split using re.split(r"\s+", text) function
import re
text = "this is the example text and i will splitting this text"
split = re.split(r"\s+", text)
for a in (range(len(split))):
print(split[a])
The result is
this
is
the
example
text
and
i
will
splitting
this
text
I only need to take 10 words from 11 words, so the result I need is only like this
is
the
example
text
and
i
will
splitting
this
text
Can you solve this problem? It will very helpful
Thank you!

Just index like that
>>> import re
>>>
>>> text = "this is the example text and i will splitting this text"
>>> split = re.split(r"\s+", text)
>>> split
['this', 'is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
>>> split[-10:]
['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']

No need of regex:
text = "this is the example text and i will splitting this text"
l = text.split() # Split with whitespace
l.pop(0) # Remove first item
print(l) # Print the results
Results: ['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
See Python proof.

Related

How to only return actual tokens, rather than empty variables when tokenizing?

I have a function:
def remove_stopwords(text):
return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]
My input is a list with a tokenized sentence:
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:
desired_output = ['example', 'input']
However, the actual output that I'm getting now is:
actual_output = [[], [], [], ['example'], [], [], ['input']]
How can I adjust my code, to get this output?
There are two solutions to your problem:
Solution 1:
Your remove_stopwords requires an array of documents to work properly, so you modify your input like this
input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]
Solution 2:
You change your remove_stopwords function to work on a single document
def remove_stopwords(text):
return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]
You can use the below code for removing stopwords, if there is no specific reason to use your code.
wordsFiltered = []
def remove_stopwords(text):
for w in text:
if w not in stop_words:
wordsFiltered.append(w)
return wordsFiltered
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
stop_words = ['This', 'is', 'an', 'of', 'my']
print remove_stopwords(input)
Output:
['example', 'input']

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

How to join a list while preserving previous structure?

I am having trouble joining a pre-split string after modification while preserving the previous structure.
say I have a string like this:
string = """
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
"""
I have to do some tests of that string.. finding specific words and characters within those words etc...and then replace them accordingly. so to accomplish that I had to break it up using
string.split()
The problem with this is, is that split also gets rid of the \n and extra spaces immediately ruining the integrity of the previous structure
Are there some extra methods in split that will allow me to accomplish this or should I seek an alternative route?
Thank you.
The split method takes an optional argument to specify the delimiter. If you only want to split words using space (' ') characters, you can pass that as an argument:
>>> string = """
...
... This is a nice piece of string isn't it?
... I assume it is so. I have to keep typing
... to use up the space. La-di-da-di-da.
...
... Bonjour.
... """
>>>
>>> string.split()
['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?', 'I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing', 'to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.', 'Bonjour.']
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nBonjour.\n']
>>>
The split method will split your string based on all white-spaces by default. If you want to split the lies separately, you can first split your string with new-lines then split the lines with white-space:
>>> [line.split() for line in string.strip().split('\n')]
[['This', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?'], ['I', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing'], ['to', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.'], [], ['Bonjour.']]
Just split with a delimiter:
>>> string.split(' ')
['\n\nThis', 'is', 'a', 'nice', 'piece', 'of', 'string', "isn't", 'it?\nI', 'assume', 'it', 'is', 'so.', 'I', 'have', 'to', 'keep', 'typing\nto', 'use', 'up', 'the', 'space.', 'La-di-da-di-da.\n\nThis', '', '', 'is', '', '', '', 'a', '', '', '', 'spaced', '', '', 'out', '', '', 'sentence\n\nBonjour.\n']
And to get it back:
>>> ' '.join(a)
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.
just do string.split(' ') (note the space argument to the split method).
this will keep your precious new lines within the strings that go into the resulting array...
You can save the spaces in another list then after modifying the words list you join them together.
In [1]: from nltk.tokenize import RegexpTokenizer
In [2]: spacestokenizer = RegexpTokenizer(r'\s+', gaps=False)
In [3]: wordtokenizer = RegexpTokenizer(r'\s+', gaps=True)
In [4]: string = """
...:
...: This is a nice piece of string isn't it?
...: I assume it is so. I have to keep typing
...: to use up the space. La-di-da-di-da.
...:
...: This is a spaced out sentence
...:
...: Bonjour.
...: """
In [5]: spaces = spacestokenizer.tokenize(string)
In [6]: words = wordtokenizer.tokenize(string)
In [7]: print ''.join([s+w for s, w in zip(spaces, words)])
This is a nice piece of string isn't it?
I assume it is so. I have to keep typing
to use up the space. La-di-da-di-da.
This is a spaced out sentence
Bonjour.

text.replace(punctuation,'') does not remove all punctuation contained in list(punctuation)?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)
# store individual words
words = obama_4427_str_processed.split(' ')
print(words)
Long story short, I have a speech from President Obama, and am looking to remove all punctuation, so that I'm left only with the words. I've imported the punctuation module, ran a for loop which didn't remove all my punctuation. What am I doing wrong here?
str.replace() searches for the whole value of the first argument. It is not a pattern, so only if the whole `string.punctuation* value is there will this be replaced with an empty string.
Use a regular expression instead:
import re
from string import punctuation as p
punctuation = re.compile('[{}]+'.format(re.escape(p)))
obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()
Note that you can just use str.split() without an argument to split on any arbitrary-width whitespace, including newlines.
If you want to remove the punctuation you can rstrip it off:
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................
using python3 to remove from anywhere use str.translate:
from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())
For python2:
from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................
On a another note, you can iterate over a string so list(p) is redundant in your own code.

Python regex string to list of words (including words with hyphens)

I would like to parse a string to obtain a list including all words (hyphenated words, too). Current code is:
s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)
returns:
['', 'this', 'is', 'A', 'sentence', 'one', 'word']
and I would like it to return:
['', 'this', 'is', 'A', 'sentence', 'one-word']
If you don't need the leading empty string, you could use the pattern \w(?:[-\w]*\w)? for matching:
>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']
Note that it won't match words with apostrophes like won't.
Here my traditional "why to use regexp language when you can use Python" alternative:
import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
for word in s.replace(';','; ').split()
])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""
You could use "[^\w-]+" instead.
s = "-this is. A - sentence;one-word what's"
re.findall("\w+-\w+|[\w']+",s)
result:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
make sure you notice that the correct ordering is to look for hyypenated words first!
Yo can try with the NLTK library:
>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']
I found it here: http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html Hope it helps

Categories

Resources