I am using the tokenizer from NLTK in Python.
There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:
More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
Remove characters with symbols in front and after. i.e. '*u*', '''','""'
Is there an elegant way of solving both problems?
Solution 1: Tokenize and strip punctuation off the tokens
>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']
Solution 2: remove punctuation then tokenize
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']
If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.
One way to go about this, then, is to tokenize on gaps like so:
>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # omits *u* per your third requirement
This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.
Related
I currently have a list of strings, let's say like this:
strings = ['Hello my name is John.', 'What is your name?', 'My name is Peter.']
and I want to replace the punctuation in each of those strings, and also replace them with lists of their tokens. The code that I wrote to do that is:
# Original list:
# strings = ['Hello my name is John.', 'What is your name?', 'My name is Peter.']
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
for i in range(len(strings)):
for token in PUNC:
if token in strings[i]:
strings[i] = strings[i].replace(token, '').split()
# New desired list:
# strings = [['Hello', 'my', 'name', 'is', John'],
# ['What', 'is', 'your', 'name'],
# ['My', 'name', 'is', Peter']]
The code works fine when I run it on individual string elements, but gives me the following warning when I run the code I wrote above:
AttributeError: 'list' object has no attribute 'replace'
I've set up breakpoints using Python Debugger and tried going through the code, and I noticed that before I run the above code the data is fine, but after I run it only the first two elements are converted into their tokenized versions and the code throws the error afterwards. This error shouldn't even be occurring since the original list only contains string elements.
Does anybody know why this might be the case? Thank you.
The problem is that you call split after each replace, turning strings[i] into a list. Just do it once after all replacements.
Also, you don't need to check if a characted is in the string to replace it. Furthermore, using enumerate allows you to avoid using indices all the time.
Here is an improved version of your code:
strings = ['Hello my name is John.', 'What is your name?', 'My name is Peter.']
# Original list:
# strings = ['Hello my name is John.', 'What is your name?', 'My name is Peter.']
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
for i, s in enumerate(strings):
for token in PUNC:
s = s.replace(token, '')
strings[i] = s.split()
print(strings)
# [['Hello', 'my', 'name', 'is', 'John'], ['What', 'is', 'your', 'name'], ['My', 'name', 'is', 'Peter']]
You should remove the .split(). It turns the string into a list:
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
for i in range(len(strings)):
for token in PUNC:
if token in strings[i]:
strings[i] = strings[i].replace(token, '')
You also don't need the if statement:
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
for i in range(len(strings)):
for token in PUNC:
strings[i] = strings[i].replace(token, '')
If you want to split all strings, do it at the end:
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
for i in range(len(strings)):
for token in PUNC:
strings[i] = strings[i].replace(token, '')
strings[i] = strings[i].split()
I am able to get your desired list with the below code:
strings = ['Hello my name is John.', 'What is your name?', 'My name is Peter.']
PUNC = ['.', ',', '?', '!', ':', ';', '(', ')']
new_list =[]
for i in range(len(strings)):
for token in PUNC:
if token in strings[i]:
strings[i] = strings[i].replace(token, '').split()
new_list.append(strings[i])
print(new_list)
I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!##$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.
Use re.split with capture group:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!##$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +:
result = re.split('([0-9,.?:;~!##$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+ - 1+ digits
(?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
[^\W\d_]+ - any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.
Use re.split to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
I have had some trouble with this problem, and I need your help.
I have to make a Python method (mySplit(x)) which takes an input list (which only has one string as element), split that element on the elements of other list and digits.
I use Python 3.6
So here is an example:
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
banned=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ' ', ';']
The returned lists should be like this:
mySplit(l)=['I', 'am', 'learning']
mySplit(l1)=['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I have tried the following, but I always get stuck:
def mySplit(x):
l=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] #Banned chars
l2=[i for i in x if i not in l] #Removing chars from input list
l2=",".join(l2)
l3=[i for i in l2 if not i.isdigit()] #Removes all the digits
l4=[i for i in l3 if i is not ',']
l5=[",".join(l4)]
l6=l5[0].split(' ')
return l6
and
mySplit(l1)
mySplit(l)
returns:
['T,h,i,s,e,x,a,m,p,l,e,a,i,n,t,e,a,s,y']
['I,', ',a,m,', ',l,e,a,r,n,i,n,g']
Use re.split() for this task:
import re
w_list = [i for i in re.split(r'[^a-zA-Z]',
'____-----This4ex5ample---aint___ea5sy;782') if i ]
Out[12]: ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I would import the punctuation marks from string and proceed with regular expressions as follows.
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
import re
from string import punctuation
punctuation # to see the punctuation marks.
>>> '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l]).split()
Here is the output:
>>> ['I', 'am', 'learning']
Notice the \d attached at the end of the punctuation marks to remove any digits.
Similarly,
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l1]).split()
Yields
>>> ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
You can also modify your function as follows:
def mySplit(x):
banned = ['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] + list('0123456789')#Banned chars
return ''.join([word if not word in banned else ' ' for word in list(x[0]) ]).split()
I extracted threegrams from a bunch of HTML files following a certain pattern. When I print them, I get a list of lists (where each line is a threegram). I would like to print it to an outfile for further text analysis, but when I try it, it only prints the first threegram. How can I print all the threegrams to the outfile? (The list of list of threegrams). I would ideally like to merge all the threegrams into one list instead of having multiple lists with one threegram. Your help would be highly appreciated.
My code looks like this so far:
from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation)
# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
mytext = f.read()
# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')
# Split the text in sentences (using the NLTK sentence splitter)
sentences = sent_tokenize(extracted_text)
# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []
for sent in sentences:
sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
all_tokens.extend(tokenized_sent) # add tagged tokens to list
n=3
threegrams = ngrams(all_tokens, n)
# Find ngrams with specific pattern
for (first, second, third) in threegrams:
if first == "a":
if second.endswith("bb") and second.startswith("leg"):
print(first, second, third)
Firstly, the punctuation removal could have been simpler, see Removing a list of characters in string
>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'
But it's not really correct to remove punctuations before you do tokenization, you see that We'll -> Well, which I think it's not desired.
Possibly this is a better approach:
>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
But do note that the idiom above don't handle multi-character punctuation.
E.g. , we see that that the word_tokenize() changes " -> `` , and using the idiom above it didn't remove it:
>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
To handle that, explicitly make punctuation into a list and append the multi-character punctuations to it:
>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']
As for getting the document stream (as you called it all_tokens), here's a neat way to get it:
>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
And now to the part of your actual question.
What you really need isn't checking the string in the ngrams, rather, you should consider a regex pattern matching.
You want to find the pattern \ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b, see https://regex101.com/r/zBVgp4/4
>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]
Now to write a string to a file, you can use this idiom, see https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function:
with open('filename.txt', 'w') as fout:
print('Hello World', end='\n', file=fout)
In fact, if you are only interested in the ngrams without the tokens, there's no need to filter or tokenize the text ;P
You can simply your code to this:
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"
with open('filename.txt', 'w') as fout:
for interesting_ngram in re.findall(pattern, extracted_text):
print(interesting_ngram, end='\n', file=fout)
I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn't the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
for c in to_be_removed:
s = s.replace(c, '')
s.split()
BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.
EDIT: probably a simple regex can solve your porblem:
(\w[\w']*)
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
(\w[\w']*\w)
This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.
Example:
rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:
(\w[\w']*\w|\w)
rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
You have too many capturing groups in your regular expression; make them non-capturing:
(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']
That returns only one element that is empty.
This regex will only allow one ending apostrophe, which may be followed by one more character:
([\w][\w]*'?\w?)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
I am new to python but i think i have figured it out
import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)
result
['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']