Split string on punctuation or number in Python - python

I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!##$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.

Use re.split with capture group:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!##$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +:
result = re.split('([0-9,.?:;~!##$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+ - 1+ digits
(?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
[^\W\d_]+ - any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.

Use re.split to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

Related

remove all the special chars from a list [duplicate]

This question already has answers here:
Removing punctuation from a list in python
(2 answers)
Closed last year.
i have a list of strings with some strings being the special characters what would be the approach to exclude them in the resultant list
list = ['ben','kenny',',','=','Sean',100,'tag242']
expected output = ['ben','kenny','Sean',100,'tag242']
please guide me with the approach to achieve the same. Thanks
The string module has a list of punctuation marks that you can use and exclude from your list of words:
import string
punctuations = list(string.punctuation)
input_list = ['ben','kenny',',','=','Sean',100,'tag242']
output = [x for x in input_list if x not in punctuations]
print(output)
Output:
['ben', 'kenny', 'Sean', 100, 'tag242']
This list of punctuation marks includes the following characters:
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
It can simply be done using the isalnum() string function. isalnum() returns true if the string contains only digits or letters, if a string contains any special character other than that, the function will return false. (no modules needed to be imported for isalnum() it is a default function)
code:
list = ['ben','kenny',',','=','Sean',100,'tag242']
olist = []
for a in list:
if str(a).isalnum():
olist.append(a)
print(olist)
output:
['ben', 'kenny', 'Sean', 100, 'tag242']
my_list = ['ben', 'kenny', ',' ,'=' ,'Sean', 100, 'tag242']
stop_words = [',', '=']
filtered_output = [i for i in my_list if i not in stop_words]
The list with stop words can be expanded if you need to remove other characters.

Python, Split the input string on elements of other list and remove digits from it

I have had some trouble with this problem, and I need your help.
I have to make a Python method (mySplit(x)) which takes an input list (which only has one string as element), split that element on the elements of other list and digits.
I use Python 3.6
So here is an example:
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
banned=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ' ', ';']
The returned lists should be like this:
mySplit(l)=['I', 'am', 'learning']
mySplit(l1)=['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I have tried the following, but I always get stuck:
def mySplit(x):
l=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] #Banned chars
l2=[i for i in x if i not in l] #Removing chars from input list
l2=",".join(l2)
l3=[i for i in l2 if not i.isdigit()] #Removes all the digits
l4=[i for i in l3 if i is not ',']
l5=[",".join(l4)]
l6=l5[0].split(' ')
return l6
and
mySplit(l1)
mySplit(l)
returns:
['T,h,i,s,e,x,a,m,p,l,e,a,i,n,t,e,a,s,y']
['I,', ',a,m,', ',l,e,a,r,n,i,n,g']
Use re.split() for this task:
import re
w_list = [i for i in re.split(r'[^a-zA-Z]',
'____-----This4ex5ample---aint___ea5sy;782') if i ]
Out[12]: ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I would import the punctuation marks from string and proceed with regular expressions as follows.
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
import re
from string import punctuation
punctuation # to see the punctuation marks.
>>> '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l]).split()
Here is the output:
>>> ['I', 'am', 'learning']
Notice the \d attached at the end of the punctuation marks to remove any digits.
Similarly,
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l1]).split()
Yields
>>> ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
You can also modify your function as follows:
def mySplit(x):
banned = ['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] + list('0123456789')#Banned chars
return ''.join([word if not word in banned else ' ' for word in list(x[0]) ]).split()

Python regular expression split string into numbers and text/symbols

I would like to split a string into sections of numbers and sections of text/symbols
my current code doesn't include negative numbers or decimals, and behaves weirdly, adding an empty list element on the end of the output
import re
mystring = 'AD%5(6ag 0.33--9.5'
newlist = re.split('([0-9]+)', mystring)
print (newlist)
current output:
['AD%', '5', '(', '6', 'ag ', '0', '.', '33', '--', '9', '.', '5', '']
desired output:
['AD%', '5', '(', '6', 'ag ', '0.33', '-', '-9.5']
Your issue is related to the fact that your regex captures one or more digits and adds them to the resulting list and digits are used as a delimiter, the parts before and after are considered. So if there are digits at the end, the split results in the empty string at the end to be added to the resulting list.
You may split with a regex that matches float or integer numbers with an optional minus sign and then remove empty values:
result = re.split(r'(-?\d*\.?\d+)', s)
result = filter(None, result)
To match negative/positive numbers with exponents, use
r'([+-]?\d*\.?\d+(?:[eE][-+]?\d+)?)'
The -?\d*\.?\d+ regex matches:
-? - an optional minus
\d* - 0+ digits
\.? - an optional literal dot
\d+ - one or more digits.
Unfortunately, re.split() does not offer an "ignore empty strings" option. However, to retrieve your numbers, you could easily use re.findall() with a different pattern:
import re
string = "AD%5(6ag0.33-9.5"
rx = re.compile(r'-?\d+(?:\.\d+)?')
numbers = rx.findall(string)
print(numbers)
# ['5', '6', '0.33', '-9.5']
As mentioned here before, there is no option to ignore the empty strings in re.split() but you can easily construct a new list the following way:
import re
mystring = "AD%5(6ag0.33--9.5"
newlist = [x for x in re.split('(-?\d+\.?\d*)', mystring) if x != '']
print newlist
output:
['AD%', '5', '(', '6', 'ag', '0.33', '-', '-9.5']

Write filtered ngrams into outfile - list of lists

I extracted threegrams from a bunch of HTML files following a certain pattern. When I print them, I get a list of lists (where each line is a threegram). I would like to print it to an outfile for further text analysis, but when I try it, it only prints the first threegram. How can I print all the threegrams to the outfile? (The list of list of threegrams). I would ideally like to merge all the threegrams into one list instead of having multiple lists with one threegram. Your help would be highly appreciated.
My code looks like this so far:
from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation)
# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
mytext = f.read()
# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')
# Split the text in sentences (using the NLTK sentence splitter)
sentences = sent_tokenize(extracted_text)
# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []
for sent in sentences:
sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
all_tokens.extend(tokenized_sent) # add tagged tokens to list
n=3
threegrams = ngrams(all_tokens, n)
# Find ngrams with specific pattern
for (first, second, third) in threegrams:
if first == "a":
if second.endswith("bb") and second.startswith("leg"):
print(first, second, third)
Firstly, the punctuation removal could have been simpler, see Removing a list of characters in string
>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'
But it's not really correct to remove punctuations before you do tokenization, you see that We'll -> Well, which I think it's not desired.
Possibly this is a better approach:
>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
But do note that the idiom above don't handle multi-character punctuation.
E.g. , we see that that the word_tokenize() changes " -> `` , and using the idiom above it didn't remove it:
>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
To handle that, explicitly make punctuation into a list and append the multi-character punctuations to it:
>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']
As for getting the document stream (as you called it all_tokens), here's a neat way to get it:
>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]
And now to the part of your actual question.
What you really need isn't checking the string in the ngrams, rather, you should consider a regex pattern matching.
You want to find the pattern \ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b, see https://regex101.com/r/zBVgp4/4
>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]
Now to write a string to a file, you can use this idiom, see https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function:
with open('filename.txt', 'w') as fout:
print('Hello World', end='\n', file=fout)
In fact, if you are only interested in the ngrams without the tokens, there's no need to filter or tokenize the text ;P
You can simply your code to this:
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"
with open('filename.txt', 'w') as fout:
for interesting_ngram in re.findall(pattern, extracted_text):
print(interesting_ngram, end='\n', file=fout)

How to remove punctuation?

I am using the tokenizer from NLTK in Python.
There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:
More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
Remove characters with symbols in front and after. i.e. '*u*', '''','""'
Is there an elegant way of solving both problems?
Solution 1: Tokenize and strip punctuation off the tokens
>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']
Solution 2: remove punctuation then tokenize
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']
If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.
One way to go about this, then, is to tokenize on gaps like so:
>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # omits *u* per your third requirement
This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

Categories

Resources