correctly strip : char with Regex - python

I want to get words in a text string in python
s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."
result = re.sub("\b[^\w\d_]+\b", " ", s ).split()
print result
I am getting:
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']
How can I get "is" and not "is:" on strings that happen to contain : ?
I thought using \b would be enough...

I think you intended to pass a raw string to re.sub (notice the r).
result = re.sub(r"\b[^\w\d_]+\b", " ", s ).split()
Returns:
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

You forgot to make it a raw string literal (r"..")
>>> import re
>>> s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."
>>> re.sub("\b[^\w\d_]+\b", " ", s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']
>>> re.sub(r"\b[^\w\d_]+\b", " ", s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

As the other answers pointed out you need to define a raw string literal using r like so: (r"...")
If you want to strip the periods, I believe you can simplify your regex to just:
result = re.sub(r"[^\w' ]", " ", s ).split()
As you likely know the \w metacharacter strips the string of anything that is not a-z, A-Z, 0-9
So if you can anticipate that your sentences will not have numbers that should do the trick.

Related

How to strip multiple unwanted characters from a list of strings in python?

I have the following input string:
text='''Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''
So far, I have split the text string into a list like so:
list=['Although', 'never', 'is', 'often', 'better', 'than', '*right*', 'now.\n\nIf', 'the', 'implementation', 'is', 'hard', 'to', 'explain,', "it's", 'a', 'bad', 'idea.\n\nIf', 'the', 'implementation', 'is', 'easy', 'to', 'explain,', 'it', 'may', 'be', 'a', 'good', 'idea.\n\nNamespaces', 'are', 'one', 'honking', 'great','idea', '--', "let's", 'do', 'more', 'of', 'those!']
Now, I want to use strip function to remove unwanted characters such as \n\n and -- from the above list.
Can You please help me on this??
Use re module, re.sub function will allow you to do that.
We need to replace multilpe \n occurences with single \n and remove -- string
import re
code='''Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''
result = re.sub('\n{2,}', '\n', code)
result = re.sub(' -- ', ' ', result)
print(result)
After that split() your text.
This will split the string with either space or newline character
import re
output = [i for i in re.split(r'\s|\n{1:2}|--', code) if i]
You can use list comprehension to get rid of --
>>> code='''Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''
>>>
>>> [word for word in code.split() if word != '--']
['Although', 'never', 'is', 'often', 'better', 'than', 'right', 'now.', 'If', 'the', 'implementation', 'is', 'hard', 'to', 'explain,', "it's", 'a', 'bad', 'idea.', 'If', 'the', 'implementation', 'is', 'easy', 'to', 'explain,', 'it', 'may', 'be', 'a', 'good', 'idea.', 'Namespaces', 'are', 'one', 'honking', 'great', 'idea', "let's", 'do', 'more', 'of', 'those!']

python regular expression to split string and get all words is not working

I'm trying to split string using regular expression with python and get all the matched literals.
RE: \w+(\.?\w+)*
this need to capture [a-zA-Z0-9_] like stuff only.
Here is example
but when I try to match and get all the contents from string, it doesn't return proper results.
Code snippet:
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']
it's only returning some of words, but actually it should return all the words, numbers and underscore(s)[as in linked example].
python version: Python 3.6.2 (default, Jul 17 2017, 16:44:45)
Thanks.
You need to use a non-capturing group (see here why) and escape the dot (see here what chars should be escaped in regex):
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']
Also, to only match ASCII letters, digits and _ you must pass re.A flag.
See the Python demo.

Regular expression to skip some specific characters

I am trying to clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z.
For example,given String is:
"coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
Required output is :
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
My solution is
re.findall(r"([A-Za-z]+)" ,string)
My output is
['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
You don't need to use regular expression:
(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet:
>>> s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
In Python 3.x, filter(str.isalpha, word) should be replaced with ''.join(filter(str.isalpha, word)), because in Python 3.x, filter returns a filter object.
With the recommendation of all of the people who answered I got the correct solution that i really wants , Thanks to every one...
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned
using re, although I'm not sure this is what you want because you said you didn't want "cool" leftover.
import re
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
REGEX = r'([^a-zA-Z\s]+)'
cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']
EDIT
WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')
def cleaned(match_obj):
return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()
[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
WORD_REGEX uses a positive lookahead for any word characters and a negative lookahead for <...>. Whatever non-whitespace that makes it past the lookaheads is grouped:
(?!<?\S+>) # negative lookahead
(?=\w) # positive lookahead
(\S+) #group non-whitespace
cleaned takes the match groups and removes any non-word characters with CLEAN_REGEX

Need String Comparison 's solution for Partial String Comparison in Python

Scenario:
I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)
While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it.
And if not, new Section Header will get appended to the Existing Section Header List.
Existing Section Header Looks Like This:
[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable
running from disk", "Actions from File"]
For below set of String the expected behaviour is as follows:
"Activity (Last 30 Days) - New Section Should be Added
"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".
"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]
Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.
This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.
import re #this will allow you to use regular expressions
def modifyHeader(header):
#change the # of days to 30
modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
#add an s to "executable"
modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
#add "a"
modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)
return modifiedHeader
The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.
Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.
Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here
If there is any issue with nltk, here is the list of stop words:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
Now use this simple code to get your desired output:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
Demo:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list

Regex to split words in Python

I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn't the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
for c in to_be_removed:
s = s.replace(c, '')
s.split()
BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.
EDIT: probably a simple regex can solve your porblem:
(\w[\w']*)
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
(\w[\w']*\w)
This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.
Example:
rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:
(\w[\w']*\w|\w)
rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
You have too many capturing groups in your regular expression; make them non-capturing:
(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']
That returns only one element that is empty.
This regex will only allow one ending apostrophe, which may be followed by one more character:
([\w][\w]*'?\w?)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
I am new to python but i think i have figured it out
import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)
result
['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']

Categories

Resources