I am trying to clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z.
For example,given String is:
"coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
Required output is :
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
My solution is
re.findall(r"([A-Za-z]+)" ,string)
My output is
['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
You don't need to use regular expression:
(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet:
>>> s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
In Python 3.x, filter(str.isalpha, word) should be replaced with ''.join(filter(str.isalpha, word)), because in Python 3.x, filter returns a filter object.
With the recommendation of all of the people who answered I got the correct solution that i really wants , Thanks to every one...
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned
using re, although I'm not sure this is what you want because you said you didn't want "cool" leftover.
import re
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
REGEX = r'([^a-zA-Z\s]+)'
cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']
EDIT
WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')
def cleaned(match_obj):
return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()
[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
WORD_REGEX uses a positive lookahead for any word characters and a negative lookahead for <...>. Whatever non-whitespace that makes it past the lookaheads is grouped:
(?!<?\S+>) # negative lookahead
(?=\w) # positive lookahead
(\S+) #group non-whitespace
cleaned takes the match groups and removes any non-word characters with CLEAN_REGEX
Related
I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']
I have the following string.
words = "this is a book and i like it"
What i want is that when i split it by one space i get the following.
wordList = words.split(" ")
print wordList
<< ['this','is','a',' book','and','i',' like','it']
Simple words.split(" ") function splits the string but incase of double space it remove both spaces which gives 'book' and 'like'. and what i need is ' book' and ' like' keeping extra spaces intact in the split output in case of double, triple... n spaces
You can split on whitespace that is not preceded by white space using look behind (?<=) syntax:
import re
re.split("(?<=\\S) ", words)
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
Or similarly, use negative look behind:
re.split("(?<!\\s) ", words)
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
Just another regex solution: if you need to split with a single left-most whitespace char, use \s? to match one or zero whitespaces, and then capture 0+ remaining whitespaces and the subsequent non-whitespace chars.
One very important step: run rstrip on the input string before running the regex to remove all the trailing whitespace, since otherwise, its performance will decrease greatly.
import re
words = "this is a book and i like it"
print(re.findall(r'\s?(\s*\S+)', words.rstrip()))
# => ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
See a Python demo. The re.findall returns just the captured substrings and since we only have one capturing group, the result is a list of those captures.
Also, here is a regex demo. Details:
\s? - 1 or 0 (due to ? quantifier) whitespaces
(\s*\S+) - Capturing group #1 matching
\s* - zero or more (due to the * quantifier) whitespace
\S+ - 1 or more (due to + quantifier) non-whitespace symbols.
If you don't feel like using a regex and want to keep something close to your own code, you could use something like this:
words = "this is a book and i like it"
wordList = words.split(" ")
for i in range(len(wordList)):
if(wordList[i]==''):
wordList[i+1] = ' ' + wordList[i+1]
wordList = [x for x in wordList if x != '']
print wordList
# Outputs: ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
An alternative using a list comprehension:
word_list = iter(words.split(" "))
["".join([" ", next(word_list)]) if not w else w for w in word_list]
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)
# store individual words
words = obama_4427_str_processed.split(' ')
print(words)
Long story short, I have a speech from President Obama, and am looking to remove all punctuation, so that I'm left only with the words. I've imported the punctuation module, ran a for loop which didn't remove all my punctuation. What am I doing wrong here?
str.replace() searches for the whole value of the first argument. It is not a pattern, so only if the whole `string.punctuation* value is there will this be replaced with an empty string.
Use a regular expression instead:
import re
from string import punctuation as p
punctuation = re.compile('[{}]+'.format(re.escape(p)))
obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()
Note that you can just use str.split() without an argument to split on any arbitrary-width whitespace, including newlines.
If you want to remove the punctuation you can rstrip it off:
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................
using python3 to remove from anywhere use str.translate:
from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())
For python2:
from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................
On a another note, you can iterate over a string so list(p) is redundant in your own code.
I have been playing around with this code that I'm trying to get to read the string of text without spaces. The code needs to separate the string by identifying the all capital letters using regular expressions. However I can’t seem to get it to display the capital letters.
import re
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
wordList = re.sub("[^\^a-z]"," ",mystring)
print (wordList)
Try:
re.sub("([A-Z])"," \\1",mystring).split()
This prepends a space in front of every capital letter and splits on these spaces.
Output:
['This',
'Is',
'String',
'Without',
'Spaces',
'Words',
'Text',
'Man',
'Dog',
'Cow!']
As an alternative to sub, you could use re.findall to find all the words (beginning with an uppercase letter followed by zero or more non-uppercase characters) and then join them back together:
>>> ' '.join(re.findall(r'[A-Z][^A-Z]*', mystring))
'This Is String Without Spaces Words Text Man Dog Cow!'
Try
>>> re.split('([A-Z][a-z]*)', mystring)
['', 'This', '', 'Is', '', 'String', '', 'Without', '', 'Spaces', '', 'Words', '', 'Text', '', 'Man', '', 'Dog', '', 'Cow', '!']
This gives you word per word output. Even the ! is separated out.
If you dont want the extra '', then you can remove it by filter(lambda x: x != '', a) if a is the output of above command
>>> filter(lambda x: x != '', a)
['This', 'Is', 'String', 'Without', 'Spaces', 'Words', 'Text', 'Man', 'Dog', 'Cow', '!']
Not a regular expression solution, but you can do it in normal code as well :-)
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
output_list = []
for i, letter in enumerate(mystring):
if i!=index and letter.isupper():
output_list.append(mystring[index:i])
index = i
else:
output_list.append(mystring[index:i])
Now on topic, this could be something what you are looking for?
mystring = re.sub(r"([a-z\d])([A-Z])", r'\1 \2', mystring)
# Makes the string space separated. You can use split to convert it to list
mystring = mystring.split()
I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn't the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
for c in to_be_removed:
s = s.replace(c, '')
s.split()
BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.
EDIT: probably a simple regex can solve your porblem:
(\w[\w']*)
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
(\w[\w']*\w)
This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.
Example:
rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:
(\w[\w']*\w|\w)
rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
You have too many capturing groups in your regular expression; make them non-capturing:
(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']
That returns only one element that is empty.
This regex will only allow one ending apostrophe, which may be followed by one more character:
([\w][\w]*'?\w?)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
I am new to python but i think i have figured it out
import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)
result
['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']