I wanted to split a sentence on multiple delimiters:
.?!\n
However, I want to keep the comma along with the word.
For example for the string
'Hi, How are you?'
I want the result
['Hi,', 'How', 'are', 'you', '?']
I tried the following, but not getting the required result
words = re.findall(r"\w+|\W+", text)
re.split and keep your delimiters, then filter out the strings which only contain whitespace.
>>> import re
>>> s = 'Hi, How are you?'
>>> [x for x in re.split('(\s|!|\.|\?|\n)', s) if x.strip()]
['Hi,', 'How', 'are', 'you', '?']
If using re.findall:
>>> ss = """
... Hi, How are
...
... yo.u
... do!ing?
... """
>>> [ w for w in re.findall('(\w+\,?|[.?!]?)?\s*', ss) if w ]
['Hi,', 'How', 'are', 'yo', '.', 'u', 'do', '!', 'ing', '?']
You can use:
re.findall('(.*?)([\s\.\?!\n])', text)
With a bit of itertools magic and list comprehensions:
[i.strip() for i in itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)) if i.strip()]
And a bit more comprehensible version:
words = []
found = itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)
for i in found:
w = i.strip()
if w:
words.append(w)
Related
I have a problem with my python script, it's a straightforward problem but I can't resolve it.
For example, I have 11 words text and then I split using re.split(r"\s+", text) function
import re
text = "this is the example text and i will splitting this text"
split = re.split(r"\s+", text)
for a in (range(len(split))):
print(split[a])
The result is
this
is
the
example
text
and
i
will
splitting
this
text
I only need to take 10 words from 11 words, so the result I need is only like this
is
the
example
text
and
i
will
splitting
this
text
Can you solve this problem? It will very helpful
Thank you!
Just index like that
>>> import re
>>>
>>> text = "this is the example text and i will splitting this text"
>>> split = re.split(r"\s+", text)
>>> split
['this', 'is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
>>> split[-10:]
['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
No need of regex:
text = "this is the example text and i will splitting this text"
l = text.split() # Split with whitespace
l.pop(0) # Remove first item
print(l) # Print the results
Results: ['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
See Python proof.
For example:
String1='Hi what are you doing?'
should be split like:
List1=['Hi','\s','what','\s','are','\s','you','\s','doing','\s','?']
If you want to split only :
String1='Hi what are you doing ?'
print(String1.split())
output:
['Hi', 'what', 'are', 'you', 'doing', '?']
if you want as you shown in the example :
print(String1.replace(" "," \s ").split())
output:
['Hi', '\\s', 'what', '\\s', 'are', '\\s', 'you', '\\s', 'doing', '\\s', '?']
import re
s = your string here \nhello" re.split('\s+', s)
Another method through re module.
>>> import re
>>> s = "your string here \nhello \thi"
>>> re.findall(r'\S+', s) ['your', 'string', 'word', 'hello', 'hi']
This would match one or more non-space characters.
Try this:
s ='Hi what are you doing?'
import re
re.findall('[a-zA-Z]{1,}|[^a-zA-Z]{1,}', s)
output:
['Hi', ' ', 'what', ' ', 'are', ' ', 'you', ' ', 'doing', '?']
I want to tokenize all the symbols of currency by using NLTK tokenize with regex.
For example this is my sentence:
The price of it is $5.00.
The price of it is RM5.00.
The price of it is €5.00.
I used this pattern of regex:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|\$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
But as we can see it only considers $.
I tried to use \p{Sc} as explained in What is regex for currency symbol? but it is still not working for me.
Try pad the numbering with the currency symbol with spaces then tokenize:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is $5.00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']
I have to match all the alphanumeric words from a text.
>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>
This is fine, but further I have few words to negate i.e. the words that shouldn't be in my final list.
>>> negate_words = ['world', 'other', 'words']
A bad way to do it
>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']
But i can save a loop if my very first regex-pattern can be changed to consider negation of those words. I found negation of characters but i have words to negate, also i found regex-lookbehind in other questions, but that doesn't help either.
Can it be done using python re?
Update
My text can span a few hundered lines. Also, list of negate_words can be lengthy too.
Considering this, is using regex for such task, correct in the first place?? Any suggestions??
I don't think there is a clean way to do this using regular expressions. The closest I could find was bit ugly and not exactly what you wanted:
>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text)
['hello', '', 'how', 'are', 'you']
Why not use Python's sets instead. They are very fast:
>>> list(set(final_list) - set(negate_words))
['hello', 'how', 'are', 'you']
If order is important, see the reply from #glglgl below. His list comprehension version is very readable. Here's a fast but less readable equivalent using itertools:
>>> negate_words_set = set(negate_words)
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list))
['hello', 'how', 'are', 'you']
Another alternative is the build-up the word list in a single pass using re.finditer:
>>> result = []
>>> negate_words_set = set(negate_words)
>>> result = []
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text):
word = mo.group()
if word not in negate_words_set:
result.append(word)
>>> result
['hello', 'how', 'are', 'you']
Maybe is worth trying pyparsing for this:
>>> from pyparsing import *
>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(Suppress(oneOf(negate_words)) ^ Word(alphanums)).ignore(CharsNotIn(alphanums))
>>> parser.parseString('hello world!! how are you?').asList()
['hello', 'how', 'are', 'you']
Note that oneOf(negate_words) must be before Word(alphanums) to make sure that it matches earlier.
Edit: Just for the fun of it, I repeated the exercise using lepl (also an interesting parsing library)
>>> from lepl import *
>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any())
>>> parser.parse('hello world!! how are you?')
['hello', 'how', 'are', 'you']
Don't ask uselessly too much to regex.
Instead, think to generators.
import re
unwanted = ('world', 'other', 'words')
text = "hello world!! how are you?"
gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text))
li = [ w for w in gen if w not in unwanted ]
And a generator can be created instead of li, also
I would like to parse a string to obtain a list including all words (hyphenated words, too). Current code is:
s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)
returns:
['', 'this', 'is', 'A', 'sentence', 'one', 'word']
and I would like it to return:
['', 'this', 'is', 'A', 'sentence', 'one-word']
If you don't need the leading empty string, you could use the pattern \w(?:[-\w]*\w)? for matching:
>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']
Note that it won't match words with apostrophes like won't.
Here my traditional "why to use regexp language when you can use Python" alternative:
import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
for word in s.replace(';','; ').split()
])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""
You could use "[^\w-]+" instead.
s = "-this is. A - sentence;one-word what's"
re.findall("\w+-\w+|[\w']+",s)
result:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
make sure you notice that the correct ordering is to look for hyypenated words first!
Yo can try with the NLTK library:
>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']
I found it here: http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html Hope it helps