Split sentences based on different patterns in Python 3 - python

I need to split strings based on a sequence of Regex patterns. I am able to apply individually the split, but the issue is recursively split the different sentences.
For example I have this sentence:
"I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
I would need to split the sentence based on ",", ";" and ".".
The resulst should be 5 sentences like:
"I want to be splitted using different patterns."
"It is a complex task,"
"and not easy to solve;"
"so,"
"I would need help."
My code so far:
import re
sample_sentence = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
patterns = [re.compile('(?<=\.) '),
re.compile('(?<=,) '),
re.compile('(?<=;) ')]
for pattern in patterns:
splitted_sentences = pattern.split(sample_sentence)
print(f'Pattern used: {pattern}')
How can I apply the different patterns without losing the results and get the expected result?
Edit: I need to run each pattern one by one, as I need to do some checks in the result of every pattern, so running it in some sort of tree algorithm. Sorry for not explaining entirely, in my head it was clear, but I did not think it would have side effects.

You can join each pattern with |:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=\.)\s|,\s*|;\s*', s)
Output:
['I want to be splitted using different patterns.', 'It is a complex task', 'and not easy to solve', 'so', 'I would need help.']

Python has this in re
Try
re.split('; | , | . ',ourString)

I can't think of a single regex to do this. So, what you can do it replace all the different type of delimiters with a custom-defined delimiter, say $DELIMITER$ and then split your sentence based on this delimiter.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent.split('$DELIMITER$')
This will result in the following:
['I want to be splitted using different patterns',
' It is a complex task',
' and not easy to solve',
' so',
' I would need help',
'']
NOTE: The above output has an additional empty string. This is because there is a period at the end of the sentence. To avoid this, you can either remove that empty element from the list or you can substitute the custom defined delimiter if it occurs at the end of the sentence.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
In case you have a list of delimiters, you can make you regex pattern using the following code:
delimiter_list = [',', '.', ':', ';']
pattern = '[' + ''.join(delimiter_list) + ']' #will result in [,.:;]
new_sent = re.sub(pattern, '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
I hope this helps!!!

Use a lookbehind with a character class:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=[.,;])\s', s)
print(result)
Output:
['I want to be splitted using different patterns.',
'It is a complex task,',
'and not easy to solve;',
'so,',
'I would need help.']

Related

How to remove duplicate words removing duplicates not case sensetive

I have a string
str1='This Python is good Good python'
I want the output removing duplicates keeping in the first word irrespective of case, for eg. good and Good are considered same as Python python. The output should be
output='This Python is good'
Following a rather traditional approach:
str1 = 'This Python is good Good python'
words_seen = set()
output = []
for word in str1.split():
if word.lower() not in words_seen:
words_seen.add(word.lower())
output.append(word)
output = ' '.join(output)
print(output) # This Python is good
A caveat: it would not preserve word boundaries consisting of multiple spaces: 'python puppy' would become 'python puppy'.
A very ugly short version:
words_seen = set()
output = ' '.join(word for word in str1.split() if not (word.lower() in words_seen or words_seen.add(word.lower())))
One approach might be to use regular expressions to remove any word for which we can find a duplicate. The catch is that regex engines move from start to end of a string. Since you want to retain the first occurrence, we can reverse the string, and then do the cleanup.
str1 = 'This Python is good Good python'
str1_reverse = ' '.join(reversed(str1.split(' ' )))
str1_reverse = re.sub(r'\s*(\w+)\s*(?=.*\b\1\b)', ' ', str1_reverse, flags=re.I)
str1 = ' '.join(reversed(str1_reverse.strip().split(' ' )))
print(str1) # This Python is good

Remove instances of words following a character in python

I'm trying to preprocess message data from the StockTwits API, how can I remove all instances of $name from a string in python?
For example if the string is:
$AAPL $TSLA $MSFT are all going up!
The output would be:
are all going up!
Something like this would do:
>>> s = "$AAPL $TSLA $MSFT are all going up!"
>>> re.sub(r"\$[a-zA-Z0-9]+\s*", "", s)
'are all going up!'
This allows numbers in the name as well, remove 0-9 if that's not what you want (it would remove e.g. $15 as well).
I'm not sure I get it, but If I'm to remove all instances of words that start with $, I would break into individual strings, then look for $, and re-form using a list comprehension.
substrings = string.split(' ')
substrings = [s for s in substrings if not s.startswith('$')]
new_string = ' '.join(substrings)
Some would use regular expressions, which are likely more computationally efficient, but less easy to read.

Python re.findall() Capitalized Words including Apostrophes

I'm having trouble completing the a regex tutorial that went \w+ references words to this problem with "Find all capitalized words in my_string and print the result" where some of the words have apostrophes.
Original String:
In [1]: my_string
Out[1]: "Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"
Current Attempt:
# Import the regex module
import re
# Find all capitalized words in my_string and print the result
capitalized_words = r"((?:[A-Z][a-z]+ ?)+)"
print(re.findall(capitalized_words, my_string))
Current Result:
['Let', 'RegEx', 'Won', 'Can ', 'Or ']
What I think the desired outcome is:
['Let's', 'RegEx', 'Won't', 'Can't', 'Or']
How do you go from r"((?:[A-Z][a-z]+ ?)+)" to also selecting the 's and 't at the end of Let's, Won't and Can't when not everything were trying to catch is expected to have an apostrophe?
Just add an apostrophe to the second bracket group:
capitalized_words = r"((?:[A-Z][a-z']+)+)"
I suppose you can add a little apostrophe in the group [a-z'].
So it will be like ((?:[A-Z][a-z']+ ?)+)
Hope that works
While you do have your answer, I'd like to provide a more "real-world" solution using nltk:
from nltk import sent_tokenize, regexp_tokenize
my_string = """Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"""
sent = sent_tokenize(my_string)
print(len(sent))
# 5
pattern = r"\b(?i)[a-z][\w']*"
print(len(regexp_tokenize(my_string, pattern)))
# 19
And imo, these are 5 sentences, not 4 unless there's some special requirement for a sentence in place.

Count all occurrences of elements with and without special characters in a list from a text file in python

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))

Split string multiple times in Python

This should be a simple thing to do but I can't get it to work.
Say I have this string.
I want this string to be splitted into smaller strings.
And, well, I want to split it into smaller strings, but only take what is between a T and a S.
So, the result should yield
this, to be s, to s, trings
So far I've tried splitting on every S, and then up to every T (backwards). However, it will only get the first "this", and stop. How can I make it continue and get all the things that are between T's and S's?
(In this program I export the results to another text file)
matches = open('string.txt', 'r')
with open ('test.txt', 'a') as file:
for line in matches:
test = line.split("S")
file.write(test[0].split("T")[-1] + "\n")
matches.close()
Maybe using Regular Expressions would be better, though I don't know how to work with them too well?
You want a re.findall() call instead:
re.findall(r't[^s]*s', line, flags=re.I)
Demo:
>>> import re
>>> sample = 'I want this string to be splitted into smaller strings.'
>>> re.findall(r't[^s]*s', sample, flags=re.I)
['t this', 'tring to be s', 'tted into s', 'trings']
Note that this matches 't this' and 'tted into s'; your rules need clarification as to why those first t characters shouldn't match when 'trings' does.
It sounds as if you want to match only text between t and s without including any other t:
>>> re.findall(r't[^ts]*s', sample, flags=re.I)
['this', 'to be s', 'to s', 'trings']
where tring in the second result and tted in the 3rd are not included because there is a later t in those results.

Categories

Resources