Split a string in python based on a set of characters - python

I need to split a string based on some set of characters using python.
For example
String = "A==B AND B==C OR C!=A OR JP Bank==Chase"
I don't want to split the string based on space, since JP and Chase will form two different words.
So, I need to split based on ==,!=,AND,OR.
Expected output
[A,==,B,AND,B,==,C,OR,C,!=,A,OR,JP Bank,==,Chase]

Using re.split with a capture group in your regular expression.
import re
s = "A==B AND B==C OR C!=A OR JP Bank==Chase"
pat = re.compile(r'(==|!=|AND|OR)')
pat.split(s)
Result
['A', '==', 'B ', 'AND', ' B', '==', 'C ', 'OR', ' C', '!=', 'A ', 'OR', ' JP Bank', '==', 'Chase']

You could try re.split function. \s* before and after (AND|OR|[!=]=) helps to remove the spaces also.
>>> s = "A==B AND B==C OR C!=A OR JP Bank==Chase"
>>> re.split('\s*(AND|OR|[!=]=)\s*', s)
['A', '==', 'B', 'AND', 'B', '==', 'C', 'OR', 'C', '!=', 'A', 'OR', 'JP Bank', '==', 'Chase']

like this?
import re
inStrint = "A==B AND B==C OR C!=A OR JP Bank==Chase"
outList = re.split( '(==|!=|OR|AND)', inString)
outList = map( lambda x: x.strip(), outList)

Related

Python split by multiple separators, including space?

Input:
Some Text here: Java, PHP, JS, HTML 5, CSS, Web, C#, SQL, databases, AJAX, etc.
Code:
import re
input_words = list(re.split('\s+', input()))
print(input_words)
Works perfect and returns me:
['Some', 'Text', 'here:', 'Java,', 'PHP,', 'JS,', 'HTML', '5,', 'CSS,', 'Web,', 'C#,', 'SQL,', 'databases,', 'AJAX,', 'etc.']
But when add some other separators, like this:
import re
input_words = list(re.split('\s+ , ; : . ! ( ) " \' \ / [ ] ', input()))
print(input_words)
It doesn't split by spaces anymore, where am I wrong?
Expected outpus would be:
['Some', 'Text', 'here', 'Java', 'PHP', 'JS', 'HTML', '5', 'CSS', 'Web', 'C#', 'SQL', 'databases', 'AJAX', 'etc']
You should be splitting on a regex alternation containing all those symbols:
input_words = re.split('[\s,;:.!()"\'\\\[\]]', input())
print(input_words)
This is a literal answer to your question. The actual solution you might want to use would be to split on the symbols with optional whitespace on either end, e.g
input = "A B ; C.D ! E[F] G"
input_words = re.split('\s*[,;:.!()"\'\\\[\]]?\s*', input)
print(input_words)
Prints:
['A', 'B', 'C', 'D', 'E', 'F', 'G']
write the expression inside brackets as shown below. Hope it helps
import re
input_words = list(re.split('[\s+,:.!()]', input()))
Word tokenization using nltk module
#!/usr/bin/python3
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
words = nltk.tokenize.word_tokenize(sentence)
print(words)
output:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

python regular expression to split string and get all words is not working

I'm trying to split string using regular expression with python and get all the matched literals.
RE: \w+(\.?\w+)*
this need to capture [a-zA-Z0-9_] like stuff only.
Here is example
but when I try to match and get all the contents from string, it doesn't return proper results.
Code snippet:
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']
it's only returning some of words, but actually it should return all the words, numbers and underscore(s)[as in linked example].
python version: Python 3.6.2 (default, Jul 17 2017, 16:44:45)
Thanks.
You need to use a non-capturing group (see here why) and escape the dot (see here what chars should be escaped in regex):
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']
Also, to only match ASCII letters, digits and _ you must pass re.A flag.
See the Python demo.

Python .split() on a string on every splittable token bar whitespace, but ignore some specific strings

I am looking to split a sentence into tokens, but ignore 2 specific strings and also ignore spaces.
For example:
GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank .
Should be split into [GNI,per,capita,;,PPP,-,LRB,-,US,dollar,-,RRB,-,in, LOCATION_SLOT,was,last,measured,at,NUMBER_SLOT,in,2011,,,according,to, the, World,Bank,.,].
I do not want LOCATION_SLOT or NUMBER_SLOT to be split, for example the former into [LOCATION,_,SLOT]. But I do want to account for dots.
My current function which only allows character based words but is removing numbers and things like ;,,,: etc is here - I don't want it to remove these:
def sentence_to_words(sentence,remove_stopwords=False):
letters_only = re.sub("[^a-zA-Z| LOCATION_SLOT | NUMBER_SLOT]", " ", sentence)
words = letters_only.lower().split()
if remove_stopwords:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
return(words)
This generates these tokens:
gni per capita ppp lrb us dollar rrb location_slot last measured number_slot according world bank
You can use re.findall and strip the spaces from starting and ending
>>> [x.strip() for x in re.findall('\s*(\w+|\W+)', line)]
#['GNI', 'per', 'capita', ';', 'PPP', '-', 'LRB', '-', 'US', 'dollar', '-', 'RRB', '-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
Regex Explanation
> \w matches word character [A-Za-z0-9_].
> \W is negation of \w. i.e. it matches anything except word character.
You can simply use split
>>> x = "GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank ."
>>>
>>> x.split()
['GNI', 'per', 'capita', ';', 'PPP', '-LRB-', 'US', 'dollar', '-RRB-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
To remove the - around -LBR- do this:
>>> z = [y.strip('-') for y in x]
>>> z
['GNI', 'per', 'capita', ';', 'PPP', 'LRB', 'US', 'dollar', 'RRB', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
>>>
If you want to keep the dashes:
>>> y = []
>>> for item in x:
... if item.startswith('-') and item.endswith('-'):
... y.append(',')
... y.append(item.strip('-'))
... y.append('-')
... else:
... y.append(item)
...

Separating a continuous string Python

I have been playing around with this code that I'm trying to get to read the string of text without spaces. The code needs to separate the string by identifying the all capital letters using regular expressions. However I can’t seem to get it to display the capital letters.
import re
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
wordList = re.sub("[^\^a-z]"," ",mystring)
print (wordList)
Try:
re.sub("([A-Z])"," \\1",mystring).split()
This prepends a space in front of every capital letter and splits on these spaces.
Output:
['This',
'Is',
'String',
'Without',
'Spaces',
'Words',
'Text',
'Man',
'Dog',
'Cow!']
As an alternative to sub, you could use re.findall to find all the words (beginning with an uppercase letter followed by zero or more non-uppercase characters) and then join them back together:
>>> ' '.join(re.findall(r'[A-Z][^A-Z]*', mystring))
'This Is String Without Spaces Words Text Man Dog Cow!'
Try
>>> re.split('([A-Z][a-z]*)', mystring)
['', 'This', '', 'Is', '', 'String', '', 'Without', '', 'Spaces', '', 'Words', '', 'Text', '', 'Man', '', 'Dog', '', 'Cow', '!']
This gives you word per word output. Even the ! is separated out.
If you dont want the extra '', then you can remove it by filter(lambda x: x != '', a) if a is the output of above command
>>> filter(lambda x: x != '', a)
['This', 'Is', 'String', 'Without', 'Spaces', 'Words', 'Text', 'Man', 'Dog', 'Cow', '!']
Not a regular expression solution, but you can do it in normal code as well :-)
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
output_list = []
for i, letter in enumerate(mystring):
if i!=index and letter.isupper():
output_list.append(mystring[index:i])
index = i
else:
output_list.append(mystring[index:i])
Now on topic, this could be something what you are looking for?
mystring = re.sub(r"([a-z\d])([A-Z])", r'\1 \2', mystring)
# Makes the string space separated. You can use split to convert it to list
mystring = mystring.split()

Regex to split words in Python

I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn't the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
for c in to_be_removed:
s = s.replace(c, '')
s.split()
BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.
EDIT: probably a simple regex can solve your porblem:
(\w[\w']*)
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
(\w[\w']*\w)
This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.
Example:
rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:
(\w[\w']*\w|\w)
rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']
You have too many capturing groups in your regular expression; make them non-capturing:
(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']
That returns only one element that is empty.
This regex will only allow one ending apostrophe, which may be followed by one more character:
([\w][\w]*'?\w?)
Demo:
>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]
I am new to python but i think i have figured it out
import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)
result
['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']

Categories

Resources