Split string based on delimiter and word using re

Split string based on delimiter and word using re - python

I am using Python for natural language processing. I am trying to split my input string using re. I want to split using ;,. as well as word but.
import re
print (re.split("[;,.]", 'i am; working here but you are. working here, as well'))
['i am', ' working here but you are', ' working here', ' as well']
How to do that? When I put in word but in regex, it treats every character as splitting criterion. How do I get following output?
['i am', ' working here', 'you are', ' working here', ' as well']

you can filter as it : but | [;,.]
It will search for char ; , and . but also for word but !
import re
print (re.split("but |[;,.]", 'i am; working here but you are. working here, as well'))
hope this help.

Even this one works:
import re
print (re.split('; |, |\. | but', 'i am; working here but you are. working here, as well'))
Output:
['i am', 'working here', ' you are', 'working here', 'as well']

Related

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.

As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']

You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'｟([^｟｠]+)｠', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'｟\1:\2｠', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ｟([^｟｠]+)｠ pattern just matches any ｟...｠ substring but keeps what is in between as it is captured.

You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)

One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

Regex - Matching String values and Date together

I have a string that looks like the following and am supposed to extract the key : value and i am using Regex for the same.
line ="Date : 20/20/20 Date1 : 15/15/15 Name : Hello World Day : Month Weekday : Monday"
1) Extracting the key or attributes only.
re.findall(r'\w+\s?(?=:)',line)
#['Date ', 'Date1 ', 'Name ', 'Day ', 'Weekday ']
2)Extracting the dates only
re.findall(r'(?<=:)\s?\d{2}/\d{2}/\d{2}',line)
#[' 20/20/20', ' 15/15/15']
3)Extracting the strings perfectly but also some wrong format dates.
re.findall(r'(?<=:)\s?\w+\s?\w+',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
But when I try to use the OR operator to pull both the strings and dates I get wrong output. I believe the piping has not worked properly.
re.findall(r'(?<=:)\s?\w+\s?\w+|\s?\d{2}/\d{2}/\d{2}',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
Any help on the above command to extract both the dates (dd/mm/yy) format and the string values will be highly appreciated.

You need to flip it around.
\s?\d{2}/\d{2}/\d{2}|(?<=:)\s?\w+\s?\w+
Live preview
Regex will first try and match the first part. If it succeeds it will not try the next part. The reason it then breaks is because \w results in the first number of the date being matched. Since / isn't a \w (word character) it stops at that point.
Flipping it around makes it first try matching the date. If it doesn't match then it tries matching an attribute. Thus avoiding the problem.

How to split at spaces and commas in Python?

I've been looking around here, but I didn't find anything that was close to my problem. I'm using Python3.
I want to split a string at every whitespace and at commas. Here is what I got now, but I am getting some weird output:
(Don't worry, the sentence is translated from German)
import re
sentence = "We eat, Granny"
split = re.split(r'(\s|\,)', sentence.strip())
print (split)
>>>['We', ' ', 'eat', ',', '', ' ', 'Granny']
What I actually want to have is:
>>>['We', ' ', 'eat', ',', ' ', 'Granny']

I'd go for findall instead of split and just match all the desired contents, like
import re
sentence = "We eat, Granny"
print(re.findall(r'\s|,|[^,\s]+', sentence))

This should work for you:
import re
sentence = "We eat, Granny"
split = list(filter(None, re.split(r'(\s|\,)', sentence.strip())))
print (split)

Alternate way:
import re
sentence = "We eat, Granny"
split = [a for a in re.split(r'(\s|\,)', sentence.strip()) if a]
Output:
['We', ' ', 'eat', ',', ' ', 'Granny']
Works with both python 2.7 and 3

Can I use regular expressions re.sub() with a numpy array or list of strings?

I have a numpy array of entries dtype=string_. I would like to use the regular expressions re module to replace all excess spaces, \t tabs, \n tabs.
If I was working with a single string, I would use re.sub() as follows:
import re
proust = 'If a little dreaming is dangerous, \t the cure for it is not to dream less but to dream more,. \t\t'
newstring = re.sub(r"\s+", " ", proust)
which returns
'If a little dreaming is dangerous, the cure for it is not to dream less but to dream more. '
To do this in each entry of a numpy array, I should somehow use a for loop.
Something like for i in numpy_arr:, but I'm not sure what should follow this soc as to apply re.sub() to every numpy array element.
What is the most sensible approach to this problem?
EDIT:
My original numpy array or list is a LONG list/array of entries, each entry one sentence like the above. An example of five entries is below:
original_list = [ 'to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question']

This isn't exactly your re.sub, but the effect is the same, if not better:
In [109]: oarray
Out[109]:
array(['to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question'],
dtype='<U55')
In [110]: np.char.join(' ',np.char.split(oarray))Out[110]:
array(['to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question'],
dtype='<U39')
It works in this case because split() recognizes the same whitespace set of characters as '\s+'.
np.char.replace will replace selected characters, but it would have to be applied several times to remove '\n', then '\t' etc. There is also a translate.

Multiline python regex

I have a file structured like this :
A: some text
B: more text
even more text
on several lines
A: and we start again
B: more text
more
multiline text
I'm trying to find the regex that will split my file like this :
>>>re.findall(regex,f.read())
[('some text','more text','even more text\non several lines'),
('and we start again','more text', 'more\nmultiline text')]
So far, I've ended up with the following :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*?)',f.read(),re.DOTALL)
[(' some text', ' more text', ''), (' and we start again', ' more text', '')]
The multiline text is not catched. I guess is because the lazy qualifier is really lazy and catch nothing, but I take it out, the regex gets really greedy :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*)',f.read(),re.DOTALL)
[(' some text',
' more text',
'even more text\non several lines\nA: and we start again\nB: more text\nmore\nmultiline text')]
Does any one has an idea ? Thanks !

You could tell the regex to stop matching at the next line that starts with A: (or at the end of the string):
re.findall(r'A:(.*?)\nB:(.*?)\n(.*?)(?=^A:|\Z)', f.read(), re.DOTALL|re.MULTILINE)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split string based on delimiter and word using re - python

you can filter as it : but | [;,.] It will search for char ; , and . but also for word but ! import re print (re.split("but |[;,.]", 'i am; working here but you are. working here, as well')) hope this help.

Even this one works: import re print (re.split('; |, |\. | but', 'i am; working here but you are. working here, as well')) Output: ['i am', 'working here', ' you are', 'working here', 'as well']

Related

Parse sentences with [value](type) format

Regex - Matching String values and Date together

How to split at spaces and commas in Python?

Can I use regular expressions re.sub() with a numpy array or list of strings?

Multiline python regex

Categories

Resources