I have a file structured like this :
A: some text
B: more text
even more text
on several lines
A: and we start again
B: more text
more
multiline text
I'm trying to find the regex that will split my file like this :
>>>re.findall(regex,f.read())
[('some text','more text','even more text\non several lines'),
('and we start again','more text', 'more\nmultiline text')]
So far, I've ended up with the following :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*?)',f.read(),re.DOTALL)
[(' some text', ' more text', ''), (' and we start again', ' more text', '')]
The multiline text is not catched. I guess is because the lazy qualifier is really lazy and catch nothing, but I take it out, the regex gets really greedy :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*)',f.read(),re.DOTALL)
[(' some text',
' more text',
'even more text\non several lines\nA: and we start again\nB: more text\nmore\nmultiline text')]
Does any one has an idea ? Thanks !
You could tell the regex to stop matching at the next line that starts with A: (or at the end of the string):
re.findall(r'A:(.*?)\nB:(.*?)\n(.*?)(?=^A:|\Z)', f.read(), re.DOTALL|re.MULTILINE)
Related
I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']
I am using Python for natural language processing. I am trying to split my input string using re. I want to split using ;,. as well as word but.
import re
print (re.split("[;,.]", 'i am; working here but you are. working here, as well'))
['i am', ' working here but you are', ' working here', ' as well']
How to do that? When I put in word but in regex, it treats every character as splitting criterion. How do I get following output?
['i am', ' working here', 'you are', ' working here', ' as well']
you can filter as it : but | [;,.]
It will search for char ; , and . but also for word but !
import re
print (re.split("but |[;,.]", 'i am; working here but you are. working here, as well'))
hope this help.
Even this one works:
import re
print (re.split('; |, |\. | but', 'i am; working here but you are. working here, as well'))
Output:
['i am', 'working here', ' you are', 'working here', 'as well']
I have a string that looks like the following and am supposed to extract the key : value and i am using Regex for the same.
line ="Date : 20/20/20 Date1 : 15/15/15 Name : Hello World Day : Month Weekday : Monday"
1) Extracting the key or attributes only.
re.findall(r'\w+\s?(?=:)',line)
#['Date ', 'Date1 ', 'Name ', 'Day ', 'Weekday ']
2)Extracting the dates only
re.findall(r'(?<=:)\s?\d{2}/\d{2}/\d{2}',line)
#[' 20/20/20', ' 15/15/15']
3)Extracting the strings perfectly but also some wrong format dates.
re.findall(r'(?<=:)\s?\w+\s?\w+',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
But when I try to use the OR operator to pull both the strings and dates I get wrong output. I believe the piping has not worked properly.
re.findall(r'(?<=:)\s?\w+\s?\w+|\s?\d{2}/\d{2}/\d{2}',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
Any help on the above command to extract both the dates (dd/mm/yy) format and the string values will be highly appreciated.
You need to flip it around.
\s?\d{2}/\d{2}/\d{2}|(?<=:)\s?\w+\s?\w+
Live preview
Regex will first try and match the first part. If it succeeds it will not try the next part. The reason it then breaks is because \w results in the first number of the date being matched. Since / isn't a \w (word character) it stops at that point.
Flipping it around makes it first try matching the date. If it doesn't match then it tries matching an attribute. Thus avoiding the problem.
I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:
\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}
In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:
'(\\.+\[*.*\]*\{.*\})'
In fact, a sample program is working:
string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)
>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']
Well, to be honest, I would prefer an output like this:
>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]
But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:
dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']
As you can see, the result is quite different from the one that I would like:
[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']
I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.
Thank you!
You can accomplish this simply with github.com/alvinwan/TexSoup. This will give you what you want, albeit with whitespaces preserved.
>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]
Disclaimer: I know (1) I'm posting over a year later and (2) OP asks for regex, but assuming the task is tool-agnostic, I'm leaving this here for folks with similar problems. Also, I wrote TexSoup, so take this suggestion with a grain of salt.
I have a numpy array of entries dtype=string_. I would like to use the regular expressions re module to replace all excess spaces, \t tabs, \n tabs.
If I was working with a single string, I would use re.sub() as follows:
import re
proust = 'If a little dreaming is dangerous, \t the cure for it is not to dream less but to dream more,. \t\t'
newstring = re.sub(r"\s+", " ", proust)
which returns
'If a little dreaming is dangerous, the cure for it is not to dream less but to dream more. '
To do this in each entry of a numpy array, I should somehow use a for loop.
Something like for i in numpy_arr:, but I'm not sure what should follow this soc as to apply re.sub() to every numpy array element.
What is the most sensible approach to this problem?
EDIT:
My original numpy array or list is a LONG list/array of entries, each entry one sentence like the above. An example of five entries is below:
original_list = [ 'to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question']
This isn't exactly your re.sub, but the effect is the same, if not better:
In [109]: oarray
Out[109]:
array(['to be or \n\n not to be that is the question',
' to be or not to be that is the question\t ',
'to be or not to be that is the question',
'to be or not to be that is the question\t ',
'to be or not to be that is \t the question'],
dtype='<U55')
In [110]: np.char.join(' ',np.char.split(oarray))Out[110]:
array(['to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question',
'to be or not to be that is the question'],
dtype='<U39')
It works in this case because split() recognizes the same whitespace set of characters as '\s+'.
np.char.replace will replace selected characters, but it would have to be applied several times to remove '\n', then '\t' etc. There is also a translate.