Regexp to catch multiple latex command in a single line - python

I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:
\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}
In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:
'(\\.+\[*.*\]*\{.*\})'
In fact, a sample program is working:
string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)
>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']
Well, to be honest, I would prefer an output like this:
>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]
But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:
dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']
As you can see, the result is quite different from the one that I would like:
[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']
I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.
Thank you!

You can accomplish this simply with github.com/alvinwan/TexSoup. This will give you what you want, albeit with whitespaces preserved.
>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]
Disclaimer: I know (1) I'm posting over a year later and (2) OP asks for regex, but assuming the task is tool-agnostic, I'm leaving this here for folks with similar problems. Also, I wrote TexSoup, so take this suggestion with a grain of salt.

Related

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

When splitting text lines after a full stop, how do I specify when NOT to split like after the title 'Dr.'? [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 7 years ago.
#!/usr/bin/python
#Opening the file
myFile=open ('Text File.txt', 'r')
#Printing the files original text first
for line in myFile.readlines():
print line
#Splitting the text
varLine = line
splitLine = varLine.split (". ")
#Printing the edited text
print splitLine
#Closing the file
myFile.close()
When opening a .txt file in a Python program, I want the output of the text to be formatted like a sentence i.e. after a full stop is shown a new line is generated. That is what I have achieved so far, however I do not know how to prevent this happening when full stops are used not at the end of a sentence such as with things like 'Dr.', or 'i.e.' etc.
Best way, if you control the input, is to use two spaces at the end of a sentence (like people should, IMHO), then use split on '. ' so you don't touch the Dr. or i.e.
If you don't control the input... I'm not sure this is really Pythonic, but here's one way you could do it: use a placeholder to identify all locations where you want to save the period. Below, I assume 'XYZ' never shows up in my text. You can make it as complex as you like, and it will be better the more complex it is (less likely to run into it that way).
sentence = "Hello, Dr. Brown. Nice to meet you. I'm Bob."
targets = ['Dr.', 'i.e.', 'etc.']
replacements = [t.replace('.', placeholder) for t in targets]
# replacements now looks like: ['DrXYZ', 'iXYZeXYZ', 'etcXYZ']
for i in range(len(targets)):
sentence = sentence.replace(targets[i], replacements[i])
# sentence now looks like: "Hello, DrXYZ Brown. Nice to meet you. I'm Bob."
output = sentence.split('. ')
# output now looks like: ['Hello, DrXYZ Brown', ' Nice to meet you', " I'm Bob."]
output = [o.replace(placeholder, '.') for o in output]
print(output)
>>> ['Hello, Dr. Brown', ' Nice to meet you', " I'm Bob."]
Use the in keyword to check.
'.' in "Dr."
# True
'.' in "Bob"
# False

Extract unquoted text from a string

I have a string that may contain random segments of quoted and unquoted texts. For example,
s = "\"java jobs in delhi\" it software \"pune\" hello".
I want to separate out the quoted and unquoted parts of this string in python.
So, basically I expect the output to be:
quoted_string = "\"java jobs in delhi\"" "\"pune\""
unquoted_string = "it software hello"
I believe using a regex is the best way to do it. But I am not very good with regex. Is there some regex expression that can help me with this?
Or is there a better solution available?
I dislike regex for something like this, why not just use a split like this?
s = "\"java jobs in delhi\" it software \"pune\" hello"
print s.split("\"")[0::2] # Unquoted
print s.split("\"")[1::2] # Quoted
If your quotes are as basic as in your example, you could just split; example:
for s in (
'"java jobs in delhi" it software "pune" hello',
'foo "bar"',
):
result = s.split('"')
print 'text between quotes: %s' % (result[1::2],)
print 'text outside quotes: %s' % (result[::2],)
Otherwise you could try:
import re
pattern = re.compile(
r'(?<!\\)(?:\\\\)*(?P<quote>["\'])(?P<value>.*?)(?<!\\)(?:\\\\)*(?P=quote)'
)
for s in data:
print pattern.findall(s)
I explain the regex (I use it in ihih):
(?<!\\)(?:\\\\)* # find backslash
(?P<quote>["\']) # any quote character (either " or ')
# which is *not* escaped (by a backslash)
(?P<value>.*?) # text between the quotes
(?<!\\)(?:\\\\)*(?P=quote) # end (matching) quote
Debuggex Demo
/
Regex101 Demo
Use a regex for that:
re.findall(r'"(.*?)"', s)
will return
['java jobs in delhi', 'pune']
You should use Python's shlex module, it's very nice:
>>> from shlex import shlex
>>> def get_quoted_unquoted(s):
... lexer = shlex(s)
... items = list(iter(lexer.get_token, ''))
... return ([i for i in items if i[0] in "\"'"],
[i for i in items if i[0] not in "\"'"])
...
>>> get_quoted_unquoted("\"java jobs in delhi\" it software \"pune\" hello")
(['"java jobs in delhi"', '"pune"'], ['it', 'software', 'hello'])
>>> get_quoted_unquoted("hello 'world' \"foo 'bar' baz\" hi")
(["'world'", '"foo \'bar\' baz"'], ['hello', 'hi'])
>>> get_quoted_unquoted("does 'nested \"quotes\" work' yes")
(['\'nested "quotes" work\''], ['does', 'yes'])
>>> get_quoted_unquoted("what's up with single quotes?")
([], ["what's", 'up', 'with', 'single', 'quotes', '?'])
>>> get_quoted_unquoted("what's up when there's two single quotes")
([], ["what's", 'up', 'when', "there's", 'two', 'single', 'quotes'])
I think this solution is as simple as any other solution (basically a oneliner, if you remove the function declaration and grouping) and it handles nested quotes well etc.

Ambiguity in parsing csv file

I am trying to parse a csv file with the following contents:
# country,title1,title2,type
GB,Fast Friends,Burn Notice, S:4, E:2,episode,
SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,
The expected output is:
['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES, THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice, S:4, E:2', 'episode']
The problem is, the commas in the 'title' fields are not escaped. I tried using csvreader as well as doing string and regex parsing, but was unable to get unambiguous matches.
Is it possible at all to parse this file accurately with unescaped commas on half of the fields? Or, does it require that a new csv be created?
You may be able to play a trick if you can make the assumption that all commas will appear in title2. Otherwise, you have ambiguous data.
strings = ['SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,'
,'GB,Fast Friends,Burn Notice, S:4, E:2,episode,'
]
for string in strings:
xs = string.split(',')
country = xs[0]
title1 = xs[1]
title2 = ' '.join(xs[2:-2])
mtype = xs[-2]
print [country, title1, title2, mtype]
Output:
['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice S:4 E:2', 'episode']
You can use RegEx (import re) - see documentation
Match for (\".*\",)|(.*,)
This way you're looking either for [quoted string,] or [any string,].
If there are commas in the fields, I would save the excel as text file with fields separated by tab.

Multiline python regex

I have a file structured like this :
A: some text
B: more text
even more text
on several lines
A: and we start again
B: more text
more
multiline text
I'm trying to find the regex that will split my file like this :
>>>re.findall(regex,f.read())
[('some text','more text','even more text\non several lines'),
('and we start again','more text', 'more\nmultiline text')]
So far, I've ended up with the following :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*?)',f.read(),re.DOTALL)
[(' some text', ' more text', ''), (' and we start again', ' more text', '')]
The multiline text is not catched. I guess is because the lazy qualifier is really lazy and catch nothing, but I take it out, the regex gets really greedy :
>>>re.findall('A:(.*?)\nB:(.*?)\n(.*)',f.read(),re.DOTALL)
[(' some text',
' more text',
'even more text\non several lines\nA: and we start again\nB: more text\nmore\nmultiline text')]
Does any one has an idea ? Thanks !
You could tell the regex to stop matching at the next line that starts with A: (or at the end of the string):
re.findall(r'A:(.*?)\nB:(.*?)\n(.*?)(?=^A:|\Z)', f.read(), re.DOTALL|re.MULTILINE)

Categories

Resources