How to format a LaTeX string in python? - python

I'm writing an application, part of whose functionality is to generate LaTeX CVs, so I find myself in a situation where I have strings like
\begin{document}
\title{Papers by AUTHOR}
\author{}
\date{}
\maketitle
\begin{enumerate}
%% LIST OF PAPERS
%% Please comment out anything between here and the
%% first \item
%% Please send any updates or corrections to the list to
%% XXXEMAIL???XXX
%\usepackage[pdftex, ...
which I would like to populate with dynamic information, e.g. an email address. Due to the format of LaTeX itself, .format with the {email} syntax won't work, and neither will using a dictionary with the %(email)s syntax. Edit: in particular, strings like "\begin{document}" (a command in LaTeX) should be left literally as they are, without replacement from .format, and strings like "%%" (a comment in LaTeX) should also be left, without replacement from a populating dictionary. What's a reasonable way to do this?

Why won't this work?
>>> output = r'\author{{email}}'.format(email='user#example.org')
>>> print output
\author{email}
edit: Use double curly braces to "escape" literal curly braces that only LaTeX understands:
>>> output = r'\begin{{document}} ... \author{{{email}}}'.format(
... email='user#example.org')
>>> print output
\begin{document} ... \author{user#example.org}

You may not use the new format syntax to avoid escaping the { and }.
That should work:
>>> a = r'''
\title{%(title)s}
\author{%(author)s}
\begin{document}'''
>>> b = a % {'title': 'My Title', 'author': 'Me, Of course'}
>>> print(b)
\title{My Title}
\author{Me, Of course}
\begin{document}
You should use raw strings r'something' to avoid escaping \ as \\.
PS: You should take a look on txt2tags, a Python script to convert t2t formatted text into html, latex, markdown etc. Check the source code to see how these conversions are done.

Related

Different strings in a single variable in python [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.
Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

python UTF-8 str.replace() vs re.sub()

When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?
Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?
Python strings are unicode-encoded by default, so the string you have is different from the string you output.
>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a
The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:
>>> txt[-5:]
'u017a'
>>> txt[-6:]
'\\u017a'
>>> txt[-6:-5]
'\\'
What you should do to resolve it:
Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'

String formating and Tex rendering

I am running into some issues getting things to display correctly when combining string formating and tex rendering in python. I want to index a series of energy levels by integers. This works fine for single-digit integers, however when I try for example:
s = r"$E_{}$".format(10)
the result looks like E10 while I want it to look like E10. I have tried using double braces but this doesn't seem to work since
s = r"$E_{{}}$".format(10)
results in "E" without any subscript at all, and something like
s = r"$E_{{k}}$".format(k = 10)
predictably gives Ek.
To me it seems the problem here is that both the string formatting and Tex syntax make use of curly braces, and while doubling the braces does escape the formatting, this will not work for me, since I still want to insert the value of k somehow. Is there any way around this issue, or will I have to resort to the old school method of formatting strings?
Literal { can be included with {{ - and then you need another {} to get the formatting placeholder - so you need 3:
s = r"$E_{{{}}}$".format(10)
print(s)
Results in:
$E_{10}$
Which should be rendered into what you wanted.

Can't replace a string with multiple escape characters

I am having trouble with the replace() method. I want to replace some part of a string, and the part which I want to replace consist of multiple escape characters. It looks like something like this;
['<div class=\"content\">rn
To remove it, I have a block of code;
garbage_text = "[\'<div class=\\\"content\\\">rn "
entry = entry.replace(garbage_text,"")
However, it does not work. Anything is removed from my complete string. Can anybody point out where exactly I am thinking wrong about it? Thanks in advance.
Addition:
The complete string looks like this;
"['<div class=\"content\">rn gitar calmak icin kullanilan minik plastik garip nesne.rn </div>']"
You could use the triple quote format for your replacement string so that you don't have to bother with escaping at all:
garbage_text = """['<div class="content">rn """
Perhaps your 'entry' is not formatted correctly?
With an extra variable 'text', the following worked in Python 3.6.7:
>>> garbage_text
'[\'<div class=\\\'content\'\\">rn '
>>> text
'[\'<div class=\\\'content\'\\">rn And then there were none'
>>> entry = text.replace(garbage_text, "")
>>> entry
'And then there were none'

Escape double quotes when converting a dict to json in Python

I need to escape double quotes when converting a dict to json in Python, but I'm struggling to figure out how.
So if I have a dict like {'foo': 'bar'}, I'd like to convert it to json and escape the double quotes - so it looks something like:
'{\"foo\":\"bar\"}'
json.dumps doesn't do this, and I have tried something like:
json.dumps({'foo': 'bar'}).replace('"', '\\"') which ends up formatting like so:
'{\\"foo\\": \\"bar\\"}'
This seems like such a simple problem to solve but I'm really struggling with it.
Your last attempt json.dumps({'foo': 'bar'}).replace('"', '\\"') is actually correct for what you think you want.
The reason you see this:
'{\\"foo\\": \\"bar\\"}'
Is because you're printing the representation of the string. The string itself will have only a single backslash for each quote. If you use print() on that result, you will see a single backslash
What you have does work. Python is showing you the literal representation of it. If you save it to a variable and print it shows you what you're looking for.
>>> a = json.dumps({'foo': 'bar'}).replace('"', '\\"')
>>> print a
{\"foo\": \"bar\"}
>>> a
'{\\"foo\\": \\"bar\\"}'

Categories

Resources