Extracting all Latex commands from a Latex code File - python

I am trying to extract all the latex commands from a tex file. I have to use Python for this. I tried to extract the latex commands in a list using Re module.
The problem is that this list does not contain the latex commands whose name includes special characters (such as \alpha*, \a', \#, \$, +, :, \; etc). It only contains the latex commands that consist of letters.
I am presently using the re.match python command :
"I already know the starting index of '\' which is at self.i.
The example Latex code string could be:
\documentclass[envcountsame,envcountchap]{svmono}"
match_text = re.match("[\w]+", search_string[self.i + 1:])
I am able to extract 'documentclass'. But suppose there is another command like:
"\abstract*[alpha]{beta}"
"\${This is a latex document}"
"\:"
How do I extract only 'abstract*', '$', ':' from these strings?
I am new to Python and tried various approaches, but am not able to extract all these command names. If there is a general python Regex that can handle all these cases, it would be useful.
NOTE: A book called 'The Not So Short introduction to LaTeX' defines that the format of LaTeX commands can be of three types -
FORMATS:
They start with a backslash \ and then have a name consisting of
letters only. Command names are terminated by a space, a number or
any other ‘non-letter.’
They consist of a backslash and exactly one non-letter.
Many commands exist in a ‘starred variant’ where a star is appended to the command name.

Here's the exact translation of your format specification:
\\(?:[^a-zA-Z]|[a-zA-Z]+)\*?
Demo
non-letter: [^a-zA-Z]
or letters: [a-zA-Z]+
starred variant: \*?
If your format description is accurate, this should do it. Unfortunately I don't know LaTeX so I'm not sure it's 100% OK.
From the feedback in the comments, it turns out the star is applicable only to letter commands, and there can be some other terminating characters as well. The final regex is:
\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)

LaTeX is a TeX macro package, and as so, all that's applicable to TeX is also applicable to LaTeX.
The question you ask is a difficult one, as TeX is not a regular language. If you want only to deal with commands, you have to check for \\([A-Za-z]+ *|.|\n) regex (see demo), with the notice that in TeX you have active characters, that is, characters for which the only presence acts like a command. If you want to deal with command parameters, you'll have to check the individual command definitions, because TeX is a Polish Notation (operators or commands are prefix, with a variable number of positional parameters) language. For parameter extraction, TeX uses brace matching which is context free and not regular, so you'll need a complete parser for that.
TeX allows you to redefine all character classes, so you can redefine the digits to act as letters, and be usable as command names (so for example \a23 is a valid command name) (this happens inside the package definitions, where the # is used as a letter, to be able to make commands that are inaccessible to users, but available inside the package)
Eliminating LaTeX markup is a difficult thing for this reason and you can only achieve partial results. There are many different problems to be solved (what to do with \include directives, what to do with valid text in parameters like \chapter parameters or \footnote, you want the index included, etc.)
Also, you have to be carefull, as if you try to eliminate command parameters, you'll be also eliminating part of your text (for example the text in \footnote, \abstract, \title, \chapter{...}, etc.) I don't know the effect you actually want to get, so I cannot give you more info in this respect.

Related

Identify a dot in an aiml pattern in Python

In a project of mine, I am trying to identify file names in a given sentence. For example, "Could you please open abc.txt", so I need to fetch the keywords "open" in order to know the kind of action that is expected and I also need to identify the file name, for obvious reasons. A simple AIML tag for this is:
<aiml>
<category>
<pattern>* OPEN *</pattern>
<template>open <star index="2"/></template>
<category>
</aiml>
Here, in the template tag, I am just giving an information about the operation to be performed and the file name. My python code on the other hand takes care of performing the required action.
Now the problem is the '.' character. Using that character divides the sentence into 2 parts, (in case of the example I mentioned above, the 2 sentences would be "Could you please open abc" and "txt") which are individually mapped to any of the aiml tags defined. But, in my case I don't want the '.' character to act as a delimiter. Basically, I want to identify file names that may or may not include an extension. Could anyone please help me out with this?
Thanks in advance!
By default AIML allows multi sentence input. This means full stops, exclamation marks and question marks are treated as separators between sentences. For example if you asked:
Good morning. My name is George. How are you today?
this is interpreted as 3 separate inputs. Normally this is a good thing as it means the AIML interpreter can re-use existing patterns for GOOD MORNING, MY NAME IS *, HOW ARE YOU *.
But in your case that's not helping as the full-stop before the extension is causing unwanted splitting. Depending on your AIML interpreter, sentence splitting is done in a pre-processing stage before sending the input to the interpreter. Some AIML interpreters have a configuration file that lets you define the sentence splitting characters, so you may simply be able to remove the full stop from the list of separators.
A better approach may be to pre-process the filenames and replace the full stop with the word DOT, you can then detect this in your pattern * OPEN *
As a final comment, * OPEN * is a very wide ranging pattern, it will also be invoked if someone says WHAT TIME IS THE SHOP OPEN TODAY, or any other input with the word OPEN in it surrounded by text.

Why is re.sub not called re.replace

I've been writing Python for quite some time now, and so far it seems like the creators of the language put a lot of effort into readability of the code, a good example of this would be the re (Regular Expression) module.
Almost every method is clear in what it does:
re.search Scan through string looking for a match to the pattern
re.split Split the source string by the occurrences of the pattern
re.escape Escape all the characters in pattern except ASCII letters, numbers and '_'.
etc..
Until we hit the following two methods:
re.sub
re.subn
These methods are meant to do a regex based replace, however the naming convention seem strange and out of place to me (especially when starting out, I had to constantly look the method names up). C# for instance does call the method Regex.Replace. source
Is there a reason behind naming these methods sub and subn?
Why didn't the developers simply name it re.replace?
The traditional name of this regex command is substitute (or substitution). It comes from the original Unix ed, where you use s to perform it, and this has been retained in sed; perl also uses the s command syntax.
From Sed - An Introduction and Tutorial by Bruce Barnett
The essential command: s for substitution
Sed has several commands, but most people only learn the substitute
command: s. The substitute command changes all occurrences of the
regular expression into a new value. A simple example is changing
"day" in the "old" file to "night" in the "new" file:
sed s/day/night/ <old >new

Loading regular expression patterns from external source?

I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?
I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.
The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.
You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.
I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.
I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.
Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Categories

Resources