preventing conflict between math symbols and html code - python

I want to present text out of math publications and every now and then I get something like
O(1/N_f) Corrections to the Thirring Model in 2<d<4<
The last part will be misinterpreted as html. I have to paste this text directly on the website allowing html. The reason why I need to allow html is that I use elasticsearch and I want to highlight the search results (elasticsearch puts tags in the text). So can't just prevent html interpretation of the text.
However, I can pre-process the text to prevent any conflict. For example above all conflict is avoided by using
text.replace('<', " < ")
in python. However, this is far from optimal since
1. It will introduce spaces even when they are not needed
2. It only accounts for this particular collision between math symbols and html
Since I figure I am not the first person who encounters this I was wondering whether there is a general solution for such a problem?

Use xml.sax.saxutils.escape function:
import xml.sax.saxutils
escaped = xml.sax.saxutils.escape(text)
This will Escape '&', '<', and '>' in the text string.

HTML has several characters with special meanings (including angle brackets), and is also usually represented in ASCII, so a good way to represent these types of special characters are needed.
In HTML, escape sequences are used to represent them. For example, a & character is represented by the named escape sequence & or the numerical escape sequence &. Only the most common special characters have named escape sequences.
Here's a good list of sequences.

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

Python regex that matches superscripted text

My question is very simple but I couldn't figure it out by myself: how to I match superscripted text with regex in Python? I'd like to match patterns like [a-zA-Z0-9,[]] but only if it's superscripted.
regards
The main problem is that information about "superscript" and "subscript" text is not conveyed at the character level.
Unicode even defines some characters to be used as sub and super-script, most notably, all decimal digits have a corresponding character - but just a handful of other latin letters have a full super-script or sub-script character with its own code. Check: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
So, if you want to match digits only, you could just put the corresponding characters in the regular expressions: "\u207X" (with X varying from 0 to 9) plus "\u00BX" with X in {2, 3, 9} - the table in the linked wikipedia article has all characters.
For the remaining characters, what takes place when we are faced with superscript text is that it is formatting information in a protocol separated from the characters: for example if you are dealing with HTML markup text, text inside the <sup> </sup> marks.
Just as happen with HTML, in any instance you find superscript text, have to be marked in a text-protocol "outside" of the characters themselves - and therefore, outside what you'd look up in the characters themselves with a regular expression.
If you are dealing with HTML text, you can search your text for the "<sup>" tag, for example. However, if it is formatted text inside a web page, there are tens of ways of marking the superscript text, as the super-script transformation can be specified in CSS, and the CSS may be applied to the page-text in several different ways.
Other text-protocols exist that might encode super-script text, like "rich-text" (rtf files) . Otherwise you have to say how the text you are dealing with is encoded, and how it does encode the markup for superscript text, in order for a proper regular expression to be built.
If it is plain HTML using "<sup>" tags, it could be as simple as:
re.findall(r"\<sup.*?\>(.*?)\<\/sup"). Otherwise, you should inspect your text stream, find out the superscript markup, and use an appropriate regexp, or even another better parsing tool (for HTML/XML you are usually better off using beautifulsoup or other XML tools than regexps, for example).
And, of course, that applies if the information for which text is superscripted is embedded in the text channel, as some kind of markup. It might be on a side-channel: another data block telling at which text indexes the effect of superscript should apply. In this case, you essentially have to figure that out, and then use that information directly.

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.
I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.
I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.
Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Problem with eastern european characters when scraping data from the European Parliament Website

EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
<td class="listcontentlight_left">
ANDRIKIENĖ, Laima Liucija
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
So far I have been using PyParser and the following code:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
I was able to show 31 names starting with A with code:
extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)
As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.
Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.
If 2 bytes are not enough then try:
extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))
Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:
soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
print cell.a.string
Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.
Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)
Other observations:
(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith
at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.
on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:
fullname := lastname ", " firstname
lastname := character+
firstname := character+
that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?
i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.
ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.
Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:
handles upper/lower casing of the tag
itself
handles embedded attributes
(returning them as named results)
handles attribute names that have
namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a
trailing '/' before the closing '>'
can be filtered for specific
attributes using the withAttribute
parse action
So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:
aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
if ";id=" in t.href:
print t.title
Now you get whatever is in the title attribute, regardless of character set.

Building proper link with spaces

I have the following code in Python:
linkHTML = "click here" % strLink
The problem is that when strLink has spaces in it the link shows up as
click here
I can use strLink.replace(" ","+")
But I am sure there are other characters which can cause errors. I tried using
urllib.quote(strLink)
But it doesn't seem to help.
Thanks!
Joel
Make sure you use the urllib.quote_plus(string[, safe]) to replace spaces with plus sign.
urllib.quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values when building up a
query string to go into a URL. Plus
signs in the original string are
escaped unless they are included in
safe. It also does not have safe
default to '/'.
from http://docs.python.org/library/urllib.html#urllib.quote_plus
Ideally you'd be using the urllib.urlencode function and passing it a sequence of key/value pairs like {["q","with space"],["s","with space & other"]} etc.
As well as quote_plus(*), you also need to HTML-encode any text you output to HTML. Otherwise < and & symbols will be markup, with potential security consequences. (OK, you're not going to get < in a URL, but you definitely are going to get &, so just one parameter name that matches an HTML entity name and your string's messed up.
html= 'click here' % cgi.escape(urllib.quote_plus(q))
*: actually plain old quote is fine too; I don't know what wasn't working for you, but it is a perfectly good way of URL-encoding strings. It converts spaces to %20 which is also valid, and valid in path parts too. quote_plus is optimal for generating query strings, but otherwise, when in doubt, quote is safest.

Categories

Resources