Python regex that matches superscripted text - python

My question is very simple but I couldn't figure it out by myself: how to I match superscripted text with regex in Python? I'd like to match patterns like [a-zA-Z0-9,[]] but only if it's superscripted.
regards

The main problem is that information about "superscript" and "subscript" text is not conveyed at the character level.
Unicode even defines some characters to be used as sub and super-script, most notably, all decimal digits have a corresponding character - but just a handful of other latin letters have a full super-script or sub-script character with its own code. Check: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
So, if you want to match digits only, you could just put the corresponding characters in the regular expressions: "\u207X" (with X varying from 0 to 9) plus "\u00BX" with X in {2, 3, 9} - the table in the linked wikipedia article has all characters.
For the remaining characters, what takes place when we are faced with superscript text is that it is formatting information in a protocol separated from the characters: for example if you are dealing with HTML markup text, text inside the <sup> </sup> marks.
Just as happen with HTML, in any instance you find superscript text, have to be marked in a text-protocol "outside" of the characters themselves - and therefore, outside what you'd look up in the characters themselves with a regular expression.
If you are dealing with HTML text, you can search your text for the "<sup>" tag, for example. However, if it is formatted text inside a web page, there are tens of ways of marking the superscript text, as the super-script transformation can be specified in CSS, and the CSS may be applied to the page-text in several different ways.
Other text-protocols exist that might encode super-script text, like "rich-text" (rtf files) . Otherwise you have to say how the text you are dealing with is encoded, and how it does encode the markup for superscript text, in order for a proper regular expression to be built.
If it is plain HTML using "<sup>" tags, it could be as simple as:
re.findall(r"\<sup.*?\>(.*?)\<\/sup"). Otherwise, you should inspect your text stream, find out the superscript markup, and use an appropriate regexp, or even another better parsing tool (for HTML/XML you are usually better off using beautifulsoup or other XML tools than regexps, for example).
And, of course, that applies if the information for which text is superscripted is embedded in the text channel, as some kind of markup. It might be on a side-channel: another data block telling at which text indexes the effect of superscript should apply. In this case, you essentially have to figure that out, and then use that information directly.

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

Split Documents into Paragraphs

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).
Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.
Is there training data for that? should I try some complex regular expression that might work?
I will try to give an easier way to deal with your problem: What you need to do is check for the double \nl then if you find double \nl then sort data considering that, and if you do not find double \nl then just sort data according to single \nl.
Another thing, i am thinking \nl is not a special character since i could not get any ASCII value for it, it is probably newline character but since you have asked for \nl i am giving the example accordingly(if it is indeed \n then you need to just change the part checking for double \nl).
Rough example to detect the way for new paragraph used in the file:
f=open('yourfile','r')
a=f.read()
f.close()
temp=0
for z in range(len(a)-4):
if a[z:z+4]=='\nl\nl':
temp=1
break
#temp=1 if formatting is by double \nl otherwise 0
After this you can use simple string formatting to check for single \nl or double \nl and replace them according to your need to distinguish new line or new paragraph.(Please read the file in chunks if the file size is too big, otherwise you might have memory problems or slower code)
You say
some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs
so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.
You can then pass all the files to the next stage where you detect paragraphs using a single \n character.
from nltk import tokenize
tk=tokenize
a='para here'
tk.sent_tokenize(a)
#output =list of sentences
#thats all u need

preventing conflict between math symbols and html code

I want to present text out of math publications and every now and then I get something like
O(1/N_f) Corrections to the Thirring Model in 2<d<4<
The last part will be misinterpreted as html. I have to paste this text directly on the website allowing html. The reason why I need to allow html is that I use elasticsearch and I want to highlight the search results (elasticsearch puts tags in the text). So can't just prevent html interpretation of the text.
However, I can pre-process the text to prevent any conflict. For example above all conflict is avoided by using
text.replace('<', " < ")
in python. However, this is far from optimal since
1. It will introduce spaces even when they are not needed
2. It only accounts for this particular collision between math symbols and html
Since I figure I am not the first person who encounters this I was wondering whether there is a general solution for such a problem?
Use xml.sax.saxutils.escape function:
import xml.sax.saxutils
escaped = xml.sax.saxutils.escape(text)
This will Escape '&', '<', and '>' in the text string.
HTML has several characters with special meanings (including angle brackets), and is also usually represented in ASCII, so a good way to represent these types of special characters are needed.
In HTML, escape sequences are used to represent them. For example, a & character is represented by the named escape sequence & or the numerical escape sequence &. Only the most common special characters have named escape sequences.
Here's a good list of sequences.

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.
I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.
I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.
Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Python, Unicode and parsing with lxml and how to deal with 35\xa0new

I am extracting a field on a webpage ad the tag html text content looks like this...
35 new
In python the extracted data looks like this...
35\xa0new
How to I deal with unicode in python to convert to a regular string?
"35 new"
what library to I use?
Thanks
Avoid working with regular strings whenever possible; unicodes are generally more useful for text, and there are many well-known solutions for manipulating and dealing with them.
You are getting unicode strings from the parser. You can replace certain characters if you prefer others. For example, your \xa0 is a non-breaking space, and you can replace it with a regular space:
text = text.replace(u"\xa0", u" ")
There could be many of these characters that you want to change, so it might be a long process of finding all the ones that occur in your data.

Categories

Resources