I've seen a lot of different posts for handling accented characters, but none that specifically find accented characters in a corpus of text. I'm trying to identify words in the text like nǚ, but the code should not include non-Latin-alphabet results. Ex: 女 should not be selected. The string I'm using for testing is:
"nǚ – woman; girl; daughter; female. A pictogram of a woman with her arms stretched. In old versions she was seated on her knees. It is a radical that forms part tón of characters related to women and their qualities. 女儿 nǚ'ér – daughter (woman + child) ǚa"
A working regex should select:
nǚ
nǚ'ér
ǚa
tón
Note:
There is a similar question here, but the problem is different. This person is just having trouble using regex with accents.
To match the accented letter, from this post you can use
[\u00C0-\u017F]
[À-ÖØ-öø-ÿ]
ǚ is not included in but you can extend unicode range to its value : [\u00C0-\u01DA]
' is not an accent you have to add it manually
Giving final \w*[\u00C0-\u01DA']\w* and Code Demo
A generic solution for Cyrillic, Arabic, etc. would be
[x for x in re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b", s)
if re.search(r'[A-Za-z]',x) and re.search(r'(?![a-zA-Z])[^\W\d_]',x)]
re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b" - finds all words that may contain apostrophes
if re.search(r'[A-Za-z]',x) - make sure there is a letter from ASCII range
re.search(r'(?![a-zA-Z])[^\W\d_]',x) - also, make sure there is a letter outside of ASCII range.
Related
I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.
I am trying to find words ending with 'ing' in the following
sentence = "Playing outdoor games when its raining outside is always fun!"
Now this is not my question itself as I found the necessary regex pattern to do it- (r'\b([A-z]+ing)\b').
The thing is I'm unable to understand why the above works but not what I tried below:
re.findall('([A-z]+ing)$',"Playing outdoor games when it's raining outside is always fun!")
Returns empty list even though the below doesn't
re.findall('([A-z]+ing)$','amazing')
Returns amazing
So this pattern can match single words ending with 'ing' but not words in sentences? Why?
What I found even more weird is this:
re.findall('\b([A-z]+ing)\b',"Playing outdoor games when it's raining outside is always fun!")
returns no matches (empty list). The only difference is not using the raw string notation (r)
I thought the 'r' notation was only necessary when we want to escape backslashes. So in that case:
Pattern1 - '\b([A-z]+ing)\b' should match playing, raining etc. instead of
Pattern2- r'\b([A-z]+ing)\b'
What exactly have I understood wrongly? I searched a lot of Stack Overflow answers and the official Python regex documentation and now I am more confused than when I started out particularly regarding the use of 'r'.
The $ matches end of line or end of whole text (depending on flag setting, here: only end of text). Using it right after the "ing" forces that the "ing" must appear at the end.
Raw string notation lets the escaped characters like \b go through to the underlying function (here: findall) to be processed further (here: as a special regex code for word boundary).
Without raw string notation, \b is the BACKSPACE control code (hex 0x08). This character is processed by the regex engine as a simple match of itself.
Using [A-z] to match all letters is also not right. It actually means to match any character in the Unicode table between A and z. As you can see here this includes e.g. [, ^ and \. If you only want the ASCII letters, use [A-Za-z] instead. If you want all Unicode word characters (letters and digits in any supported language and underscore) use \w.
To play around with regular expressions there is e.g. https://regex101.com/
I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".
Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زندهگی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زندهگی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to میشود. Also you can add custom words to the database.
I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.
I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.
How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.
If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.
Thanks so much for your help!
So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?
her_PP$ voice_NN an_DT exact_JJ duplicate_NN
Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.
In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:
[\w-]+_NN
This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.
For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:
[\w-]+_PP\$
Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).
Combined this would amount to this:
[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN
Debuggex Demo
To do "an undetermined amount of unknown words" you would do this:
(?:[\w-]+\W+)*?
So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.
EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
<td class="listcontentlight_left">
ANDRIKIENĖ, Laima Liucija
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
So far I have been using PyParser and the following code:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
I was able to show 31 names starting with A with code:
extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)
As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.
Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.
If 2 bytes are not enough then try:
extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))
Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:
soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
print cell.a.string
Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.
Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)
Other observations:
(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith
at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.
on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:
fullname := lastname ", " firstname
lastname := character+
firstname := character+
that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?
i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.
ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.
Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:
handles upper/lower casing of the tag
itself
handles embedded attributes
(returning them as named results)
handles attribute names that have
namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a
trailing '/' before the closing '>'
can be filtered for specific
attributes using the withAttribute
parse action
So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:
aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
if ";id=" in t.href:
print t.title
Now you get whatever is in the title attribute, regardless of character set.