I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".
Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زندهگی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زندهگی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to میشود. Also you can add custom words to the database.
Related
I've seen a lot of different posts for handling accented characters, but none that specifically find accented characters in a corpus of text. I'm trying to identify words in the text like nǚ, but the code should not include non-Latin-alphabet results. Ex: 女 should not be selected. The string I'm using for testing is:
"nǚ – woman; girl; daughter; female. A pictogram of a woman with her arms stretched. In old versions she was seated on her knees. It is a radical that forms part tón of characters related to women and their qualities. 女儿 nǚ'ér – daughter (woman + child) ǚa"
A working regex should select:
nǚ
nǚ'ér
ǚa
tón
Note:
There is a similar question here, but the problem is different. This person is just having trouble using regex with accents.
To match the accented letter, from this post you can use
[\u00C0-\u017F]
[À-ÖØ-öø-ÿ]
ǚ is not included in but you can extend unicode range to its value : [\u00C0-\u01DA]
' is not an accent you have to add it manually
Giving final \w*[\u00C0-\u01DA']\w* and Code Demo
A generic solution for Cyrillic, Arabic, etc. would be
[x for x in re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b", s)
if re.search(r'[A-Za-z]',x) and re.search(r'(?![a-zA-Z])[^\W\d_]',x)]
re.findall(r"\b[^\W\d_]+(?:['’][^\W\d_]+)*\b" - finds all words that may contain apostrophes
if re.search(r'[A-Za-z]',x) - make sure there is a letter from ASCII range
re.search(r'(?![a-zA-Z])[^\W\d_]',x) - also, make sure there is a letter outside of ASCII range.
I am trying to find words ending with 'ing' in the following
sentence = "Playing outdoor games when its raining outside is always fun!"
Now this is not my question itself as I found the necessary regex pattern to do it- (r'\b([A-z]+ing)\b').
The thing is I'm unable to understand why the above works but not what I tried below:
re.findall('([A-z]+ing)$',"Playing outdoor games when it's raining outside is always fun!")
Returns empty list even though the below doesn't
re.findall('([A-z]+ing)$','amazing')
Returns amazing
So this pattern can match single words ending with 'ing' but not words in sentences? Why?
What I found even more weird is this:
re.findall('\b([A-z]+ing)\b',"Playing outdoor games when it's raining outside is always fun!")
returns no matches (empty list). The only difference is not using the raw string notation (r)
I thought the 'r' notation was only necessary when we want to escape backslashes. So in that case:
Pattern1 - '\b([A-z]+ing)\b' should match playing, raining etc. instead of
Pattern2- r'\b([A-z]+ing)\b'
What exactly have I understood wrongly? I searched a lot of Stack Overflow answers and the official Python regex documentation and now I am more confused than when I started out particularly regarding the use of 'r'.
The $ matches end of line or end of whole text (depending on flag setting, here: only end of text). Using it right after the "ing" forces that the "ing" must appear at the end.
Raw string notation lets the escaped characters like \b go through to the underlying function (here: findall) to be processed further (here: as a special regex code for word boundary).
Without raw string notation, \b is the BACKSPACE control code (hex 0x08). This character is processed by the regex engine as a simple match of itself.
Using [A-z] to match all letters is also not right. It actually means to match any character in the Unicode table between A and z. As you can see here this includes e.g. [, ^ and \. If you only want the ASCII letters, use [A-Za-z] instead. If you want all Unicode word characters (letters and digits in any supported language and underscore) use \w.
To play around with regular expressions there is e.g. https://regex101.com/
I have a project where I'm trying to analyze a database of tweets. I need to write a python regex expression that pulls tweets mentioning specific twitter users. Here is an example tweet I'd like to capture.
"That #A_Person is a real jerk."
The regex that I've been trying is
([^.?!]*)(\b([#]A_Person)\b)([^.?!]*)
But it's not working and I've tried lots of variations. Any advice would be appreciated!
\b matches a word boundary, but # is not a word character, so if it occurs after a space, the match will fail. Try removing the word boundary there, and removing the extra groups, and add a character set at the end for [.?!] to include the final punctuation, and you get:
[^.?!]*#A_Person\b.*?[^.?!]*[.?!]
You also might consider including a check for the start of the string or the end of the last sentence, otherwise the engine will go through a lot of steps while going through areas without any matches. Perhaps use
(?:^|(?<=[.?!])\s*)
which will match the start of the string, or will lookbehind for [.?!] possibly followed by spaces. Put those together and you get
(?:^|(?<=[.?!])\s*)([^.?!]*#A_Person\b.*?[^.?!]*[.?!])
where the string you want is in the first group (no leading spaces).
https://regex101.com/r/447KsF/3
Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.
I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.
I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.
How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.
If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.
Thanks so much for your help!
So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?
her_PP$ voice_NN an_DT exact_JJ duplicate_NN
Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.
In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:
[\w-]+_NN
This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.
For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:
[\w-]+_PP\$
Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).
Combined this would amount to this:
[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN
Debuggex Demo
To do "an undetermined amount of unknown words" you would do this:
(?:[\w-]+\W+)*?
So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.