I am trying to find words ending with 'ing' in the following
sentence = "Playing outdoor games when its raining outside is always fun!"
Now this is not my question itself as I found the necessary regex pattern to do it- (r'\b([A-z]+ing)\b').
The thing is I'm unable to understand why the above works but not what I tried below:
re.findall('([A-z]+ing)$',"Playing outdoor games when it's raining outside is always fun!")
Returns empty list even though the below doesn't
re.findall('([A-z]+ing)$','amazing')
Returns amazing
So this pattern can match single words ending with 'ing' but not words in sentences? Why?
What I found even more weird is this:
re.findall('\b([A-z]+ing)\b',"Playing outdoor games when it's raining outside is always fun!")
returns no matches (empty list). The only difference is not using the raw string notation (r)
I thought the 'r' notation was only necessary when we want to escape backslashes. So in that case:
Pattern1 - '\b([A-z]+ing)\b' should match playing, raining etc. instead of
Pattern2- r'\b([A-z]+ing)\b'
What exactly have I understood wrongly? I searched a lot of Stack Overflow answers and the official Python regex documentation and now I am more confused than when I started out particularly regarding the use of 'r'.
The $ matches end of line or end of whole text (depending on flag setting, here: only end of text). Using it right after the "ing" forces that the "ing" must appear at the end.
Raw string notation lets the escaped characters like \b go through to the underlying function (here: findall) to be processed further (here: as a special regex code for word boundary).
Without raw string notation, \b is the BACKSPACE control code (hex 0x08). This character is processed by the regex engine as a simple match of itself.
Using [A-z] to match all letters is also not right. It actually means to match any character in the Unicode table between A and z. As you can see here this includes e.g. [, ^ and \. If you only want the ASCII letters, use [A-Za-z] instead. If you want all Unicode word characters (letters and digits in any supported language and underscore) use \w.
To play around with regular expressions there is e.g. https://regex101.com/
Related
In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.
You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax
You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.
I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.
I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.
I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.
How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.
If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.
Thanks so much for your help!
So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?
her_PP$ voice_NN an_DT exact_JJ duplicate_NN
Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.
In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:
[\w-]+_NN
This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.
For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:
[\w-]+_PP\$
Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).
Combined this would amount to this:
[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN
Debuggex Demo
To do "an undetermined amount of unknown words" you would do this:
(?:[\w-]+\W+)*?
So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.
How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.
I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.
For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.
The regex is: (?<!\\)\"[^\"]+\"
Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.
To break it down:
(?<\\)\" # The group should begin with a quotation mark that is not escaped
[^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\" # and end at the first quotation mark it finds
That being said, I (obviously) start hitting problems with things like this:
"This is a string with an \"escaped quote\" inside it"
I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:
([^\"]|\\\")+ # a group of anything but a quote or an escaped quote
, but that lead to very strange results.
I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.
Thanks for any assistance.
EDIT
As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):
STRLIT = r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # non-greedy 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
'''
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)
The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.
I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.
r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''
Which does the following:
(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.
Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.
If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.
edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).
Perhaps use two negative look behinds:
import re
text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''
for match in (re.findall(r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
''', text)):
print(match)
yields
"This is a string with an \"escaped quote\" inside it"
""
"data"
The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.
Using python, the correct regex matching double quoted string is:
pattern = r'"(\.|[^"])*"'
It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".
unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.