I'm trying to get all 3 letter words. They end with double letters and start with the letter 'a'.
Like: app, add, all, arr, aoo, aee
I tried this but It doesn't work very well...
words =re.findall(r" a(\w)\1* ",text)
You are using
words =re.findall(r" a(\w)\1* ",text)
and here is a demo of it.
You can see an improvement by using a word boundary and as well as a specific limit of matches in your search here
\ba(\w)\1{1}\b
as you want 1 and only 1 additional instances of the matched \w, achieved with the {1} which only allows 1 of the match, i.e., \1 which is an additional \w.
I think you need to use + instead of *:
words = re.findall(r"\ba(\w)\1+\b", text)
Otherwise you will match things with non-double letters. Also use \b to detect word boundaries.
Related
I am quite new to regex, working on string verification where I want both conditions to be met. I am matching text containing 7digit numbers starting with 4 or 7 + string needs to contain one of the provided words.
What I managed so far:
\b((4|7)\d{6})\b|(\border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)
Regex above correctly finds numbers but words are after OR statement which I would need to follow AND logic instead.
Could you please help me implement a change that would work as AND statement between digits and words?
You can use
(?s)^(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b).*\b([47]\d{6})\b
If you can and want use a case insensitive matching with re.I, you can use
(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b
See the regex demo.
This matches
^ - start of string
(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b) - a positive lookahead that matches any zero or more chars, as many as possible, up to any of the whole words listed in the group
.* - zero or more chars, as many as possible
\b([47]\d{6})\b - a 7-digit number as a whole word that starts with 4 or 7.
Do not forget to use a raw string literal to define a regex in Python code:
pattern = r'(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b'
By default, everything in regex is AND
if you do
abc,
it means "a" AND "b" AND "c"
so there is no need for an AND in regex
just remove the | between the numbers match and the words
\b(4|7)\d{6}(border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b
I assume the backslash with the first word \border was a mistake.
This can match stuff like : "4958374border"
In Python re, I have long strings of text with > character chunks of different lengths. One string can have 3 consecutive > chars in the middle, >> in the beginning, or any such combination.
I want to write a regexp that, after splitting the string based on spaces, iterates through each word to only identify those regions with exactly 2 occurrences >>, and I can't be sure if it's at the beginning, middle or end of the whole string, or what characters are before or after it, or if it's even the only 2 characters in the string.
So far I could come up with:
word = re.sub(r'>{2}', '', word)
This ends up removing all occurrences of 2 or more. What regular expression would work for this requirement? Any help is appreciated.
You need to make sure there is no character of your choice both on the left and right using a pair of lookaround, a lookahead and a lookbehind. The general scheme is
(?<!X)X{n}(?!X)
where (?<!X) means no X immediately on the left is allowed, X{n} means n occurrences of X, and (?!X) means no X immediately on the right is allowed.
In this case, use
r'(?<!>)>{2}(?!>)'
See the regex demo.
no need to split on spaces first if dont needs to
try (?<![^ ])[^ >]*>>[^ >]*(?![^ ])
finds segments on space boundry's with only >> in it and no more
I have a text file of the type:
[...speech...]
NAME_OF_SPEAKER_1: [...speech...]
NAME_OF_SPEAKER_2: [...speech...]
My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like
re.search('[A-Z^(ABCD)]{3,}',text_to_search)
in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?
In the pattern that you tried, you get partial matches, as there are no boundaries and [A-Z^(ABCD)]{3,} will match 3 or more times any of the listed characters.
A-Z will also match ABCD, so it could also be written as [A-Z^)(]{3,}
Instead of using the negated character class, you could assert that the word that consists only of uppercase chars A-Z does not contain ABCD using a negative lookahead (?!
\b(?![A-Z]*ABCD)[A-Z]{3,}\b
Regex demo
If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:
\b(?![A-Z]*ABCD)[A-Z]{3}\w*\b
Regex demo
Square brackets [] match single characters, only. Also round brackets() inside of square brackets match single characters, only. That means:
[ABCD] and [(ABCD)] are the same as [A-D].
[^(ABCD)] matches any character, which is not one of A-D
I would try something different:
^[A-Z]*?: matches each word written in capital letters, which starts at the beginning of a line, and is followed by a colon
I'd like to define a regular expression in python3 where I can extract words that starts with alphabets and finish with digits.
what I've been trying is r'^[a-z][A-Z].[0-9]$'
and didn't return any single word.
Use
r'\b[A-Za-z]\w*[0-9]\b'
See proof. This matches words that begin with a letter, have any word characters after, and end in a digit. Notice the word boundaries that match whole words.
As per the valuable comment below, consider an alternative:
r'\b[A-Za-z][A-Za-z0-9]*[0-9]\b'
The [A-Za-z0-9]* won't match underscores while \w will.
How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.