Regex pattern for tweets

Regex pattern for tweets - python

I am building a tweet classification model, and I am having trouble finding a regex pattern that fits what I am looking for.
what I want the regex pattern to pick up:
Any hashtags used in the tweet but without the hash mark (example - #omg to just omg)
Any mentions used in the tweet but without the # symbol (example - #username to just username)
I don't want any numbers or any words containing numbers returned ( this is the most difficult task for me)
Other than that, I just want all words returned
Thank you in advance if you can help
Currently I am using this pattern:** r"(?u)\b\w\w+\b"** but it is failing to remove numbers.

This regex should work.
(#|#)?(?![^ ]*\d[^ ]*)([^ ]+)
Explanation:
(#|#)?: A 'hash' or 'at' character. Match 0 or 1 times.
(?!): Negative lookahead. Check ahead to see if the pattern in the brackets matches. If so, negate the match.
[^ ]*\d[^ ]*: any number of not space characters, followed by a digit, followed by any number of space characters. This is nested in the negative lookahead, so if a number is found in the username or hashtag, the match is negated.
([^ ]+): One or more not space characters. The negative lookahead is a 0-length match, so if it passes, fetch the rest of the username/hashtag (Grouped with brackets so you can replace with $2).

Related

Regular expression match when specific digits AND words appear

I am quite new to regex, working on string verification where I want both conditions to be met. I am matching text containing 7digit numbers starting with 4 or 7 + string needs to contain one of the provided words.
What I managed so far:
\b((4|7)\d{6})\b|(\border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)
Regex above correctly finds numbers but words are after OR statement which I would need to follow AND logic instead.
Could you please help me implement a change that would work as AND statement between digits and words?

You can use
(?s)^(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b).*\b([47]\d{6})\b
If you can and want use a case insensitive matching with re.I, you can use
(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b
See the regex demo.
This matches
^ - start of string
(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b) - a positive lookahead that matches any zero or more chars, as many as possible, up to any of the whole words listed in the group
.* - zero or more chars, as many as possible
\b([47]\d{6})\b - a 7-digit number as a whole word that starts with 4 or 7.
Do not forget to use a raw string literal to define a regex in Python code:
pattern = r'(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b'

By default, everything in regex is AND
if you do
abc,
it means "a" AND "b" AND "c"
so there is no need for an AND in regex
just remove the | between the numbers match and the words
\b(4|7)\d{6}(border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b
I assume the backslash with the first word \border was a mistake.
This can match stuff like : "4958374border"

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.

You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.

If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})

I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?

You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Match a sequence of numbers preceded by certain text

How do I match a sequence of numbers preceded by certain text but not return the text, just the sequence of numbers?
For example, let's assume I have the following string:
url = "sampleurl/485734/abcdefgh/83275/"
I want to match all numbers that comes after the word sampleurl. So far, I`ve been using the following code
re.search("sampleurl/[0-9]+", url).group(0)[9:]
that works, but I'm assuming there is a fancier way of doing that instead of needing to use [9:] at the end.
For a quick reference, I've been using regex101 to check the validation of the regex.

You can place a capturing group around the part you want and refer to that group number for the match result.
re.search(r'sampleurl/(\d+)', url).group(1)
Another way would be implementing a lookaround assertion.
re.search(r'(?<=sampleurl/)\d+', url).group(0)

Python, regex negative lookbehind behavior

I have a regular experssion that should find up to 10 words in a line. THat is, it should include the word just preceding the line feed but not the words after the line feed. I am using a negative lookbehind with "\n".
a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")
When I execute this regex and call the method r.group(), I am getting back the whole sentence but the last word that contains a period. I was expecting only the complete string preceding the new line. That is, "THe car is parked in the garage\n".
What is the mistake that I am making here with the negative look behind...?

I don't know why you would use negative lookahead. You are saying that you want a maximum of 10 words before a linefeed. The regex below should work. It uses a positive lookahead to ensure that there is a linefeed after the words. Also when searching for words use `b\w+\b` instead of what you were using.
/(\b\w+\b)*(?=.*\\n)/
Python :
result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)
Explanation :
# (\b\w+\b)*(?=.*\\n)
#
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
# Match any single character that is not a line break character «.*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the character “\” literally «\\»
# Match the character “n” literally «n»
You may also wish to consider the fact that there could be no \n at your string.

If I read you right, you want to read up to 10 words, or the first newline, whichever comes first:
((?:(?<!\n)\w+\b[\s.]*){0,10})
This uses a negative lookbehind, but just before the word match, so it blocks getting any word after a newline.
This will need some tuning for imperfect input, but it's a start.

For this task there is the anchor $ to find the the end of the string and together with the modifier re.MULTILINE/re.M it will find the end of the line. So you would end up with something like this
(\b\w+\b[.\s /]{0,2}){0,10}$
See it here on Regexr
The \b is a word boundary. I included [.\s /]{0,2} to match a dot followed by a whitespace in my example. If you don't want the dots you need to make this part at least optional like this [\s /]? otherwise it will be missing at the last word and then the \s is matching the \n.
Update/Idea 2
OK, maybe I misunderstood your question with my first solution.
If you just want to not match a newline and continue in the second row, then just don't allow it. The problem is that the newline is matched by the \s in your character class. The \s is a class for whitespace and this includes also the newline characters \r and \n
You already have a space in the class then just replace the \s with \t in case you want to allow tab and then you should be fine without lookbehind. And of course, make the character class optional otherwise the last word will also not be matched.
((\w)+[\t /]?){0,10}
See it here on Regexr

I think you shouldn't be using a lookbehind at all. If you want to match up to ten words not including a newline, try this:
\S+(?:[ \t]+\S+){0,9}
A word is defined here as one or more non-whitespace characters, which includes periods, apostrophes, and other sentence punctuation as well as letters. If you know the text you're matching is regular prose, there's no point limiting yourself to \w+, which isn't really meant to match natural-language words anyway.
After the first word, it repeatedly matches one or more horizontal whitespace characters (space or TAB) followed by another word, for a maximum of ten words. If it encounters a newline before the tenth word, it simply stops matching at that point. There's no need to mention newlines in the regex at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.